# **AI TECH INSTITUTE** · *Intermediate AI & Data Science*
### Week 7 - Notebook 02: Model Evaluation - The Complete Guide
**Instructor:** Amir Charkhi |  **Goal:** Mastering ML Evaluation Metrics

> Format: theory → implementation → best practices → real-world application.

## How Do We Know If Our Model Is Good? 🎯

**Learning Objectives:**
- Master classification metrics: accuracy, precision, recall, F1-score, ROC-AUC
- Master regression metrics: MSE, RMSE, MAE, R²
- Understand when to use which metric (CRITICAL!)
- Interpret confusion matrices like a pro
- Avoid common evaluation pitfalls
- Connect metrics to real business decisions

**Prerequisites:** Notebook 01 (ML Fundamentals & Lifecycle)



## 🤔 Why Can't We Just Use Accuracy?

**Scenario:** You're building a fraud detection model for a bank.

Dataset:
- 990 legitimate transactions
- 10 fraudulent transactions

**Model A:**
```python
def predict(transaction):
    return "legitimate"  # Always predict legitimate!
```

**Accuracy of Model A:** 990/1000 = **99%** 🤯

But this model is USELESS! It never catches fraud!

**The lesson:** Different problems need different metrics. Accuracy alone can be dangerously misleading.

This notebook will teach you:
- Which metrics to use for which problems
- How to interpret each metric
- How metrics connect to business outcomes

In [None]:
# Essential imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.datasets import load_breast_cancer, load_diabetes, make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier

# Evaluation metrics - Classification
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    classification_report,
    roc_curve,
    roc_auc_score,
    ConfusionMatrixDisplay
)

# Evaluation metrics - Regression
from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_error,
    r2_score
)

import warnings
warnings.filterwarnings('ignore')

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("📚 Evaluation Metrics Guide:")
print("")
print("CLASSIFICATION METRICS:")
print("  - Accuracy: Overall correctness")
print("  - Precision: Of those we predicted positive, how many were right?")
print("  - Recall: Of all actual positives, how many did we find?")
print("  - F1-Score: Balance between precision and recall")
print("  - ROC-AUC: Overall ability to discriminate between classes")
print("")
print("REGRESSION METRICS:")
print("  - MSE: Mean Squared Error")
print("  - RMSE: Root Mean Squared Error (same units as target)")
print("  - MAE: Mean Absolute Error")
print("  - R²: Proportion of variance explained")
print("")
print("✅ All libraries loaded!")

---

## 🎯 Part 1: Classification Metrics Deep Dive

Let's use a real medical dataset: Breast Cancer Detection

### 1.1 Setup: Build a Model to Evaluate

In [None]:
print("🏥 BREAST CANCER DETECTION - Setup\n")

# Load data
cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = pd.Series(cancer.target, name='diagnosis')

# 0 = malignant (cancer), 1 = benign (not cancer)
# Let's flip it to make it more intuitive: 1 = cancer, 0 = no cancer
y = 1 - y

print(f"Dataset: {len(X)} patients")
print(f"Features: {X.shape[1]} measurements")
print(f"\nTarget distribution:")
print(f"  Cancer: {(y==1).sum()} patients ({(y==1).sum()/len(y)*100:.1f}%)")
print(f"  Healthy: {(y==0).sum()} patients ({(y==0).sum()/len(y)*100:.1f}%)")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train model
model = LogisticRegression(max_iter=10000, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]  # Probability of cancer

print("\n✅ Model trained and predictions made!")
print("Now let's evaluate it with multiple metrics...")

### 1.2 The Confusion Matrix - Foundation of All Classification Metrics

In [None]:
print("🎭 THE CONFUSION MATRIX\n")

# Calculate confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Extract components
tn, fp, fn, tp = cm.ravel()

print("Understanding the Confusion Matrix:")
print("")
print("                    PREDICTED")
print("               Negative  Positive")
print("ACTUAL Negative    TN        FP     (Type I Error)")
print("       Positive    FN        TP     (Type II Error)")
print("")
print("Where:")
print("  TN (True Negative):  Correctly predicted no cancer")
print("  TP (True Positive):  Correctly predicted cancer")
print("  FP (False Positive): Predicted cancer but was healthy (False Alarm)")
print("  FN (False Negative): Predicted healthy but had cancer (Missed Case!)")
print("")
print("="*60)
print("\nOur Model's Confusion Matrix:")
print(f"  True Negatives (TN):  {tn:3d} - Correctly identified healthy")
print(f"  True Positives (TP):  {tp:3d} - Correctly identified cancer")
print(f"  False Positives (FP): {fp:3d} - Healthy but predicted cancer")
print(f"  False Negatives (FN): {fn:3d} - Cancer but predicted healthy ⚠️")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion matrix heatmap
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Healthy', 'Cancer'])
disp.plot(ax=axes[0], cmap='Blues', values_format='d')
axes[0].set_title('Confusion Matrix\n(Numbers)', fontsize=13, fontweight='bold')

# Normalized confusion matrix
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
disp_norm = ConfusionMatrixDisplay(confusion_matrix=cm_normalized, display_labels=['Healthy', 'Cancer'])
disp_norm.plot(ax=axes[1], cmap='Oranges', values_format='.2%')
axes[1].set_title('Confusion Matrix\n(Percentages)', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n💡 Key Insight:")
print(f"   The model is {cm_normalized[1,1]:.1%} accurate at detecting cancer (Recall)")
print(f"   When it predicts cancer, it's right {cm_normalized[1,1]/(cm_normalized[0,1]+cm_normalized[1,1]):.1%} of the time")

### 1.3 Accuracy - The Most Basic Metric

In [None]:
print("📊 ACCURACY\n")

accuracy = accuracy_score(y_test, y_pred)

print("Definition: (TP + TN) / Total")
print("Meaning: What percentage of predictions were correct?")
print("")
print(f"Calculation: ({tp} + {tn}) / {len(y_test)} = {accuracy:.4f}")
print(f"Accuracy: {accuracy:.1%}")
print("")
print("✅ Advantages:")
print("   - Easy to understand and explain")
print("   - Good when classes are balanced")
print("")
print("❌ Limitations:")
print("   - Misleading with imbalanced classes")
print("   - Doesn't distinguish between types of errors")
print("   - In medicine: Missing a cancer case is worse than a false alarm!")
print("")
print("🎯 Use when: Classes are balanced and all errors are equally bad")

### 1.4 Precision - "When I predict positive, am I usually right?"

In [None]:
print("🎯 PRECISION\n")

precision = precision_score(y_test, y_pred)

print("Definition: TP / (TP + FP)")
print("Meaning: Of all patients we predicted have cancer, what % actually have it?")
print("")
print(f"Calculation: {tp} / ({tp} + {fp}) = {precision:.4f}")
print(f"Precision: {precision:.1%}")
print("")
print("💡 Interpretation:")
print(f"   When our model says 'cancer', it's correct {precision:.1%} of the time")
print(f"   False alarm rate: {(1-precision)*100:.1f}%")
print("")
print("🎯 Use when: False Positives are costly")
print("   Examples:")
print("   - Spam detection: Don't want important emails in spam")
print("   - Legal cases: Don't want to wrongly convict innocent people")
print("   - Marketing: Don't want to waste money on unlikely customers")

### 1.5 Recall (Sensitivity) - "Am I finding all the positives?"

In [None]:
print("🔍 RECALL (also called Sensitivity or True Positive Rate)\n")

recall = recall_score(y_test, y_pred)

print("Definition: TP / (TP + FN)")
print("Meaning: Of all patients who actually have cancer, what % did we catch?")
print("")
print(f"Calculation: {tp} / ({tp} + {fn}) = {recall:.4f}")
print(f"Recall: {recall:.1%}")
print("")
print("💡 Interpretation:")
print(f"   We catch {recall:.1%} of all cancer cases")
print(f"   Miss rate: {(1-recall)*100:.1f}% of cancer cases go undetected ⚠️")
print("")
print("🎯 Use when: False Negatives are costly (missing positives is bad!)")
print("   Examples:")
print("   - Cancer detection: Can't miss cancer cases!")
print("   - Fraud detection: Must catch fraud even if some false alarms")
print("   - Security: Better safe than sorry")
print("   - COVID testing: Don't want to miss infected people")
print("")
print("⚖️ The Precision-Recall Tradeoff:")
print("   - High precision → fewer false alarms but might miss cases")
print("   - High recall → catch more cases but more false alarms")
print("   - Can't optimize both perfectly - must choose based on problem!")

### 1.6 F1-Score - Balancing Precision and Recall

In [None]:
print("⚖️ F1-SCORE\n")

f1 = f1_score(y_test, y_pred)

print("Definition: 2 × (Precision × Recall) / (Precision + Recall)")
print("Meaning: Harmonic mean of precision and recall")
print("")
print(f"Calculation: 2 × ({precision:.3f} × {recall:.3f}) / ({precision:.3f} + {recall:.3f})")
print(f"F1-Score: {f1:.4f} ({f1:.1%})")
print("")
print("💡 Why harmonic mean?")
print("   - Penalizes extreme imbalance")
print("   - If precision is 90% but recall is 10%, F1 is only 18%")
print("   - Forces you to care about both metrics")
print("")
print("🎯 Use when:")
print("   - You need a single metric but care about precision AND recall")
print("   - Classes are imbalanced")
print("   - You can't decide which error type is worse")
print("")

# Visualize all metrics together
metrics = {
    'Accuracy': accuracy,
    'Precision': precision,
    'Recall': recall,
    'F1-Score': f1
}

plt.figure(figsize=(10, 6))
bars = plt.bar(metrics.keys(), metrics.values(), 
               color=['skyblue', 'lightcoral', 'lightgreen', 'gold'],
               alpha=0.7, edgecolor='black', linewidth=1.5)
plt.ylim([0, 1])
plt.ylabel('Score', fontsize=12)
plt.title('Classification Metrics Comparison', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, (name, value) in zip(bars, metrics.items()):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.02,
            f'{value:.1%}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print("\n📋 Full Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Healthy', 'Cancer']))

### 1.7 ROC Curve & AUC - The Big Picture Metric

In [None]:
print("📈 ROC CURVE & AUC SCORE\n")

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print("ROC = Receiver Operating Characteristic")
print("AUC = Area Under the Curve")
print("")
print("What it shows:")
print("  - Trade-off between True Positive Rate (Recall) and False Positive Rate")
print("  - Model's ability to discriminate between classes across all thresholds")
print("")
print(f"Our model's ROC-AUC: {roc_auc:.4f}")
print("")
print("Interpretation of AUC:")
print("  1.0 = Perfect classifier")
print("  0.9-1.0 = Excellent")
print("  0.8-0.9 = Very Good")
print("  0.7-0.8 = Good")
print("  0.6-0.7 = Fair")
print("  0.5 = Random guessing (coin flip)")
print("  < 0.5 = Worse than random (something's wrong!)")
print("")

# Plot ROC curve
plt.figure(figsize=(10, 7))
plt.plot(fpr, tpr, color='darkorange', lw=2, 
         label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', 
         label='Random Classifier (AUC = 0.50)')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - Specificity)', fontsize=12)
plt.ylabel('True Positive Rate (Recall)', fontsize=12)
plt.title('ROC Curve - Cancer Detection Model', fontsize=14, fontweight='bold')
plt.legend(loc="lower right", fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("💡 Key Insights:")
print("   - Closer the curve to top-left = better model")
print("   - AUC = probability model ranks random positive higher than random negative")
print(f"   - Our AUC of {roc_auc:.2f} means: {roc_auc:.0%} chance model correctly ranks cases")
print("")
print("🎯 Use when:")
print("   - Comparing models at a glance")
print("   - You don't care about a specific threshold")
print("   - Binary classification problems")

---

## 📉 Part 2: Regression Metrics Deep Dive

Now let's look at metrics for predicting continuous values!

### 2.1 Setup: Build a Regression Model

In [None]:
print("🏥 DIABETES PROGRESSION PREDICTION - Setup\n")

# Load diabetes dataset
diabetes = load_diabetes()
X_reg = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y_reg = pd.Series(diabetes.target, name='progression')

print(f"Dataset: {len(X_reg)} patients")
print(f"Features: {X_reg.shape[1]} measurements")
print(f"Target: Disease progression (quantitative measure)")
print(f"  Range: {y_reg.min():.1f} to {y_reg.max():.1f}")
print(f"  Mean: {y_reg.mean():.1f}")
print(f"  Std: {y_reg.std():.1f}")

# Split data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Train model
reg_model = LinearRegression()
reg_model.fit(X_train_reg, y_train_reg)

# Make predictions
y_train_pred_reg = reg_model.predict(X_train_reg)
y_test_pred_reg = reg_model.predict(X_test_reg)

print("\n✅ Regression model trained!")

# Visualize predictions vs actual
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training set
axes[0].scatter(y_train_reg, y_train_pred_reg, alpha=0.5, s=30)
axes[0].plot([y_train_reg.min(), y_train_reg.max()], 
             [y_train_reg.min(), y_train_reg.max()], 
             'r--', lw=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Progression', fontsize=11)
axes[0].set_ylabel('Predicted Progression', fontsize=11)
axes[0].set_title('Training Set: Predictions vs Actual', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Test set
axes[1].scatter(y_test_reg, y_test_pred_reg, alpha=0.5, s=30, color='coral')
axes[1].plot([y_test_reg.min(), y_test_reg.max()], 
             [y_test_reg.min(), y_test_reg.max()], 
             'r--', lw=2, label='Perfect Prediction')
axes[1].set_xlabel('Actual Progression', fontsize=11)
axes[1].set_ylabel('Predicted Progression', fontsize=11)
axes[1].set_title('Test Set: Predictions vs Actual', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Closer to the red line = better predictions")

### 2.2 Mean Absolute Error (MAE)

In [None]:
print("📏 MEAN ABSOLUTE ERROR (MAE)\n")

mae_train = mean_absolute_error(y_train_reg, y_train_pred_reg)
mae_test = mean_absolute_error(y_test_reg, y_test_pred_reg)

print("Definition: Average of absolute differences between predictions and actual")
print("Formula: (1/n) × Σ|actual - predicted|")
print("")
print(f"Training MAE: {mae_train:.2f}")
print(f"Test MAE: {mae_test:.2f}")
print("")
print("💡 Interpretation:")
print(f"   On average, predictions are off by {mae_test:.1f} units")
print(f"   As % of target range: {mae_test/(y_test_reg.max()-y_test_reg.min())*100:.1f}%")
print("")
print("✅ Advantages:")
print("   - Easy to interpret (same units as target)")
print("   - Not sensitive to outliers")
print("   - All errors weighted equally")
print("")
print("❌ Limitations:")
print("   - Doesn't penalize large errors more than small ones")
print("")
print("🎯 Use when: You want intuitive error metric and all errors are equally bad")

### 2.3 Mean Squared Error (MSE) & Root Mean Squared Error (RMSE)

In [None]:
print("📐 MEAN SQUARED ERROR (MSE) & ROOT MEAN SQUARED ERROR (RMSE)\n")

mse_train = mean_squared_error(y_train_reg, y_train_pred_reg)
mse_test = mean_squared_error(y_test_reg, y_test_pred_reg)
rmse_train = np.sqrt(mse_train)
rmse_test = np.sqrt(mse_test)

print("MSE Definition: Average of squared differences")
print("Formula: (1/n) × Σ(actual - predicted)²")
print("")
print("RMSE Definition: Square root of MSE")
print("Formula: √MSE")
print("")
print("Scores:")
print(f"  Training MSE:  {mse_train:.2f}")
print(f"  Test MSE:      {mse_test:.2f}")
print(f"  Training RMSE: {rmse_train:.2f}")
print(f"  Test RMSE:     {rmse_test:.2f}")
print("")
print("💡 Key Differences from MAE:")
print(f"   MAE:  {mae_test:.2f} (average absolute error)")
print(f"   RMSE: {rmse_test:.2f} (penalizes large errors more)")
print(f"   RMSE > MAE because large errors are penalized more")
print("")
print("✅ Advantages:")
print("   - Penalizes large errors heavily (good for critical applications)")
print("   - RMSE in same units as target (interpretable)")
print("   - Mathematically convenient (differentiable)")
print("")
print("❌ Limitations:")
print("   - Sensitive to outliers")
print("   - MSE not in same units (squared)")
print("")
print("🎯 Use when:")
print("   - Large errors are much worse than small errors")
print("   - Example: House price prediction (being off by $100k worse than off by $10k)")

# Visualize comparison
errors = np.abs(y_test_reg - y_test_pred_reg)
squared_errors = (y_test_reg - y_test_pred_reg) ** 2

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Error distribution
axes[0].hist(errors, bins=20, alpha=0.7, color='skyblue', edgecolor='black')
axes[0].axvline(mae_test, color='red', linestyle='--', linewidth=2, label=f'MAE = {mae_test:.1f}')
axes[0].axvline(rmse_test, color='orange', linestyle='--', linewidth=2, label=f'RMSE = {rmse_test:.1f}')
axes[0].set_xlabel('Absolute Error', fontsize=11)
axes[0].set_ylabel('Frequency', fontsize=11)
axes[0].set_title('Distribution of Prediction Errors', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Comparison of errors
sample_errors = np.array([5, 10, 20, 50])
sample_mae = sample_errors
sample_mse = sample_errors ** 2
x = np.arange(len(sample_errors))
width = 0.35
axes[1].bar(x - width/2, sample_mae, width, label='Absolute Error', alpha=0.7, color='skyblue')
axes[1].bar(x + width/2, sample_mse, width, label='Squared Error', alpha=0.7, color='coral')
axes[1].set_xlabel('Error Magnitude', fontsize=11)
axes[1].set_ylabel('Penalized Value', fontsize=11)
axes[1].set_title('How MAE vs MSE Penalizes Errors', fontsize=12, fontweight='bold')
axes[1].set_xticks(x)
axes[1].set_xticklabels(['5', '10', '20', '50'])
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n💡 Notice: Squared error grows much faster for large errors!")

### 2.4 R² Score (Coefficient of Determination)

In [None]:
print("📊 R² SCORE (R-SQUARED)\n")

r2_train = r2_score(y_train_reg, y_train_pred_reg)
r2_test = r2_score(y_test_reg, y_test_pred_reg)

print("Definition: Proportion of variance in target explained by features")
print("Formula: 1 - (SS_residual / SS_total)")
print("")
print(f"Training R²: {r2_train:.4f}")
print(f"Test R²: {r2_test:.4f}")
print("")
print("💡 Interpretation:")
print(f"   Model explains {r2_test*100:.1f}% of variance in disease progression")
print(f"   {(1-r2_test)*100:.1f}% of variance is unexplained (other factors)")
print("")
print("Understanding R² values:")
print("  1.0 = Perfect predictions (explains 100% of variance)")
print("  0.9-1.0 = Excellent")
print("  0.7-0.9 = Good")
print("  0.5-0.7 = Moderate")
print("  0.3-0.5 = Weak")
print("  < 0.3 = Very weak")
print("  0.0 = No better than predicting the mean")
print("  < 0.0 = Worse than predicting the mean! (something's wrong)")
print("")
print("✅ Advantages:")
print("   - Scale-independent (always between -∞ and 1)")
print("   - Easy to interpret as percentage")
print("   - Shows model's explanatory power")
print("")
print("❌ Limitations:")
print("   - Can be misleading with non-linear relationships")
print("   - Always increases with more features (use adjusted R² for this)")
print("")
print("🎯 Use when:")
print("   - You want to understand model's explanatory power")
print("   - Comparing models with same target variable")
print("   - Communicating to non-technical stakeholders")

# Visualize R² concept
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Model with different R² values
scenarios = [
    ('Poor Model (R²≈0.1)', y_test_reg, y_test_reg.mean() + np.random.normal(0, 50, len(y_test_reg))),
    ('Our Model (R²={:.2f})'.format(r2_test), y_test_reg, y_test_pred_reg),
    ('Perfect Model (R²=1.0)', y_test_reg, y_test_reg)
]

for i, (title, actual, predicted) in enumerate(scenarios):
    axes[i].scatter(actual, predicted, alpha=0.5, s=30)
    axes[i].plot([actual.min(), actual.max()], 
                 [actual.min(), actual.max()], 
                 'r--', lw=2)
    r2 = r2_score(actual, predicted)
    axes[i].set_xlabel('Actual', fontsize=10)
    axes[i].set_ylabel('Predicted', fontsize=10)
    axes[i].set_title(f'{title}', fontsize=11, fontweight='bold')
    axes[i].grid(True, alpha=0.3)
    axes[i].text(0.05, 0.95, f'R² = {r2:.2f}', 
                transform=axes[i].transAxes, 
                fontsize=12, fontweight='bold',
                verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

### 2.5 Regression Metrics Summary

In [None]:
print("📊 REGRESSION METRICS - COMPLETE SUMMARY\n")

# Calculate all metrics
metrics_summary = pd.DataFrame({
    'Metric': ['MAE', 'MSE', 'RMSE', 'R²'],
    'Training': [
        f"{mae_train:.2f}",
        f"{mse_train:.2f}",
        f"{rmse_train:.2f}",
        f"{r2_train:.4f}"
    ],
    'Test': [
        f"{mae_test:.2f}",
        f"{mse_test:.2f}",
        f"{rmse_test:.2f}",
        f"{r2_test:.4f}"
    ],
    'Interpretation': [
        'Average absolute error',
        'Average squared error',
        'Average error (same units)',
        'Variance explained (0-1)'
    ]
})

print(metrics_summary.to_string(index=False))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Error metrics
error_metrics = ['MAE', 'RMSE']
train_errors = [mae_train, rmse_train]
test_errors = [mae_test, rmse_test]
x = np.arange(len(error_metrics))
width = 0.35
axes[0].bar(x - width/2, train_errors, width, label='Training', alpha=0.7, color='skyblue')
axes[0].bar(x + width/2, test_errors, width, label='Test', alpha=0.7, color='coral')
axes[0].set_ylabel('Error', fontsize=11)
axes[0].set_title('Error Metrics Comparison', fontsize=12, fontweight='bold')
axes[0].set_xticks(x)
axes[0].set_xticklabels(error_metrics)
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis='y')

# R² score
r2_data = [r2_train, r2_test]
colors = ['skyblue', 'coral']
bars = axes[1].bar(['Training', 'Test'], r2_data, color=colors, alpha=0.7, width=0.5)
axes[1].set_ylim([0, 1])
axes[1].set_ylabel('R² Score', fontsize=11)
axes[1].set_title('R² Score Comparison', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')
for bar, val in zip(bars, r2_data):
    height = bar.get_height()
    axes[1].text(bar.get_x() + bar.get_width()/2., height + 0.02,
                f'{val:.3f}', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

print("\n💡 Key Insight:")
if abs(r2_train - r2_test) < 0.1:
    print("   ✅ Training and test R² are similar - good generalization!")
else:
    print("   ⚠️ Gap between training and test R² - possible overfitting!")

---

## 🎯 Part 3: Which Metric Should You Use?

The ultimate decision framework!

In [None]:
print("🧭 METRIC SELECTION GUIDE\n")
print("="*70)

print("\n📋 CLASSIFICATION PROBLEMS:\n")

classification_guide = pd.DataFrame([
    ['Balanced classes, all errors equal', 'Accuracy'],
    ['Must minimize false alarms', 'Precision'],
    ['Cannot miss positive cases (life/death)', 'Recall'],
    ['Need balance, imbalanced classes', 'F1-Score'],
    ['Comparing models, threshold-independent', 'ROC-AUC'],
    ['Medical diagnosis', 'Recall + F1'],
    ['Spam detection', 'Precision'],
    ['Fraud detection', 'Recall + ROC-AUC'],
    ['Marketing (predict buyers)', 'Precision + F1']
], columns=['Situation', 'Best Metric(s)'])

print(classification_guide.to_string(index=False))

print("\n" + "="*70)
print("\n📈 REGRESSION PROBLEMS:\n")

regression_guide = pd.DataFrame([
    ['Simple interpretation needed', 'MAE or RMSE'],
    ['All errors equally bad', 'MAE'],
    ['Large errors much worse', 'MSE or RMSE'],
    ['Sensitive to outliers', 'MAE (robust)'],
    ['Model comparison (same target)', 'R²'],
    ['Explaining model to stakeholders', 'R² + RMSE'],
    ['House price prediction', 'RMSE (in $)'],
    ['Stock price prediction', 'MAE + RMSE'],
    ['Weather forecasting', 'MAE']
], columns=['Situation', 'Best Metric(s)'])

print(regression_guide.to_string(index=False))

print("\n" + "="*70)
print("\n⚠️ COMMON MISTAKES TO AVOID:\n")
print("1. ❌ Using only accuracy for imbalanced classes")
print("2. ❌ Not considering the business cost of errors")
print("3. ❌ Optimizing for a metric that doesn't match your goal")
print("4. ❌ Looking at training metrics instead of test metrics")
print("5. ❌ Using R² when prediction accuracy matters more")
print("")
print("🎯 GOLDEN RULE: Choose metrics based on business impact, not convenience!")

---

## 🎓 Key Takeaways

You now understand model evaluation inside and out! Here's your framework:

### For Classification:
1. **Confusion Matrix** - Foundation of all metrics
2. **Accuracy** - Overall correctness (use with balanced classes)
3. **Precision** - How reliable are positive predictions?
4. **Recall** - Are we catching all positives?
5. **F1-Score** - Balanced metric for imbalanced classes
6. **ROC-AUC** - Overall discrimination ability

### For Regression:
1. **MAE** - Average absolute error (intuitive, robust)
2. **MSE** - Average squared error (penalizes large errors)
3. **RMSE** - Square root of MSE (same units as target)
4. **R²** - Variance explained (0-1 scale)

### Critical Principles:
- ✅ **Always use multiple metrics** - no single metric tells the whole story
- ✅ **Match metrics to business goals** - what error is most costly?
- ✅ **Test set metrics** - training metrics can be misleading
- ✅ **Consider class imbalance** - accuracy can be deceptive
- ✅ **Understand tradeoffs** - precision vs recall, MAE vs RMSE

### Decision Framework:
```
1. What's my problem type? (Classification / Regression)
2. Are classes balanced? (If no → don't use accuracy alone)
3. What's more costly? (False positive / False negative)
4. What do stakeholders care about? (Interpretability / Performance)
5. Choose 2-3 metrics that capture these priorities
```

---

## 🚀 Next Steps

Now you know HOW to evaluate models. Next:

**Notebook 03**: Cross-validation and proper model selection
- Why single train/test split isn't enough
- K-fold cross-validation
- Comparing models fairly
- Avoiding overfitting

Then Weeks 8-12: Apply these evaluation techniques to different algorithms!

In [None]:
print("🎉 Congratulations! You've mastered model evaluation!")
print("")
print("📚 You learned:")
print("   ✅ All major classification metrics")
print("   ✅ All major regression metrics")
print("   ✅ When to use which metric")
print("   ✅ How to interpret confusion matrices")
print("   ✅ Real-world metric selection")
print("")
print("🎯 Next: Notebook 03 - Cross-Validation & Model Selection")
print("   Learn how to properly compare and select models!")