# **AI TECH INSTITUTE** · *Intermediate AI & Data Science*
### Week 7 - Notebook 03: Cross-Validation & Model Selection
**Instructor:** Amir Charkhi |  **Goal:** Proper Model Comparison & Selection

> Format: theory → implementation → best practices → real-world application.

## How Do We Choose the Best Model? 🏆

**Learning Objectives:**
- Understand why single train/test split isn't enough
- Master K-fold cross-validation
- Learn stratified cross-validation for classification
- Interpret validation curves and learning curves
- Compare multiple models fairly
- Avoid common pitfalls in model selection

**Prerequisites:** Notebooks 01 (ML Fundamentals) and 02 (Model Evaluation)



## 🤔 The Problem with Single Train/Test Split

**Scenario:** You're comparing two models...

```
Split 1 (random_state=42):
Model A: 85% accuracy
Model B: 83% accuracy
→ Choose Model A! ✅

Split 2 (random_state=99):
Model A: 81% accuracy
Model B: 87% accuracy
→ Wait... Choose Model B? 🤔
```

**The Issue:** Performance depends on which data points ended up in test set!

**The Solution:** Cross-Validation - test on multiple different splits and average results.

This notebook will teach you:
- How to evaluate models robustly
- Techniques to avoid overfitting
- How to select models with confidence

In [None]:
# Essential imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.datasets import load_iris, load_wine
from sklearn.model_selection import (
    train_test_split,
    cross_val_score,
    cross_validate,
    KFold,
    StratifiedKFold,
    learning_curve,
    validation_curve
)
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score

import warnings
warnings.filterwarnings('ignore')

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("📚 Cross-Validation Toolkit:")
print("")
print("KEY CONCEPTS:")
print("  - K-Fold CV: Split data into K parts, train on K-1, test on 1")
print("  - Stratified K-Fold: Maintains class proportions in each fold")
print("  - Validation Curve: Performance vs hyperparameter values")
print("  - Learning Curve: Performance vs training set size")
print("")
print("✅ All libraries loaded!")

---

## 📊 Part 1: Demonstrating the Problem

### 1.1 How Random Seed Affects Results

In [None]:
print("🎲 EXPERIMENT: Impact of Random Seed on Model Performance\n")

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Test with different random seeds
seeds = range(0, 50, 5)
results = []

for seed in seeds:
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=seed
    )
    
    model = LogisticRegression(max_iter=200)
    model.fit(X_train, y_train)
    accuracy = accuracy_score(y_test, model.predict(X_test))
    
    results.append({'seed': seed, 'accuracy': accuracy})

results_df = pd.DataFrame(results)

print(f"Results across {len(seeds)} different random splits:")
print(f"  Min accuracy:  {results_df['accuracy'].min():.1%}")
print(f"  Max accuracy:  {results_df['accuracy'].max():.1%}")
print(f"  Mean accuracy: {results_df['accuracy'].mean():.1%}")
print(f"  Std deviation: {results_df['accuracy'].std():.1%}")
print(f"  Range:         {(results_df['accuracy'].max() - results_df['accuracy'].min())*100:.1f} percentage points!")

# Visualize
plt.figure(figsize=(12, 6))
plt.plot(results_df['seed'], results_df['accuracy'], 'o-', linewidth=2, markersize=8, alpha=0.6)
plt.axhline(results_df['accuracy'].mean(), color='red', linestyle='--', 
            linewidth=2, label=f"Mean: {results_df['accuracy'].mean():.1%}")
plt.fill_between(results_df['seed'], 
                 results_df['accuracy'].mean() - results_df['accuracy'].std(),
                 results_df['accuracy'].mean() + results_df['accuracy'].std(),
                 alpha=0.2, color='red', label='±1 Std Dev')
plt.xlabel('Random Seed', fontsize=12)
plt.ylabel('Test Accuracy', fontsize=12)
plt.title('Model Performance Varies with Random Split!', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n💡 Key Insight:")
print("   Same model, same data, but performance varies by up to 10+ percentage points!")
print("   → Single split gives unreliable estimate of true performance")
print("   → Solution: Cross-validation! Test on multiple splits and average.")

---

## 🔄 Part 2: K-Fold Cross-Validation

### 2.1 Understanding K-Fold CV

In [None]:
print("🔄 K-FOLD CROSS-VALIDATION EXPLAINED\n")

print("How it works:")
print("")
print("1. Split data into K equal parts (folds)")
print("2. For each fold:")
print("   - Use that fold as test set")
print("   - Use remaining K-1 folds as training set")
print("   - Train model and evaluate")
print("3. Average performance across all K folds")
print("")
print("Example with 5-Fold CV on 100 samples:")
print("")
print("Fold 1: Train[20-100], Test[0-20]   → Accuracy: 85%")
print("Fold 2: Train[0-20,40-100], Test[20-40] → Accuracy: 88%")
print("Fold 3: Train[0-40,60-100], Test[40-60] → Accuracy: 82%")
print("Fold 4: Train[0-60,80-100], Test[60-80] → Accuracy: 87%")
print("Fold 5: Train[0-80], Test[80-100]       → Accuracy: 84%")
print("                                         ───────────")
print("                              Mean CV Score: 85.2% ± 2.4%")
print("")
print("Benefits:")
print("  ✅ Every sample used for both training and testing")
print("  ✅ More reliable performance estimate")
print("  ✅ Get mean AND standard deviation")
print("  ✅ Reduces impact of lucky/unlucky splits")
print("")
print("Common K values:")
print("  - K=5: Good balance (most common)")
print("  - K=10: More robust but slower")
print("  - K=n (Leave-One-Out): Maximum robustness but very slow")

### 2.2 Implementing K-Fold CV

In [None]:
print("💻 IMPLEMENTING K-FOLD CROSS-VALIDATION\n")

# Load data
wine = load_wine()
X = wine.data
y = wine.target

print(f"Dataset: {len(X)} wine samples, {X.shape[1]} features, {len(np.unique(y))} classes")
print("")

# Create model
model = LogisticRegression(max_iter=10000)

# Method 1: Simple cross_val_score
print("Method 1: Simple cross_val_score")
print("="*50)
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

print(f"5-Fold CV Scores: {[f'{score:.3f}' for score in cv_scores]}")
print(f"Mean CV Score: {cv_scores.mean():.3f} (±{cv_scores.std():.3f})")
print("")

# Method 2: Manual K-Fold with more control
print("Method 2: Manual K-Fold (more control)")
print("="*50)
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
fold_scores = []

for fold, (train_idx, test_idx) in enumerate(kfold.split(X), 1):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    
    model.fit(X_train, y_train)
    score = accuracy_score(y_test, model.predict(X_test))
    fold_scores.append(score)
    
    print(f"Fold {fold}: Train size={len(train_idx)}, Test size={len(test_idx)}, Accuracy={score:.3f}")

print(f"\nMean: {np.mean(fold_scores):.3f} (±{np.std(fold_scores):.3f})")

# Visualize fold scores
plt.figure(figsize=(10, 6))
folds = range(1, len(fold_scores) + 1)
plt.bar(folds, fold_scores, alpha=0.6, color='skyblue', edgecolor='black')
plt.axhline(np.mean(fold_scores), color='red', linestyle='--', linewidth=2, 
            label=f'Mean: {np.mean(fold_scores):.3f}')
plt.xlabel('Fold', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('5-Fold Cross-Validation Results', fontsize=14, fontweight='bold')
plt.ylim([0, 1])
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3, axis='y')
for i, score in enumerate(fold_scores, 1):
    plt.text(i, score + 0.02, f'{score:.3f}', ha='center', fontweight='bold')
plt.tight_layout()
plt.show()

print("\n💡 Notice: Scores are more consistent now, giving reliable estimate!")

### 2.3 Stratified K-Fold for Classification

In [None]:
print("🎯 STRATIFIED K-FOLD CROSS-VALIDATION\n")

print("Problem with regular K-Fold for classification:")
print("  If you have imbalanced classes, some folds might not have all classes!")
print("")
print("Solution: Stratified K-Fold")
print("  → Ensures each fold has same class distribution as original dataset")
print("")

# Compare regular vs stratified
print("Comparison: Regular K-Fold vs Stratified K-Fold")
print("="*60)

# Regular K-Fold
kfold_regular = KFold(n_splits=5, shuffle=True, random_state=42)
scores_regular = cross_val_score(model, X, y, cv=kfold_regular, scoring='accuracy')

# Stratified K-Fold
kfold_stratified = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_stratified = cross_val_score(model, X, y, cv=kfold_stratified, scoring='accuracy')

print(f"Regular K-Fold:     {scores_regular.mean():.4f} (±{scores_regular.std():.4f})")
print(f"Stratified K-Fold:  {scores_stratified.mean():.4f} (±{scores_stratified.std():.4f})")
print("")

# Show class distribution in folds
print("Class distribution in each fold:\n")
for fold, (train_idx, test_idx) in enumerate(kfold_stratified.split(X, y), 1):
    y_test_fold = y[test_idx]
    class_counts = pd.Series(y_test_fold).value_counts().sort_index()
    class_props = (class_counts / len(y_test_fold) * 100)
    print(f"Fold {fold}: ", end="")
    for cls in range(len(np.unique(y))):
        print(f"Class {cls}: {class_props[cls]:.1f}%  ", end="")
    print()

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Regular K-Fold
axes[0].bar(range(1, 6), scores_regular, alpha=0.6, color='lightcoral', edgecolor='black')
axes[0].axhline(scores_regular.mean(), color='red', linestyle='--', linewidth=2)
axes[0].set_xlabel('Fold', fontsize=11)
axes[0].set_ylabel('Accuracy', fontsize=11)
axes[0].set_title(f'Regular K-Fold\n(Mean: {scores_regular.mean():.3f} ±{scores_regular.std():.3f})', 
                  fontsize=12, fontweight='bold')
axes[0].set_ylim([0.9, 1.0])
axes[0].grid(True, alpha=0.3, axis='y')

# Stratified K-Fold
axes[1].bar(range(1, 6), scores_stratified, alpha=0.6, color='skyblue', edgecolor='black')
axes[1].axhline(scores_stratified.mean(), color='red', linestyle='--', linewidth=2)
axes[1].set_xlabel('Fold', fontsize=11)
axes[1].set_ylabel('Accuracy', fontsize=11)
axes[1].set_title(f'Stratified K-Fold\n(Mean: {scores_stratified.mean():.3f} ±{scores_stratified.std():.3f})', 
                  fontsize=12, fontweight='bold')
axes[1].set_ylim([0.9, 1.0])
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n🎯 Best Practice: Always use StratifiedKFold for classification!")

---

## 🏆 Part 3: Comparing Multiple Models

### 3.1 Model Comparison with Cross-Validation

In [None]:
print("🏆 COMPARING MULTIPLE MODELS WITH CROSS-VALIDATION\n")

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Define models to compare
models = {
    'Logistic Regression': LogisticRegression(max_iter=200),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier()
}

# Perform 5-fold CV for each model
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
results = []

print("Evaluating models with 5-fold cross-validation...\n")
print("="*70)

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
    results.append({
        'Model': name,
        'Mean Score': scores.mean(),
        'Std Dev': scores.std(),
        'Min Score': scores.min(),
        'Max Score': scores.max()
    })
    print(f"{name:25} → {scores.mean():.4f} (±{scores.std():.4f})")

results_df = pd.DataFrame(results).sort_values('Mean Score', ascending=False)

print("\n" + "="*70)
print("\n📊 Summary Table:\n")
print(results_df.to_string(index=False))

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Mean scores with error bars
axes[0].barh(results_df['Model'], results_df['Mean Score'], 
             xerr=results_df['Std Dev'], alpha=0.6, 
             color='skyblue', edgecolor='black', capsize=5)
axes[0].set_xlabel('Accuracy', fontsize=12)
axes[0].set_title('Model Comparison\n(Mean ± Std Dev)', fontsize=13, fontweight='bold')
axes[0].set_xlim([0.85, 1.0])
axes[0].grid(True, alpha=0.3, axis='x')

# Box plot showing distribution
all_scores = []
labels = []
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
    all_scores.append(scores)
    labels.append(name)

bp = axes[1].boxplot(all_scores, labels=labels, patch_artist=True, vert=False)
for patch in bp['boxes']:
    patch.set_facecolor('lightcoral')
    patch.set_alpha(0.6)
axes[1].set_xlabel('Accuracy', fontsize=12)
axes[1].set_title('Score Distribution Across Folds', fontsize=13, fontweight='bold')
axes[1].set_xlim([0.85, 1.0])
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\n🏆 Winner:", results_df.iloc[0]['Model'])
print(f"   Score: {results_df.iloc[0]['Mean Score']:.4f} (±{results_df.iloc[0]['Std Dev']:.4f})")
print("\n💡 Key Insight: Use cross-validation to compare models fairly!")

---

## 📈 Part 4: Learning Curves & Validation Curves

### 4.1 Learning Curves - Diagnosing Bias vs Variance

In [None]:
print("📈 LEARNING CURVES - Understanding Model Behavior\n")

print("What are learning curves?")
print("  → Show model performance vs. training set size")
print("  → Help diagnose underfitting (bias) vs overfitting (variance)")
print("")
print("How to interpret:")
print("")
print("HIGH BIAS (Underfitting):")
print("  - Train and validation scores both low")
print("  - Scores plateau early")
print("  - Small gap between them")
print("  → Solution: More complex model, more features")
print("")
print("HIGH VARIANCE (Overfitting):")
print("  - Train score high, validation score low")
print("  - Large gap between them")
print("  - Gap doesn't close with more data")
print("  → Solution: More data, regularization, simpler model")
print("")
print("GOOD FIT:")
print("  - Both scores high")
print("  - Small gap between them")
print("  - Converge with more data")
print("")

# Generate learning curves for different model complexities
models_to_compare = [
    ('Underfitting (max_depth=1)', DecisionTreeClassifier(max_depth=1, random_state=42)),
    ('Good Fit (max_depth=5)', DecisionTreeClassifier(max_depth=5, random_state=42)),
    ('Overfitting (max_depth=20)', DecisionTreeClassifier(max_depth=20, random_state=42))
]

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for idx, (name, model) in enumerate(models_to_compare):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y, 
        cv=5,
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='accuracy',
        random_state=42
    )
    
    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)
    
    axes[idx].plot(train_sizes, train_mean, 'o-', color='blue', label='Training score')
    axes[idx].fill_between(train_sizes, train_mean - train_std, train_mean + train_std, 
                           alpha=0.1, color='blue')
    axes[idx].plot(train_sizes, val_mean, 'o-', color='red', label='Validation score')
    axes[idx].fill_between(train_sizes, val_mean - val_std, val_mean + val_std, 
                           alpha=0.1, color='red')
    
    axes[idx].set_xlabel('Training Set Size', fontsize=10)
    axes[idx].set_ylabel('Accuracy', fontsize=10)
    axes[idx].set_title(name, fontsize=11, fontweight='bold')
    axes[idx].legend(loc='lower right', fontsize=9)
    axes[idx].grid(True, alpha=0.3)
    axes[idx].set_ylim([0.3, 1.05])

plt.tight_layout()
plt.show()

print("\n💡 Observations:")
print("   LEFT: Both scores low & close → Underfitting (high bias)")
print("   MIDDLE: Both scores high & close → Good fit!")
print("   RIGHT: Large gap between scores → Overfitting (high variance)")

### 4.2 Validation Curves - Finding Optimal Hyperparameters

In [None]:
print("📊 VALIDATION CURVES - Hyperparameter Tuning\n")

print("What are validation curves?")
print("  → Show model performance vs. hyperparameter values")
print("  → Help find optimal hyperparameter settings")
print("")
print("Example: Finding optimal max_depth for Decision Tree")
print("")

# Generate validation curve
param_range = range(1, 21)
train_scores, val_scores = validation_curve(
    DecisionTreeClassifier(random_state=42),
    X, y,
    param_name='max_depth',
    param_range=param_range,
    cv=5,
    scoring='accuracy'
)

train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)

# Find optimal value
optimal_idx = np.argmax(val_mean)
optimal_depth = param_range[optimal_idx]
optimal_score = val_mean[optimal_idx]

print(f"Optimal max_depth: {optimal_depth}")
print(f"Validation score at optimal: {optimal_score:.4f}")

# Plot
plt.figure(figsize=(12, 6))
plt.plot(param_range, train_mean, 'o-', color='blue', label='Training score', linewidth=2)
plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, 
                 alpha=0.1, color='blue')
plt.plot(param_range, val_mean, 'o-', color='red', label='Validation score', linewidth=2)
plt.fill_between(param_range, val_mean - val_std, val_mean + val_std, 
                 alpha=0.1, color='red')
plt.axvline(optimal_depth, color='green', linestyle='--', linewidth=2, 
            label=f'Optimal (depth={optimal_depth})')
plt.xlabel('max_depth', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Validation Curve: Finding Optimal Tree Depth', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n💡 Key Insights:")
print(f"   - Before depth {optimal_depth}: Underfitting (both scores improving)")
print(f"   - At depth {optimal_depth}: Sweet spot! (best validation score)")
print(f"   - After depth {optimal_depth}: Overfitting (gap widens, validation drops)")
print("")
print("🎯 Use validation curves to:")
print("   - Find optimal hyperparameter values")
print("   - Detect overfitting/underfitting")
print("   - Understand model sensitivity to parameters")

---

## ⚠️ Part 5: Common Pitfalls & Best Practices

In [None]:
print("⚠️ CROSS-VALIDATION PITFALLS & BEST PRACTICES\n")
print("="*70)

print("\n❌ COMMON MISTAKES:\n")

pitfalls = pd.DataFrame([
    ['Not shuffling data before CV', 
     'Data might be ordered by class',
     'Use shuffle=True in KFold'],
    
    ['Using regular KFold for classification', 
     'Imbalanced folds',
     'Use StratifiedKFold'],
    
    ['Touching test set during CV', 
     'Data leakage!',
     'CV only on training set'],
    
    ['Preprocessing before CV split', 
     'Information leakage from test to train',
     'Use Pipeline or ColumnTransformer'],
    
    ['Using CV to select AND evaluate', 
     'Overly optimistic results',
     'Nested CV or separate test set'],
    
    ['Too few folds (K=2)', 
     'High variance in estimates',
     'Use K=5 or K=10'],
    
    ['Too many folds (K=n)', 
     'Computationally expensive',
     'K=5 or K=10 usually sufficient'],
    
    ['Not setting random_state', 
     'Results not reproducible',
     'Always set random_state']
], columns=['Mistake', 'Why Bad', 'Solution'])

print(pitfalls.to_string(index=False))

print("\n" + "="*70)
print("\n✅ BEST PRACTICES:\n")

print("1. 📊 Data Splitting Strategy:")
print("   Train (60-70%) → Validation (10-20%) → Test (10-20%)")
print("   OR")
print("   Train+Validation (80% with CV) → Test (20% held out)")
print("")

print("2. 🔄 Cross-Validation Guidelines:")
print("   - Use StratifiedKFold for classification")
print("   - Use K=5 or K=10 (good balance)")
print("   - Always shuffle=True")
print("   - Set random_state for reproducibility")
print("")

print("3. 🎯 Model Selection Workflow:")
print("   Step 1: Split → Train (80%) + Test (20%)")
print("   Step 2: Use CV on training set to:")
print("           - Compare models")
print("           - Tune hyperparameters")
print("   Step 3: Select best model")
print("   Step 4: Train on full training set")
print("   Step 5: Evaluate ONCE on test set (final score)")
print("")

print("4. 📈 Reporting Results:")
print("   ✅ Report mean ± std dev from CV")
print("   ✅ Show multiple metrics (accuracy, F1, etc.)")
print("   ✅ Include confidence intervals")
print("   ✅ Report final test set score separately")
print("   ❌ Don't just report best single score")
print("")

print("5. 🔍 When to Use What:")
print("   - Small dataset (<1000 samples): K=10 or Leave-One-Out")
print("   - Medium dataset (1000-10000): K=5 or K=10")
print("   - Large dataset (>10000): K=5 or simple train/val/test split")
print("   - Time series: TimeSeriesSplit (not covered today)")
print("")

print("="*70)
print("\n💡 Golden Rule:")
print("   Use CV to SELECT models, use test set to EVALUATE final model!")

In [None]:
print("🎉 Congratulations! You've mastered model evaluation and selection!")
print("")
print("📚 Week 7 Complete! You learned:")
print("")
print("   Notebook 01: ✅ ML Fundamentals & Lifecycle")
print("   Notebook 02: ✅ Classification & Regression Metrics")
print("   Notebook 03: ✅ Cross-Validation & Model Selection")
print("")
print("🎯 You now have a complete ML evaluation framework!")
print("")
print("🚀 Next Week: Linear Models & Tree-Based Methods")
print("   Apply everything you learned to real algorithms!")
print("")
print("💪 You're ready to build production-quality ML models!")