# Cross-validation: Reliable Model Evaluation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/example-notebooks/28_cross_validation.ipynb)

This companion notebook provides hands-on exercises for the **Cross-validation** chapter. You'll learn why repeatedly evaluating on the test set leads to contamination, how k-fold cross-validation solves this problem, and how to implement the proper 5-stage workflow for honest model evaluation.

**What you'll practice**
- Understand test set contamination through examples
- Visualize how k-fold cross-validation works
- Implement cross-validation using `cross_val_score()` and `cross_validate()`
- Compare multiple models using CV instead of test set
- Apply the proper 5-stage workflow: split, CV comparison, select, train, evaluate
- Choose appropriate k values and scoring metrics

**How to use**
- Run from top to bottom. When you see **🏃‍♂️ Try It Yourself**, add your code beneath the prompt.
- In Colab: `Runtime → Restart and run all` to test from a clean environment.

## 0) Setup

Install and import the required packages.

In [None]:
# If using Colab/a fresh env, uncomment to install
# !pip -q install scikit-learn pandas numpy matplotlib ISLP

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from ISLP import load_data

from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification, load_breast_cancer

# Set random seed for reproducibility
np.random.seed(42)

## 1) The Peeking Problem: Demonstrating Test Set Contamination

Let's see exactly what happens when you repeatedly evaluate models on the test set to select the "best" one. This simulates a realistic scenario where you're trying different hyperparameter combinations.

In [None]:
# Create synthetic customer churn dataset
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=8,
    n_redundant=2,
    n_classes=2,
    weights=[0.8, 0.2],  # 20% churn rate
    random_state=42
)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")

In [None]:
# THE WRONG WAY: Peeking at test set 20 times to select hyperparameters
best_test_score = 0
best_config = None
peek_count = 0

for n_estimators in [50, 100, 200, 300]:
    for max_depth in [5, 10, 15, 20, None]:
        rf = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            random_state=42
        )
        rf.fit(X_train, y_train)
        test_score = rf.score(X_test, y_test)  # ⚠️ Peeking!
        peek_count += 1
        
        if test_score > best_test_score:
            best_test_score = test_score
            best_config = (n_estimators, max_depth)

print(f"⚠️ WARNING: Peeked at test set {peek_count} times!")
print(f"Best config found: n_estimators={best_config[0]}, max_depth={best_config[1]}")
print(f"Test accuracy: {best_test_score:.3f}")
print("\n❌ This test score is CONTAMINATED - we can't trust it!")

**What went wrong?**

We selected the model specifically because it performed best on this test set. Some of that performance is genuine, but some is just luck—the model happened to work well with these specific test observations. When deployed to production, performance will likely drop.

This is **test set contamination** in action.

### 🏃‍♂️ Try It Yourself
- Create a new random train/test split (use `random_state=99`) and re-run the same hyperparameter search
- Does the same configuration win? 
- What does this tell you about the stability of selecting models based on test set performance?

## 2) The Solution: Cross-Validation

Cross-validation solves the contamination problem by creating validation estimates entirely within the training set. Let's see how k-fold CV works.

### Visualizing 5-Fold Cross-Validation

In 5-fold CV:
- Training set is split into 5 equal folds
- Each fold takes a turn being the validation set
- Model is trained on the other 4 folds
- We get 5 performance scores, then average them

In [None]:
import matplotlib.patches as mpatches

# Visualize 5-fold CV
fig, ax = plt.subplots(figsize=(9, 5))

train_color = '#4CAF50'  # Green
val_color = '#FFC107'     # Amber

k = 5
fold_labels = ['Fold 1', 'Fold 2', 'Fold 3', 'Fold 4', 'Fold 5']

# Create visualization for each iteration
for iteration in range(k):
    y_pos = k - iteration - 1
    
    for fold in range(k):
        if fold == iteration:
            color = val_color
            label = 'Validate'
        else:
            color = train_color
            label = 'Train'
        
        rect = mpatches.Rectangle((fold, y_pos), 1, 0.8,
                                   facecolor=color,
                                   edgecolor='black',
                                   linewidth=2)
        ax.add_patch(rect)
        
        ax.text(fold + 0.5, y_pos + 0.4, label,
                ha='center', va='center',
                fontsize=10, fontweight='bold')
    
    # Add iteration label
    ax.text(-0.5, y_pos + 0.4, f'Iteration {iteration + 1}',
            ha='right', va='center', fontsize=11, fontweight='bold')
    ax.text(5.5, y_pos + 0.4, f'→ Score {iteration + 1}',
            ha='left', va='center', fontsize=10, style='italic')

# Add fold labels at top
for fold in range(k):
    ax.text(fold + 0.5, k + 0.2, fold_labels[fold],
            ha='center', va='bottom', fontsize=11, fontweight='bold')

# Add final averaging step
ax.text(2.5, -0.8, 'Final CV Score = Average of 5 scores',
        ha='center', va='top', fontsize=12, fontweight='bold',
        bbox=dict(boxstyle='round', facecolor='lightblue', 
                  edgecolor='black', linewidth=2))

ax.set_xlim(-1, 6.5)
ax.set_ylim(-1.2, k + 0.5)
ax.set_aspect('equal')
ax.axis('off')

plt.title('5-Fold Cross-Validation: Each Fold Takes a Turn as Validation Set',
          fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

**Key insight**: Every observation in the training set is used for:
- **Training** in 4 out of 5 iterations (80%)
- **Validation** in 1 out of 5 iterations (20%)

This maximizes data efficiency while providing honest performance estimates!

## 3) Implementing Cross-Validation with scikit-learn

Let's use the Default dataset to practice implementing CV.

In [None]:
# Load Default dataset
Default = load_data('Default')

# Prepare features and target
X = pd.get_dummies(Default[['balance', 'income', 'student']], drop_first=True)
y = (Default['default'] == 'Yes').astype(int)

# Create train/test split (lock away test set)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {len(X_train)} observations")
print(f"Test set: {len(X_test)} observations (🔒 locked away)")
print(f"Default rate: {y_train.mean():.1%}")

### Basic Cross-Validation with `cross_val_score()`

In [None]:
# Create a model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Perform 5-fold cross-validation on TRAINING SET ONLY
cv_scores = cross_val_score(
    estimator=model,
    X=X_train,             # Training features only!
    y=y_train,             # Training labels only!
    cv=5,                  # Number of folds
    scoring='accuracy'     # Metric to compute
)

print(f"Individual fold scores: {cv_scores}")
print(f"Mean CV accuracy: {cv_scores.mean():.3f}")
print(f"Std deviation: {cv_scores.std():.3f}")
print(f"\n✓ Test set was never touched!")

**Interpretation:**
- Mean: Our best estimate of model performance on new data
- Std dev: How variable performance is across different data splits
- Low std dev = stable, consistent model

### 🏃‍♂️ Try It Yourself
- Try different k values: `cv=3` and `cv=10`
- How does the mean CV score change?
- How does the standard deviation change?
- Which k value gives the most stable estimates?

### Using Different Scoring Metrics

In [None]:
# Try different metrics for classification
metrics = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']

print("Cross-Validation with Different Metrics:")
print("=" * 50)

for metric in metrics:
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring=metric)
    print(f"{metric:12s}: {scores.mean():.3f} (±{scores.std():.3f})")

print("=" * 50)

### Advanced: Multiple Metrics with `cross_validate()`

In [None]:
# Evaluate multiple metrics simultaneously
cv_results = cross_validate(
    model, X_train, y_train,
    cv=5,
    scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'],
    return_train_score=True  # Also get training scores
)

print("Detailed Cross-Validation Results:")
print("=" * 60)
print(f"CV Accuracy:  {cv_results['test_accuracy'].mean():.3f}")
print(f"CV Precision: {cv_results['test_precision'].mean():.3f}")
print(f"CV Recall:    {cv_results['test_recall'].mean():.3f}")
print(f"CV F1:        {cv_results['test_f1'].mean():.3f}")
print(f"CV ROC AUC:   {cv_results['test_roc_auc'].mean():.3f}")
print("\n--- Checking for overfitting ---")
print(f"Train Accuracy: {cv_results['train_accuracy'].mean():.3f}")
print(f"CV Accuracy:    {cv_results['test_accuracy'].mean():.3f}")
print(f"Gap:            {cv_results['train_accuracy'].mean() - cv_results['test_accuracy'].mean():.3f}")
print("=" * 60)

**Detecting overfitting**: If train scores are much higher than CV scores, your model is overfitting!

## 4) Comparing Models: The RIGHT Way

Now let's compare multiple models using CV instead of peeking at the test set.

In [None]:
# Create synthetic data with non-linear patterns
X_comp, y_comp = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=8,
    n_redundant=0,
    n_clusters_per_class=3,
    flip_y=0.1,
    class_sep=0.5,
    random_state=42
)

# Add polynomial interactions
X_poly = np.column_stack([
    X_comp,
    X_comp[:, 0] * X_comp[:, 1],
    X_comp[:, 2] ** 2,
    X_comp[:, 3] * X_comp[:, 4]
])

# Train/test split (lock away test set)
X_train_comp, X_test_comp, y_train_comp, y_test_comp = train_test_split(
    X_poly, y_comp, test_size=0.2, random_state=42, stratify=y_comp
)

print(f"Training set: {len(X_train_comp)} observations")
print(f"Test set: {len(X_test_comp)} observations (🔒 locked away)")

In [None]:
# Define models to compare
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree (depth=5)': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Decision Tree (depth=10)': DecisionTreeClassifier(max_depth=10, random_state=42),
    'Random Forest (n=50)': RandomForestClassifier(n_estimators=50, random_state=42),
    'Random Forest (n=100)': RandomForestClassifier(n_estimators=100, random_state=42),
    'Random Forest (n=200)': RandomForestClassifier(n_estimators=200, random_state=42)
}

# Evaluate each with 5-fold CV (NEVER touching test set)
print("Model Comparison Using Cross-Validation:")
print("=" * 70)
print(f"{'Model':<30s} {'Mean ROC AUC':<15s} {'Std Dev'}")
print("=" * 70)

cv_results_all = {}
for name, model in models.items():
    cv_scores = cross_val_score(
        model, X_train_comp, y_train_comp,
        cv=5,
        scoring='roc_auc'
    )
    
    cv_results_all[name] = {
        'mean': cv_scores.mean(),
        'std': cv_scores.std(),
        'model': model
    }
    
    print(f"{name:<30s} {cv_scores.mean():.4f}          (±{cv_scores.std():.4f})")

print("=" * 70)
print("\n✓ All comparisons done WITHOUT touching test set!")

### 🏃‍♂️ Try It Yourself
- Add a Random Forest with `max_depth=5` to the comparison
- Does limiting depth help or hurt performance?
- Which model would you select based on CV scores?

## 5) The Complete Proper Workflow (5 Stages)

Let's put everything together: the professional workflow you'll use for every ML project.

In [None]:
# Load breast cancer dataset for demonstration
data = load_breast_cancer()
X, y = data.data, data.target

print("Starting Complete 5-Stage Workflow...\n")

In [None]:
# ========== STAGE 1: Initial Setup ==========
print("STAGE 1: Initial Setup - Split data and lock away test set")
print("=" * 60)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print("🔒 Test set locked away\n")

In [None]:
# ========== STAGE 2: Model Development ==========
print("STAGE 2: Model Development - Compare models using CV")
print("=" * 60)

models_workflow = {
    'Logistic Regression': LogisticRegression(max_iter=10000),
    'Decision Tree (depth=5)': DecisionTreeClassifier(max_depth=5),
    'Decision Tree (depth=10)': DecisionTreeClassifier(max_depth=10),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

cv_results_workflow = {}
for name, model in models_workflow.items():
    cv_scores = cross_val_score(model, X_train, y_train, cv=5)
    cv_results_workflow[name] = {
        'mean': cv_scores.mean(),
        'std': cv_scores.std(),
        'model': model
    }
    print(f"{name:30s} {cv_scores.mean():.4f} (±{cv_scores.std():.4f})")

print("")

In [None]:
# ========== STAGE 3: Select Best ==========
print("STAGE 3: Select Best - Choose model with highest CV score")
print("=" * 60)
best_name = max(cv_results_workflow, key=lambda k: cv_results_workflow[k]['mean'])
best_model = cv_results_workflow[best_name]['model']
best_cv_score = cv_results_workflow[best_name]['mean']
print(f"Selected model: {best_name}")
print(f"CV score: {best_cv_score:.4f}\n")

In [None]:
# ========== STAGE 4: Train Final Model ==========
print("STAGE 4: Train Final Model - Retrain on all training data")
print("=" * 60)
best_model.fit(X_train, y_train)
print(f"Trained {best_name} on {len(X_train)} training samples\n")

In [None]:
# ========== STAGE 5: Final Evaluation ==========
print("STAGE 5: Final Evaluation - Test set evaluation (ONLY ONCE)")
print("=" * 60)
test_score = best_model.score(X_test, y_test)
print(f"Cross-validation score (training): {best_cv_score:.4f}")
print(f"Test score (held-out data):        {test_score:.4f}")
print(f"\nDifference: {abs(test_score - best_cv_score):.4f}")

if abs(test_score - best_cv_score) < 0.02:
    print("✓ CV and test scores are close - good sign!")
else:
    print("⚠️ Larger gap between CV and test - check for issues")

print("\n🔓 Test set has now been used - project complete!")

### Key Takeaways from the 5-Stage Workflow:

1. **Test set touches = 1** (only in Stage 5)
2. All model comparison happens in Stage 2 using CV
3. Selection (Stage 3) is based on CV scores, NOT test scores
4. Final model (Stage 4) uses all training data
5. Test evaluation (Stage 5) gives honest production estimate

---

## ✅ Summary: The Proper Cross-Validation Workflow

**The Golden Rule**: Test set touches = EXACTLY ONE

**The 5-Stage Workflow:**
1. Split data → Lock away test set
2. Compare models → Use CV on training set ONLY
3. Select best → Based on CV scores
4. Train final model → On all training data
5. Evaluate once → On test set

**Key Functions:**
- `cross_val_score()` - Single metric evaluation
- `cross_validate()` - Multiple metrics + training scores

**Best Practices:**
- Use k=5 as default (k=10 for small datasets)
- Use stratified CV for classification (automatic in scikit-learn)
- Choose scoring metric appropriate for your problem
- Check train vs CV scores to detect overfitting

---

## ✅ End-of-Chapter Exercises

These exercises ask you to revisit your previous work using the proper cross-validation workflow.

### Exercise 1: Fixing the Baseball Salary Predictions

Revisit your Chapter 26 Exercise 1 work on the Hitters dataset using proper cross-validation.

**Your tasks:**

**Part A: Identify the problem**
1. How many times did you evaluate on the test set in your original work?
2. Why is this problematic?

**Part B: Implement proper workflow**
1. Load Hitters data and create train/test split
2. Compare 4+ model configurations using 5-fold CV on training set:
   - Decision tree with different depths
   - Random Forest with different parameters
3. Select best model based on CV scores
4. Train final model on full training set
5. Evaluate on test set ONCE

**Part C: Analysis**
1. Is the CV score close to the test score?
2. Which estimate would you trust for production?
3. Try k=3 and k=10. Does the same model win?

In [None]:
# TODO: Your code for Exercise 1 here
Hitters = load_data('Hitters')
# ... continue implementation

### Exercise 2: Proper Default Risk Assessment

Apply the complete 5-stage workflow to the Default dataset.

**Your tasks:**

**Part A: Clean slate with proper workflow**
1. Load Default dataset
2. Create train/test split (80/20, stratified)
3. Define 5 different models to compare
4. Use 5-fold stratified CV to evaluate all models
5. Report accuracy, precision, recall, F1, and ROC AUC

**Part B: Selection and evaluation**
1. Select best model (consider: which metric matters most for credit default?)
2. Train final model on full training set
3. Evaluate on test set once
4. Create confusion matrix
5. Is test performance consistent with CV?

**Part C: Business communication**
Write a memo explaining:
- Which model you selected and why
- Expected production accuracy (and why it's trustworthy)
- Why this approach is better than repeatedly using test set

In [None]:
# TODO: Your code for Exercise 2 here

### Exercise 3: The Ames Housing Challenge

Apply the complete workflow to the Ames housing dataset (regression).

**Your tasks:**

**Part A: Setup**
1. Load Ames data from URL: `https://raw.githubusercontent.com/bradleyboehmke/uc-bana-4080/main/data/ames_clean.csv`
2. Select 8+ features
3. Handle missing values
4. Create train/test split (80/20)

**Part B: Model comparison**
1. Compare 5+ approaches (linear regression, decision trees, random forests with different parameters)
2. Use 5-fold CV with R² and RMSE metrics
3. Create visualization comparing models

**Part C: Final evaluation**
1. Select best model based on CV
2. Train on full training set
3. Evaluate on test set ONCE
4. Compare CV estimate to test performance
5. Create predicted vs actual scatter plot

**Part D: Critical analysis**
1. Try the WRONG way: select based on test set performance
2. Compare test scores from wrong vs right approach
3. Which would you trust for production?

In [None]:
# TODO: Your code for Exercise 3 here
# ames_url = "https://raw.githubusercontent.com/bradleyboehmke/uc-bana-4080/main/data/ames_clean.csv"
# ames = pd.read_csv(ames_url)
# ...