# Day 14: Model Selection and Comparison - Making the Call Like a Professional

**MGMT 47400 - Predictive Analytics**  
**4-Week Online Course**  
**Day 14 - Friday June 4, 2027**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/blob/main/notebooks/14_model_selection_protocol.ipynb)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Build a standardized model comparison workflow (same CV, same metric)
2. Use multiple metrics without "metric shopping"
3. Select a champion model and justify it (performance, stability, interpretability, cost)
4. Create a reproducible experiment log table
5. Prepare project improved-model plan for submission

---

In [None]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import make_scorer, roc_auc_score, accuracy_score, f1_score
import time
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.precision', 4)
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
print("‚úì Setup complete!")

## 1. The Model Selection Problem

### Common Mistakes

**‚ùå Wrong:**
- Comparing models on different data splits
- Using different metrics for different models
- Choosing model based on test set performance
- "Metric shopping" (trying metrics until one model wins)
- Ignoring computational cost
- Not documenting why a model was chosen

**‚úì Right:**
- Same cross-validation folds for all models
- Single primary metric (with supporting metrics)
- Select based on CV, validate on test
- Document selection criteria upfront
- Consider multiple factors (performance, stability, cost, interpretability)
- Create reproducible experiment logs

### Fair Comparison Protocol

```
1. Define evaluation criteria BEFORE training
   - Primary metric
   - Supporting metrics
   - Constraints (time, interpretability, etc.)

2. Use SAME cross-validation for all models
   - Same CV object
   - Same random seed
   - Same data

3. Compare on PRIMARY metric
   - Use supporting metrics to break ties
   - Document tradeoffs

4. Select champion, THEN evaluate on test
   - Test set touched ONCE
```

In [None]:
# Load data
data = load_breast_cancer(as_frame=True)
X = data.data
y = data.target

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED, stratify=y)

print(f"Train: {len(X_train)} | Test: {len(X_test)} (LOCKED until final selection)")

## 2. Build Model Comparison Harness

### Comparison Harness Function

A harness ensures fair comparison by:
- Using same CV folds
- Evaluating same metrics
- Tracking fit/score times
- Returning structured results

In [None]:
def compare_models_comprehensive(models_dict, X, y, cv, scoring_metrics, primary_metric='roc_auc'):
    """
    Comprehensive model comparison with multiple metrics.
    
    Parameters:
    -----------
    models_dict : dict
        Dictionary of {name: model} pairs
    X, y : array-like
        Training data
    cv : cross-validator
        CV splitter (SAME for all models)
    scoring_metrics : list
        List of metric names to evaluate
    primary_metric : str
        Primary metric for ranking
    
    Returns:
    --------
    pd.DataFrame : Comparison table
    """
    results = []
    
    for name, model in models_dict.items():
        print(f"Evaluating: {name}...")
        
        # Time the evaluation
        start_time = time.time()
        
        # Cross-validate with multiple metrics
        cv_results = cross_validate(
            model, X, y, cv=cv,
            scoring=scoring_metrics,
            return_train_score=True,
            n_jobs=-1
        )
        
        elapsed_time = time.time() - start_time
        
        # Build result row
        row = {'Model': name}
        
        # Add metrics
        for metric in scoring_metrics:
            test_scores = cv_results[f'test_{metric}']
            train_scores = cv_results[f'train_{metric}']
            row[f'{metric}_cv_mean'] = test_scores.mean()
            row[f'{metric}_cv_std'] = test_scores.std()
            row[f'{metric}_train_mean'] = train_scores.mean()
        
        # Add timing
        row['fit_time_mean'] = cv_results['fit_time'].mean()
        row['total_cv_time'] = elapsed_time
        
        # Overfitting gap for primary metric
        row['overfit_gap'] = row[f'{primary_metric}_train_mean'] - row[f'{primary_metric}_cv_mean']
        
        results.append(row)
    
    # Create DataFrame and sort by primary metric
    df = pd.DataFrame(results).sort_values(f'{primary_metric}_cv_mean', ascending=False)
    
    return df

print("‚úì Comparison harness ready")

## 3. Define Candidate Models

In [None]:
# Define candidate models
candidate_models = {
    'Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(C=1.0, random_state=RANDOM_SEED, max_iter=1000))
    ]),
    
    'Logistic (L1)': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(C=0.1, penalty='l1', solver='liblinear', random_state=RANDOM_SEED))
    ]),
    
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=RANDOM_SEED),
    
    'Random Forest': RandomForestClassifier(
        n_estimators=200, max_depth=10, random_state=RANDOM_SEED, n_jobs=-1
    ),
    
    'Gradient Boosting': GradientBoostingClassifier(
        n_estimators=100, learning_rate=0.1, max_depth=3, random_state=RANDOM_SEED
    )
}

print(f"Candidate models: {len(candidate_models)}")
for name in candidate_models.keys():
    print(f"  - {name}")

## üìù PAUSE-AND-DO Exercise 1 (10 minutes)

**Task:** Implement the comparison harness for 3 candidate models.

---

In [None]:
# Define evaluation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)
metrics = ['roc_auc', 'accuracy', 'f1']
primary_metric = 'roc_auc'

print("=== EVALUATION PROTOCOL ===")
print(f"Cross-validation: {cv}")
print(f"Primary metric: {primary_metric}")
print(f"Supporting metrics: {[m for m in metrics if m != primary_metric]}")
print(f"\nRunning comparison...\n")

# Run comparison
comparison_df = compare_models_comprehensive(
    candidate_models, X_train, y_train, cv, metrics, primary_metric
)

print("\n=== COMPREHENSIVE MODEL COMPARISON ===")
display_cols = ['Model', 'roc_auc_cv_mean', 'roc_auc_cv_std', 'accuracy_cv_mean', 
                'f1_cv_mean', 'overfit_gap', 'total_cv_time']
print(comparison_df[display_cols].to_string(index=False))

## 4. Multi-Metric Reporting

In [None]:
# Visualize metric comparison
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# ROC-AUC
axes[0, 0].barh(comparison_df['Model'], comparison_df['roc_auc_cv_mean'], 
                xerr=comparison_df['roc_auc_cv_std'], alpha=0.7)
axes[0, 0].set_xlabel('ROC-AUC')
axes[0, 0].set_title('ROC-AUC (Primary Metric)')
axes[0, 0].invert_yaxis()

# Accuracy
axes[0, 1].barh(comparison_df['Model'], comparison_df['accuracy_cv_mean'],
                xerr=comparison_df['accuracy_cv_std'], alpha=0.7, color='orange')
axes[0, 1].set_xlabel('Accuracy')
axes[0, 1].set_title('Accuracy')
axes[0, 1].invert_yaxis()

# F1
axes[1, 0].barh(comparison_df['Model'], comparison_df['f1_cv_mean'],
                xerr=comparison_df['f1_cv_std'], alpha=0.7, color='green')
axes[1, 0].set_xlabel('F1 Score')
axes[1, 0].set_title('F1 Score')
axes[1, 0].invert_yaxis()

# Training time
axes[1, 1].barh(comparison_df['Model'], comparison_df['total_cv_time'], alpha=0.7, color='red')
axes[1, 1].set_xlabel('Time (seconds)')
axes[1, 1].set_title('Total CV Time')
axes[1, 1].invert_yaxis()

plt.tight_layout()
plt.show()

print("üí° Look for consistency across metrics")
print("üí° Consider time/performance tradeoffs")

## 5. Champion Selection Memo

In [None]:
# Select champion
champion_row = comparison_df.iloc[0]
runner_up_row = comparison_df.iloc[1]

print("=== CHAMPION SELECTION MEMO ===")
print(f"\nSelected Model: {champion_row['Model']}")
print(f"\nPrimary Metric ({primary_metric}):")
print(f"  Champion: {champion_row['roc_auc_cv_mean']:.4f} ¬± {champion_row['roc_auc_cv_std']:.4f}")
print(f"  Runner-up ({runner_up_row['Model']}): {runner_up_row['roc_auc_cv_mean']:.4f} ¬± {runner_up_row['roc_auc_cv_std']:.4f}")
print(f"  Advantage: {(champion_row['roc_auc_cv_mean'] - runner_up_row['roc_auc_cv_mean']):.4f}")

print(f"\nSupporting Evidence:")
print(f"  Accuracy: {champion_row['accuracy_cv_mean']:.4f}")
print(f"  F1: {champion_row['f1_cv_mean']:.4f}")
print(f"  Overfit gap: {champion_row['overfit_gap']:.4f}")
print(f"  CV time: {champion_row['total_cv_time']:.2f}s")

print(f"\nJustification:")
print(f"  1. Best {primary_metric} by {(champion_row['roc_auc_cv_mean'] - runner_up_row['roc_auc_cv_mean'])*100:.2f} percentage points")
print(f"  2. Reasonable overfitting ({champion_row['overfit_gap']:.4f} gap)")
print(f"  3. Acceptable training time ({champion_row['total_cv_time']:.1f}s for 5-fold CV)")
print(f"  4. Consistent performance across folds (std = {champion_row['roc_auc_cv_std']:.4f})")

print(f"\nRisks/Limitations:")
if 'Gradient Boosting' in champion_row['Model'] or 'Random Forest' in champion_row['Model']:
    print(f"  - Less interpretable than linear models")
    print(f"  - Requires feature importance analysis for explanations")
if champion_row['total_cv_time'] > 10:
    print(f"  - Longer training time may impact iteration speed")
if champion_row['overfit_gap'] > 0.05:
    print(f"  - Moderate overfitting detected - monitor on new data")

print(f"\nRecommendation: Proceed with {champion_row['Model']} for final evaluation")

## üìù PAUSE-AND-DO Exercise 2 (10 minutes)

**Task:** Write a champion selection memo (5 bullets + 1 risk).

---

### YOUR CHAMPION SELECTION MEMO:

**Selected Champion:** [Model name]

**Supporting Evidence:**
1. [Primary metric performance]
2. [Supporting metric 1]
3. [Supporting metric 2]
4. [Stability (std)]
5. [Other consideration]

**Key Risk:**
[Most important limitation or risk]

---

## 6. Final Test Set Evaluation

In [None]:
# Train champion on full training set
champion_model = candidate_models[champion_row['Model']]
champion_model.fit(X_train, y_train)

# Evaluate on test set (FIRST AND ONLY TIME)
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score, classification_report

y_pred_test = champion_model.predict(X_test)
y_proba_test = champion_model.predict_proba(X_test)[:, 1]

test_roc_auc = roc_auc_score(y_test, y_proba_test)
test_accuracy = accuracy_score(y_test, y_pred_test)
test_f1 = f1_score(y_test, y_pred_test)

print("=== FINAL TEST SET EVALUATION ===")
print(f"\nChampion: {champion_row['Model']}")
print(f"\nTest Set Performance:")
print(f"  ROC-AUC: {test_roc_auc:.4f}")
print(f"  Accuracy: {test_accuracy:.4f}")
print(f"  F1: {test_f1:.4f}")

print(f"\nCV vs Test Comparison:")
print(f"  CV ROC-AUC: {champion_row['roc_auc_cv_mean']:.4f} ¬± {champion_row['roc_auc_cv_std']:.4f}")
print(f"  Test ROC-AUC: {test_roc_auc:.4f}")
print(f"  Difference: {abs(test_roc_auc - champion_row['roc_auc_cv_mean']):.4f}")

if abs(test_roc_auc - champion_row['roc_auc_cv_mean']) < 2 * champion_row['roc_auc_cv_std']:
    print(f"\n‚úì Test performance within expected range (< 2 std from CV mean)")
else:
    print(f"\n‚ö†Ô∏è Test performance differs from CV - investigate further")

print(f"\nDetailed Classification Report:")
print(classification_report(y_test, y_pred_test, target_names=data.target_names))

## 7. Experiment Log Template

In [None]:
# Create reproducible experiment log
experiment_log = comparison_df.copy()

# Add metadata
experiment_log['date'] = pd.Timestamp.now().strftime('%Y-%m-%d')
experiment_log['dataset'] = 'breast_cancer'
experiment_log['n_samples'] = len(X_train)
experiment_log['n_features'] = X_train.shape[1]
experiment_log['cv_folds'] = cv.n_splits
experiment_log['random_seed'] = RANDOM_SEED

# Mark champion
experiment_log['is_champion'] = experiment_log['Model'] == champion_row['Model']

# Add test score for champion
experiment_log['test_roc_auc'] = experiment_log['Model'].apply(
    lambda x: test_roc_auc if x == champion_row['Model'] else np.nan
)

print("=== EXPERIMENT LOG ===")
log_cols = ['date', 'Model', 'roc_auc_cv_mean', 'roc_auc_cv_std', 'test_roc_auc', 'is_champion']
print(experiment_log[log_cols].to_string(index=False))

# Save to CSV
# experiment_log.to_csv('model_experiments.csv', index=False)
print("\nüí° Experiment log ready for version control")
print("üí° Track all experiments for reproducibility")

## 8. Wrap-Up: Key Takeaways

### What We Learned Today:

1. **Fair Comparison**: Same CV, same metrics, same data
2. **Multi-Metric Evaluation**: Primary metric + supporting evidence
3. **Champion Selection**: Documented, justified, reproducible
4. **Test Discipline**: Touch test set once, after selection
5. **Experiment Logging**: Track everything for reproducibility

### Critical Rules:

> **"Same CV folds for all models, always"**

> **"Select on CV, validate on test (once)"**

> **"Document selection criteria before training"**

### Next Steps:

- Day 15: Model interpretation + error analysis + Project Milestone 3
- We'll interpret the champion model
- Deliver improved model for project

---

## Bibliography

- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning with Python* - Model Assessment and Selection (protocols for fair comparison)
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* - Selection bias and repeated peeking hazards
- scikit-learn User Guide: [Model evaluation](https://scikit-learn.org/stable/modules/model_evaluation.html)
- scikit-learn User Guide: [Parameter tuning best practices](https://scikit-learn.org/stable/modules/grid_search.html)

---

**End of Day 14 Notebook**