# Model Selection and Comparison - Making the Call Like a Professional

<hr>

<center>
<div>
<img src="https://raw.githubusercontent.com/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/main/notebooks/figures/mgmt_474_ai_logo_02-modified.png" width="200"/>
</div>
</center>

# <center><a class="tocSkip"></center>
# <center>MGMT47400 Predictive Analytics</center>
# <center>Professor: Davi Moreira </center>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/blob/main/notebooks/14_model_selection_protocol.ipynb)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Build a standardized model comparison workflow (same CV, same metric)
2. Use multiple metrics without "metric shopping"
3. Select a champion model and justify it (performance, stability, interpretability, cost)
4. Create a reproducible experiment log table
5. Prepare project improved-model plan for submission

---

In [None]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import make_scorer, roc_auc_score, accuracy_score, f1_score
import time
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.precision', 4)
RANDOM_SEED = 474
np.random.seed(RANDOM_SEED)
print("✓ Setup complete!")

**Reading the output:**

The setup cell imports `cross_validate` (not `cross_val_score`) because we need to evaluate **multiple metrics** simultaneously and retrieve training scores alongside test scores. The `make_scorer` import is available for custom metrics, though we will use scikit-learn's built-in scoring strings here.

The `time` module is imported to measure wall-clock training time for each model, which becomes an important selection criterion when models have similar accuracy but very different computational costs.

**Key takeaway:** This notebook shifts focus from "how do I build a model" to "how do I systematically choose between models." The imports reflect that shift: the emphasis is on evaluation infrastructure rather than new algorithms.

---

## 1. The Model Selection Problem

### Common Mistakes

**❌ Wrong:**
- Comparing models on different data splits
- Using different metrics for different models
- Choosing model based on test set performance
- "Metric shopping" (trying metrics until one model wins)
- Ignoring computational cost
- Not documenting why a model was chosen

**✓ Right:**
- Same cross-validation folds for all models
- Single primary metric (with supporting metrics)
- Select based on CV, validate on test
- Document selection criteria upfront
- Consider multiple factors (performance, stability, cost, interpretability)
- Create reproducible experiment logs

### Fair Comparison Protocol

```
1. Define evaluation criteria BEFORE training
   - Primary metric
   - Supporting metrics
   - Constraints (time, interpretability, etc.)

2. Use SAME cross-validation for all models
   - Same CV object
   - Same random seed
   - Same data

3. Compare on PRIMARY metric
   - Use supporting metrics to break ties
   - Document tradeoffs

4. Select champion, THEN evaluate on test
   - Test set touched ONCE
```

In [None]:
# Load data
data = load_breast_cancer(as_frame=True)
X = data.data
y = data.target

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED, stratify=y)

print(f"Train: {len(X_train)} | Test: {len(X_test)} (LOCKED until final selection)")

**Reading the output:**

The data is split into **455 training** and **114 test** samples using an 80/20 split with stratification. Note that this notebook uses `test_size=0.2` rather than 0.3, giving slightly more training data. The test set is explicitly labeled as "LOCKED until final selection" to reinforce the discipline of not peeking.

The breast cancer dataset has **30 features** and a roughly 63/37 benign-to-malignant class distribution, which is preserved in both splits through stratification.

**Why this matters:** The test set lockdown is the single most important methodological point in this notebook. Every model will be evaluated solely on cross-validation during the selection phase. The test set exists only to provide a final, unbiased generalization estimate after the champion is chosen.

---

## 2. Build Model Comparison Harness

### Comparison Harness Function

A harness ensures fair comparison by:
- Using same CV folds
- Evaluating same metrics
- Tracking fit/score times
- Returning structured results

In [None]:
def compare_models_comprehensive(models_dict, X, y, cv, scoring_metrics, primary_metric='roc_auc'):
    """
    Comprehensive model comparison with multiple metrics.
    
    Parameters:
    -----------
    models_dict : dict
        Dictionary of {name: model} pairs
    X, y : array-like
        Training data
    cv : cross-validator
        CV splitter (SAME for all models)
    scoring_metrics : list
        List of metric names to evaluate
    primary_metric : str
        Primary metric for ranking
    
    Returns:
    --------
    pd.DataFrame : Comparison table
    """
    results = []
    
    for name, model in models_dict.items():
        print(f"Evaluating: {name}...")
        
        # Time the evaluation
        start_time = time.time()
        
        # Cross-validate with multiple metrics
        cv_results = cross_validate(
            model, X, y, cv=cv,
            scoring=scoring_metrics,
            return_train_score=True,
            n_jobs=-1
        )
        
        elapsed_time = time.time() - start_time
        
        # Build result row
        row = {'Model': name}
        
        # Add metrics
        for metric in scoring_metrics:
            test_scores = cv_results[f'test_{metric}']
            train_scores = cv_results[f'train_{metric}']
            row[f'{metric}_cv_mean'] = test_scores.mean()
            row[f'{metric}_cv_std'] = test_scores.std()
            row[f'{metric}_train_mean'] = train_scores.mean()
        
        # Add timing
        row['fit_time_mean'] = cv_results['fit_time'].mean()
        row['total_cv_time'] = elapsed_time
        
        # Overfitting gap for primary metric
        row['overfit_gap'] = row[f'{primary_metric}_train_mean'] - row[f'{primary_metric}_cv_mean']
        
        results.append(row)
    
    # Create DataFrame and sort by primary metric
    df = pd.DataFrame(results).sort_values(f'{primary_metric}_cv_mean', ascending=False)
    
    return df

print("✓ Comparison harness ready")

**Reading the output:**

The cell defines the `compare_models_comprehensive()` function and prints `Comparison harness ready`. No models have been evaluated yet; this cell only sets up the infrastructure.

The function accepts a dictionary of named models, a CV splitter, and a list of scoring metrics. For each model it calls `cross_validate` with `return_train_score=True` to capture both train and validation performance, computes the overfit gap, and records wall-clock time. The results are returned as a sorted DataFrame with the best model on the primary metric at the top.

Using a single harness function guarantees that every model sees the exact same CV folds, the same metrics, and the same timing methodology. This eliminates the most common source of unfair model comparison: inconsistent evaluation setups.

**Key takeaway:** Writing a comparison harness function once and reusing it is a professional best practice. It prevents subtle bugs (e.g., accidentally using different CV objects for different models) that can silently invalidate your conclusions.

---

## 3. Define Candidate Models

A fair comparison requires defining all candidate models **before** seeing any results, just as a clinical trial registers its protocol before collecting data. This prevents "model shopping," where you keep trying new models until one happens to score well on this particular data split.

Our candidate pool spans the spectrum from simple to complex: two logistic regression variants (standard L2 and sparse L1), a single decision tree, a Random Forest, and Gradient Boosting. Each model is wrapped in a scikit-learn `Pipeline` where necessary so that preprocessing (e.g., scaling for logistic regression) is included in the cross-validation loop and does not leak information.

In [None]:
# Define candidate models
candidate_models = {
    'Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(C=1.0, random_state=RANDOM_SEED, max_iter=1000))
    ]),
    
    'Logistic (L1)': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(C=0.1, penalty='l1', solver='liblinear', random_state=RANDOM_SEED))
    ]),
    
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=RANDOM_SEED),
    
    'Random Forest': RandomForestClassifier(
        n_estimators=200, max_depth=10, random_state=RANDOM_SEED, n_jobs=-1
    ),
    
    'Gradient Boosting': GradientBoostingClassifier(
        n_estimators=100, learning_rate=0.1, max_depth=3, random_state=RANDOM_SEED
    )
}

print(f"Candidate models: {len(candidate_models)}")
for name in candidate_models.keys():
    print(f"  - {name}")

**Reading the output:**

Five candidate models are printed: **Logistic Regression** (L2 regularization, C=1.0), **Logistic (L1)** (L1/Lasso regularization, C=0.1 for aggressive sparsity), **Decision Tree** (max_depth=5), **Random Forest** (200 trees, max_depth=10), and **Gradient Boosting** (100 trees, learning rate 0.1, max_depth=3).

The logistic regression models are wrapped in `Pipeline` objects with `StandardScaler` so that feature scaling happens inside each CV fold, preventing data leakage. The tree-based models do not need scaling because decision trees are invariant to monotone transformations of features.

Having both L1 and L2 logistic regression lets us test whether feature sparsity helps. L1 will zero out irrelevant features, potentially improving generalization when many of the 30 breast cancer features are redundant.

**Key takeaway:** Define your candidate pool before seeing any results. Resist the temptation to add or remove models after peeking at initial scores.

---

## 📝 PAUSE-AND-DO Exercise 1 (10 minutes)

**Task:** Implement the comparison harness for 3 candidate models.

---

In [None]:
# Define evaluation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)
metrics = ['roc_auc', 'accuracy', 'f1']
primary_metric = 'roc_auc'

print("=== EVALUATION PROTOCOL ===")
print(f"Cross-validation: {cv}")
print(f"Primary metric: {primary_metric}")
print(f"Supporting metrics: {[m for m in metrics if m != primary_metric]}")
print(f"\nRunning comparison...\n")

# Run comparison
comparison_df = compare_models_comprehensive(
    candidate_models, X_train, y_train, cv, metrics, primary_metric
)

print("\n=== COMPREHENSIVE MODEL COMPARISON ===")
display_cols = ['Model', 'roc_auc_cv_mean', 'roc_auc_cv_std', 'accuracy_cv_mean', 
                'f1_cv_mean', 'overfit_gap', 'total_cv_time']
print(comparison_df[display_cols].to_string(index=False))

**Reading the output:**

The evaluation protocol is printed first: 5-fold `StratifiedKFold` with `roc_auc` as the primary metric and `accuracy` and `f1` as supporting metrics. Then the harness evaluates each model and prints its name as it runs.

The comprehensive comparison table shows all five models ranked by CV ROC-AUC. Typical results: ensemble models (RF and GBM) lead with CV ROC-AUC near **0.99**, Logistic Regression follows closely at **0.98-0.99**, L1 Logistic may be slightly lower due to aggressive sparsity, and the Decision Tree trails at **0.95-0.97**.

The `overfit_gap` column is critical: it shows how much each model's training performance exceeds its CV performance. A gap above **0.05** signals potential overfitting. Ensemble methods typically have moderate gaps (0.01-0.03), while the Decision Tree may show a larger gap.

The `total_cv_time` column reveals computational costs. Logistic Regression and Decision Trees are typically under 0.1 seconds, while Random Forest and Gradient Boosting take 0.5-2 seconds for 5-fold CV on this small dataset. These differences scale dramatically on larger datasets.

**Why this matters:** This table is the centerpiece of the model selection process. Every subsequent decision (champion selection, test evaluation, deployment) flows from this single, fair, multi-metric comparison.

---

## 4. Multi-Metric Reporting

Evaluating models on a single metric can be misleading. ROC-AUC measures ranking ability, accuracy captures overall correctness, and F1 balances precision and recall. A model that maximizes ROC-AUC might not maximize accuracy if the classification threshold is suboptimal. Reporting all three metrics (plus training time) gives a multi-dimensional view of each candidate.

The 2x2 visualization below plots each metric as a horizontal bar chart with error bars representing cross-validation standard deviation. The bottom-right panel adds training time so you can spot cases where a marginal performance gain comes at a steep computational cost.

In [None]:
# Visualize metric comparison
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# ROC-AUC
axes[0, 0].barh(comparison_df['Model'], comparison_df['roc_auc_cv_mean'], 
                xerr=comparison_df['roc_auc_cv_std'], alpha=0.7)
axes[0, 0].set_xlabel('ROC-AUC')
axes[0, 0].set_title('ROC-AUC (Primary Metric)')
axes[0, 0].invert_yaxis()

# Accuracy
axes[0, 1].barh(comparison_df['Model'], comparison_df['accuracy_cv_mean'],
                xerr=comparison_df['accuracy_cv_std'], alpha=0.7, color='orange')
axes[0, 1].set_xlabel('Accuracy')
axes[0, 1].set_title('Accuracy')
axes[0, 1].invert_yaxis()

# F1
axes[1, 0].barh(comparison_df['Model'], comparison_df['f1_cv_mean'],
                xerr=comparison_df['f1_cv_std'], alpha=0.7, color='green')
axes[1, 0].set_xlabel('F1 Score')
axes[1, 0].set_title('F1 Score')
axes[1, 0].invert_yaxis()

# Training time
axes[1, 1].barh(comparison_df['Model'], comparison_df['total_cv_time'], alpha=0.7, color='red')
axes[1, 1].set_xlabel('Time (seconds)')
axes[1, 1].set_title('Total CV Time')
axes[1, 1].invert_yaxis()

plt.tight_layout()
plt.show()

print("💡 Look for consistency across metrics")
print("💡 Consider time/performance tradeoffs")

**Reading the output:**

The 2x2 panel displays four perspectives on the same five models. The **top-left** (ROC-AUC) is the primary metric; models are ranked by this. The **top-right** (Accuracy) often tells a similar story but can diverge when class imbalance matters. The **bottom-left** (F1 Score) balances precision and recall, important when false negatives have different costs than false positives. The **bottom-right** (Training Time) shows computational cost.

Look for consistency: a model that ranks first in ROC-AUC but third in F1 may have a threshold calibration issue. Error bars (horizontal lines at each bar's tip) show cross-validation variability. A model with overlapping error bars against another model is not statistically distinguishable on that metric.

If two models have nearly identical ROC-AUC, the supporting metrics and training time break the tie. A Logistic Regression that scores 0.985 in 0.05 seconds may be preferable to a Gradient Boosting model that scores 0.990 in 1.5 seconds, depending on deployment constraints.

**Key takeaway:** Multi-metric visualization prevents "tunnel vision" on a single number. Always check that your champion is robust across multiple evaluation criteria before committing.

---

## 5. Champion Selection Memo

Selecting a champion model is not just about picking the highest number; it is a documented business decision. The memo format below forces you to state the primary metric, quantify the margin over the runner-up, assess stability (std), check for overfitting (train-vs-CV gap), and flag risks. This structure ensures that the selection can be reviewed, challenged, and reproduced by a colleague.

In industry, model selection memos become part of the audit trail. Regulators, compliance teams, and business stakeholders can trace exactly why a particular model was deployed, making the process transparent and defensible.

In [None]:
# Select champion
champion_row = comparison_df.iloc[0]
runner_up_row = comparison_df.iloc[1]

print("=== CHAMPION SELECTION MEMO ===")
print(f"\nSelected Model: {champion_row['Model']}")
print(f"\nPrimary Metric ({primary_metric}):")
print(f"  Champion: {champion_row['roc_auc_cv_mean']:.4f} ± {champion_row['roc_auc_cv_std']:.4f}")
print(f"  Runner-up ({runner_up_row['Model']}): {runner_up_row['roc_auc_cv_mean']:.4f} ± {runner_up_row['roc_auc_cv_std']:.4f}")
print(f"  Advantage: {(champion_row['roc_auc_cv_mean'] - runner_up_row['roc_auc_cv_mean']):.4f}")

print(f"\nSupporting Evidence:")
print(f"  Accuracy: {champion_row['accuracy_cv_mean']:.4f}")
print(f"  F1: {champion_row['f1_cv_mean']:.4f}")
print(f"  Overfit gap: {champion_row['overfit_gap']:.4f}")
print(f"  CV time: {champion_row['total_cv_time']:.2f}s")

print(f"\nJustification:")
print(f"  1. Best {primary_metric} by {(champion_row['roc_auc_cv_mean'] - runner_up_row['roc_auc_cv_mean'])*100:.2f} percentage points")
print(f"  2. Reasonable overfitting ({champion_row['overfit_gap']:.4f} gap)")
print(f"  3. Acceptable training time ({champion_row['total_cv_time']:.1f}s for 5-fold CV)")
print(f"  4. Consistent performance across folds (std = {champion_row['roc_auc_cv_std']:.4f})")

print(f"\nRisks/Limitations:")
if 'Gradient Boosting' in champion_row['Model'] or 'Random Forest' in champion_row['Model']:
    print(f"  - Less interpretable than linear models")
    print(f"  - Requires feature importance analysis for explanations")
if champion_row['total_cv_time'] > 10:
    print(f"  - Longer training time may impact iteration speed")
if champion_row['overfit_gap'] > 0.05:
    print(f"  - Moderate overfitting detected - monitor on new data")

print(f"\nRecommendation: Proceed with {champion_row['Model']} for final evaluation")

**Reading the output:**

The champion selection memo is a structured report. It names the winning model, quantifies the margin over the runner-up on the primary metric (ROC-AUC), and provides supporting evidence from accuracy, F1, overfit gap, and CV time.

The **Justification** section gives four numbered arguments for the selection. Look at argument 2 (overfitting gap): a gap below **0.02** is excellent, meaning the model's training performance is close to its validation performance. Argument 4 (fold stability) reports the standard deviation; values below **0.01** indicate very consistent performance across data subsets.

The **Risks/Limitations** section is equally important. If the champion is an ensemble model, the memo flags reduced interpretability. If training time is high, it flags iteration speed. These risks must be weighed against the performance advantage.

The final recommendation to "Proceed with [Model] for final evaluation" explicitly gates the next step: only after writing this memo should you touch the test set.

**Why this matters:** In professional settings, this memo is the document that a manager, stakeholder, or regulator reviews. It transforms a subjective model choice into a transparent, evidence-based decision.

---

## 📝 PAUSE-AND-DO Exercise 2 (10 minutes)

**Task:** Write a champion selection memo (5 bullets + 1 risk).

---

### YOUR CHAMPION SELECTION MEMO:

**Selected Champion:** [Model name]

**Supporting Evidence:**
1. [Primary metric performance]
2. [Supporting metric 1]
3. [Supporting metric 2]
4. [Stability (std)]
5. [Other consideration]

**Key Risk:**
[Most important limitation or risk]

---

## 6. Final Test Set Evaluation

The test set has been locked away since the initial data split. We now touch it **exactly once** to get an unbiased estimate of the champion model's real-world performance. This single-use discipline is critical: if you peek at the test set multiple times during model selection, you are implicitly optimizing for it and the final estimate will be overly optimistic.

After evaluation, we compare the test score to the CV estimate. If the two agree (the test score falls within roughly two standard deviations of the CV mean), we have strong evidence that the model generalizes well. A large discrepancy in either direction warrants investigation.

In [None]:
# Train champion on full training set
champion_model = candidate_models[champion_row['Model']]
champion_model.fit(X_train, y_train)

# Evaluate on test set (FIRST AND ONLY TIME)
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score, classification_report

y_pred_test = champion_model.predict(X_test)
y_proba_test = champion_model.predict_proba(X_test)[:, 1]

test_roc_auc = roc_auc_score(y_test, y_proba_test)
test_accuracy = accuracy_score(y_test, y_pred_test)
test_f1 = f1_score(y_test, y_pred_test)

print("=== FINAL TEST SET EVALUATION ===")
print(f"\nChampion: {champion_row['Model']}")
print(f"\nTest Set Performance:")
print(f"  ROC-AUC: {test_roc_auc:.4f}")
print(f"  Accuracy: {test_accuracy:.4f}")
print(f"  F1: {test_f1:.4f}")

print(f"\nCV vs Test Comparison:")
print(f"  CV ROC-AUC: {champion_row['roc_auc_cv_mean']:.4f} ± {champion_row['roc_auc_cv_std']:.4f}")
print(f"  Test ROC-AUC: {test_roc_auc:.4f}")
print(f"  Difference: {abs(test_roc_auc - champion_row['roc_auc_cv_mean']):.4f}")

if abs(test_roc_auc - champion_row['roc_auc_cv_mean']) < 2 * champion_row['roc_auc_cv_std']:
    print(f"\n✓ Test performance within expected range (< 2 std from CV mean)")
else:
    print(f"\n⚠️ Test performance differs from CV - investigate further")

print(f"\nDetailed Classification Report:")
print(classification_report(y_test, y_pred_test, target_names=data.target_names))

**Reading the output:**

The final evaluation reports three metrics on the 114 held-out test samples: **ROC-AUC**, **Accuracy**, and **F1**. These numbers represent the model's best estimate of real-world performance, untainted by any selection bias.

The **CV vs Test Comparison** section is the key diagnostic. If the test ROC-AUC falls within two standard deviations of the CV mean (i.e., the check mark message appears), the model generalizes as expected. On the breast cancer dataset, typical test ROC-AUC is around **0.98-1.00**, consistent with the high CV scores.

The detailed **Classification Report** breaks down precision, recall, and F1 for each class (malignant and benign). Pay special attention to the malignant class (the minority class): a model that achieves 99 % accuracy but misses 10 % of malignant cases may be unacceptable in a clinical setting. The recall for malignant tells you the sensitivity of the model.

**Key takeaway:** The test set evaluation is the final checkpoint. If the test score is dramatically lower than the CV estimate, something went wrong in the selection process (data leakage, overfitting to CV, or an unusual test split). If it agrees, you can confidently report this performance to stakeholders.

---

## 7. Experiment Log Template

Professional data science teams track every model experiment in a structured log: dataset metadata, model name, hyperparameters, CV scores, test scores, and timestamps. This experiment log serves three purposes: (1) it prevents re-running experiments you have already tried, (2) it enables reproducibility months later, and (3) it provides evidence for model governance and audit.

The code below creates a DataFrame that captures all of this information in one table. In practice, you would append each new experiment to a persistent CSV or database, building a complete history of your modeling efforts.

In [None]:
# Create reproducible experiment log
experiment_log = comparison_df.copy()

# Add metadata
experiment_log['date'] = pd.Timestamp.now().strftime('%Y-%m-%d')
experiment_log['dataset'] = 'breast_cancer'
experiment_log['n_samples'] = len(X_train)
experiment_log['n_features'] = X_train.shape[1]
experiment_log['cv_folds'] = cv.n_splits
experiment_log['random_seed'] = RANDOM_SEED

# Mark champion
experiment_log['is_champion'] = experiment_log['Model'] == champion_row['Model']

# Add test score for champion
experiment_log['test_roc_auc'] = experiment_log['Model'].apply(
    lambda x: test_roc_auc if x == champion_row['Model'] else np.nan
)

print("=== EXPERIMENT LOG ===")
log_cols = ['date', 'Model', 'roc_auc_cv_mean', 'roc_auc_cv_std', 'test_roc_auc', 'is_champion']
print(experiment_log[log_cols].to_string(index=False))

# Save to CSV
# experiment_log.to_csv('model_experiments.csv', index=False)
print("\n💡 Experiment log ready for version control")
print("💡 Track all experiments for reproducibility")

**Reading the output:**

The experiment log table shows one row per model with columns for date, model name, CV ROC-AUC (mean and std), test ROC-AUC (only for the champion), and a boolean `is_champion` flag. Metadata columns record the dataset name, sample count (**455** training samples), feature count (**30**), CV folds (**5**), and random seed (**474**).

Only the champion model has a test score; all other rows show `NaN` for `test_roc_auc`. This is deliberate: non-champion models were never evaluated on the test set, preserving the single-use discipline.

In a real project, you would append this log to a persistent file (CSV, database, or experiment tracking tool like MLflow) after every modeling session. Over weeks of work, the log accumulates all experiments, making it easy to answer questions like "Did we already try L1 regularization with C=0.01?" or "What was our best score three weeks ago?"

**Key takeaway:** The experiment log is the institutional memory of a data science project. Without it, teams waste time re-running experiments and lose track of what has been tried.

---

## 8. Wrap-Up: Key Takeaways

### What We Learned Today:

1. **Fair Comparison**: Same CV, same metrics, same data
2. **Multi-Metric Evaluation**: Primary metric + supporting evidence
3. **Champion Selection**: Documented, justified, reproducible
4. **Test Discipline**: Touch test set once, after selection
5. **Experiment Logging**: Track everything for reproducibility

### Critical Rules:

> **"Same CV folds for all models, always"**

> **"Select on CV, validate on test (once)"**

> **"Document selection criteria before training"**

### Next Steps:

- Next notebook: Model interpretation + error analysis + Project Milestone 3
- We'll interpret the champion model
- Deliver improved model for project

---

## Bibliography

- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning with Python* - Model Assessment and Selection (protocols for fair comparison)
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* - Selection bias and repeated peeking hazards
- scikit-learn User Guide: [Model evaluation](https://scikit-learn.org/stable/modules/model_evaluation.html)
- scikit-learn User Guide: [Parameter tuning best practices](https://scikit-learn.org/stable/modules/grid_search.html)

---



<center>

Thank you!

</center>