# Train/Validation/Test Rigor + Regression Metrics + Baseline Modeling

<hr>

<center>
<div>
<img src="https://raw.githubusercontent.com/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/main/notebooks/figures/mgmt_474_ai_logo_02-modified.png" width="200"/>
</div>
</center>

# <center><a class="tocSkip"></center>
# <center>MGMT47400 Predictive Analytics</center>
# <center>Professor: Davi Moreira </center>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/blob/main/notebooks/03_regression_metrics_baselines.ipynb)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Choose regression metrics aligned to business loss (MAE vs RMSE)
2. Establish a baseline model and interpret it correctly
3. Run holdout evaluation without contaminating the test set
4. Use quick diagnostic plots to spot obvious modeling issues
5. Document evaluation decisions (metric, split, baseline, assumptions)

---

## 1. Setup

In [None]:
# Core imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.precision', 4)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

RANDOM_SEED = 474
np.random.seed(RANDOM_SEED)

print("✓ Setup complete!")

## 2. Load Data and Create Splits

In [None]:
# Load dataset
california = fetch_california_housing(as_frame=True)
df = california.frame

X = df.drop(columns=['MedHouseVal'])
y = df['MedHouseVal']

# Create splits
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.20, random_state=RANDOM_SEED)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=RANDOM_SEED)

print(f"Train: {len(X_train)} | Validation: {len(X_val)} | Test: {len(X_test)} (LOCKED)")
print(f"\n⚠️ TEST SET IS LOCKED - Do not use until final evaluation!")

## 3. Regression Metrics: MAE, RMSE, R²

### 3.1 Understanding the Metrics

**Mean Absolute Error (MAE)**
- Average of absolute differences
- Same units as target
- Less sensitive to outliers
- Use when: All errors cost the same

**Root Mean Squared Error (RMSE)**
- Square root of average squared differences
- Same units as target
- More sensitive to outliers (penalizes large errors)
- Use when: Large errors are disproportionately costly

**R² (Coefficient of Determination)**
- Proportion of variance explained
- Scale: 0 (no better than mean) to 1 (perfect)
- Can be negative if model is worse than mean
- Use when: You want a relative improvement metric

In [None]:
def evaluate_regression(y_true, y_pred, name="Model"):
    """
    Compute standard regression metrics.
    
    Parameters:
    -----------
    y_true : array-like
        True target values
    y_pred : array-like
        Predicted values
    name : str
        Name for reporting
    
    Returns:
    --------
    dict : Dictionary of metrics
    """
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)
    
    metrics = {
        'Model': name,
        'MAE': mae,
        'RMSE': rmse,
        'R²': r2
    }
    
    return metrics

print("✓ Evaluation function created")

## 4. Baseline Models

### 4.1 Why Baselines Matter

A baseline model provides a reference point. Any model should beat these simple strategies:
- **Mean baseline**: Predict the training set mean for all samples
- **Median baseline**: Predict the training set median for all samples

If your model doesn't beat the baseline, something is wrong.

In [None]:
# Mean baseline
baseline_mean = DummyRegressor(strategy='mean')
baseline_mean.fit(X_train, y_train)
y_pred_mean_train = baseline_mean.predict(X_train)
y_pred_mean_val = baseline_mean.predict(X_val)

# Median baseline
baseline_median = DummyRegressor(strategy='median')
baseline_median.fit(X_train, y_train)
y_pred_median_train = baseline_median.predict(X_train)
y_pred_median_val = baseline_median.predict(X_val)

print("✓ Baseline models fitted")

## 📝 PAUSE-AND-DO Exercise 1 (10 minutes)

**Task:** Write `evaluate_regression(y_true, y_pred)` returning MAE/RMSE/R².

The function is already implemented above. Now:
1. Use it to evaluate both baselines on train and validation sets
2. Create a comparison table
3. Interpret the results

---

In [None]:
# Evaluate baselines
results = []

# Mean baseline - train
results.append(evaluate_regression(y_train, y_pred_mean_train, "Mean Baseline (Train)"))
# Mean baseline - validation
results.append(evaluate_regression(y_val, y_pred_mean_val, "Mean Baseline (Val)"))

# Median baseline - train
results.append(evaluate_regression(y_train, y_pred_median_train, "Median Baseline (Train)"))
# Median baseline - validation
results.append(evaluate_regression(y_val, y_pred_median_val, "Median Baseline (Val)"))

results_df = pd.DataFrame(results)
print("=== BASELINE COMPARISON ===")
print(results_df)

print(f"\n💡 Key insights:")
print(f"  - R² = 0 means the model is no better than predicting the mean")
print(f"  - MAE = {results_df.loc[1, 'MAE']:.3f} means on average we're off by ${results_df.loc[1, 'MAE']:.1f}00,000")
print(f"  - RMSE > MAE indicates some large errors in predictions")

## 5. Simple Linear Model

Now let's fit a simple linear regression and compare to baselines.

In [None]:
# Create pipeline with scaling + linear regression
linear_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# Fit on training data
linear_pipeline.fit(X_train, y_train)

# Predict
y_pred_linear_train = linear_pipeline.predict(X_train)
y_pred_linear_val = linear_pipeline.predict(X_val)

print("✓ Linear model fitted")

## 📝 PAUSE-AND-DO Exercise 2 (10 minutes)

**Task:** Compare baseline vs linear regression and interpret the delta.

---

In [None]:
# Add linear model results
results.append(evaluate_regression(y_train, y_pred_linear_train, "Linear Regression (Train)"))
results.append(evaluate_regression(y_val, y_pred_linear_val, "Linear Regression (Val)"))

# Create full comparison table
full_results_df = pd.DataFrame(results)
print("=== FULL MODEL COMPARISON ===")
print(full_results_df)

# Calculate improvements
baseline_mae = full_results_df.loc[full_results_df['Model'] == 'Mean Baseline (Val)', 'MAE'].values[0]
linear_mae = full_results_df.loc[full_results_df['Model'] == 'Linear Regression (Val)', 'MAE'].values[0]
improvement_pct = ((baseline_mae - linear_mae) / baseline_mae) * 100

print(f"\n=== IMPROVEMENT OVER BASELINE ===")
print(f"Baseline MAE: {baseline_mae:.4f}")
print(f"Linear MAE: {linear_mae:.4f}")
print(f"Improvement: {improvement_pct:.2f}%")
print(f"\nLinear R² on validation: {full_results_df.loc[full_results_df['Model'] == 'Linear Regression (Val)', 'R²'].values[0]:.4f}")
print(f"This means the model explains {full_results_df.loc[full_results_df['Model'] == 'Linear Regression (Val)', 'R²'].values[0]*100:.2f}% of variance")

### YOUR INTERPRETATION HERE:

**Observation 1: Performance**  
[How much better is linear vs baseline?]

**Observation 2: Overfitting Check**  
[Compare train vs validation scores - is there overfitting?]

**Observation 3: Business Context**  
[What does this MAE mean in practical terms?]

---

## 6. Residual Analysis

Residuals = True values - Predicted values

Good residuals should:
- Be centered around zero
- Show no patterns (random scatter)
- Have roughly constant variance

In [None]:
# Calculate residuals
residuals_train = y_train - y_pred_linear_train
residuals_val = y_val - y_pred_linear_val

# Create residual plots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Residuals vs Predicted (Train)
axes[0, 0].scatter(y_pred_linear_train, residuals_train, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Predicted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residual Plot - Training Set')

# Plot 2: Residuals vs Predicted (Validation)
axes[0, 1].scatter(y_pred_linear_val, residuals_val, alpha=0.5, color='orange')
axes[0, 1].axhline(y=0, color='r', linestyle='--')
axes[0, 1].set_xlabel('Predicted Values')
axes[0, 1].set_ylabel('Residuals')
axes[0, 1].set_title('Residual Plot - Validation Set')

# Plot 3: Distribution of Residuals (Train)
axes[1, 0].hist(residuals_train, bins=50, alpha=0.7, edgecolor='black')
axes[1, 0].axvline(x=0, color='r', linestyle='--')
axes[1, 0].set_xlabel('Residual')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Residual Distribution - Training')

# Plot 4: Distribution of Residuals (Validation)
axes[1, 1].hist(residuals_val, bins=50, alpha=0.7, color='orange', edgecolor='black')
axes[1, 1].axvline(x=0, color='r', linestyle='--')
axes[1, 1].set_xlabel('Residual')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Residual Distribution - Validation')

plt.tight_layout()
plt.show()

print("=== RESIDUAL STATISTICS ===")
print(f"Training residuals: mean={residuals_train.mean():.6f}, std={residuals_train.std():.4f}")
print(f"Validation residuals: mean={residuals_val.mean():.6f}, std={residuals_val.std():.4f}")

## 7. Predicted vs Actual Plot

In [None]:
# Predicted vs Actual
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training set
axes[0].scatter(y_train, y_pred_linear_train, alpha=0.5)
axes[0].plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'r--', lw=2)
axes[0].set_xlabel('Actual Values')
axes[0].set_ylabel('Predicted Values')
axes[0].set_title('Training Set: Predicted vs Actual')

# Validation set
axes[1].scatter(y_val, y_pred_linear_val, alpha=0.5, color='orange')
axes[1].plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'r--', lw=2)
axes[1].set_xlabel('Actual Values')
axes[1].set_ylabel('Predicted Values')
axes[1].set_title('Validation Set: Predicted vs Actual')

plt.tight_layout()
plt.show()

print("💡 Points close to the red line indicate good predictions")
print("   Points far from the line show prediction errors")

## 8. Test Set Lockbox Discipline

### The Test Set Rule:

> **"Touch the test set ONCE, at the very end"**

**Why?**
- Every time you look at test performance, you risk making decisions based on it
- This causes "leakage" of test information into your modeling choices
- The test set must remain a true "unseen" evaluation

**What to do instead:**
- Use validation set for all model selection and tuning
- Use cross-validation for robust comparisons
- Only evaluate on test when you're completely done

**Warning signs you're peeking:**
- "Let me just check test performance real quick"
- "The test score is lower, let me adjust..."
- Running multiple experiments and looking at test each time

In [None]:
print("=== TEST SET STATUS ===")
print(f"Test set size: {len(X_test)} samples")
print(f"Test set is LOCKED 🔒")
print(f"\n✓ We will NOT evaluate on test until the final submission")
print(f"✓ All model development uses only train + validation")

## 9. Evaluation Documentation Template

Every evaluation should document:

### Evaluation Plan

**Primary Metric:** [MAE / RMSE / R²]  
**Rationale:** [Why this metric aligns with business goals]

**Split Strategy:**
- Training: 60% (for fitting)
- Validation: 20% (for selection)
- Test: 20% (for final evaluation)

**Baseline:** [Mean / Median predictor]  
**Baseline Performance:** [Validation score]

**Model:** [Linear Regression with Standard Scaling]  
**Model Performance:** [Validation score]  
**Improvement over Baseline:** [Percentage or absolute difference]

**Assumptions:**
- Features are available at prediction time
- Relationship is approximately linear
- No major data quality issues

**Risks:**
- Model may not generalize to different time periods
- Assumes no distribution shift
- Sensitive to outliers

## 10. Wrap-Up: Key Takeaways

### What We Learned Today:

1. **Metrics Matter**: Choose MAE/RMSE/R² based on business loss function
2. **Baselines First**: Always establish a simple baseline for comparison
3. **Holdout Discipline**: Train on train, evaluate on validation, lock test away
4. **Residuals Tell Stories**: Use diagnostic plots to spot issues
5. **Document Everything**: Clear evaluation plans prevent confusion later

### Critical Rules:

> **"If your model can't beat the mean, debug before proceeding"**

> **"The test set is a lockbox - open it once"**

### Next Steps:

- Next notebook: Linear regression with features and diagnostics
- We'll build on today's evaluation framework
- Start thinking about which metrics matter for your project

---

## 11. Submission Instructions

### To Submit:

1. Run all cells
2. Complete both exercises
3. Write a 3-sentence evaluation note answering:
   - Which metric is most appropriate for this problem and why?
   - How much better is linear regression than the baseline?
   - What does the residual analysis suggest?
4. Submit Colab link + evaluation note in LMS

---

## Bibliography

- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning with Python* - Chapter on Model Assessment and Selection
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* - Test error, training error, bias-variance
- scikit-learn User Guide: [Regression metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics)
- scikit-learn User Guide: [Common pitfalls](https://scikit-learn.org/stable/common_pitfalls.html)

---



<center>

Thank you!

</center>