<a href="https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/example-notebooks/22_regression_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluating Regression Models

This notebook contains code examples from the **Evaluating Regression Models** chapter (Chapter 22) of the BANA 4080 textbook. Follow along to practice model evaluation techniques using scikit-learn and Python.

## 📚 Chapter Overview

Building a regression model is only half the battle—the real question is how good is your model? This chapter teaches you to measure model performance using various metrics, understand when each metric is most appropriate for business decision-making, and evaluate whether your models are ready for real-world deployment.

## 🎯 What You'll Practice

- Calculate and interpret error metrics (SSE, R², MSE, RMSE, MAE, MAPE)
- Apply train/test splits to evaluate model generalization
- Diagnose overfitting vs underfitting using performance comparisons
- Connect evaluation metrics to business decision contexts

## 💡 How to Use This Notebook

1. **Read the chapter first** - This notebook supplements the textbook, not replaces it
2. **Run cells sequentially** - Code builds on previous examples
3. **Experiment freely** - Modify code to test your understanding
4. **Practice variations** - Try different approaches to reinforce learning

## Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, root_mean_squared_error, mean_absolute_percentage_error
from sklearn.model_selection import train_test_split

# Create our advertising dataset used throughout the chapter
data = pd.DataFrame({
    "ad_spend": [400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300],
    "weekly_sales": [4200, 4400, 4100, 4800, 5600, 5200, 4900, 5500, 5300, 5900, 5700, 6300, 6900, 6200, 5800, 6600, 7100, 6800, 7300, 7800]
})

print("Advertising dataset shape:", data.shape)
data.head()

## Understanding Sum of Squared Errors (SSE)

Linear regression finds the "best-fit line" by minimizing the Sum of Squared Errors. Let's build a model and calculate SSE manually to understand this fundamental concept.

In [None]:
# Build a simple regression model
X = data[['ad_spend']]
y = data['weekly_sales']
model = LinearRegression()
model.fit(X, y)

# Make predictions
predictions = model.predict(X)

print(f"Intercept: {model.intercept_:.2f}")
print(f"Ad Spend Coefficient: {model.coef_[0]:.2f}")

In [None]:
# Calculate SSE manually
residuals = y - predictions
squared_residuals = residuals ** 2
sse_manual = np.sum(squared_residuals)

print(f"Sum of Squared Errors: {sse_manual:,.0f}")
print(f"Number of data points: {len(y)}")
print(f"Average squared error per point: {sse_manual/len(y):,.0f}")

# Show calculation breakdown for first 5 points
print(f"\nBreaking down the first 5 predictions:")
print(f"{'Point':<8} {'Actual':<8} {'Predicted':<10} {'Error':<8} {'Squared Error':<12}")
print(f"{'-'*50}")
for i in range(5):
    actual = y.iloc[i]
    predicted = predictions[i]
    error = actual - predicted
    squared_error = error ** 2
    print(f"{i+1:<8} ${actual:<7.0f} ${predicted:<9.0f} {error:<+7.0f} {squared_error:<11.0f}")

### 🏃‍♂️ Try It Yourself

Calculate the residuals for data points 10-15 and find their contribution to the total SSE.

In [None]:
# Your code here


## R-Squared: Measuring Goodness of Fit

R² converts error into an interpretable percentage that represents the proportion of variation explained by your model.

In [None]:
# Calculate R² manually and compare with sklearn
y_mean = np.mean(y)
tss = np.sum((y - y_mean) ** 2)  # Total Sum of Squares
sse = np.sum((y - predictions) ** 2)  # Sum of Squared Errors
r_squared_manual = 1 - (sse / tss)

# Compare with sklearn's calculation
r_squared_sklearn = r2_score(y, predictions)

print(f"Manual R² calculation: {r_squared_manual:.4f}")
print(f"Sklearn R² calculation: {r_squared_sklearn:.4f}")
print(f"Model R² (from .score() method): {model.score(X, y):.4f}")

print(f"\nInterpretation: {r_squared_manual:.1%} of the variation in weekly sales")
print(f"is explained by advertising spend in our model.")

### 🏃‍♂️ Try It Yourself

Create a "null model" that always predicts the mean of y. Calculate its R² value. What should it be and why?

In [None]:
# Your code here


## Error Metrics for Business Decisions

Different error metrics emphasize different aspects of prediction accuracy. Let's calculate and compare MSE, RMSE, MAE, and MAPE.

In [None]:
# Calculate all major error metrics
mse = mean_squared_error(y, predictions)
rmse = root_mean_squared_error(y, predictions)
mae = mean_absolute_error(y, predictions)
mape = mean_absolute_percentage_error(y, predictions) * 100  # Convert to percentage

print("Error Metrics Summary:")
print(f"{'Metric':<20} {'Value':<15} {'Interpretation'}")
print(f"{'-'*60}")
print(f"{'R²':<20} {r_squared_sklearn:<15.3f} {'Proportion of variation explained'}")
print(f"{'MSE':<20} {mse:<15,.0f} {'Average squared error'}")
print(f"{'RMSE':<20} ${rmse:<14,.0f} {'Typical prediction error (same units)'}")
print(f"{'MAE':<20} ${mae:<14,.0f} {'Average absolute error'}")
print(f"{'MAPE':<20} {mape:<14.1f}% {'Average percentage error'}")

print(f"\nBusiness Interpretation:")
print(f"On average, our predictions are off by about ${rmse:,.0f} when predicting weekly sales.")
print(f"This represents roughly {mape:.1f}% error relative to actual sales values.")

In [None]:
# Demonstrate difference between MAE and RMSE with an outlier
y_with_outlier = y.copy()
predictions_with_outlier = predictions.copy()
y_with_outlier.iloc[0] = 10000  # Simulate one very bad prediction

mae_outlier = mean_absolute_error(y_with_outlier, predictions_with_outlier)
rmse_outlier = root_mean_squared_error(y_with_outlier, predictions_with_outlier)

print("Impact of Outliers on Error Metrics:")
print(f"{'Metric':<20} {'Original':<15} {'With Outlier':<15} {'Change'}")
print(f"{'-'*65}")
print(f"{'MAE':<20} ${mae:<14,.0f} ${mae_outlier:<14,.0f} {((mae_outlier/mae - 1)*100):+.0f}%")
print(f"{'RMSE':<20} ${rmse:<14,.0f} ${rmse_outlier:<14,.0f} {((rmse_outlier/rmse - 1)*100):+.0f}%")
print(f"\nRMSE is much more sensitive to outliers!")

### 🏃‍♂️ Try It Yourself

Which error metric would be most appropriate for:
1. A financial risk model where large losses could bankrupt the company
2. A customer service staffing model where errors have roughly linear costs
3. A retail forecasting model comparing across different product categories

Justify your choices below:

In [None]:
# Your analysis here


## Train/Test Splits: Evaluating Generalization

The most critical aspect of model evaluation is testing performance on unseen data. Let's implement proper train/test evaluation.

In [None]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=30
)

print(f"Total data points: {len(X)}")
print(f"Training set: {len(X_train)} points ({len(X_train)/len(X):.1%})")
print(f"Test set: {len(X_test)} points ({len(X_test)/len(X):.1%})")

# Train model on training data only
model_train = LinearRegression()
model_train.fit(X_train, y_train)

# Evaluate on both training and test sets
train_predictions = model_train.predict(X_train)
test_predictions = model_train.predict(X_test)

# Calculate metrics for both sets
print(f"\n{'Metric':<20} {'Training Set':<15} {'Test Set':<15}")
print(f"{'-'*50}")
print(f"{'R²':<20} {r2_score(y_train, train_predictions):<15.3f} {r2_score(y_test, test_predictions):<15.3f}")
print(f"{'RMSE':<20} {root_mean_squared_error(y_train, train_predictions):<15.0f} {root_mean_squared_error(y_test, test_predictions):<15.0f}")
print(f"{'MAE':<20} {mean_absolute_error(y_train, train_predictions):<15.0f} {mean_absolute_error(y_test, test_predictions):<15.0f}")

### 🏃‍♂️ Try It Yourself

Try different random_state values (e.g., 42, 123, 999) and observe how train/test performance varies. What does this tell you about the reliability of your evaluation?

In [None]:
# Your code here


## Overfitting vs Underfitting Demonstration

Let's create synthetic data to clearly demonstrate the difference between underfitting, good fit, and overfitting using polynomial models of different complexities.

In [None]:
# Create synthetic non-linear data for demonstration
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
import warnings

# Suppress warnings for high-degree polynomial demonstration
warnings.filterwarnings('ignore', category=RuntimeWarning)

np.random.seed(42)
X_demo = np.linspace(0, 10, 50).reshape(-1, 1)
y_true = 2 * X_demo.ravel() + 0.5 * X_demo.ravel()**2 + np.random.normal(0, 8, 50)

# Split the demonstration data
X_demo_train, X_demo_test, y_demo_train, y_demo_test = train_test_split(
    X_demo, y_true, test_size=0.3, random_state=42
)

# Create three models with different complexity
models = {
    'Linear (Underfit)': LinearRegression(),
    'Polynomial-2 (Good)': Pipeline([('poly', PolynomialFeatures(degree=2)), ('linear', LinearRegression())]),
    'Polynomial-15 (Overfit)': Pipeline([('poly', PolynomialFeatures(degree=15)), ('linear', LinearRegression())])
}

results = {}
for name, model in models.items():
    model.fit(X_demo_train, y_demo_train)
    train_rmse = root_mean_squared_error(y_demo_train, model.predict(X_demo_train))
    test_rmse = root_mean_squared_error(y_demo_test, model.predict(X_demo_test))
    results[name] = {'train_rmse': train_rmse, 'test_rmse': test_rmse}

# Display results
print("RMSE Comparison:")
print(f"{'Model':<25} {'Train RMSE':<12} {'Test RMSE':<12} {'Interpretation'}")
print(f"{'-'*70}")
for name, metrics in results.items():
    if 'Underfit' in name:
        interpretation = 'Poor on both'
    elif 'Good' in name:
        interpretation = 'Good on both'
    else:
        interpretation = 'Great on train, poor on test'
    print(f"{name:<25} {metrics['train_rmse']:<12.1f} {metrics['test_rmse']:<12.1f} {interpretation}")

### 🏃‍♂️ Try It Yourself

Create a polynomial model with degree 5 and evaluate its performance. Does it underfit, overfit, or show good generalization? Explain your reasoning.

In [None]:
# Your code here


## Real-World Application: Advertising Dataset Evaluation

Let's apply comprehensive evaluation to the Advertising dataset from Chapter 21, building multiple models and comparing their performance.

In [None]:
# Load the full Advertising dataset
# You can use this GitHub URL or load locally if you have the file
advertising_url = "https://raw.githubusercontent.com/bradleyboehmke/uc-bana-4080/main/data/Advertising.csv"
advertising = pd.read_csv(advertising_url)

print("Advertising dataset shape:", advertising.shape)
print("\nFirst few rows:")
print(advertising.head())
print("\nSummary statistics:")
print(advertising.describe())

In [None]:
# Build and evaluate three different models
models_to_compare = {
    'TV Only': ['TV'],
    'TV + Radio': ['TV', 'radio'],
    'All Channels': ['TV', 'radio', 'newspaper']
}

model_results = {}

for model_name, features in models_to_compare.items():
    # Prepare data
    X = advertising[features]
    y = advertising['sales']
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    # Train model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Make predictions
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)
    
    # Calculate metrics
    train_r2 = r2_score(y_train, train_pred)
    test_r2 = r2_score(y_test, test_pred)
    train_rmse = root_mean_squared_error(y_train, train_pred)
    test_rmse = root_mean_squared_error(y_test, test_pred)
    train_mae = mean_absolute_error(y_train, train_pred)
    test_mae = mean_absolute_error(y_test, test_pred)
    
    model_results[model_name] = {
        'train_r2': train_r2, 'test_r2': test_r2,
        'train_rmse': train_rmse, 'test_rmse': test_rmse,
        'train_mae': train_mae, 'test_mae': test_mae,
        'model': model
    }

# Display comparison
print("Model Comparison - R² Performance:")
print(f"{'Model':<15} {'Train R²':<12} {'Test R²':<12} {'Difference':<12}")
print(f"{'-'*55}")
for name, results in model_results.items():
    diff = results['train_r2'] - results['test_r2']
    print(f"{name:<15} {results['train_r2']:<12.3f} {results['test_r2']:<12.3f} {diff:<12.3f}")

print("\nModel Comparison - RMSE Performance:")
print(f"{'Model':<15} {'Train RMSE':<12} {'Test RMSE':<12} {'Generalization':<15}")
print(f"{'-'*65}")
for name, results in model_results.items():
    if results['test_rmse'] <= results['train_rmse'] * 1.1:  # Within 10%
        generalization = "Good"
    elif results['test_rmse'] <= results['train_rmse'] * 1.2:  # Within 20%
        generalization = "Fair"
    else:
        generalization = "Poor"
    print(f"{name:<15} {results['train_rmse']:<12.2f} {results['test_rmse']:<12.2f} {generalization:<15}")

### 🏃‍♂️ Try It Yourself

Based on the model comparison above:
1. Which model would you deploy for business use? Why?
2. Calculate MAPE for your chosen model and interpret it in business terms
3. If a marketing manager has a $50,000 advertising budget, how would you use your model to guide allocation?

In [None]:
# Your analysis here


## 🚀 Practice Challenges

Test your understanding with these additional exercises that combine multiple concepts from the chapter.

### Challenge 1: Multiple Random Splits Analysis

Instead of relying on a single train/test split, evaluate model stability by using multiple different random splits. Test the same model with 5 different random_state values and analyze the variation in performance metrics. What does this tell you about the reliability of your evaluation?

In [None]:
# Your solution here


### Challenge 2: Business Metric Design

Create a custom business metric that combines prediction accuracy with cost considerations. For example, if overestimating sales costs $10 per unit in lost opportunity, while underestimating costs $50 per unit in excess inventory, design a metric that reflects these asymmetric costs.

In [None]:
# Your solution here


### Challenge 3: Evaluation Across Time

If the advertising dataset represented time-series data (e.g., weeks 1-200), how would you modify your evaluation approach? Implement a time-aware train/test split and discuss the implications.

In [None]:
# Your solution here


## 📝 Chapter Summary

In this notebook, you practiced:

- ✅ Calculating SSE manually and understanding why regression minimizes squared errors
- ✅ Computing and interpreting R² as a measure of explained variation
- ✅ Applying different error metrics (MSE, RMSE, MAE, MAPE) and understanding their business contexts
- ✅ Implementing train/test splits to evaluate model generalization
- ✅ Diagnosing overfitting vs underfitting using train/test performance comparisons

## 🔗 Connections to Other Chapters

- **Previous chapters**: Chapter 21 (regression modeling) provided the foundation models that this chapter teaches you to evaluate
- **Upcoming chapters**: These evaluation techniques will apply to every machine learning algorithm you'll learn, from decision trees to neural networks

## 📚 Additional Resources

- [Scikit-learn Model Evaluation Documentation](https://scikit-learn.org/stable/modules/model_evaluation.html)
- [Cross-validation techniques](https://scikit-learn.org/stable/modules/cross_validation.html)
- [Metrics for regression problems](https://scikit-learn.org/stable/modules/classes.html#regression-metrics)

## 🎯 Next Steps

1. **Review the chapter** to reinforce theoretical concepts
2. **Complete the end-of-chapter exercises** in the textbook
3. **Practice with your own datasets** to build confidence
4. **Apply these evaluation techniques** to future modeling projects