# Hyperparameter Tuning: Finding Optimal Model Configurations

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/example-notebooks/29_hyperparameter_tuning.ipynb)

This companion notebook provides hands-on exercises for the **Hyperparameter Tuning** chapter. You'll explore the bias-variance tradeoff, use K-Nearest Neighbors to see how hyperparameters affect model performance, and systematically tune decision trees and random forests using grid search.

**What you'll practice**
- Understand the bias-variance tradeoff through visualization
- Tune K-Nearest Neighbors (KNN) regression to find optimal K
- Use GridSearchCV to systematically search hyperparameter spaces
- Tune decision trees and random forests with multiple hyperparameters
- Compare grid search vs. random search
- Apply the complete 5-stage workflow: split, tune, compare, train, evaluate

**How to use**
- Run from top to bottom. When you see **🏃‍♂️ Try It Yourself**, add your code beneath the prompt.
- In Colab: `Runtime → Restart and run all` to test from a clean environment.

## 0) Setup

Install and import the required packages. In local environments where these libraries are already installed, you can skip the install cell.

In [None]:
# If using Colab/a fresh env, uncomment to install
# !pip -q install scikit-learn pandas numpy matplotlib ISLP optuna

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from ISLP import load_data

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.metrics import roc_auc_score, accuracy_score
from scipy.stats import randint

# Set random seed for reproducibility
np.random.seed(42)

## 1) The Bias-Variance Tradeoff: Visualizing Underfitting vs. Overfitting

We'll create synthetic data (sine wave with noise) and fit three models with different complexity levels to illustrate:
- **High bias (underfitting)**: Model too simple to capture the pattern
- **Balanced (good fit)**: Model captures the signal without memorizing noise
- **High variance (overfitting)**: Model memorizes training data including noise

In [None]:
# Generate synthetic data: sine wave with noise
np.random.seed(42)
X = np.linspace(0, 10, 100)
y_true = np.sin(X) + 0.5 * X  # True underlying pattern
y = y_true + np.random.normal(0, 0.4, 100)  # Add noise

# Reshape for sklearn
X_reshaped = X.reshape(-1, 1)
X_plot = np.linspace(0, 10, 300).reshape(-1, 1)

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Three models with different complexity
# 1. High bias (underfitting): Linear regression
bias_model = LinearRegression()
bias_model.fit(X_reshaped, y)
y_bias = bias_model.predict(X_plot)

# 2. Balanced (good fit): Random forest with limited depth
good_model = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)
good_model.fit(X_reshaped, y)
y_good = good_model.predict(X_plot)

# 3. High variance (overfitting): Deep decision tree
variance_model = DecisionTreeRegressor(max_depth=None, min_samples_split=2, random_state=42)
variance_model.fit(X_reshaped, y)
y_variance = variance_model.predict(X_plot)

In [None]:
# Visualize bias-variance tradeoff
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Left: High bias (underfitting)
axes[0].scatter(X, y, alpha=0.6, s=30, edgecolors='black', label='Training data')
axes[0].plot(X_plot, y_bias, 'r-', linewidth=2.5, label='Linear model')
axes[0].plot(X, y_true, 'g--', linewidth=1.5, alpha=0.7, label='True pattern')
axes[0].set_title('High Bias (Underfitting)\nToo Simple', fontsize=11, fontweight='bold')
axes[0].set_xlabel('X')
axes[0].set_ylabel('y')
axes[0].legend(fontsize=9)
axes[0].grid(True, alpha=0.3)

# Middle: Balanced (good fit)
axes[1].scatter(X, y, alpha=0.6, s=30, edgecolors='black', label='Training data')
axes[1].plot(X_plot, y_good, 'b-', linewidth=2.5, label='Random forest')
axes[1].plot(X, y_true, 'g--', linewidth=1.5, alpha=0.7, label='True pattern')
axes[1].set_title('Balanced (Good Fit)\nJust Right', fontsize=11, fontweight='bold')
axes[1].set_xlabel('X')
axes[1].set_ylabel('y')
axes[1].legend(fontsize=9)
axes[1].grid(True, alpha=0.3)

# Right: High variance (overfitting)
axes[2].scatter(X, y, alpha=0.6, s=30, edgecolors='black', label='Training data')
axes[2].plot(X_plot, y_variance, 'orange', linewidth=2.5, label='Deep tree')
axes[2].plot(X, y_true, 'g--', linewidth=1.5, alpha=0.7, label='True pattern')
axes[2].set_title('High Variance (Overfitting)\nToo Complex', fontsize=11, fontweight='bold')
axes[2].set_xlabel('X')
axes[2].set_ylabel('y')
axes[2].legend(fontsize=9)
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 🏃‍♂️ Try It Yourself
- What happens if you change `max_depth=5` to `max_depth=2` for the random forest? Does it become more biased or more variant?
- Try `max_depth=15` for the decision tree. Does overfitting improve or worsen?

## 2) K-Nearest Neighbors: A Case Study in Bias-Variance

KNN has a single intuitive hyperparameter: **K** (number of neighbors). Let's see how different K values affect performance on our sine wave data.

In [None]:
# Try different K values
k_values = [1, 2, 5, 10, 25, 50, 75]

print("K-Nearest Neighbors Regressor: Effect of K on Performance")
print("=" * 65)

for k in k_values:
    knn = KNeighborsRegressor(n_neighbors=k)
    
    # Use cross-validation to evaluate (negative MSE)
    cv_scores = cross_val_score(knn, X_reshaped, y, cv=5, scoring='neg_mean_squared_error')
    
    # Also check training score to see overfitting
    knn.fit(X_reshaped, y)
    train_score = knn.score(X_reshaped, y)
    
    print(f"K={k:2d}  |  Train R²: {train_score:.3f}  |  CV MSE: {-cv_scores.mean():.3f} (±{cv_scores.std():.3f})")

**Key observations:**
- **K=1**: Very high training R² but higher CV error → Overfitting
- **Larger K**: Training R² decreases, CV error may improve then worsen
- **Optimal K**: Around K=5-7 where CV error is minimized

In [None]:
# Visualize different K values
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

k_examples = [75, 5, 1]
titles = ['High Bias (K=75)', 'Balanced (K=5)', 'High Variance (K=1)']
colors = ['red', 'green', 'blue']

for idx, (k, title, color) in enumerate(zip(k_examples, titles, colors)):
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit(X_reshaped, y)
    y_pred = knn.predict(X_plot)
    
    axes[idx].scatter(X, y, alpha=0.4, s=20, color='gray', label='Training data')
    axes[idx].plot(X_plot, y_pred, color=color, linewidth=2.5, label=f'KNN (K={k})')
    axes[idx].set_xlabel('X', fontsize=11)
    axes[idx].set_ylabel('y', fontsize=11)
    axes[idx].set_title(title, fontsize=12, fontweight='bold')
    axes[idx].legend(fontsize=10)
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 🏃‍♂️ Try It Yourself
- Plot training R² vs. CV MSE for all K values. Where do you see the biggest gap (overfitting)?
- Try K values from 1 to 100 in steps of 5. What K gives the best CV score?

## 3) Grid Search: Automating Hyperparameter Tuning

Instead of manually trying different K values, we can use **GridSearchCV** to systematically search and find the optimal hyperparameter.

In [None]:
# Create train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X_reshaped, y, test_size=0.3, random_state=42
)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")

In [None]:
# Step 1: Define the parameter grid
param_grid = {
    'n_neighbors': [1, 3, 5, 7, 10, 15, 20, 30, 50]
}

# Step 2: Create GridSearchCV object
knn = KNeighborsRegressor()
grid_search = GridSearchCV(
    estimator=knn,
    param_grid=param_grid,
    cv=5,                               # 5-fold cross-validation
    scoring='neg_mean_squared_error',   # Metric to optimize
    return_train_score=True,            # Also return training scores
    verbose=1                           # Show progress
)

# Step 3: Fit grid search (tries all combinations)
grid_search.fit(X_train, y_train)

# Step 4: View results
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV score (MSE): {grid_search.best_score_:.3f}")
print(f"\nBest model (already retrained on all training data):")
print(grid_search.best_estimator_)

In [None]:
# Convert results to DataFrame for easier viewing
results_df = pd.DataFrame(grid_search.cv_results_)

# Select relevant columns
results_summary = results_df[[
    'param_n_neighbors',
    'mean_train_score',
    'mean_test_score',
    'std_test_score'
]].copy()

results_summary.columns = ['K', 'Train Score', 'CV Score', 'CV Std']
results_summary = results_summary.sort_values('K')

print("\nGrid Search Results Summary:")
print(results_summary.to_string(index=False))

In [None]:
# Evaluate best model on test set (Stage 5: final evaluation)
y_pred_test = grid_search.best_estimator_.predict(X_test)
from sklearn.metrics import mean_squared_error, r2_score

test_mse = mean_squared_error(y_test, y_pred_test)
test_r2 = r2_score(y_test, y_pred_test)

print(f"\nFinal Test Performance:")
print(f"Test MSE: {test_mse:.3f}")
print(f"Test R²: {test_r2:.3f}")

### 🏃‍♂️ Try It Yourself
- Add more hyperparameters to the grid: `weights`: ['uniform', 'distance'] and `p`: [1, 2]
- How many total model fits does GridSearchCV perform with these additional parameters?
- Does adding these parameters improve CV performance?

## 4) Tuning Decision Trees

Decision trees have multiple hyperparameters. Let's tune them on the Default dataset (classification task).

In [None]:
# Load and prepare Default dataset
Default = load_data('Default')
X = pd.get_dummies(Default[['balance', 'income', 'student']], drop_first=True)
y = (Default['default'] == 'Yes').astype(int)

# Train/test split (stratified for imbalanced data)
X_train_dt, X_test_dt, y_train_dt, y_test_dt = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {len(X_train_dt)} samples")
print(f"Test set: {len(X_test_dt)} samples")
print(f"\nClass distribution (training):")
print(y_train_dt.value_counts())

In [None]:
# Define parameter grid for decision tree
param_grid_tree = {
    'max_depth': [3, 5, 7, 10, 15, 20, None],
    'min_samples_split': [2, 10, 20, 50],
    'min_samples_leaf': [1, 5, 10, 20]
}

# Create GridSearchCV
tree = DecisionTreeClassifier(random_state=42)
grid_search_tree = GridSearchCV(
    estimator=tree,
    param_grid=param_grid_tree,
    cv=5,
    scoring='roc_auc',  # Better metric for imbalanced data
    verbose=1,
    n_jobs=-1  # Use all CPU cores
)

# Fit grid search
print("Searching for best decision tree hyperparameters...")
print(f"Total combinations to try: {len(param_grid_tree['max_depth']) * len(param_grid_tree['min_samples_split']) * len(param_grid_tree['min_samples_leaf'])}")
grid_search_tree.fit(X_train_dt, y_train_dt)

print(f"\nBest parameters: {grid_search_tree.best_params_}")
print(f"Best CV ROC AUC: {grid_search_tree.best_score_:.4f}")

**Note:** This grid searches 7 × 4 × 4 = 112 hyperparameter combinations, each evaluated with 5-fold CV = 560 model fits!

In [None]:
# Evaluate on test set
best_tree = grid_search_tree.best_estimator_
y_pred_proba_tree = best_tree.predict_proba(X_test_dt)[:, 1]
tree_test_auc = roc_auc_score(y_test_dt, y_pred_proba_tree)

print(f"\nDecision Tree Final Performance:")
print(f"  Best params: {grid_search_tree.best_params_}")
print(f"  CV ROC AUC:  {grid_search_tree.best_score_:.4f}")
print(f"  Test ROC AUC: {tree_test_auc:.4f}")

## 5) Tuning Random Forests

Random forests have additional hyperparameters like `n_estimators` and `max_features`.

In [None]:
# Define parameter grid for random forest (smaller to save computation)
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'max_features': ['sqrt', 'log2'],
    'min_samples_split': [2, 10, 20]
}

# Create GridSearchCV
rf = RandomForestClassifier(random_state=42)
grid_search_rf = GridSearchCV(
    estimator=rf,
    param_grid=param_grid_rf,
    cv=5,
    scoring='roc_auc',
    verbose=1,
    n_jobs=-1
)

# Fit grid search
print("Searching for best random forest hyperparameters...")
print(f"Total combinations to try: {len(param_grid_rf['n_estimators']) * len(param_grid_rf['max_depth']) * len(param_grid_rf['max_features']) * len(param_grid_rf['min_samples_split'])}")
grid_search_rf.fit(X_train_dt, y_train_dt)

print(f"\nBest parameters: {grid_search_rf.best_params_}")
print(f"Best CV ROC AUC: {grid_search_rf.best_score_:.4f}")

In [None]:
# Evaluate on test set
best_rf = grid_search_rf.best_estimator_
y_pred_proba_rf = best_rf.predict_proba(X_test_dt)[:, 1]
rf_test_auc = roc_auc_score(y_test_dt, y_pred_proba_rf)

print(f"\nRandom Forest Final Performance:")
print(f"  Best params: {grid_search_rf.best_params_}")
print(f"  CV ROC AUC:  {grid_search_rf.best_score_:.4f}")
print(f"  Test ROC AUC: {rf_test_auc:.4f}")

In [None]:
# Compare decision tree vs. random forest
print("\nFinal Model Comparison:")
print("=" * 60)
print(f"Decision Tree:")
print(f"  Best params: {grid_search_tree.best_params_}")
print(f"  CV ROC AUC:  {grid_search_tree.best_score_:.4f}")
print(f"  Test ROC AUC: {tree_test_auc:.4f}")
print()
print(f"Random Forest:")
print(f"  Best params: {grid_search_rf.best_params_}")
print(f"  CV ROC AUC:  {grid_search_rf.best_score_:.4f}")
print(f"  Test ROC AUC: {rf_test_auc:.4f}")
print("=" * 60)

### 🏃‍♂️ Try It Yourself
- Are the CV and test scores close? What does this indicate about overfitting?
- Which model performs better? By how much?
- Try adding `min_samples_leaf` to the random forest parameter grid. Does it improve performance?

## 6) Beyond Grid Search: Random Search

When you have many hyperparameters or a large search space, random search can be more efficient than grid search.

In [None]:
# Define parameter distributions (instead of fixed values)
param_distributions_rf = {
    'n_estimators': randint(50, 300),           # Random integers from 50-300
    'max_depth': [5, 10, 15, 20, None],         # Can still use lists
    'max_features': ['sqrt', 'log2'],
    'min_samples_split': randint(2, 50),        # Random integers from 2-50
    'min_samples_leaf': randint(1, 20)          # Random integers from 1-20
}

# RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_distributions=param_distributions_rf,
    n_iter=50,  # Number of random samples to try
    cv=5,
    scoring='roc_auc',
    random_state=42,
    verbose=1,
    n_jobs=-1
)

print("Random search: trying 50 random hyperparameter combinations...")
random_search.fit(X_train_dt, y_train_dt)

print(f"\nBest parameters: {random_search.best_params_}")
print(f"Best CV ROC AUC: {random_search.best_score_:.4f}")

In [None]:
# Evaluate random search best model on test set
best_rf_random = random_search.best_estimator_
y_pred_proba_random = best_rf_random.predict_proba(X_test_dt)[:, 1]
rf_random_test_auc = roc_auc_score(y_test_dt, y_pred_proba_random)

print(f"\nRandom Search vs. Grid Search Comparison:")
print(f"Grid Search:   CV AUC = {grid_search_rf.best_score_:.4f}, Test AUC = {rf_test_auc:.4f}")
print(f"Random Search: CV AUC = {random_search.best_score_:.4f}, Test AUC = {rf_random_test_auc:.4f}")

### 🏃‍♂️ Try It Yourself
- Increase `n_iter` to 100. Does random search find better hyperparameters?
- Time both grid search and random search. Which is faster?
- When would you prefer random search over grid search?

## 7) (Optional) Bayesian Optimization with Optuna

For even more efficient hyperparameter tuning, you can use Bayesian optimization libraries like Optuna.

In [None]:
# Uncomment to install optuna
# !pip install optuna

import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)  # Reduce output

def objective(trial):
    # Optuna suggests hyperparameters
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'max_depth': trial.suggest_int('max_depth', 3, 20),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 50),
    }
    
    # Train and evaluate
    rf = RandomForestClassifier(**params, random_state=42)
    score = cross_val_score(rf, X_train_dt, y_train_dt, cv=5, scoring='roc_auc').mean()
    return score

# Run optimization
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=20, show_progress_bar=True)

print(f"\nBest params: {study.best_params}")
print(f"Best CV ROC AUC: {study.best_value:.4f}")

---

## ✅ Summary: The Complete Workflow

1. **Train/test split** → Lock away test set
2. **Define models and grids** → Set up GridSearchCV for each candidate model
3. **Grid search with CV** → Find best hyperparameters using training set only
4. **Compare models** → Use CV scores to select best model type and configuration
5. **Final evaluation** → Test set evaluation EXACTLY ONCE

**Key takeaways:**
- Hyperparameters control the bias-variance tradeoff
- Grid search systematically finds optimal hyperparameters using cross-validation
- Random search is more efficient for large hyperparameter spaces
- Bayesian optimization (Optuna) can be even more efficient for expensive models
- Always evaluate on test set only once, after all tuning is complete

---

## ✅ End-of-Chapter Exercises

These exercises give you hands-on practice with hyperparameter tuning using grid search and cross-validation.

### Exercise 1: Tuning KNN for Regression

Use the [Ames housing dataset](https://raw.githubusercontent.com/bradleyboehmke/uc-bana-4080/refs/heads/main/data/ames_clean.csv) to predict `SalePrice` using K-Nearest Neighbors regression.

**Your tasks:**
1. Load the Ames housing data and select at least 5 numerical features
2. Create a train/test split (80/20)
3. Use `GridSearchCV` to tune these KNN hyperparameters:
   - `n_neighbors`: [3, 5, 7, 10, 15, 20, 30, 50]
   - `weights`: ['uniform', 'distance']
   - `p`: [1, 2] (1 = Manhattan distance, 2 = Euclidean distance)
4. Use 5-fold CV and optimize for R² score
5. Report the best hyperparameters and CV score
6. Evaluate the best model on the test set
7. Create a visualization showing how `n_neighbors` affects performance

**Reflection questions:**
- How does the `weights` parameter affect performance?
- What does the `p` parameter control? Which distance metric worked better?
- Is there a large gap between CV and test performance? What does this indicate?

In [None]:
# TODO: Your code for Exercise 1 here

### Exercise 2: Decision Tree Depth Analysis

Systematically analyze how `max_depth` affects decision tree performance on the Default dataset.

**Your tasks:**
1. Load the Default dataset and prepare features
2. Create train/test split (80/20, stratified)
3. For each `max_depth` in [1, 2, 3, 5, 7, 10, 15, 20, None]:
   - Train a decision tree
   - Compute training accuracy and 5-fold CV accuracy
   - Store results
4. Create a line plot showing training vs. CV accuracy across depths
5. Identify the depth where overfitting begins (gap between train and CV widens)
6. Use `GridSearchCV` to tune multiple hyperparameters simultaneously:
   - `max_depth`: [3, 5, 7, 10, 15, None]
   - `min_samples_split`: [2, 10, 20, 50]
   - `min_samples_leaf`: [1, 5, 10, 20]
7. Compare the best tuned tree to your depth-only analysis

**Reflection questions:**
- At what depth does overfitting become apparent?
- Did tuning multiple hyperparameters improve performance over just tuning depth?
- Which hyperparameter had the largest impact on performance?

In [None]:
# TODO: Your code for Exercise 2 here

### Exercise 3: Random Forest Comprehensive Tuning

Perform comprehensive hyperparameter tuning for a random forest classifier on the Default dataset.

**Part A: Grid Search**
1. Define a parameter grid with:
   - `n_estimators`: [50, 100, 200, 300]
   - `max_depth`: [5, 10, 15, 20, None]
   - `max_features`: ['sqrt', 'log2']
   - `min_samples_split`: [2, 10, 20]
2. Use `GridSearchCV` with 5-fold CV and ROC AUC scoring
3. Report best parameters and CV score
4. Evaluate on test set

**Part B: Random Search**
1. Define parameter distributions:
   - `n_estimators`: integers from 50 to 500
   - `max_depth`: [3, 5, 7, 10, 15, 20, None]
   - `max_features`: ['sqrt', 'log2']
   - `min_samples_split`: integers from 2 to 100
   - `min_samples_leaf`: integers from 1 to 50
2. Use `RandomizedSearchCV` with `n_iter=100`
3. Compare results to grid search

**Part C: Analysis**
1. Create a bar chart comparing:
   - Default random forest (no tuning)
   - Grid search tuned
   - Random search tuned
2. Show both CV and test scores
3. Report computation time for each approach

**Reflection questions:**
- Did random search find better hyperparameters than grid search?
- Was random search faster? By how much?
- How much improvement did tuning provide over defaults?
- Would you recommend random search or grid search for this problem?

In [None]:
# TODO: Your code for Exercise 3 here