# Week 12 Lab: The Professional ML Workflow

<a href="https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/labs/12_wk12_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome to Week 12! This week marks a turning point in your data science journey. Up until now, you've been learning the fundamentals of building machine learning models—and you've done great work. But this week, you're leveling up from "beginner data scientist" to "production-ready professional."

Here's the uncomfortable truth: **we've been breaking a fundamental rule of production machine learning**. Every time you compared different model settings or tuned hyperparameters based on test set performance, you were "peeking" at the test set—making your performance estimates less trustworthy. This week, you'll learn the **proper workflow** that ensures honest, reliable model performance.

In this lab, you'll practice three critical professional skills: **cross-validation** (comparing models without contaminating your test set), **hyperparameter tuning** (systematically finding optimal model configurations), and **feature engineering** (transforming raw data into powerful model inputs). By the end, you'll be able to build models the right way—with techniques that work in production, not just in the classroom.

## 🎯 Learning Objectives
By the end of this lab, you will be able to:
- Implement k-fold cross-validation to compare models without touching the test set
- Use GridSearchCV to systematically tune hyperparameters across multiple parameters
- Apply feature engineering techniques including encoding, scaling, and creating new features
- Build end-to-end pipelines that prevent data leakage and ensure reproducible workflows
- Execute the complete 5-stage professional ML workflow from data preparation through final evaluation

## 📚 This Lab Reinforces
- **Chapter 28: Cross-Validation** — K-fold CV, test set contamination, the proper 5-stage workflow
- **Chapter 29: Hyperparameter Tuning** — Bias-variance tradeoff, GridSearchCV, systematic parameter search
- **Chapter 30: Feature Engineering** — Encoding strategies, feature scaling, pipelines, preventing data leakage

## 🕐 Estimated Time & Structure
**Total Time:** 75 minutes  
**Mode:** Group work (2-4 students)

- **[0–30 min]** Part A: Guided Reinforcement — TA-led practice with cross-validation, GridSearchCV, and feature engineering
- **[30–40 min]** Class Q&A — Discussion and clarification of key concepts
- **[40–72 min]** Part B: Independent Challenges — 6 group challenges applying the complete professional workflow
- **[72–75 min]** Wrap-Up & Reflection — What you learned and next steps

You are encouraged to work in small groups of **2–4 students** and complete the lab together.

## 💡 Why This Matters
These are the skills that separate junior data scientists from professionals who build production-ready models. When you interview for data science positions, employers will ask about cross-validation, hyperparameter tuning, and feature engineering—they want to know you can build models that perform reliably in the real world, not just on classroom exercises.

More importantly, these techniques ensure your models are trustworthy. When you tell a business stakeholder "this model will achieve 85% accuracy in production," you need to be confident that estimate is honest and reliable. The professional workflow you'll practice today is what makes that confidence possible.

## Setup
We'll use the familiar Ames housing dataset for today's lab. You've worked with this data throughout the course, which means you can focus on learning the new workflow techniques rather than getting familiar with unfamiliar data.

Make sure you have the required libraries installed and the dataset accessible.

In [9]:
# Required imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn imports
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Display settings
pd.set_option('display.max_columns', None)
np.random.seed(42)

# Hide warnings for cleaner output
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)

print("✅ All libraries imported successfully!")

✅ All libraries imported successfully!


In [2]:
# Load the Ames housing data
# adjust path if running in Google Colab
try:
    # Try local path first
    ames = pd.read_csv('../data/ames_clean.csv')
except FileNotFoundError:
    # If in Colab, load from GitHub
    url = 'https://raw.githubusercontent.com/bradleyboehmke/uc-bana-4080/main/data/ames_clean.csv'
    ames = pd.read_csv(url)

# Quick preview
print(f"Dataset shape: {ames.shape}")
print(f"\nFirst few rows:")
ames.head()

Dataset shape: (1460, 81)

First few rows:


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


## Part A — Guided Reinforcement (30 minutes)

In this section, your TA will lead you through the key techniques step-by-step. Follow along, run the code, and ask questions as we go. This guided practice will prepare you for the independent challenges in Part B.

### Section 1: Cross-Validation Basics (10 minutes)

Let's start by understanding **why** cross-validation matters and **how** to implement it properly.

**The Problem:** In previous weeks, you evaluated models on the test set multiple times to make decisions. Each "peek" at the test set made your performance estimates less trustworthy.

**The Solution:** Cross-validation lets you compare models and tune hyperparameters using only the training data, keeping your test set pristine for final evaluation.

**📋 Step-by-step instructions:**
1. Split data into train/test sets (80/20 split)
2. Select a subset of features for simplicity
3. Use `cross_val_score()` to evaluate a model with 5-fold CV
4. Interpret the CV scores (mean and standard deviation)

In [3]:
# Step 1: Create features and target
# We'll use a few numerical features to start
features = ['GrLivArea', 'YearBuilt', 'TotalBsmtSF', 'GarageCars', 'FullBath', 'OverallQual']
X = ames[features]
y = ames['SalePrice']

# Step 2: Split into train and test (test set is LOCKED)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print("\n🔒 Test set is now LOCKED. We won't touch it until the very end.")
print("\nLet's check out our training data features:")
X_train.head()

Training set size: 1168
Test set size: 292

🔒 Test set is now LOCKED. We won't touch it until the very end.

Let's check out our training data features:


Unnamed: 0,GrLivArea,YearBuilt,TotalBsmtSF,GarageCars,FullBath,OverallQual
254,1314,1957,1314,1,1,5
1066,1571,1993,799,2,2,6
638,796,1910,796,0,1,5
799,1768,1937,731,1,1,5
380,1691,1924,1026,1,2,5


In [4]:
# Step 3: Use cross-validation to evaluate a decision tree
# Notice: We ONLY use X_train and y_train here!
dt = DecisionTreeRegressor(max_depth=10, random_state=42)

# Perform 5-fold cross-validation
cv_scores = cross_val_score(
    dt, X_train, y_train, 
    cv=5, 
    scoring='neg_root_mean_squared_error'
)

# Convert to positive RMSE
cv_rmse = -cv_scores

print("Cross-Validation Results (5 folds):")
print(f"RMSE per fold: {cv_rmse}")
print(f"\nMean RMSE: ${cv_rmse.mean():,.0f}")
print(f"Std Dev: ${cv_rmse.std():,.0f}")
print("\n💡 The mean tells us expected performance, std dev tells us consistency")

Cross-Validation Results (5 folds):
RMSE per fold: [53211.61847297 53539.25783402 38889.51059663 38369.22929503
 35559.99196675]

Mean RMSE: $43,914
Std Dev: $7,809

💡 The mean tells us expected performance, std dev tells us consistency


### 🧠 Your Turn — Compare Two Models with Cross-Validation

Now you try! Use cross-validation to compare a shallow decision tree (`max_depth=5`) against a deeper tree (`max_depth=15`).

**Tasks:**
- Create two DecisionTreeRegressor models with different max_depth values
- Evaluate both using 5-fold cross-validation on the training data
- Compare the mean RMSE for both models
- Decide which model performs better based on CV scores (not test scores!)

💡 **Hint:** Use the same pattern as above, just change the `max_depth` parameter

In [None]:
# Your code here


### ✅ Check Your Understanding

**Questions to consider:**
- Which model had better (lower) mean RMSE?
- Which model had more consistent scores across folds (lower std dev)?
- Why is comparing models with CV better than repeatedly checking the test set?

**Expected Result:** You should see that one model generalizes better than the other, and you made this decision without ever touching the test set!

### Section 2: Hyperparameter Tuning with GridSearchCV (10 minutes)

Now let's automate the hyperparameter search process. Instead of manually trying different values one by one, GridSearchCV will systematically test all combinations and use cross-validation to find the best configuration.

**Why this matters:** Manual tuning is tedious, error-prone, and doesn't explore the full parameter space. GridSearchCV does this systematically and reproducibly.

### Guided Example: Tuning a Random Forest

Let's tune a Random Forest model by searching over multiple hyperparameters simultaneously.

**Example:** We'll search over `n_estimators`, `max_depth`, and `min_samples_split` to find the optimal combination.

In [5]:
# Define the parameter grid to search
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 15, 20],
    'min_samples_split': [2, 5]
}

# Create the model
rf = RandomForestRegressor(random_state=42)

# Create GridSearchCV
grid_search = GridSearchCV(
    rf, 
    param_grid, 
    cv=5, 
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,  # Use all CPU cores
    verbose=1
)

# Fit on training data (this will try 2 × 3 × 2 = 12 configs, each with 5-fold CV = 60 models!)
print("Starting grid search... this may take a minute")
grid_search.fit(X_train, y_train)

# Display results
print("\n" + "="*50)
print("Grid Search Complete!")
print("="*50)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV RMSE: ${-grid_search.best_score_:,.0f}")

Starting grid search... this may take a minute
Fitting 5 folds for each of 12 candidates, totalling 60 fits

Grid Search Complete!

Best parameters: {'max_depth': 15, 'min_samples_split': 5, 'n_estimators': 100}
Best CV RMSE: $33,509


### 🧪 Practice Exercise — Tune a Decision Tree

**Business Scenario:** You're building a decision tree model for your real estate company's pricing tool. You need to find the optimal complexity settings.

**Your Task:** Use GridSearchCV to tune a DecisionTreeRegressor across these parameters:
- `max_depth`: [5, 10, 15, 20]
- `min_samples_split`: [2, 5, 10]
- `min_samples_leaf`: [1, 2, 4]

**Step-by-step approach:**
1. Define the parameter grid as a dictionary
2. Create a DecisionTreeRegressor
3. Create GridSearchCV with cv=5
4. Fit on X_train and y_train
5. Print the best parameters and best CV score

In [None]:
# Your solution here


### Section 3: Feature Engineering Essentials (10 minutes)

Raw data often isn't in the best format for machine learning. Feature engineering transforms your data to help models learn better patterns. Let's practice the core techniques: encoding categorical variables and scaling numerical features.

#### Scaling Numerical Features

First, let's see how scaling may improve model performance. Let's first train a model without scaling, then apply StandardScaler and see if performance improves.

In [17]:
# Train a KNN regression model using 5-fold cross-validation
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors=5)

cv_scores = cross_val_score(
    knn, X_train, y_train,
    cv=5,
    scoring='neg_root_mean_squared_error'
)
cv_rmse_lr = -cv_scores

print("KNN Regression Cross-Validation Results (5 folds):")
print(f"Mean RMSE: ${cv_rmse.mean():,.0f}")
print(f"Std Dev: ${cv_rmse.std():,.0f}")
print(f"\n💡 Represents baseline performance with raw feature values.")

KNN Regression Cross-Validation Results (5 folds):
Mean RMSE: $33,793
Std Dev: $4,270

💡 Represents baseline performance with raw feature values.


In [18]:
# Scale the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use training stats to transform test

# Train a KNN regression model using 5-fold cross-validation on scaled features
knn = KNeighborsRegressor(n_neighbors=5)
cv_scores = cross_val_score(
    knn, X_train_scaled, y_train,
    cv=5,
    scoring='neg_root_mean_squared_error'
)
cv_rmse = -cv_scores

print("KNN Regression with Scaled Features (5 folds):")
print(f"Mean RMSE: ${cv_rmse.mean():,.0f}")
print(f"Std Dev: ${cv_rmse.std():,.0f}")
print("\n💡 Scaling numerical features can help some models like KNN perform better!")

KNN Regression with Scaled Features (5 folds):
Mean RMSE: $33,793
Std Dev: $4,270

💡 Scaling numerical features can help some models like KNN perform better!


**🎯 When scaling does matter**

| Model Type                             | Does Scaling Change Performance? | Why                                         |
| -------------------------------------- | -------------------------------- | ------------------------------------------- |
| **LinearRegression (OLS)**             | ❌ No                             | Closed-form solution, not iterative         |
| **Ridge / Lasso / ElasticNet**         | ✅ Yes                            | Regularization depends on coefficient sizes |
| **Logistic Regression (solver-based)** | ✅ Yes                            | Gradient descent & regularization           |
| **SVM / SVR**                          | ✅ Yes                            | Distance-based kernel computations          |
| **KNN / K-means / PCA / clustering**   | ✅ Yes                            | Distance/variance sensitive                 |
| **Neural networks**                    | ✅ Yes                            | Gradient descent stability                  |
| **Tree models (RF, XGBoost, etc.)**    | ❌ No                             | Trees are scale-invariant                   |
