# Week 12 Lab: The Professional ML Workflow

<a href="https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/labs/12_wk12_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome to Week 12! This week marks a turning point in your data science journey. Up until now, you've been learning the fundamentals of building machine learning models—and you've done great work. But this week, you're leveling up from "beginner data scientist" to "production-ready professional."

Here's the uncomfortable truth: **we've been breaking a fundamental rule of production machine learning**. Every time you compared different model settings or tuned hyperparameters based on test set performance, you were "peeking" at the test set—making your performance estimates less trustworthy. This week, you'll learn the **proper workflow** that ensures honest, reliable model performance.

In this lab, you'll practice three critical professional skills: **cross-validation** (comparing models without contaminating your test set), **hyperparameter tuning** (systematically finding optimal model configurations), and **feature engineering** (transforming raw data into powerful model inputs). By the end, you'll be able to build models the right way—with techniques that work in production, not just in the classroom.

## 🎯 Learning Objectives
By the end of this lab, you will be able to:
- Implement k-fold cross-validation to compare models without touching the test set
- Use GridSearchCV to systematically tune hyperparameters across multiple parameters
- Apply feature engineering techniques including encoding, scaling, and creating new features
- Build end-to-end pipelines that prevent data leakage and ensure reproducible workflows
- Execute the complete 5-stage professional ML workflow from data preparation through final evaluation

## 📚 This Lab Reinforces
- **Chapter 28: Cross-Validation** — K-fold CV, test set contamination, the proper 5-stage workflow
- **Chapter 29: Hyperparameter Tuning** — Bias-variance tradeoff, GridSearchCV, systematic parameter search
- **Chapter 30: Feature Engineering** — Encoding strategies, feature scaling, pipelines, preventing data leakage

## 🕐 Estimated Time & Structure
**Total Time:** 75 minutes  
**Mode:** Group work (2-4 students)

- **[0–30 min]** Part A: Guided Reinforcement — TA-led practice with cross-validation, GridSearchCV, and feature engineering
- **[30–40 min]** Class Q&A — Discussion and clarification of key concepts
- **[40–72 min]** Part B: Independent Challenges — 6 group challenges applying the complete professional workflow
- **[72–75 min]** Wrap-Up & Reflection — What you learned and next steps

You are encouraged to work in small groups of **2–4 students** and complete the lab together.

## 💡 Why This Matters
These are the skills that separate junior data scientists from professionals who build production-ready models. When you interview for data science positions, employers will ask about cross-validation, hyperparameter tuning, and feature engineering—they want to know you can build models that perform reliably in the real world, not just on classroom exercises.

More importantly, these techniques ensure your models are trustworthy. When you tell a business stakeholder "this model will achieve 85% accuracy in production," you need to be confident that estimate is honest and reliable. The professional workflow you'll practice today is what makes that confidence possible.

## Setup
We'll use the familiar Ames housing dataset for today's lab. You've worked with this data throughout the course, which means you can focus on learning the new workflow techniques rather than getting familiar with unfamiliar data.

Make sure you have the required libraries installed and the dataset accessible.

In [None]:
# Required imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn imports
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Display settings
pd.set_option('display.max_columns', None)
np.random.seed(42)

print("✅ All libraries imported successfully!")

In [None]:
# Load the Ames housing data
ames = pd.read_csv('../data/ames_clean.csv')

# Quick preview
print(f"Dataset shape: {ames.shape}")
print(f"\nFirst few rows:")
ames.head()

## Part A — Guided Reinforcement (30 minutes)

In this section, your TA will lead you through the key techniques step-by-step. Follow along, run the code, and ask questions as we go. This guided practice will prepare you for the independent challenges in Part B.

### Section 1: Cross-Validation Basics (10 minutes)

Let's start by understanding **why** cross-validation matters and **how** to implement it properly.

**The Problem:** In previous weeks, you evaluated models on the test set multiple times to make decisions. Each "peek" at the test set made your performance estimates less trustworthy.

**The Solution:** Cross-validation lets you compare models and tune hyperparameters using only the training data, keeping your test set pristine for final evaluation.

**📋 Step-by-step instructions:**
1. Split data into train/test sets (80/20 split)
2. Select a subset of features for simplicity
3. Use `cross_val_score()` to evaluate a model with 5-fold CV
4. Interpret the CV scores (mean and standard deviation)

In [None]:
# Step 1: Create features and target
# We'll use a few numerical features to start
features = ['GrLivArea', 'YearBuilt', 'TotalBsmtSF', 'GarageCars', 'FullBath']
X = ames[features]
y = ames['SalePrice']

# Step 2: Split into train and test (test set is LOCKED)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print("\n🔒 Test set is now LOCKED. We won't touch it until the very end.")

In [None]:
# Step 3: Use cross-validation to evaluate a decision tree
# Notice: We ONLY use X_train and y_train here!
dt = DecisionTreeRegressor(max_depth=10, random_state=42)

# Perform 5-fold cross-validation
cv_scores = cross_val_score(
    dt, X_train, y_train, 
    cv=5, 
    scoring='neg_root_mean_squared_error'
)

# Convert to positive RMSE
cv_rmse = -cv_scores

print("Cross-Validation Results (5 folds):")
print(f"RMSE per fold: {cv_rmse}")
print(f"\nMean RMSE: ${cv_rmse.mean():,.0f}")
print(f"Std Dev: ${cv_rmse.std():,.0f}")
print("\n💡 The mean tells us expected performance, std dev tells us consistency")

### 🧠 Your Turn — Compare Two Models with Cross-Validation

Now you try! Use cross-validation to compare a shallow decision tree (`max_depth=5`) against a deeper tree (`max_depth=15`).

**Tasks:**
- Create two DecisionTreeRegressor models with different max_depth values
- Evaluate both using 5-fold cross-validation on the training data
- Compare the mean RMSE for both models
- Decide which model performs better based on CV scores (not test scores!)

💡 **Hint:** Use the same pattern as above, just change the `max_depth` parameter

In [None]:
# Your code here


### ✅ Check Your Understanding

**Questions to consider:**
- Which model had better (lower) mean RMSE?
- Which model had more consistent scores across folds (lower std dev)?
- Why is comparing models with CV better than repeatedly checking the test set?

**Expected Result:** You should see that one model generalizes better than the other, and you made this decision without ever touching the test set!

### Section 2: Hyperparameter Tuning with GridSearchCV (10 minutes)

Now let's automate the hyperparameter search process. Instead of manually trying different values one by one, GridSearchCV will systematically test all combinations and use cross-validation to find the best configuration.

**Why this matters:** Manual tuning is tedious, error-prone, and doesn't explore the full parameter space. GridSearchCV does this systematically and reproducibly.

### Guided Example: Tuning a Random Forest

Let's tune a Random Forest model by searching over multiple hyperparameters simultaneously.

**Example:** We'll search over `n_estimators`, `max_depth`, and `min_samples_split` to find the optimal combination.

In [None]:
# Define the parameter grid to search
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 15, 20],
    'min_samples_split': [2, 5]
}

# Create the model
rf = RandomForestRegressor(random_state=42)

# Create GridSearchCV
grid_search = GridSearchCV(
    rf, 
    param_grid, 
    cv=5, 
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,  # Use all CPU cores
    verbose=1
)

# Fit on training data (this will try 2 × 3 × 2 = 12 configs, each with 5-fold CV = 60 models!)
print("Starting grid search... this may take a minute")
grid_search.fit(X_train, y_train)

# Display results
print("\n" + "="*50)
print("Grid Search Complete!")
print("="*50)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV RMSE: ${-grid_search.best_score_:,.0f}")

### 🧪 Practice Exercise — Tune a Decision Tree

**Business Scenario:** You're building a decision tree model for your real estate company's pricing tool. You need to find the optimal complexity settings.

**Your Task:** Use GridSearchCV to tune a DecisionTreeRegressor across these parameters:
- `max_depth`: [5, 10, 15, 20]
- `min_samples_split`: [2, 5, 10]
- `min_samples_leaf`: [1, 2, 4]

**Step-by-step approach:**
1. Define the parameter grid as a dictionary
2. Create a DecisionTreeRegressor
3. Create GridSearchCV with cv=5
4. Fit on X_train and y_train
5. Print the best parameters and best CV score

In [None]:
# Your solution here


### Section 3: Feature Engineering Essentials (10 minutes)

Raw data often isn't in the best format for machine learning. Feature engineering transforms your data to help models learn better patterns. Let's practice the core techniques: encoding categorical variables and scaling numerical features.

### Guided Example: Encoding and Scaling

Let's prepare features using proper encoding and scaling techniques.

**Example:** We'll encode a categorical variable (Neighborhood) and scale numerical features.

In [None]:
# Add a categorical feature to our feature set
print("Original Neighborhood (categorical):")
print(ames['Neighborhood'].head())
print(f"\nUnique neighborhoods: {ames['Neighborhood'].nunique()}")

# Label encode the neighborhood (converts to integers 0-27)
le = LabelEncoder()
ames['Neighborhood_Encoded'] = le.fit_transform(ames['Neighborhood'])

print("\nEncoded Neighborhood (numerical):")
print(ames[['Neighborhood', 'Neighborhood_Encoded']].head())
print("\n✅ Categorical variable converted to numerical format!")

In [None]:
# Now let's scale our numerical features
# First, create new feature set with encoded neighborhood
features_enhanced = ['GrLivArea', 'YearBuilt', 'TotalBsmtSF', 'GarageCars', 'FullBath', 'Neighborhood_Encoded']
X_enhanced = ames[features_enhanced]

# Split again with enhanced features
X_train_enh, X_test_enh, y_train_enh, y_test_enh = train_test_split(
    X_enhanced, y, test_size=0.2, random_state=42
)

print("Before scaling (first 3 rows):")
print(X_train_enh.head(3))

# Scale the features (fit on training data only!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_enh)
X_test_scaled = scaler.transform(X_test_enh)  # Use training stats to transform test

# Convert back to DataFrame for viewing
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=features_enhanced)

print("\nAfter scaling (first 3 rows):")
print(X_train_scaled_df.head(3))
print("\n💡 Notice: All features now have similar scales (mean≈0, std≈1)")

### 🧠 Your Turn — Build a Pipeline

Pipelines are the professional way to combine preprocessing and modeling. They ensure transformations are applied correctly and prevent data leakage.

**Your Task:** Create a pipeline that:
1. Scales features with StandardScaler
2. Fits a LinearRegression model
3. Evaluates performance using cross-validation

**Tasks:**
- Create a Pipeline with two steps: ('scaler', StandardScaler()) and ('model', LinearRegression())
- Use cross_val_score to evaluate the pipeline with 5-fold CV
- Print the mean RMSE

💡 **Hint:** The pipeline syntax is: `Pipeline([('step1_name', transformer1), ('step2_name', model)])`

In [None]:
# Your code here


## Class Discussion/Q&A (5-10 minutes)

Let's pause and discuss what we've learned so far. Your TA will facilitate a discussion around these key concepts.

**Discussion prompts:**
- Why is cross-validation more trustworthy than repeatedly checking the test set?
- How does GridSearchCV save time and reduce errors compared to manual tuning?
- What's the danger of fitting the scaler on the entire dataset before splitting?
- When would you use label encoding vs. dummy encoding for categorical variables?

**Common blockers and clarifications:**
- **"Why do we need so many folds in CV?"** — More folds = more reliable estimates but longer computation time. 5-fold is a good default balance.
- **"How do I know which hyperparameters to tune?"** — Read the documentation, start with complexity parameters (like max_depth, n_estimators), and use domain knowledge.
- **"What if GridSearchCV takes too long?"** — Reduce your parameter grid, use fewer folds, or try RandomizedSearchCV for very large grids.
- **"Why does the pipeline matter?"** — It prevents data leakage by ensuring the scaler fits on training data only, then transforms both train and test correctly.

## Part B — Independent Challenges (35-40 minutes)

Now it's your turn to apply the complete professional workflow! For the next several challenges...

* You will not be given starter code to work with; rather, you need to start from a blank cell.
* **DO NOT USE AI** to generate code for you. This is a group exercise, and you should be writing the code together.
* Work with your group to write the code.
* Feel free to ask questions or seek help from the instructor.
* We'll stop and walk through each challenge together after each time block.

**Timing for Part B:**
- Challenge 1-2: 10 minutes
- Challenge 3-4: 10 minutes  
- Challenge 5-6: 12 minutes
- Group discussion: 5-8 minutes

### Challenge 1 — Compare Three Models with Cross-Validation

**Question:** Your real estate company wants to choose between three modeling approaches for their pricing tool: Linear Regression, Decision Tree (max_depth=15), and Random Forest (n_estimators=100, max_depth=15). Which model should they use based on cross-validation performance?

**Context:** Use the enhanced features we created (including Neighborhood_Encoded) and 5-fold cross-validation. Compare models based on mean RMSE. Remember: lower RMSE is better!

In [None]:
# Your turn: write code here to compare three models using cross-validation


### Challenge 2 — Tune a Random Forest Comprehensively

**Question:** Find the optimal Random Forest configuration for predicting house prices. Search over these hyperparameters:
- `n_estimators`: [50, 100, 200]
- `max_depth`: [10, 15, 20, None]
- `min_samples_split`: [2, 5]
- `max_features`: ['sqrt', 'log2']

What are the best hyperparameters, and what's the expected CV RMSE?

**Hint:** This will search 3 × 4 × 2 × 2 = 48 configurations. With 5-fold CV, that's 240 models. It may take 2-3 minutes to run!

In [None]:
# Your turn: write code here to perform comprehensive hyperparameter tuning


### Challenge 3 — Engineer New Features

**Question:** Create three new features that might improve prediction accuracy:
1. **HouseAge**: Current year (2024) minus YearBuilt
2. **TotalBathrooms**: FullBath + (0.5 × HalfBath)
3. **QualityScore**: OverallQual × OverallCond (interaction of quality and condition)

Add these to your feature set, split the data, and evaluate a Decision Tree (max_depth=15) using cross-validation. Does adding these features improve performance compared to Challenge 1?

In [None]:
# Your turn: write code here to engineer new features and evaluate model


### Challenge 4 — Build a Complete Pipeline

**Question:** Create a pipeline that combines feature scaling and random forest modeling. The pipeline should:
1. Scale features using StandardScaler
2. Fit a RandomForestRegressor with your best hyperparameters from Challenge 2

Evaluate this pipeline using 5-fold cross-validation. Compare the RMSE to the unscaled Random Forest from Challenge 2. Does scaling help tree-based models?

In [None]:
# Your turn: write code here to build and evaluate a pipeline


### Challenge 5 — The Complete Professional Workflow

**Question:** Execute the full 5-stage professional ML workflow:

**Stage 1:** Split data into train/test (80/20)

**Stage 2:** Use cross-validation on the training set to:
- Compare at least 2 different models
- Tune hyperparameters for your best model
- Decide on your final model choice

**Stage 3:** Select your best model based on CV results

**Stage 4:** Retrain your chosen model on the FULL training set (not just one fold)

**Stage 5:** Evaluate on the test set ONCE and report final RMSE

Print results at each stage so we can see your decision-making process.

In [None]:
# Your turn: write code here to execute the complete 5-stage workflow


### Challenge 6 — Pipeline + GridSearchCV Integration

**Question:** Combine everything you've learned! Create a pipeline that includes:
1. StandardScaler for preprocessing
2. RandomForestRegressor for modeling

Then use GridSearchCV to tune the Random Forest hyperparameters WITHIN the pipeline. Search over:
- `model__n_estimators`: [100, 200]
- `model__max_depth`: [10, 15, 20]

Note the double underscore syntax: `model__parameter_name` is how you specify parameters for a specific step in a pipeline.

What are the best parameters? How does this compare to your previous results?

In [None]:
# Your turn: write code here to combine pipeline and GridSearchCV


## 🎯 (Optional) Extension Activities

If you finish early or want additional practice:

### Extension 1: RandomizedSearchCV
When the parameter space is too large for grid search, RandomizedSearchCV samples random combinations. Try using `RandomizedSearchCV` to search a much larger parameter space (e.g., n_estimators from 50-500, max_depth from 5-50) with only 20 iterations. Compare results to grid search.

### Extension 2: Feature Importance Analysis
After training your best Random Forest model, extract feature importances using `model.feature_importances_`. Create a bar plot showing which features matter most for predicting house prices. Do your engineered features rank highly?

### Extension 3: Analyze Cross-Validation Variance
Use `cross_validate()` with `return_train_score=True` to get both training and validation scores for each fold. Plot them to visualize if your model shows signs of overfitting (large gap between train and validation scores).

### Extension 4: Brainstorm — What else is interesting?
Write **3 questions** you'd like to explore with this dataset in future weeks.

Examples to spark ideas:
- How would you handle missing data in features like GarageYrBlt or LotFrontage?
- Could you create polynomial features (e.g., GrLivArea²) to capture non-linear relationships?
- What happens if you use dummy encoding instead of label encoding for Neighborhood?

## 🎓 Lab Wrap-Up & Reflection

### ✅ What You Accomplished
In this lab, you practiced:
- **Cross-validation**: Comparing models using k-fold CV to get honest performance estimates without contaminating the test set
- **Hyperparameter tuning**: Using GridSearchCV to systematically search parameter spaces and find optimal model configurations
- **Feature engineering**: Encoding categorical variables, scaling numerical features, and creating new features based on domain knowledge
- **Pipelines**: Building end-to-end workflows that prevent data leakage and ensure reproducibility
- **Professional workflow**: Executing the complete 5-stage process from split through final evaluation

### 🤔 Reflection Questions
Take 2-3 minutes to consider:
- What concept from today clicked for you?
- What would you like more practice with?
- How might you use today's skills in a real business context?

### 🔗 Connection to Course Goals
This week represents a major milestone in your data science journey. You've moved from learning individual modeling techniques to understanding the **complete professional workflow** that ensures your models are trustworthy, reproducible, and production-ready. These skills are what employers look for when hiring data scientists—they want people who can build models that work reliably in the real world, not just on classroom exercises.

The techniques you practiced today (cross-validation, hyperparameter tuning, feature engineering, pipelines) are foundational to every data science project you'll work on in your career. Master these, and you'll be well-prepared for advanced topics and real-world applications.

### 📋 Next Steps
- **Homework:** Complete the Week 12 homework assignment applying these techniques to a new dataset
- **Quiz:** Week 12 reading quiz covering Chapters 28-30 (due Sunday)
- **Next Week:** We'll explore unsupervised learning techniques including clustering and dimensionality reduction
- **Additional Practice:** Revisit previous homework assignments and apply the proper workflow—compare your original results to what you get with cross-validation and hyperparameter tuning

---
**💾 Save your work** and be ready to share your approach and findings. During our wrap-up discussion, be prepared to explain:
- Which model performed best in your workflow and why
- What hyperparameters had the biggest impact on performance
- Whether your engineered features improved predictions
- Any surprises or unexpected results you discovered

## 🚨 Troubleshooting & Common Issues

**Issue 1:** "GridSearchCV is taking forever to run!"
- **Solution:** Reduce the parameter grid (try fewer values per parameter), use fewer CV folds (try cv=3 instead of cv=5), set `n_jobs=-1` to use all CPU cores, or consider RandomizedSearchCV for large grids.

**Issue 2:** "I'm getting a 'ValueError: could not convert string to float' error"
- **Solution:** You likely have a categorical variable in your feature set that hasn't been encoded. Use LabelEncoder or get_dummies() to convert categorical variables to numerical format before modeling.

**Issue 3:** "My cross-validation scores vary a lot between folds"
- **Solution:** This is normal and indicates your model's performance is somewhat sensitive to the specific data it sees. A large standard deviation suggests either: (1) your dataset is small, (2) the model is overfitting, or (3) there are outliers. Try increasing the number of folds, simplifying your model, or investigating outliers.

**Issue 4:** "Pipeline syntax with GridSearchCV is confusing"
- **Solution:** Remember the double underscore syntax: `'step_name__parameter'`. For example, if your pipeline has a step named 'model', use `'model__max_depth'` in your param_grid. Print `pipeline.get_params()` to see all available parameters.

**Issue 5:** "My test set performance is much worse than CV performance"
- **Solution:** This suggests overfitting. Your model performed well during cross-validation but doesn't generalize to truly new data. Try: (1) simplifying your model (reduce max_depth, increase min_samples_split), (2) using more training data, or (3) adding regularization.

**Issue 6:** "StandardScaler is giving me weird results"
- **Solution:** Make sure you fit the scaler on training data only (`scaler.fit_transform(X_train)`), then transform test data using those fitted parameters (`scaler.transform(X_test)`). NEVER fit on test data—that's data leakage! Using a pipeline helps prevent this mistake.

**General Debugging Tips:**
- **Start simple:** Get a basic version working before adding complexity
- **Check shapes:** Use `.shape` to verify your arrays are the right size at each step
- **Print intermediate results:** Don't just run everything at once—print outputs to understand what's happening
- **Use small grids first:** Test GridSearchCV with a small parameter grid (2-3 combinations) before running the full search
- **Read error messages carefully:** They often tell you exactly what's wrong (e.g., "expected 6 features but got 5")