# Week 12 Lab: The Professional ML Workflow

<a href="https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/labs/12_wk12_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome to Week 12! This week marks a turning point in your data science journey. Up until now, you've been learning the fundamentals of building machine learning models‚Äîand you've done great work. But this week, you're leveling up from "beginner data scientist" to "production-ready professional."

Here's the uncomfortable truth: **we've been breaking a fundamental rule of production machine learning**. Every time you compared different model settings or tuned hyperparameters based on test set performance, you were "peeking" at the test set‚Äîmaking your performance estimates less trustworthy. This week, you'll learn the **proper workflow** that ensures honest, reliable model performance.

In this lab, you'll practice three critical professional skills: **cross-validation** (comparing models without contaminating your test set), **hyperparameter tuning** (systematically finding optimal model configurations), and **feature engineering** (transforming raw data into powerful model inputs). By the end, you'll be able to build models the right way‚Äîwith techniques that work in production, not just in the classroom.

## üéØ Learning Objectives
By the end of this lab, you will be able to:
- Implement k-fold cross-validation to compare models without touching the test set
- Use GridSearchCV to systematically tune hyperparameters across multiple parameters
- Apply feature engineering techniques including encoding, scaling, and creating new features
- Build end-to-end pipelines that prevent data leakage and ensure reproducible workflows
- Execute the complete 5-stage professional ML workflow from data preparation through final evaluation

## üìö This Lab Reinforces
- **Chapter 28: Cross-Validation** ‚Äî K-fold CV, test set contamination, the proper 5-stage workflow
- **Chapter 29: Hyperparameter Tuning** ‚Äî Bias-variance tradeoff, GridSearchCV, systematic parameter search
- **Chapter 30: Feature Engineering** ‚Äî Encoding strategies, feature scaling, pipelines, preventing data leakage

## üïê Estimated Time & Structure
**Total Time:** 75 minutes  
**Mode:** Group work (2-4 students)

- **[0‚Äì30 min]** Part A: Guided Reinforcement ‚Äî TA-led practice with cross-validation, GridSearchCV, and feature engineering
- **[30‚Äì40 min]** Class Q&A ‚Äî Discussion and clarification of key concepts
- **[40‚Äì72 min]** Part B: Independent Challenges ‚Äî 6 group challenges applying the complete professional workflow
- **[72‚Äì75 min]** Wrap-Up & Reflection ‚Äî What you learned and next steps

You are encouraged to work in small groups of **2‚Äì4 students** and complete the lab together.

## üí° Why This Matters
These are the skills that separate junior data scientists from professionals who build production-ready models. When you interview for data science positions, employers will ask about cross-validation, hyperparameter tuning, and feature engineering‚Äîthey want to know you can build models that perform reliably in the real world, not just on classroom exercises.

More importantly, these techniques ensure your models are trustworthy. When you tell a business stakeholder "this model will achieve 85% accuracy in production," you need to be confident that estimate is honest and reliable. The professional workflow you'll practice today is what makes that confidence possible.

## Setup
We'll use the familiar Ames housing dataset for today's lab. You've worked with this data throughout the course, which means you can focus on learning the new workflow techniques rather than getting familiar with unfamiliar data.

Make sure you have the required libraries installed and the dataset accessible.

In [None]:
# Required imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn imports
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Display settings
pd.set_option('display.max_columns', None)
np.random.seed(42)

# Hide warnings for cleaner output
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)

print("‚úÖ All libraries imported successfully!")

In [None]:
# Load the Ames housing data
# adjust path if running in Google Colab
try:
    # Try local path first
    ames = pd.read_csv('../data/ames_clean.csv')
except FileNotFoundError:
    # If in Colab, load from GitHub
    url = 'https://raw.githubusercontent.com/bradleyboehmke/uc-bana-4080/main/data/ames_clean.csv'
    ames = pd.read_csv(url)

# Quick preview
print(f"Dataset shape: {ames.shape}")
print(f"\nFirst few rows:")
ames.head()

## Part A ‚Äî Guided Reinforcement (30 minutes)

In this section, your TA will lead you through the key techniques step-by-step. Follow along, run the code, and ask questions as we go. This guided practice will prepare you for the independent challenges in Part B.

### Section 1: Cross-Validation Basics (10 minutes)

Let's start by understanding **why** cross-validation matters and **how** to implement it properly.

**The Problem:** In previous weeks, you evaluated models on the test set multiple times to make decisions. Each "peek" at the test set made your performance estimates less trustworthy.

**The Solution:** Cross-validation lets you compare models and tune hyperparameters using only the training data, keeping your test set pristine for final evaluation.

**üìã Step-by-step instructions:**
1. Split data into train/test sets (80/20 split)
2. Select a subset of features for simplicity
3. Use `cross_val_score()` to evaluate a model with 5-fold CV
4. Interpret the CV scores (mean and standard deviation)

In [None]:
# Step 1: Create features and target
# We'll use a few numerical features to start
features = ['GrLivArea', 'YearBuilt', 'TotalBsmtSF', 'GarageCars', 'FullBath', 'OverallQual']
X = ames[features]
y = ames['SalePrice']

# Step 2: Split into train and test (test set is LOCKED)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print("\nüîí Test set is now LOCKED. We won't touch it until the very end.")
print("\nLet's check out our training data features:")
X_train.head()

In [None]:
# Step 3: Use cross-validation to evaluate a decision tree
# Notice: We ONLY use X_train and y_train here!
dt = DecisionTreeRegressor(max_depth=10, random_state=42)

# Perform 5-fold cross-validation
cv_scores = cross_val_score(
    dt, X_train, y_train, 
    cv=5, 
    scoring='neg_root_mean_squared_error'
)

# Convert to positive RMSE
cv_rmse = -cv_scores

print("Cross-Validation Results (5 folds):")
print(f"RMSE per fold: {cv_rmse}")
print(f"\nMean RMSE: ${cv_rmse.mean():,.0f}")
print(f"Std Dev: ${cv_rmse.std():,.0f}")
print("\nüí° The mean tells us expected performance, std dev tells us consistency")

#### üß† Your Turn ‚Äî Compare Two Models with Cross-Validation

Now you try! Use cross-validation to compare a shallow decision tree (`max_depth=5`) against a deeper tree (`max_depth=15`).

**Tasks:**
- Use the code above, just change the `max_depth` parameter to test out different settings.
- Evaluate these models using 5-fold cross-validation on the training data
- Compare the mean RMSE for these models
- Which model performs better based on CV scores (not test scores!)

#### ‚úÖ Check Your Understanding

**Questions to consider:**
- Which model had better (lower) mean RMSE?
- Which model had more consistent scores across folds (lower std dev)?
- Why is comparing models with CV better than repeatedly checking the test set?

**Expected Result:** You should see that one model generalizes better than the other, and you made this decision without ever touching the test set!

### Section 2: Hyperparameter Tuning with GridSearchCV (10 minutes)

Now let's automate the hyperparameter search process. Instead of manually trying different values one by one, GridSearchCV will systematically test all combinations and use cross-validation to find the best configuration.

**Why this matters:** Manual tuning is tedious, error-prone, and doesn't explore the full parameter space. GridSearchCV does this systematically and reproducibly.

### Guided Example: Tuning a Random Forest

Let's tune a Random Forest model by searching over multiple hyperparameters simultaneously.

**Example:** We'll search over `n_estimators`, `max_depth`, and `min_samples_split` to find the optimal combination.

In [None]:
# Define the parameter grid to search
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 15, 20],
    'min_samples_split': [2, 5]
}

# Create the model
rf = RandomForestRegressor(random_state=42)

# Create GridSearchCV
grid_search = GridSearchCV(
    rf, 
    param_grid, 
    cv=5, 
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,  # Use all CPU cores
    verbose=1
)

# Fit on training data (this will try 2 √ó 3 √ó 2 = 12 configs, each with 5-fold CV = 60 models!)
print("Starting grid search... this may take a minute")
grid_search.fit(X_train, y_train)

print("Grid Search Complete!")

# Check out the results
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results

In [None]:
# Display results
print("\n" + "="*50)
print("Best Hyperparameter Configuration")
print("="*50)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV RMSE: ${-grid_search.best_score_:,.0f}")

### üß™ Practice Exercise ‚Äî Tune a Decision Tree

**Business Scenario:** You're building a decision tree model for your real estate company's pricing tool. You need to find the optimal complexity settings.

**Your Task:** Fill in the blanks below to use GridSearchCV to tune a DecisionTreeRegressor across these parameters:
- `max_depth`: [5, 10, 15, 20]
- `min_samples_split`: [2, 5, 10]
- `min_samples_leaf`: [1, 2, 4]

**Step-by-step approach:**
1. Define the parameter grid as a dictionary
2. Create a DecisionTreeRegressor
3. Create GridSearchCV with cv=5
4. Fit on X_train and y_train
5. Print the best parameters and best CV score

In [None]:
# Step 1: Define the parameter grid
param_grid = {
    'max_depth': _________,           
    'min_samples_split': _________,       
    'min_samples_leaf': _________         
    }

# Step 2: Create the DecisionTreeRegressor
dt = DecisionTreeRegressor(random_state=42)

# Step 3: Create GridSearchCV
grid_search = GridSearchCV(
    dt,
    ______,
    cv=__,                                   
    scoring='neg_root_mean_squared_error',
    verbose=1
)

# Step 4: Fit on training data
grid_search.fit(X_train, y_train)

# Step 5: Print best parameters and best CV score
print("Best parameters:", grid_search._______)
print("Best CV RMSE: $", -grid_search._______)

### Section 3: Feature Engineering Essentials (10 minutes)

Raw data often isn't in the best format for machine learning. Feature engineering transforms your data to help models learn better patterns. Let's practice the core techniques: encoding categorical variables and scaling numerical features.

#### Scaling Numerical Features

First, let's see how scaling may improve model performance. Let's first train a model without scaling, then apply StandardScaler and see if performance improves.

In [None]:
# Train a KNN regression model using 5-fold cross-validation
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors=5)

cv_scores = cross_val_score(
    knn, X_train, y_train,
    cv=5,
    scoring='neg_root_mean_squared_error'
)
cv_rmse_unscaled = -cv_scores

print("KNN Regression Cross-Validation Results (5 folds):")
print(f"Mean RMSE: ${cv_rmse_unscaled.mean():,.0f}")
print(f"Std Dev: ${cv_rmse_unscaled.std():,.0f}")
print(f"\nüí° Represents baseline performance with raw feature values.")

In [None]:
# Scale the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use training stats to transform test

# Train a KNN regression model using 5-fold cross-validation on scaled features
knn = KNeighborsRegressor(n_neighbors=5)
cv_scores = cross_val_score(
    knn, X_train_scaled, y_train,
    cv=5,
    scoring='neg_root_mean_squared_error'
)
cv_rmse_scaled = -cv_scores

print("KNN Regression with Scaled Features (5 folds):")
print(f"Mean RMSE: ${cv_rmse_scaled.mean():,.0f}")
print(f"Std Dev: ${cv_rmse_scaled.std():,.0f}")
print("\nüí° Scaling numerical features can help some models like KNN perform better!")

**üéØ When scaling does matter**

| Model Type                             | Does Scaling Change Performance? | Why                                         |
| -------------------------------------- | -------------------------------- | ------------------------------------------- |
| **LinearRegression (OLS)**             | ‚ùå No                             | Closed-form solution, not iterative         |
| **Ridge / Lasso / ElasticNet**         | ‚úÖ Yes                            | Regularization depends on coefficient sizes |
| **Logistic Regression (solver-based)** | ‚úÖ Yes                            | Gradient descent & regularization           |
| **SVM / SVR**                          | ‚úÖ Yes                            | Distance-based kernel computations          |
| **KNN / K-means / PCA / clustering**   | ‚úÖ Yes                            | Distance/variance sensitive                 |
| **Neural networks**                    | ‚úÖ Yes                            | Gradient descent stability                  |
| **Tree models (RF, XGBoost, etc.)**    | ‚ùå No                             | Trees are scale-invariant                   |

#### Encoding Categorical Variables

Most machine learning algorithms require numerical inputs. When you have categorical variables (like neighborhood names or house styles), you need to convert them to numbers. Let's practice the two most common approaches:

1. **One-Hot Encoding** ‚Äî Create binary (0/1) columns for each category
2. **Label Encoding** ‚Äî Assign each category a unique integer

**When to use which?**
- **One-hot encoding:** Best for nominal categories (no inherent order) with tree-based models or linear models
- **Label encoding:** Can work for ordinal categories (inherent order) or with tree-based models that can learn splits

In [None]:
# Let's add some categorical features to our feature set
# We'll use Neighborhood and BldgType as examples
# Step 1: Create features and target
# We'll use a few numerical features to start
features = ['Neighborhood', 'BldgType', 'GrLivArea', 'YearBuilt', 'TotalBsmtSF', 'GarageCars', 'FullBath', 'OverallQual']
X = ames[features]

# Step 2: Split into train and test (test set is LOCKED)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# First, let's see what values these categorical variables have
print(f"Unique Neighborhoods: {X_train['Neighborhood'].nunique()}")
print(f"Unique Building Types: {X_train['BldgType'].nunique()}")

print("\nLet's check out our training data features:")
X_train.head(10)


In [None]:
from sklearn.preprocessing import OneHotEncoder

# Select categorical columns
cat_cols = ['BldgType', 'Neighborhood']

# Initialize encoder
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Fit and transform on training data
X_cat_encoded = encoder.fit_transform(X_train[cat_cols])

# Get feature names for encoded columns
encoded_feature_names = encoder.get_feature_names_out(cat_cols)

# Create DataFrame for encoded categorical features
X_cat_encoded_df = pd.DataFrame(X_cat_encoded, columns=encoded_feature_names, index=X_train.index)

# Concatenate with numerical features
num_cols = [col for col in X_train.columns if col not in cat_cols]
X_encoded = pd.concat([X_train[num_cols], X_cat_encoded_df], axis=1)

print(f"One-hot encoding turned our {X_train.shape[1]} original features into {X_encoded.shape[1]} features.")
print("\nFirst few rows:")
X_encoded.head()

#### Alternative: Label Encoding for Categorical Variables

While one-hot encoding is a common approach for handling categorical variables, another option is **label encoding**. This technique assigns each unique category an integer value. Label encoding can be especially useful when you have a categorical feature with many unique values‚Äîlike the `Neighborhood` column, which contains 25 unique neighborhoods.

Let's see how to apply label encoding to the `Neighborhood` column in the next code chunk.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Initialize the label encoder
le = LabelEncoder()

# Fit and transform Neighborhood in training data
X_train['Neighborhood_LE'] = le.fit_transform(X_train['Neighborhood'])

# Transform Neighborhood in test data using the same encoder
X_test['Neighborhood_LE'] = le.transform(X_test['Neighborhood'])

print(X_train[['Neighborhood', 'Neighborhood_LE']].head())

#### üèóÔ∏è Putting Feature Engineering Steps Together with Pipelines

When building machine learning models, it's critical to apply all preprocessing steps (like scaling and encoding) in a reproducible, leak-proof way. Scikit-learn's `Pipeline` lets you chain together feature engineering and modeling steps so that each transformation is fit only on the training data within each cross-validation fold. This minimizes data leakage and ensures your workflow is robust and production-ready.

**Benefits of using a pipeline:**
- All preprocessing steps are applied consistently and correctly
- Prevents accidental data leakage by fitting transformers only on training data
- Makes your workflow easier to reproduce and deploy
- Allows you to tune preprocessing and model hyperparameters together

Below is an example pipeline that:
- Standardizes numeric features
- One-hot encodes the `BldgType` variable
- Label-encodes the `Neighborhood` variable
- Trains a Decision Tree model using 5-fold cross-validation

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Define column types
numeric_features = ['GrLivArea', 'YearBuilt', 'TotalBsmtSF', 'GarageCars', 'FullBath', 'OverallQual']
onehot_features = ['BldgType', 'Neighborhood']          

# Build preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('onehot', OneHotEncoder(handle_unknown='ignore'), onehot_features)
    ]
)

# Build full pipeline with DecisionTreeRegressor
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', DecisionTreeRegressor(random_state=42))
])

# Set up hyperparameter grid for DecisionTreeRegressor
param_grid = {
    'model__max_depth': [5, 10, 15, 20],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4]
}

# GridSearchCV for pipeline
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    verbose=1
)

# Fit and evaluate
grid_search.fit(X_train, y_train)
cv_rmse = -grid_search.best_score_

print("Best parameters:", grid_search.best_params_)
print(f"Best CV RMSE: ${cv_rmse:,.0f}")


**üí° Why pipelines are essential:**
- ‚úÖ Prevent data leakage by ensuring transformations are fit only on training data
- ‚úÖ Make your workflow reproducible and easier to deploy
- ‚úÖ Allow you to tune preprocessing and model hyperparameters together
- ‚úÖ Simplify your code by combining multiple steps

## Class Discussion/Q&A (10 minutes)

Before moving to the independent challenges, let's discuss key concepts:

**Discussion prompts:**
- What's the difference between cross-validation and a simple train/test split?
- Why shouldn't we look at the test set until the very end?
- When would you use GridSearchCV vs. manually trying different hyperparameters?
- What happens if you fit a scaler on all your data before splitting into train/test?
- How do pipelines help prevent data leakage?

**Common blockers and clarifications:**
- **"My GridSearchCV is taking forever!"** ‚Äî Reduce the parameter grid size or use fewer CV folds during experimentation
- **"Do I need to scale features for Random Forest?"** ‚Äî No! Tree-based models are scale-invariant
- **"When do I use one-hot vs. label encoding?"** ‚Äî One-hot for nominal categories with no order; label encoding for ordinal or when working with tree models

## Part B ‚Äî Independent Challenges (32 minutes)

Now it's time to apply the complete professional ML workflow! These challenges require you to integrate everything you've learned.

**Important rules for Part B:**
* You will not be given starter code to work with; rather, you need to start from a blank cell.
* **DO NOT USE AI** to generate code for you. This is a group exercise, and you should be writing the code together.
* Work with your group to write the code.
* Feel free to ask questions or seek help from the instructor.
* We'll stop and walk through each challenge together after each time block.

**The Professional ML Workflow (5 Stages):**
1. **Data Preparation** ‚Äî Feature selection, encoding, handling missing values
2. **Initial Train/Test Split** ‚Äî Lock away your test set (don't touch until Stage 5!)
3. **Model Comparison with CV** ‚Äî Compare different model types using cross-validation
4. **Hyperparameter Tuning** ‚Äî Optimize your chosen model with GridSearchCV
5. **Final Test Evaluation** ‚Äî Evaluate your final model on the locked test set (ONCE!)

### Challenge 1 ‚Äî Compare Model Types (5 minutes)

**Business Context:** Your real estate company wants to choose the best model architecture for predicting house prices. They're considering three options: Linear Regression, Decision Trees, and Random Forests.

**Your Task:**
1. Using the numerical features from Part A, create a train/test split (80/20)
   - Numerical features: `['GrLivArea', 'YearBuilt', 'TotalBsmtSF', 'GarageCars', 'FullBath', 'OverallQual']`
2. Use 5-fold cross-validation to compare these three models:
   - Linear Regression
   - DecisionTreeRegressor (max_depth=10)
   - RandomForestRegressor (n_estimators=100, max_depth=10)
3. Print the mean CV RMSE for each model
4. Which model performs best? Why do you think that is?

**Remember:** Use ONLY the training data for cross-validation! Make sure you always set your `random_state=42` whether you're splitting your data or when you initiate your model (i.e. `DecisionTreeRegressor(random_state=42)`)

In [None]:
# Your turn: write code here to compare model types


### Challenge 2 ‚Äî Systematic Hyperparameter Tuning (6 minutes)

**Business Context:** Now that you've identified which model type works best, your manager wants you to find the optimal hyperparameter settings for that model.

**Your Task:**
1. Take the Random Forest model from Challenge 1
2. Define a parameter grid that assesses the following values:
    - `n_estimators`: [100, 200, 300]
    - `max_depth`: [5, 10, 15, 20]
    - `min_samples_split`: [2, 5, 10]
    - `min_samples_leaf`: [1, 2, 5]
3. Use GridSearchCV with 5-fold CV to find the best configuration
4. Print the best parameters and best CV score
5. Did tuning improve performance compared to the first random forest model you trained?  If so, by how much?

**Tip:** Don't forget to set `random_state=42`!

**‚ö†Ô∏è‚ö†Ô∏è Warning ‚ö†Ô∏è‚ö†Ô∏è**: This may take a couple minutes to run since we're are running 540 models!

In [None]:
# Your turn: write code here to tune hyperparameters


### Challenge 3 ‚Äî Build a Complete Pipeline (6 minutes)
**Business Context:** Your model is performing well, but you think you can do better by adding more features and performing some feature engineering.  Let's do this but do it in a safe way by using a `Pipeline()` to ensure no data leakage and easy reproducibility.

**Your Task:**
1. Start from scratch with the original `ames` data and select the following features:
   - Numeric: GrLivArea, YearBuilt, TotalBsmtSF, GarageCars, FullBath, OverallQual, YearRemodAdd, BedroomAbvGr, TotRmsAbvGrd
   - Categorical: Neighborhood, HouseStyle
2. Split the data into train/test sets (80/20 split) using these features.
3. Build a pipeline that:
   - Applies StandardScaler to numeric features
   - Applies OneHotEncoder to categorical features (Neighborhood, HouseStyle)
   - Uses RandomForestRegressor with the optimal hyperparameters from Challenge 2
4. Use 5-fold cross-validation to evaluate the pipeline on the training data.
5. Print the mean CV RMSE and compare it to your previous Random Forest model.  
6. Does adding more features and applying feature engineering improve model performance? If so, by how much?

**Tip:** Don't forget to set `random_state=42`!

In [None]:
# Your turn: write code here to build a pipeline


### Challenge 4 ‚Äî Complete End-to-End Workflow (4 minutes)
**Business Context:** Your manager wants you to execute the complete professional ML workflow from start to finish using the same feature set as the previous challenge.

**Your Task:** Demonstrate the full end-to-end workflow:

1. **Data Preparation:**
   - Select features:  
     - Numeric: GrLivArea, YearBuilt, TotalBsmtSF, GarageCars, FullBath, OverallQual, YearRemodAdd, BedroomAbvGr, TotRmsAbvGrd  
     - Categorical: Neighborhood, HouseStyle
   - Target variable: SalePrice

2. **Initial Split:**
   - Split data into train/test sets (80/20)
   - Lock away the test set

3. **Pipeline & Feature Engineering:**
   - Build a pipeline that applies StandardScaler to numeric features and OneHotEncoder to categorical features

4. **Hyperparameter Tuning:**
   - Use GridSearchCV to tune a RandomForestRegressor with these settings:
      - `n_estimators`: [100, 200, 300]
      - `max_depth`: [5, 10, 15, 20]
      - `min_samples_split`: [2, 5, 10]
      - `min_samples_leaf`: [1, 2, 5]
   - Use 5-fold cross-validation to select the best model

5. **Final Evaluation:**
   - Retrain the best model on the full training set
   - Evaluate ONCE on the test set
   - Report the final test RMSE

**Success criteria:** Your code should demonstrate the complete workflow, use a pipeline for feature engineering, and report the final test RMSE with no data leakage!

**‚ö†Ô∏è‚ö†Ô∏è Warning ‚ö†Ô∏è‚ö†Ô∏è**: This may take a couple minutes to run since we're are running 540 models!

In [None]:
#Your Turn: write code here to complete the end-to-end workflow

### Challenge 5 ‚Äî Model Interpretation: Permutation Importance (5 minutes)

**Business Context:** Your manager wants to understand *why* your model makes its predictions. Which features are most influential for predicting house prices?

**Your Task:**
1. Take the best model you identified above.
2. Use the permutation importance approach (`sklearn.inspection.permutation_importance`) on the test set to identify the most influential features.
3. Create a bar chart that plots the permutation importance scores for all features.
4. Identify and answer: Which feature is the most important for your model?

**Tip:** Use the test set (`X_test`, `y_test`) for permutation importance. For plotting, feel free to use matplotlib, seaborn or any other library.

**Success criteria:** You should be able to visualize the feature importances and clearly identify the most influential feature.## Challenge 5 ‚Äî Model Interpretation: Permutation Importance & PDP (5 minutes)

In [None]:
# Your Turn: write code here to compute feature importance

### Challenge 6 ‚Äî Partial Dependence Plot (PDP) for Model Interpretation

**Business Context:** Now that you've identified the most important feature using permutation importance, your manager wants to understand *how* this feature influences the model's predictions. A Partial Dependence Plot (PDP) helps visualize the relationship between the feature and the predicted sale price, holding all other features constant.

**Your Task:**
1. Use the best model and the most important feature identified in Challenge 5.
2. Generate a Partial Dependence Plot (PDP) for this feature using the test set.
3. Interpret the plot: How do changes in this feature affect the predicted house price? Is the relationship linear, monotonic, or more complex?

**Success criteria:** Your code should generate a clear PDP for the most important feature and provide insights into how the model uses this feature to make predictions.

In [None]:
# Your Turn: write code here to compute the PDP plot

## üéì Lab Wrap-Up & Reflection

### üíæ Save Your Work
You use your findings to complete this week's homework quiz!

### ‚úÖ What You Accomplished
In this lab, you practiced:
- Implementing cross-validation to compare models without touching the test set
- Using GridSearchCV to systematically tune hyperparameters
- Engineering features through encoding categorical variables and creating new features
- Building pipelines to prevent data leakage and ensure reproducible workflows
- Executing the complete 5-stage professional ML workflow

### ü§î Reflection Questions
Take 2-3 minutes to consider:
- What concept from today clicked for you?
- What would you like more practice with?
- How does the professional workflow differ from what you were doing before?
- Why is it critical to keep the test set locked until the very end?
- How might you use pipelines in future projects?

### üîó Connection to Course Goals
This lab represents a critical milestone in your data science journey. You've moved from learning individual modeling techniques to understanding how professional data scientists build production-ready models. The workflow you practiced today is exactly what you'll use in real-world ML projects:

1. **Cross-validation** ensures your model comparisons are trustworthy
2. **Hyperparameter tuning** finds optimal configurations systematically
3. **Feature engineering** extracts maximum predictive power from your data
4. **Pipelines** prevent data leakage and make your work reproducible
5. **Test set discipline** gives you honest performance estimates

These aren't just classroom skills‚Äîthey're industry best practices.

## üö® Troubleshooting & Common Issues

**Issue 1: GridSearchCV is taking forever to run**
- **Solution:** Reduce the size of your parameter grid during experimentation. Instead of `[1, 10, 100, 1000]`, try `[10, 100]` first. You can also reduce `cv` from 5 to 3 for faster prototyping.

**Issue 2: "ValueError: Input contains NaN"**
- **Solution:** Some features in the Ames dataset have missing values. Either drop rows with missing values using `.dropna()`, or impute missing values using `SimpleImputer` from sklearn.

**Issue 3: One-hot encoding creates too many columns**
- **Solution:** This is normal for features with many categories (like Neighborhood). You can: (1) use only high-frequency categories, (2) use label encoding for tree-based models instead, or (3) accept the high dimensionality.

**Issue 4: "Arrays must have the same shape" when predicting on test set**
- **Solution:** Make sure you apply the same transformations to both train and test sets. If you used `pd.get_dummies()` on the train set, you need to apply it to the test set too, ensuring the same columns exist.

**Issue 5: Pipeline syntax is confusing**
- **Solution:** Remember the double underscore notation: `'step_name__parameter_name'`. To tune a Ridge model in a pipeline, use `'model__alpha'` where 'model' is the name you gave that pipeline step.

**Issue 6: CV scores vary dramatically between folds**
- **Solution:** High variance in CV scores suggests your model is unstable or you have outliers. Try: (1) increasing training data, (2) simplifying your model, (3) checking for data quality issues, or (4) using stratified splits if appropriate.

**General Debugging Tips:**
- Print the shape of your data at each step to catch dimension mismatches early
- Start simple (fewer features, simpler models) and add complexity gradually
- Use `verbose=1` in GridSearchCV to see progress
- Check for missing values with `df.isnull().sum()`
- When encoding categorical variables, verify the columns match between train and test sets