# TA Guidance: Week 12 Lab - The Professional ML Workflow

## üéØ Lab Overview and Teaching Philosophy

**Critical Understanding:** This lab represents a **MAJOR MILESTONE** in the students' data science journey. Until now, they've been "peeking" at the test set repeatedly‚Äîa fundamental violation of professional ML practice. This week, they learn the **proper 5-stage workflow** used in production environments.

### Learning Objectives
- Students will implement k-fold cross-validation to compare models without contaminating the test set
- Students will use GridSearchCV to systematically tune hyperparameters across multiple parameters
- Students will apply feature engineering techniques (encoding, scaling, feature creation)
- Students will build end-to-end pipelines that prevent data leakage
- Students will execute the complete professional ML workflow from data preparation through final evaluation

### Time Allocation & Teaching Strategy
- **[0‚Äì30 min]** Part A: Guided Reinforcement ‚Äî TA-led practice with cross-validation, GridSearchCV, and feature engineering
- **[30‚Äì40 min]** Class Q&A ‚Äî Discussion and clarification of key concepts
- **[40‚Äì72 min]** Part B: Independent Challenges ‚Äî 6 group challenges applying the complete professional workflow
- **[72‚Äì75 min]** Wrap-Up & Reflection ‚Äî What you learned and next steps

### Content Alignment
- Directly reinforces Tuesday's slides on cross-validation and hyperparameter tuning
- Provides hands-on practice with concepts from Chapters 28-30
- Uses familiar Ames housing data to focus on workflow rather than data exploration
- Bridges beginner practices to professional production-ready techniques

## üõ†Ô∏è Pre-Lab Setup Instructions

**Technical Setup:**
- Ensure all students can access Google Colab and load the lab notebook
- Test that Ames dataset URL loads correctly (both local and GitHub paths)
- Verify sklearn, pandas, numpy imports work properly
- **WARNING**: Challenge 2 and 4 involve 540 model fits‚Äîwarn students this takes 2-3 minutes

**Important Teaching Approach:**
- **Part A**: Walk through guided examples step-by-step, ensure everyone follows along
- **Part B**: Students work in groups of 2-4; circulate to provide help but let them struggle productively
- **Group work encouraged**: This lab is designed for collaborative learning

**Materials Needed:**
- Lab notebook: `12_wk12_lab.ipynb`
- This TA guidance notebook
- Access to Ames housing dataset (`ames_clean.csv`)

## üìö Key Concepts to Emphasize

1. **Test Set Contamination**: Every "peek" at the test set makes performance estimates less trustworthy
2. **Cross-Validation Philosophy**: Compare models and tune hyperparameters using ONLY training data
3. **GridSearchCV Automation**: Systematic exploration of parameter space with built-in CV
4. **Feature Engineering Impact**: Encoding and scaling can dramatically improve model performance
5. **Pipeline Benefits**: Prevents data leakage and ensures reproducible workflows
6. **The 5-Stage Workflow**: Data prep ‚Üí Split ‚Üí Compare models (CV) ‚Üí Tune (GridSearchCV) ‚Üí Final test evaluation
7. **Random State Consistency**: Always use `random_state=42` for reproducibility

## Part A Teaching Guide: Guided Reinforcement (30 minutes)

### Section 1: Cross-Validation Basics (10 minutes)

**Teaching Approach:**
- **Start with the problem**: "How many of you have tuned models by checking test set performance multiple times?" (Most will raise hands)
- **Explain the issue**: Each peek makes your performance estimate less reliable‚Äîyou're overfitting to the test set
- **Present the solution**: Cross-validation lets you compare models using only training data
- **Emphasize the mindset shift**: "The test set is LOCKED until the very end‚Äîwe pretend it doesn't exist"

### Key Teaching Points:

**1. Why Cross-Validation Matters:**
- Simple train/test split gives ONE performance estimate‚Äîcould be lucky or unlucky
- Cross-validation gives MULTIPLE estimates‚Äîmore robust and reliable
- Using CV, we can compare models without ever touching the test set

**2. How K-Fold CV Works:**
- Data is split into K folds (usually 5 or 10)
- Train on K-1 folds, validate on the remaining fold
- Repeat K times so each fold serves as validation once
- Average the K performance scores

**3. Interpreting CV Results:**
- **Mean score**: Expected performance on unseen data
- **Standard deviation**: Consistency across different data splits
- Low std dev = stable model, high std dev = unstable model

### Guided Example Walkthrough:

**Cell 5 - Train/Test Split:**
```python
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
```
**Teaching moment**: "Once we run this cell, we LOCK the test set. We won't touch X_test or y_test until Challenge 4!"

**Cell 6 - First CV Example:**
```python
cv_scores = cross_val_score(
    dt, X_train, y_train, 
    cv=5, 
    scoring='neg_root_mean_squared_error'
)
```
**Teaching points**:
- `cv=5` means 5-fold cross-validation
- `scoring='neg_root_mean_squared_error'` returns negative RMSE (convert to positive for interpretation)
- Notice we ONLY use X_train and y_train‚Äîtest set untouched!

**Expected Output Discussion:**
```
RMSE per fold: [34567, 35123, 33890, 34456, 35001]
Mean RMSE: $34,607
Std Dev: $456
```
- "What does the mean tell us?" ‚Üí Expected performance on new data
- "What does std dev tell us?" ‚Üí How consistent the model is
- "Why are the scores different across folds?" ‚Üí Different validation sets, natural variation

### üß† "Your Turn" Exercise Guidance:

**Give students 3-5 minutes** to modify `max_depth` and compare models.

**Circulate and check:**
- Are they changing only the `max_depth` parameter?
- Are they comparing mean RMSE values?
- Do they understand which model is "better" (lower RMSE)?

**Expected behavior:**
- `max_depth=5`: Higher RMSE (underfitting)
- `max_depth=15`: Lower RMSE (better fit)
- Students should observe the bias-variance tradeoff

**Discussion prompts:**
- "Which model had lower mean RMSE?"
- "Which model had more consistent scores (lower std dev)?"
- "How is this better than checking the test set repeatedly?"

**Key insight to emphasize**: "You just compared two models without EVER touching the test set. That's the power of cross-validation!"

### Section 2: Hyperparameter Tuning with GridSearchCV (10 minutes)

**Teaching Approach:**
- **Start with the pain point**: "Manually trying different hyperparameters one by one is tedious and error-prone"
- **Introduce automation**: "GridSearchCV systematically tests all combinations and uses CV to find the best"
- **Emphasize scale**: "We're going to train 60 models in one command‚Äîthis is the power of automation"

### Key Teaching Points:

**1. What GridSearchCV Does:**
- Takes a parameter grid (all combinations you want to test)
- For each combination, performs k-fold CV
- Returns the best parameters and best CV score
- Automatically retrains on full training set with best parameters

**2. Parameter Grid Structure:**
```python
param_grid = {
    'n_estimators': [100, 200],      # 2 values
    'max_depth': [10, 15, 20],       # 3 values
    'min_samples_split': [2, 5]      # 2 values
}
```
**Teaching moment**: "How many combinations will we test? 2 √ó 3 √ó 2 = 12 configurations. With 5-fold CV, that's 60 models!"

**3. Computational Cost Awareness:**
- More parameters = exponentially more models
- Start with coarse grid, then refine
- Use `n_jobs=-1` to parallelize across CPU cores
- Be strategic about parameter ranges

### Guided Example Walkthrough:

**Cell 11 - GridSearchCV Setup:**
```python
grid_search = GridSearchCV(
    rf, 
    param_grid, 
    cv=5, 
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,  # Use all CPU cores
    verbose=1
)
```

**Teaching points**:
- `cv=5`: Each configuration tested with 5-fold CV
- `scoring`: Same metric we want to optimize
- `n_jobs=-1`: Speeds up computation by using all cores
- `verbose=1`: Shows progress during fitting

**Cell 11 - Fitting:**
```python
grid_search.fit(X_train, y_train)
```
**Teaching moment**: "This one line trains 60 models! Watch the progress bar‚Äîyou'll see it working through all combinations."

**Expected wait time**: 30-60 seconds depending on system

**Cell 12 - Results:**
```python
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV RMSE: ${-grid_search.best_score_:,.0f}")
```

**Discussion prompts**:
- "What are the optimal hyperparameters GridSearchCV found?"
- "How does the best CV RMSE compare to our earlier models?"
- "Why is this better than manually trying different values?"

**Key insights to emphasize**:
- Systematic exploration finds better configurations than manual search
- Still haven't touched the test set!
- `grid_search.best_estimator_` is already retrained on full training data

### üß™ Practice Exercise Guidance:

**Give students 5-7 minutes** to complete the DecisionTree GridSearchCV exercise.

**Common issues to watch for:**
1. **Syntax errors in param_grid**: Check dictionary structure and list syntax
2. **Forgot `cv=5`**: Remind them this sets the number of folds
3. **Wrong attribute names**: `best_params_` and `best_score_` (note the underscores!)
4. **Forgot negative sign**: `best_score_` is negative, need `-grid_search.best_score_`

**Solution (for reference):**
```python
param_grid = {
    'max_depth': [5, 10, 15, 20],           
    'min_samples_split': [2, 5, 10],       
    'min_samples_leaf': [1, 2, 4]         
}

dt = DecisionTreeRegressor(random_state=42)

grid_search = GridSearchCV(
    dt,
    param_grid,
    cv=5,                                   
    scoring='neg_root_mean_squared_error',
    verbose=1
)

grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best CV RMSE: $", -grid_search.best_score_)
```

**Expected result**: Best RMSE should be around $30,000-$35,000 depending on optimal params found.

**Computational note**: 4 √ó 3 √ó 3 = 36 configs √ó 5 folds = 180 models‚Äîmay take 30-45 seconds.

**After exercise discussion:**
- "How many total models did GridSearchCV train?"
- "What were the optimal hyperparameters?"
- "Did tuning improve performance significantly?"

### Section 3: Feature Engineering Essentials (10 minutes)

**Teaching Approach:**
- **Start with motivation**: "Raw data often isn't in the best format for ML‚Äîwe need to transform it"
- **Focus on practical techniques**: Scaling and encoding are the bread-and-butter of feature engineering
- **Emphasize when techniques matter**: Not all models benefit from all transformations

### Key Teaching Points:

**1. When Scaling Matters:**
- **Distance-based models** (KNN, SVM, Neural Networks): REQUIRE scaling
- **Regularized models** (Ridge, Lasso, ElasticNet): REQUIRE scaling
- **Tree-based models** (Decision Trees, Random Forest, XGBoost): DON'T need scaling
- **Plain Linear Regression**: Doesn't need scaling (closed-form solution)

**2. StandardScaler Mechanics:**
- Transforms features to have mean=0, std=1
- Formula: `(x - mean) / std`
- **CRITICAL**: Fit on training data, transform both train and test
- **Data leakage warning**: Never fit scaler on combined train+test data!

**3. Encoding Categorical Variables:**
- **One-Hot Encoding**: Creates binary columns for each category
  - Best for nominal categories (no inherent order)
  - Can create many columns with high-cardinality features
- **Label Encoding**: Assigns integers to categories
  - Best for ordinal categories (natural order)
  - Works with tree-based models (they can learn splits)
  - Can mislead linear models (implies order/magnitude)

### Guided Example Walkthrough:

**Cells 17-18 - Scaling Impact:**

**Without scaling (Cell 17):**
```python
knn = KNeighborsRegressor(n_neighbors=5)
cv_scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='neg_root_mean_squared_error')
```
**Expected result**: RMSE around $60,000-$70,000 (poor performance)

**Teaching moment**: "Why is KNN performing so poorly? Because features are on different scales‚ÄîGrLivArea (1000-4000) dominates YearBuilt (1900-2010)!"

**With scaling (Cell 18):**
```python
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
```
**CRITICAL TEACHING POINT**: "Notice we FIT the scaler on training data, but TRANSFORM both train and test. This prevents data leakage!"

**Expected result**: RMSE around $35,000-$45,000 (dramatic improvement!)

**Discussion prompts:**
- "Why did scaling help KNN but wouldn't help Random Forest?"
- "What would happen if we fit the scaler on all the data before splitting?"
- "When should we use scaling?"

**Cell 19 - When Scaling Matters Table:**
Walk through this table carefully‚Äîstudents often misunderstand which models need scaling.

**Key takeaways:**
- **Always scale**: KNN, SVM, Neural Networks, Ridge/Lasso, PCA
- **Doesn't matter**: Plain Linear Regression, Decision Trees, Random Forest, XGBoost, Gradient Boosting

**Cells 21-22 - One-Hot Encoding:**

**Setup (Cell 21):**
```python
features = ['Neighborhood', 'BldgType', 'GrLivArea', ...]
```
**Teaching moment**: "We're adding categorical variables‚ÄîNeighborhood has 25 unique values, BldgType has 5."

**Encoding (Cell 22):**
```python
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
X_cat_encoded = encoder.fit_transform(X_train[cat_cols])
```

**Teaching points:**
- `handle_unknown='ignore'`: Won't crash if test set has new categories
- `sparse_output=False`: Returns dense array (easier to work with)
- Creates 30 new columns (25 neighborhoods + 5 building types)

**Expected output**: "One-hot encoding turned our 8 original features into 36 features."

**Discussion**: "Why did we go from 8 to 36 features? Is this always necessary?"

**Cells 24 - Label Encoding Alternative:**

```python
le = LabelEncoder()
X_train['Neighborhood_LE'] = le.fit_transform(X_train['Neighborhood'])
```

**Teaching moment**: "Label encoding assigns each neighborhood a number 0-24. This works well for tree-based models but can mislead linear models."

**When to use each:**
- **One-hot**: Linear models, neural networks, when categories are nominal
- **Label**: Tree-based models, when categories are ordinal, high cardinality features

**Cells 26-27 - Pipelines:**

**Why pipelines are critical:**
1. **Prevent data leakage**: Transformations fit only on training folds during CV
2. **Reproducibility**: Easy to apply same transformations to new data
3. **Cleaner code**: All steps bundled together
4. **Deployment-ready**: Can save and load entire pipeline

**Pipeline structure:**
```python
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('onehot', OneHotEncoder(handle_unknown='ignore'), onehot_features)
    ]
)

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', DecisionTreeRegressor(random_state=42))
])
```

**Teaching points:**
- `ColumnTransformer`: Applies different transformations to different columns
- `Pipeline`: Chains preprocessing and modeling steps
- Can pass entire pipeline to `GridSearchCV`
- Use `model__parameter` syntax to tune model hyperparameters in pipeline

**Expected output**: Best CV RMSE around $25,000-$30,000 (improvement from feature engineering!)

**Key insight**: "We just built a production-ready pipeline that prevents data leakage and is easy to deploy!"

## Class Discussion/Q&A Section (10 minutes)

### Facilitation Strategy:

**Open with reflection prompt**: "Before we move to independent challenges, let's discuss some key concepts. I want to hear YOUR understanding."

### Discussion Prompts (Select 3-4):

**1. Cross-Validation vs. Train/Test Split**
- Prompt: "What's the difference between cross-validation and a simple train/test split?"
- Expected answers:
  - CV gives multiple performance estimates, train/test gives one
  - CV uses training data more efficiently
  - CV provides measure of model stability (std dev)
- **TA clarification**: "Both approaches use a test set! CV just helps us make better decisions about model selection and tuning using ONLY the training portion."

**2. Test Set Discipline**
- Prompt: "Why shouldn't we look at the test set until the very end?"
- Expected answers:
  - Prevents overfitting to test set
  - Keeps performance estimate honest
  - Simulates real-world deployment (you don't have future data!)
- **TA clarification**: "Every time you peek at test set performance and make a decision based on it, you're indirectly optimizing for that specific test set. Your performance estimate becomes less trustworthy."

**3. GridSearchCV vs. Manual Tuning**
- Prompt: "When would you use GridSearchCV vs. manually trying different hyperparameters?"
- Expected answers:
  - GridSearchCV: When you have multiple parameters to tune, want systematic exploration
  - Manual: When you have intuition about good values, doing quick experiments
- **TA clarification**: "In production ML, you almost always use GridSearchCV or RandomizedSearchCV. Manual tuning doesn't scale and introduces human bias."

**4. Data Leakage from Scaling**
- Prompt: "What happens if you fit a scaler on all your data before splitting into train/test?"
- Expected answers (students often struggle here):
  - Information from test set leaks into training data
  - Performance estimates become overly optimistic
  - Model won't perform as well on truly new data
- **TA clarification**: "The scaler computes mean and std from the data. If you fit on all data, your training set 'knows' information about the test set. Always fit transformers on training data only!"

**5. Pipeline Benefits**
- Prompt: "How do pipelines help prevent data leakage?"
- Expected answers:
  - Automatically fit transformers only on training folds during CV
  - Ensures consistent transformation order
  - Makes it impossible to accidentally apply transformations incorrectly
- **TA clarification**: "When you pass a pipeline to GridSearchCV, the scaling/encoding is fit INSIDE each CV fold. This is the gold standard for preventing leakage."

### Common Blockers and Clarifications:

**1. "My GridSearchCV is taking forever!"**
- **Root cause**: Too many parameter combinations (exponential growth)
- **Solution**: "Start with a coarse grid. If you have 5 values each for 4 parameters, that's 5^4 = 625 configurations! Try 2-3 values per parameter first."
- **Teaching moment**: "In practice, you often do a coarse grid search first, then a fine-grained search around the best region."

**2. "Do I need to scale features for Random Forest?"**
- **Answer**: "No! Tree-based models are scale-invariant. They only care about feature order, not magnitude."
- **Why students ask**: Confusion about when scaling matters
- **Clarification**: Refer back to the table in Cell 19

**3. "When do I use one-hot vs. label encoding?"**
- **Rule of thumb**:
  - Nominal categories (no order): One-hot encoding
  - Ordinal categories (natural order): Label encoding or ordinal encoding
  - High cardinality + tree models: Label encoding (avoids dimension explosion)
  - Linear models: Almost always one-hot (label encoding implies order)
- **Example**: "Neighborhood is nominal (no inherent order), so one-hot is safer. But with 25 neighborhoods, label encoding works fine for tree models."

**4. "I'm getting different results than my partner!"**
- **Root cause**: Inconsistent random states
- **Solution**: "Always use `random_state=42` in ALL random operations: train_test_split, model initialization, GridSearchCV."
- **Teaching moment**: "Reproducibility is critical in professional ML. Different random states = different results = can't compare fairly."

**5. "What's the difference between .fit(), .transform(), and .fit_transform()?"**
- **Explanation**:
  - `.fit()`: Learns parameters from data (e.g., mean/std for scaler)
  - `.transform()`: Applies learned transformation
  - `.fit_transform()`: Does both in one step (only use on training data!)
- **Critical rule**: "Fit on training data, transform on both train and test."

### Transition to Part B:

"Great discussion! Now you'll apply these concepts in 6 independent challenges. Work in groups of 2-4. These challenges don't provide starter code‚Äîyou'll build everything from scratch using the patterns you learned in Part A. Don't worry, I'll be circulating to help. Let's get started!"

## Part B Teaching Strategy: Independent Challenges (32 minutes)

### üö® Critical Teaching Philosophy for Part B

**Group Work Approach:**
- **Encourage collaboration**: "Work in groups of 2-4. Discuss your approach before coding."
- **Productive struggle**: Let groups struggle for 3-5 minutes before intervening
- **Circulate actively**: Walk around, check progress, provide strategic hints
- **Don't give complete solutions**: Guide thinking, don't write code for them

**Time Management:**
- **Each challenge has a time allocation** (5-6 minutes per challenge)
- **Stop after each challenge**: Briefly discuss results as a class (1-2 minutes)
- **Adjust pacing**: If groups are struggling, provide hints earlier; if succeeding, let them continue
- **Priority challenges**: 1, 2, and 4 are most important‚Äîensure these are completed

**Your Role:**
1. **Monitor progress**: Who's stuck? Who's making good progress?
2. **Provide strategic hints**: Guide approach, not specific code
3. **Debug syntax issues**: Help with Python/sklearn syntax errors
4. **Facilitate discussion**: After each challenge, share interesting observations
5. **Keep energy high**: Encourage groups, celebrate successes

### General Support Guidelines:

**What YOU CAN Do:**
- ‚úÖ Help with syntax errors: "You need to pass `random_state=42` to train_test_split"
- ‚úÖ Clarify concepts: "Cross-validation is for comparing models, not for final evaluation"
- ‚úÖ Provide general guidance: "Follow the same pattern as the guided example in Part A"
- ‚úÖ Answer method questions: "Use `cross_val_score(model, X_train, y_train, cv=5, scoring='...')`"
- ‚úÖ Suggest debugging approaches: "Print the shape of your data at each step"

**What YOU SHOULD NOT Do:**
- ‚ùå Write complete code solutions: Let students build code themselves
- ‚ùå Tell them exactly what to type: Guide their thinking instead
- ‚ùå Rush to help immediately: Allow productive struggle first
- ‚ùå Do the work for them: They learn by doing, not watching

### Challenge 1: Compare Model Types (5 minutes)

**Learning Goal**: Practice using cross-validation to compare different model architectures without touching the test set.

**Time Allocation**: 5 minutes (3 min work, 2 min discussion)

#### Expected Student Approach:
1. Create train/test split with specified features
2. Initialize three different model types
3. Use `cross_val_score()` for each model with 5-fold CV
4. Compare mean RMSE values

#### Common Issues & Strategic Hints:

**Issue 1: "How do I create the train/test split?"**
- **Hint**: "You did this in Part A, Cell 5. What was the pattern?"
- **If still stuck**: "Use `train_test_split(X, y, test_size=0.2, random_state=42)`"

**Issue 2: "How do I run cross-validation?"**
- **Hint**: "Look at Cell 6 in Part A. What function did we use?"
- **If still stuck**: "Use `cross_val_score(model, X_train, y_train, cv=5, scoring='neg_root_mean_squared_error')`"

**Issue 3: "My RMSE values are negative!"**
- **Explanation**: "Sklearn returns negative scores. Multiply by -1 or take the negative to convert to positive RMSE."
- **Code**: `cv_rmse = -cv_scores`

**Issue 4: "Which model should I expect to perform best?"**
- **Don't tell them!** Let them discover: "Run the code and see‚Äîwhat do YOUR results show?"
- **Expected result**: Random Forest should have lowest RMSE (~$30k), followed by DecisionTree (~$35k), then LinearRegression (~$40k)

#### After 5 Minutes - Brief Class Discussion:

**Questions to ask:**
- "Which model performed best? Show of hands: Linear Regression? Decision Tree? Random Forest?"
- "Why do you think Random Forest outperformed the others?"
- "Did anyone compare standard deviations? What did you notice?"

**Key teaching moments:**
- Random Forest typically wins on tabular data (ensemble averaging)
- Linear Regression assumes linear relationships‚Äîmay not fit housing data well
- Decision Tree can overfit‚ÄîRandom Forest mitigates this
- **Most important**: "You compared three models without EVER touching the test set!"

#### Solution (FOR YOUR REFERENCE - Don't share unless necessary):

In [None]:
# Challenge 1 Solution - FOR TA REFERENCE ONLY

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

# Load data (assuming already done)
# ames = pd.read_csv('../data/ames_clean.csv')

# 1. Select features and create train/test split
features = ['GrLivArea', 'YearBuilt', 'TotalBsmtSF', 'GarageCars', 'FullBath', 'OverallQual']
X = ames[features]
y = ames['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print("üîí Test set is LOCKED\n")

# 2. Compare three models using cross-validation
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(max_depth=10, random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
}

results = {}
for name, model in models.items():
    cv_scores = cross_val_score(
        model, X_train, y_train,
        cv=5,
        scoring='neg_root_mean_squared_error'
    )
    cv_rmse = -cv_scores
    results[name] = cv_rmse
    
    print(f"{name}:")
    print(f"  Mean CV RMSE: ${cv_rmse.mean():,.0f}")
    print(f"  Std Dev:      ${cv_rmse.std():,.0f}\n")

# 3. Identify best model
best_model = min(results.items(), key=lambda x: x[1].mean())
print(f"‚úÖ Best model: {best_model[0]} with RMSE ${best_model[1].mean():,.0f}")

### Challenge 2: Systematic Hyperparameter Tuning (6 minutes)

**Learning Goal**: Use GridSearchCV to systematically tune hyperparameters using cross-validation.

**Time Allocation**: 6 minutes (4 min work, 2 min discussion)

**‚ö†Ô∏è WARNING**: This challenge trains 540 models (3√ó4√ó3√ó3√ó3 configs √ó 5 folds). Warn students it will take 2-3 minutes to run!

#### Expected Student Approach:
1. Define parameter grid with specified ranges
2. Create GridSearchCV with RandomForestRegressor
3. Fit on training data
4. Print best parameters and best score
5. Compare to Challenge 1 results

#### Common Issues & Strategic Hints:

**Issue 1: "How do I create the parameter grid?"**
- **Hint**: "Look at Cell 11 in Part A. It's a dictionary where keys are parameter names and values are lists to try."
- **If still stuck**: Show syntax: `param_grid = {'n_estimators': [100, 200, 300], ...}`

**Issue 2: "It's taking forever to run!"**
- **Explanation**: "You're training 540 models! This is normal‚Äîprofessional ML often takes time."
- **Teaching moment**: "This is why we use `n_jobs=-1` to parallelize. In production, you might run this overnight or on cloud GPUs."

**Issue 3: "How do I access the best parameters?"**
- **Hint**: "GridSearchCV stores the best parameters in an attribute. Check the documentation or look at Cell 12."
- **If still stuck**: "`grid_search.best_params_` and `grid_search.best_score_`"

**Issue 4: "How much did tuning improve performance?"**
- **Prompt**: "Compare the best CV RMSE from this challenge to the Random Forest RMSE from Challenge 1."
- **Expected**: Improvement of $2,000-$5,000 in RMSE
- **Teaching moment**: "Tuning doesn't always give huge improvements, but every bit counts in production!"

#### After 6 Minutes - Brief Class Discussion:

**Questions to ask:**
- "What optimal hyperparameters did GridSearchCV find?"
- "How much did tuning improve your RMSE compared to Challenge 1?"
- "How many total models did GridSearchCV train?"

**Key teaching moments:**
- Systematic tuning often finds better configurations than intuition
- Computational cost grows exponentially with parameter grid size
- GridSearchCV handles cross-validation automatically‚Äîless error-prone than manual tuning
- Still haven't touched the test set!

#### Solution (FOR YOUR REFERENCE):

In [None]:
# Challenge 2 Solution - FOR TA REFERENCE ONLY

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# 1. Define parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5]
}

# 2. Create GridSearchCV
rf = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(
    rf,
    param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

# 3. Fit (this will take 2-3 minutes!)
print("‚ö†Ô∏è Training 540 models... this will take 2-3 minutes")
grid_search.fit(X_train, y_train)
print("‚úÖ Grid search complete!\n")

# 4. Print results
print("="*50)
print("Best Hyperparameters:")
print("="*50)
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")

best_cv_rmse = -grid_search.best_score_
print(f"\nBest CV RMSE: ${best_cv_rmse:,.0f}")

# 5. Compare to Challenge 1
# Assuming Challenge 1 Random Forest had RMSE ~$30,000
challenge1_rmse = 30000  # Students should replace with their actual value
improvement = challenge1_rmse - best_cv_rmse
print(f"\nüí° Improvement over Challenge 1: ${improvement:,.0f}")
print(f"   ({improvement/challenge1_rmse*100:.1f}% reduction in error)")

### Challenge 3: Build a Complete Pipeline (6 minutes)

**Learning Goal**: Construct a pipeline that handles feature engineering and modeling without data leakage.

**Time Allocation**: 6 minutes (4 min work, 2 min discussion)

#### Expected Student Approach:
1. Start fresh with new feature set (numeric + categorical)
2. Create new train/test split
3. Build ColumnTransformer for preprocessing
4. Build Pipeline combining preprocessing and model
5. Evaluate with cross-validation
6. Compare to Challenge 2 results

#### Common Issues & Strategic Hints:

**Issue 1: "How do I build a ColumnTransformer?"**
- **Hint**: "Look at Cell 27 in Part A. You need to specify which transformation applies to which columns."
- **If still stuck**: Show structure:
  ```python
  preprocessor = ColumnTransformer(
      transformers=[
          ('num', StandardScaler(), numeric_features),
          ('cat', OneHotEncoder(...), categorical_features)
      ]
  )
  ```

**Issue 2: "Do I need to scale for Random Forest?"**
- **Answer**: "No, but it doesn't hurt. The exercise asks you to do it for practice‚Äîin production, you'd skip scaling for tree models."
- **Teaching moment**: "Pipelines let you include transformations that might not be strictly necessary but ensure consistency."

**Issue 3: "How do I combine the preprocessor with the model?"**
- **Hint**: "Use `Pipeline()` with two steps: preprocessing and modeling."
- **If still stuck**: `pipeline = Pipeline([('preprocessor', preprocessor), ('model', model)])`

**Issue 4: "How do I use the optimal hyperparameters from Challenge 2?"**
- **Hint**: "When creating your RandomForestRegressor, pass the best parameters as keyword arguments."
- **Example**: `RandomForestRegressor(n_estimators=200, max_depth=15, ...)`

**Issue 5: "Did adding features improve performance?"**
- **Prompt**: "Compare your CV RMSE to Challenge 2. Did it go down?"
- **Expected**: Should improve by $2,000-$5,000 due to additional informative features
- **If worse**: "Check for errors‚Äîadding good features should help, not hurt!"

#### After 6 Minutes - Brief Class Discussion:

**Questions to ask:**
- "Did adding more features and encoding categoricals improve performance?"
- "What was your CV RMSE? How does it compare to Challenge 2?"
- "Why are pipelines better than manually applying transformations?"

**Key teaching moments:**
- More informative features ‚Üí better performance (if features are relevant)
- Categorical variables often contain valuable information
- Pipelines prevent data leakage by fitting transformers inside CV folds
- This approach is production-ready and deployment-friendly

#### Solution (FOR YOUR REFERENCE):

In [None]:
# Challenge 3 Solution - FOR TA REFERENCE ONLY

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, train_test_split

# 1. Start fresh with expanded feature set
numeric_features = ['GrLivArea', 'YearBuilt', 'TotalBsmtSF', 'GarageCars', 
                   'FullBath', 'OverallQual', 'YearRemodAdd', 'BedroomAbvGr', 'TotRmsAbvGrd']
categorical_features = ['Neighborhood', 'HouseStyle']
all_features = numeric_features + categorical_features

X = ames[all_features]
y = ames['SalePrice']

# 2. New train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Features: {len(all_features)} ({len(numeric_features)} numeric, {len(categorical_features)} categorical)")
print(f"Training: {len(X_train)}, Test: {len(X_test)}\n")

# 3. Build preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

# 4. Build pipeline with optimal hyperparameters from Challenge 2
# (Students should use their actual best parameters)
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(
        n_estimators=200,      # Replace with actual best params
        max_depth=15,
        min_samples_split=2,
        min_samples_leaf=1,
        random_state=42
    ))
])

# 5. Evaluate with cross-validation
cv_scores = cross_val_score(
    pipeline, X_train, y_train,
    cv=5,
    scoring='neg_root_mean_squared_error'
)
cv_rmse = -cv_scores

print("Pipeline Cross-Validation Results:")
print(f"  Mean CV RMSE: ${cv_rmse.mean():,.0f}")
print(f"  Std Dev:      ${cv_rmse.std():,.0f}")

# 6. Compare to Challenge 2
challenge2_rmse = 28000  # Students should replace with their actual value
improvement = challenge2_rmse - cv_rmse.mean()
print(f"\nüí° Improvement over Challenge 2: ${improvement:,.0f}")
if improvement > 0:
    print(f"   Adding more features and encoding categoricals helped!")
else:
    print(f"   Performance is similar‚Äîfeatures may not add much information.")

### Challenge 4: Complete End-to-End Workflow (4 minutes)

**Learning Goal**: Execute the full 5-stage professional ML workflow from data preparation through final test evaluation.

**Time Allocation**: 4 minutes work (most code is from Challenge 3), then 2-3 min discussion

**‚ö†Ô∏è WARNING**: GridSearchCV with 540 models will take 2-3 minutes!

#### Expected Student Approach:
This challenge combines all previous challenges:
1. **Data Preparation**: Select features (from Challenge 3)
2. **Split**: Create train/test split
3. **Pipeline**: Build preprocessing pipeline (from Challenge 3)
4. **Tune**: Use GridSearchCV with pipeline (from Challenge 2)
5. **Final Evaluation**: Evaluate ONCE on test set

#### Teaching Strategy:

**Key message before they start:**
"This challenge brings together everything you've learned. Most of the code you've already written in Challenges 2 and 3‚Äîyou're just combining them and adding the final test evaluation. The new part is Stage 5: evaluating on the test set for the FIRST and ONLY time."

#### Common Issues & Strategic Hints:

**Issue 1: "This seems like a lot of work!"**
- **Encouragement**: "You've already done most of it! Combine your Challenge 3 pipeline with Challenge 2's GridSearchCV."
- **Hint**: "Copy your code from Challenges 2 and 3, then add the final test evaluation."

**Issue 2: "How do I tune a pipeline's hyperparameters?"**
- **Hint**: "Use the double underscore syntax: `'model__n_estimators'`, `'model__max_depth'`, etc."
- **Example**: 
  ```python
  param_grid = {
      'model__n_estimators': [100, 200, 300],
      'model__max_depth': [5, 10, 15, 20],
      ...
  }
  ```

**Issue 3: "Do I retrain the model before testing?"**
- **Answer**: "No! GridSearchCV automatically retrains the best model on the full training set. Just use `grid_search.best_estimator_` to predict on test set."
- **Alternative**: Use `grid_search.score(X_test, y_test)` directly

**Issue 4: "Should my test RMSE match my CV RMSE?"**
- **Answer**: "Not exactly, but they should be close. CV gives you an ESTIMATE of test performance."
- **If very different**: "If test RMSE is much worse than CV RMSE, you might have data leakage or a lucky/unlucky split."
- **Teaching moment**: "This is why we use CV‚Äîto get a more reliable estimate before committing to the test set."

#### After Challenge 4 - Important Class Discussion (3-5 minutes):

**This is a CRITICAL teaching moment‚Äîdon't rush it!**

**Questions to ask:**
1. "What was your final test RMSE?"
2. "How close was it to your best CV RMSE from GridSearchCV?"
3. "Why did we wait until NOW to touch the test set?"
4. "What would have happened if we had repeatedly checked test performance while tuning?"

**Key teaching moments:**
- **Professional workflow**: "This is exactly how professional data scientists build production models."
- **Test set discipline**: "We made ALL our decisions‚Äîmodel type, hyperparameters, features‚Äîusing only CV. The test set was locked."
- **Honest evaluation**: "Because we didn't peek at the test set, our final RMSE is an honest estimate of how this model will perform on new data."
- **One-time test**: "In production, you report this test RMSE to stakeholders. You can't tune further and re-test‚Äîthat would be cheating!"

**Connection to real-world:**
"In production, you'd deploy this model to predict prices on houses you haven't seen yet. The test set simulates that future data. By keeping it locked, you ensure your performance estimate is trustworthy."

#### Solution (FOR YOUR REFERENCE):

In [None]:
# Challenge 4 Solution - FOR TA REFERENCE ONLY
# This combines Challenges 2 and 3

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
import numpy as np

print("="*60)
print("PROFESSIONAL ML WORKFLOW: 5 STAGES")
print("="*60)

# STAGE 1: Data Preparation
print("\n[Stage 1] Data Preparation")
numeric_features = ['GrLivArea', 'YearBuilt', 'TotalBsmtSF', 'GarageCars', 
                   'FullBath', 'OverallQual', 'YearRemodAdd', 'BedroomAbvGr', 'TotRmsAbvGrd']
categorical_features = ['Neighborhood', 'HouseStyle']
all_features = numeric_features + categorical_features

X = ames[all_features]
y = ames['SalePrice']
print(f"  ‚úÖ Selected {len(all_features)} features")

# STAGE 2: Initial Split (LOCK THE TEST SET)
print("\n[Stage 2] Train/Test Split")
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"  ‚úÖ Training: {len(X_train)}, Test: {len(X_test)}")
print("  üîí TEST SET IS NOW LOCKED")

# Build preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(random_state=42))
])
print("  ‚úÖ Pipeline created (preprocessing + model)")

# STAGE 3: Model Comparison with CV
# (Skipped in this challenge since we already chose Random Forest)

# STAGE 4: Hyperparameter Tuning with GridSearchCV
print("\n[Stage 4] Hyperparameter Tuning with GridSearchCV")
param_grid = {
    'model__n_estimators': [100, 200, 300],
    'model__max_depth': [5, 10, 15, 20],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 5]
}

grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

print("  ‚ö†Ô∏è  Training 540 models (this takes 2-3 minutes)...")
grid_search.fit(X_train, y_train)
print("  ‚úÖ GridSearchCV complete")

best_cv_rmse = -grid_search.best_score_
print(f"\n  Best hyperparameters:")
for param, value in grid_search.best_params_.items():
    print(f"    {param}: {value}")
print(f"  Best CV RMSE: ${best_cv_rmse:,.0f}")

# STAGE 5: Final Test Evaluation (FIRST AND ONLY TIME)
print("\n[Stage 5] Final Test Evaluation")
print("  üîì UNLOCKING TEST SET FOR FINAL EVALUATION")

y_pred_test = grid_search.best_estimator_.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))

print(f"\n  Final Test RMSE: ${test_rmse:,.0f}")
print(f"  Best CV RMSE:    ${best_cv_rmse:,.0f}")
print(f"  Difference:      ${abs(test_rmse - best_cv_rmse):,.0f}")

print("\n" + "="*60)
print("WORKFLOW COMPLETE")
print("="*60)
print("‚úÖ We made ALL decisions using only training data (via CV)")
print("‚úÖ Test set was evaluated ONCE at the very end")
print("‚úÖ Our test RMSE is an honest estimate of real-world performance")
print("‚úÖ This model is ready for production deployment!")

### Challenge 5: Model Interpretation - Permutation Importance (5 minutes)

**Learning Goal**: Understand which features drive model predictions using permutation importance.

**Time Allocation**: 5 minutes (3 min work, 2 min discussion)

#### Expected Student Approach:
1. Use the best model from Challenge 4
2. Apply `permutation_importance` from sklearn.inspection
3. Create a bar chart of feature importances
4. Identify the most important feature

#### Common Issues & Strategic Hints:

**Issue 1: "How do I import permutation_importance?"**
- **Hint**: "It's in `sklearn.inspection`. Check the sklearn documentation or search for 'permutation importance sklearn'."
- **If still stuck**: `from sklearn.inspection import permutation_importance`

**Issue 2: "What data should I use for permutation importance?"**
- **Answer**: "Use the test set (X_test, y_test). We want to understand feature importance on held-out data."
- **Teaching moment**: "We already evaluated on the test set once in Challenge 4, so it's okay to use it again for interpretation (we're not making decisions that affect the model)."

**Issue 3: "How do I get feature names from the pipeline?"**
- **Challenge**: Pipeline transforms features (one-hot encoding), so feature names change
- **Hint**: "The preprocessor step has a `get_feature_names_out()` method."
- **If stuck**: Provide code:
  ```python
  feature_names = grid_search.best_estimator_.named_steps['preprocessor'].get_feature_names_out()
  ```

**Issue 4: "My bar chart is too crowded!"**
- **Solution**: "Show only the top 10-15 features, or rotate x-axis labels."
- **Code hints**:
  - Top N features: Sort by importance and slice
  - Rotate labels: `plt.xticks(rotation=90)`
  - Horizontal bar chart: `plt.barh()` works better for many features

**Issue 5: "What feature should I expect to be most important?"**
- **Don't tell them!** Let them discover: "Run the code and see what YOUR data shows."
- **Expected**: Usually `GrLivArea` or `OverallQual`, possibly `Neighborhood` features
- **Teaching moment**: "Does this make business sense? Larger homes typically cost more!"

#### After 5 Minutes - Brief Class Discussion:

**Questions to ask:**
- "What was the most important feature for predicting house prices?"
- "Were you surprised by any features that ranked high or low?"
- "How does this help you trust or explain the model to stakeholders?"

**Key teaching moments:**
- Model interpretation builds trust with stakeholders
- Permutation importance shows feature contribution without making assumptions about model internals
- Important features often align with domain knowledge (GrLivArea, OverallQual make intuitive sense)
- Some encoded features (neighborhood dummies) might rank high‚Äîneighborhood location matters!

#### Solution (FOR YOUR REFERENCE):

In [None]:
# Challenge 5 Solution - FOR TA REFERENCE ONLY

from sklearn.inspection import permutation_importance
import matplotlib.pyplot as plt
import pandas as pd

# 1. Get the best model from Challenge 4
best_model = grid_search.best_estimator_

# 2. Compute permutation importance on test set
perm_importance = permutation_importance(
    best_model, X_test, y_test,
    n_repeats=10,
    random_state=42,
    scoring='neg_root_mean_squared_error'
)

# 3. Get feature names (accounting for one-hot encoding)
feature_names = best_model.named_steps['preprocessor'].get_feature_names_out()

# 4. Create DataFrame for easier sorting and plotting
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': perm_importance.importances_mean
}).sort_values('importance', ascending=False)

# 5. Plot top 15 features
top_n = 15
plt.figure(figsize=(10, 6))
plt.barh(range(top_n), importance_df['importance'].head(top_n))
plt.yticks(range(top_n), importance_df['feature'].head(top_n))
plt.xlabel('Permutation Importance (RMSE increase when shuffled)')
plt.title(f'Top {top_n} Most Important Features for House Price Prediction')
plt.gca().invert_yaxis()  # Highest importance at top
plt.tight_layout()
plt.show()

# 6. Identify most important feature
most_important = importance_df.iloc[0]
print(f"\nüèÜ Most Important Feature: {most_important['feature']}")
print(f"   Importance Score: {most_important['importance']:.2f}")
print(f"\nüí° Interpretation: Shuffling this feature increases RMSE by ~${most_important['importance']:,.0f}")
print("   This means the model heavily relies on this feature for accurate predictions.")

# Show top 5 for discussion
print("\nTop 5 Features:")
for i, row in importance_df.head(5).iterrows():
    print(f"  {row['feature']:30s}: {row['importance']:,.2f}")

### Challenge 6: Partial Dependence Plot (PDP) (5 minutes)

**Learning Goal**: Visualize how the most important feature affects predictions using Partial Dependence Plots.

**Time Allocation**: 5 minutes (3 min work, 2 min discussion)

#### Expected Student Approach:
1. Use the most important feature identified in Challenge 5
2. Use `PartialDependenceDisplay` from sklearn.inspection
3. Generate PDP for that feature
4. Interpret the relationship between feature values and predictions

#### Common Issues & Strategic Hints:

**Issue 1: "How do I create a partial dependence plot?"**
- **Hint**: "Look up `PartialDependenceDisplay.from_estimator()` in sklearn documentation."
- **If still stuck**: Show import: `from sklearn.inspection import PartialDependenceDisplay`

**Issue 2: "What feature should I plot?"**
- **Answer**: "Use the most important feature from Challenge 5."
- **Note**: "If it's a one-hot encoded categorical feature, you might want to choose a continuous feature instead for a clearer interpretation."
- **Suggestion**: "Try GrLivArea or OverallQual if your top feature is categorical."

**Issue 3: "How do I specify which feature to plot in the pipeline?"**
- **Challenge**: After preprocessing, feature indices change
- **Hint**: "You can pass the feature name or index. If using a name, make sure it matches the preprocessed feature name."
- **Alternative**: "Use the original feature name if plotting a numeric feature that wasn't one-hot encoded."

**Issue 4: "How do I interpret the PDP?"**
- **Prompt questions**:
  - "Is the relationship linear or non-linear?"
  - "As the feature value increases, what happens to the predicted price?"
  - "Are there any plateaus or sharp changes?"
- **Expected for GrLivArea**: Generally increasing relationship (bigger homes = higher prices)
- **Expected for OverallQual**: Strong positive relationship (better quality = higher prices)

#### After 5 Minutes - Brief Class Discussion:

**Questions to ask:**
- "What feature did you create a PDP for?"
- "How does that feature affect the predicted sale price?"
- "Is the relationship linear, or more complex?"
- "Does this align with your intuition about real estate?"

**Key teaching moments:**
- PDPs show the marginal effect of a feature on predictions
- Helps explain "how" the model uses features, not just "which" features matter
- Can reveal non-linear relationships that might not be obvious
- Critical for explaining models to non-technical stakeholders
- "Interpretability builds trust in ML systems"

**Real-world connection:**
"In production ML, you'll often need to explain model decisions to business stakeholders who aren't data scientists. PDPs and feature importance are your primary tools for making models interpretable and trustworthy."

#### Solution (FOR YOUR REFERENCE):

In [None]:
# Challenge 6 Solution - FOR TA REFERENCE ONLY

from sklearn.inspection import PartialDependenceDisplay
import matplotlib.pyplot as plt

# 1. Use best model from Challenge 4
best_model = grid_search.best_estimator_

# 2. Choose feature to plot
# Option 1: Use the most important feature from Challenge 5
# Option 2: Choose a continuous feature for clearer interpretation
# We'll demonstrate with GrLivArea (usually top feature and continuous)

# Get the feature index after preprocessing
feature_to_plot = 'GrLivArea'  # or 'OverallQual', or top feature from Challenge 5

# Find feature index in original feature list
feature_idx = all_features.index(feature_to_plot)

# 3. Create Partial Dependence Plot
fig, ax = plt.subplots(figsize=(10, 6))

PartialDependenceDisplay.from_estimator(
    best_model,
    X_test,
    [feature_idx],  # Feature index
    feature_names=all_features,  # Original feature names
    ax=ax
)

plt.suptitle(f'Partial Dependence Plot: {feature_to_plot}', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# 4. Interpretation
print(f"\nüìä Partial Dependence Plot Interpretation for '{feature_to_plot}':\n")

if feature_to_plot == 'GrLivArea':
    print("What the plot shows:")
    print("  ‚Ä¢ X-axis: Above-grade living area (square feet)")
    print("  ‚Ä¢ Y-axis: Predicted sale price (marginal effect)")
    print("\nTypical pattern:")
    print("  ‚Ä¢ Generally INCREASING relationship")
    print("  ‚Ä¢ As living area increases, predicted price increases")
    print("  ‚Ä¢ May show non-linearity (e.g., larger homes don't always increase price proportionally)")
    print("  ‚Ä¢ Possible plateau at very high square footage (fewer data points, less confident)")

elif feature_to_plot == 'OverallQual':
    print("What the plot shows:")
    print("  ‚Ä¢ X-axis: Overall quality rating (1-10)")
    print("  ‚Ä¢ Y-axis: Predicted sale price (marginal effect)")
    print("\nTypical pattern:")
    print("  ‚Ä¢ Strong POSITIVE relationship")
    print("  ‚Ä¢ Higher quality homes command significantly higher prices")
    print("  ‚Ä¢ May be relatively linear or show accelerating returns at high quality levels")

else:
    print("  Look for:")
    print("  ‚Ä¢ Direction: Does the predicted price increase or decrease?")
    print("  ‚Ä¢ Linearity: Is the relationship straight or curved?")
    print("  ‚Ä¢ Magnitude: How much does price change across the feature range?")
    print("  ‚Ä¢ Business sense: Does this align with real-world intuition?")

print("\nüí° Why PDPs Matter:")
print("  ‚Ä¢ Shows HOW the model uses each feature to make predictions")
print("  ‚Ä¢ Helps identify non-linear relationships")
print("  ‚Ä¢ Makes black-box models more interpretable")
print("  ‚Ä¢ Builds stakeholder trust in model decisions")

## üéì Lab Wrap-Up & Reflection (3-5 minutes)

### Facilitation Strategy:

**Gather the class together** for final reflection and next steps.

### Key Messages to Deliver:

**1. Celebrate Progress:**
"You just executed the professional ML workflow used in production data science! This is a huge milestone in your journey from beginner to professional."

**2. Emphasize the Paradigm Shift:**
"Before today: You built models and checked test performance repeatedly.  
After today: You know the proper workflow that keeps test set locked until the end.  
This is the difference between classroom exercises and production ML."

**3. Recap the 5-Stage Workflow:**
1. **Data Preparation**: Feature selection, handling missing values
2. **Train/Test Split**: LOCK the test set
3. **Model Comparison**: Use CV to compare different model types
4. **Hyperparameter Tuning**: Use GridSearchCV to optimize
5. **Final Evaluation**: Test ONCE on held-out data

"This workflow isn't just for school‚Äîit's exactly what you'll do in industry."

**4. Key Takeaways:**
- ‚úÖ **Cross-validation** lets you compare models without touching the test set
- ‚úÖ **GridSearchCV** systematically finds optimal hyperparameters
- ‚úÖ **Feature engineering** (encoding, scaling) often improves performance
- ‚úÖ **Pipelines** prevent data leakage and ensure reproducibility
- ‚úÖ **Test set discipline** gives honest performance estimates
- ‚úÖ **Model interpretation** (feature importance, PDPs) builds stakeholder trust

### Save Your Work:
"Make sure you save your notebook! You'll use your findings for this week's homework quiz."

### Reflection Prompts (Optional, if time allows):
Ask 1-2 of these questions:
- "What concept from today clicked for you?"
- "What would you like more practice with?"
- "How will you approach ML projects differently after today?"
- "What surprised you most about the professional workflow?"

### Preview Next Week:
"Next week, we'll build on this foundation by exploring advanced ensemble methods and learning how to handle even more complex ML scenarios. The workflow you learned today will be your foundation for everything that follows."

### Final Encouragement:
"You've leveled up from beginner data scientist to professional practitioner. This is a skill that will serve you throughout your career. Great work today!"

## üö® Common Issues & Troubleshooting Guide

### Technical Issues:

**1. "GridSearchCV is taking forever to run"**
- **Root cause**: Too many parameter combinations
- **Solution**: Reduce parameter grid size during experimentation
- **Teaching moment**: "In production, you balance thoroughness with computational cost. Start coarse, then refine."
- **Quick fix**: Reduce cv from 5 to 3 for faster prototyping

**2. "ValueError: Input contains NaN"**
- **Root cause**: Missing values in selected features
- **Solution**: 
  - Check for missing values: `X_train.isnull().sum()`
  - Either drop problematic features or use `SimpleImputer`
- **Teaching moment**: "Real-world data is messy. Always check for missing values!"

**3. "One-hot encoding creates too many columns"**
- **Root cause**: High cardinality categorical feature (e.g., Neighborhood with 25 values)
- **This is normal!** One-hot encoding increases dimensionality
- **Alternatives**:
  - Use label encoding for tree-based models
  - Use only high-frequency categories
  - Accept the dimensionality (modern ML can handle it)

**4. "ColumnNotFoundError when applying transformations"**
- **Root cause**: Column name mismatch between specified features and actual data
- **Solution**: Verify feature names with `print(X_train.columns.tolist())`
- **Common mistake**: Typo in feature name or column doesn't exist in dataset

**5. "My train and test transforms don't match"**
- **Root cause**: Fit scaler/encoder on test data instead of training data
- **Solution**: Always `fit_transform` on train, `transform` on test
- **Teaching moment**: "This is data leakage! Always fit on training data only."

### Conceptual Issues:

**1. "CV scores vary dramatically between folds"**
- **Explanation**: High variance suggests:
  - Model is unstable
  - Data has outliers
  - Small dataset size
- **Solutions**:
  - Try simpler model (reduce complexity)
  - Check for outliers in data
  - Use more CV folds (cv=10)
  - Ensure consistent random_state

**2. "Test RMSE is much worse than CV RMSE"**
- **Possible causes**:
  - Overfitting (model too complex)
  - Data leakage during preprocessing
  - Unlucky train/test split
  - Test set has different distribution
- **What to check**:
  - Are transformations fit only on training data?
  - Is model complexity reasonable?
  - Are train/test distributions similar?

**3. "Pipeline syntax is confusing"**
- **Reminder**: Double underscore notation for nested parameters
- **Pattern**: `'step_name__parameter_name'`
- **Example**: To tune Ridge alpha in a pipeline:
  - Pipeline step named 'model'
  - Parameter: 'alpha'
  - Grid parameter: `'model__alpha': [0.1, 1.0, 10.0]`

**4. "When do I scale vs. not scale?"**
- **Quick reference**:
  - **MUST scale**: KNN, SVM, Neural Nets, Ridge/Lasso, PCA
  - **DON'T scale**: Decision Trees, Random Forest, XGBoost
  - **Doesn't matter**: Plain Linear Regression
- **Rule of thumb**: When in doubt, scale for safety (won't hurt tree models)

### Debugging Tips:

**1. Check data shapes at each step**
```python
print(f"X shape: {X.shape}")
print(f"X_train shape: {X_train.shape}")
print(f"X_train_scaled shape: {X_train_scaled.shape}")
```

**2. Verify random states are consistent**
```python
# ALL of these should use random_state=42
train_test_split(..., random_state=42)
DecisionTreeRegressor(random_state=42)
RandomForestRegressor(random_state=42)
GridSearchCV(..., random_state=42)  # if available
```

**3. Check for missing values**
```python
print(X_train.isnull().sum())
```

**4. Verify feature names match**
```python
print("Expected features:", features)
print("Actual columns:", X_train.columns.tolist())
```

**5. Use verbose mode to track progress**
```python
GridSearchCV(..., verbose=1)  # Shows progress
```

### Time Management Issues:

**If running behind schedule:**
- **Priority**: Ensure Challenges 1, 2, and 4 are completed (core workflow)
- **Optional**: Challenges 5 and 6 (interpretation) can be homework if needed
- **Skip**: Detailed discussion if necessary‚Äîfocus on hands-on practice

**If ahead of schedule:**
- Encourage deeper exploration of parameter grids
- Discuss additional feature engineering ideas
- Compare additional model types (Gradient Boosting, SVM, etc.)
- Explore partial dependence plots for multiple features

## üìã Post-Lab Checklist

### For TAs - After Lab is Complete:

**Student Understanding:**
- [ ] Students understand why test set should be locked until final evaluation
- [ ] Students can implement cross-validation to compare models
- [ ] Students can use GridSearchCV to tune hyperparameters
- [ ] Students understand when to scale features and when to encode categoricals
- [ ] Students can build pipelines to prevent data leakage
- [ ] Students completed the 5-stage professional ML workflow

**Technical Completion:**
- [ ] All students successfully ran cross-validation (Challenge 1)
- [ ] Most students completed GridSearchCV tuning (Challenge 2)
- [ ] Students built at least one working pipeline (Challenge 3)
- [ ] Students completed the end-to-end workflow (Challenge 4)
- [ ] Students saved their notebooks with results

**Homework Preparation:**
- [ ] Students know to save their numerical results for homework quiz
- [ ] Reminded students to use `random_state=42` for reproducibility
- [ ] Clarified any confusion about homework requirements

### For Students - What You Should Have:

**Completed Challenges:**
- [ ] Challenge 1: Compared 3 model types using cross-validation
- [ ] Challenge 2: Tuned Random Forest with GridSearchCV
- [ ] Challenge 3: Built a preprocessing pipeline
- [ ] Challenge 4: Executed complete end-to-end workflow with final test evaluation
- [ ] Challenge 5: Computed feature importance
- [ ] Challenge 6: Created partial dependence plot

**Understanding Checkpoints:**
- [ ] Can explain why we don't touch the test set until the end
- [ ] Understand the difference between cross-validation and train/test split
- [ ] Know when to use GridSearchCV vs. manual tuning
- [ ] Can identify when scaling is necessary vs. unnecessary
- [ ] Understand how pipelines prevent data leakage
- [ ] Can interpret feature importance and partial dependence plots

**Saved Artifacts:**
- [ ] Jupyter notebook with all code and outputs
- [ ] Numerical results (CV RMSEs, test RMSE, best parameters)
- [ ] Feature importance visualizations
- [ ] Partial dependence plot

### Notes for Next Time:

**What worked well:**
- (TA fills in after lab)

**What needs improvement:**
- (TA fills in after lab)

**Concepts that need reinforcement:**
- (TA fills in after lab)

**Student questions that came up frequently:**
- (TA fills in after lab)

---

## üéØ Lab Success Metrics

**Lab is successful if:**
- Students can execute the 5-stage professional ML workflow independently
- Students understand test set discipline and can articulate why it matters
- Students can use cross-validation and GridSearchCV effectively
- Students recognize when feature engineering is needed and can implement it
- Students can interpret model results and explain them in business terms
- Students feel confident applying these techniques to their own ML projects

**This lab represents a critical milestone**: Students are now equipped to build production-ready ML models using industry best practices!