# Week 4 Lab: Model Preparation & Evaluation

Welcome to this week's lab on **Model Preparation and Evaluation**. We'll be covering the following key concepts:
- Train/Validation/Test Splits
- Avoiding Data Leakage
- Using Pipelines for cleaner workflows
- Model Evaluation Metrics (like MSE and R²)
- Hyperparameter Tuning with GridSearchCV

---
## Part 1: Splitting Data
Load the California Housing dataset and create Train/Validation/Test splits.

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import pandas as pd

data = fetch_california_housing(as_frame=True)
df = data.frame

X = df.drop(columns='MedHouseVal')
y = df['MedHouseVal']

# TODO: Split into train/val/test (60/20/20)
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)


## Part 2: Spot the Data Leakage
Look at this feature set. Can you spot the leaky column?

In [None]:
# Simulate data leakage
df['LeakyFeature'] = df['MedHouseVal'] * 12  # Simulated leakage
X_leaky = df.drop(columns='MedHouseVal')
print(X_leaky.head())

**Q: What makes `LeakyFeature` a source of leakage? What could go wrong if it's included in training?**

## Part 3: Pipeline + Ridge Regression
Build a pipeline to standardize the features and apply Ridge regression.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

# TODO: Build and fit the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge())
])

pipeline.fit(X_train, y_train)


## Part 4: Evaluate Model on Validation Set

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = pipeline.predict(X_val)
print("Validation MSE:", mean_squared_error(y_val, y_pred))
print("Validation R²:", r2_score(y_val, y_pred))

## Part 5: Cross-Validation

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='r2')
print("Cross-validated R² scores:", scores)
print("Mean R²:", scores.mean())

## Part 6: GridSearchCV
Try optimizing the Ridge regularization parameter.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'ridge__alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}
grid = GridSearchCV(pipeline, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best CV Score (neg MSE):", grid.best_score_)