# Week 4 Lab: Model Preparation & Evaluation

Welcome to this week's lab on **Model Preparation and Evaluation**. We'll be covering the following key concepts:
- Train/Validation/Test Splits
- Avoiding Data Leakage
- Using Pipelines for cleaner workflows
- Model Evaluation Metrics (like MSE and R²)
- Hyperparameter Tuning with GridSearchCV

---

## Week 4 Challenge Lab

This lab is designed to give you more independence in applying what you've learned. You're encouraged to explore, test hypotheses, and compare models. Use the cells and prompts below to guide your process.

### Objective:
Build and evaluate a regression model using the California Housing dataset. Follow best practices for data preparation and model evaluation.


## Step 1: Load and Explore the Data
Use the California Housing dataset.

In [None]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

data = fetch_california_housing(as_frame=True)
df = data.frame
df.describe()

## Step 2: Train/Validation/Test Split
Split the data into 60/20/20 using `train_test_split`.

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(columns='MedHouseVal')
y = df['MedHouseVal']

# Your split code here
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)


## Step 3: Choose and Create a Pipeline
Try out StandardScaler + Ridge, Lasso, or KNeighborsRegressor. You can also add your own steps.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor

# Choose one model and build your pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', Ridge())
])

pipeline.fit(X_train, y_train)

## Step 4: Evaluate Your Model
Use MSE, R² and any other metric to evaluate on validation set.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = pipeline.predict(X_val)
print("Validation MSE:", mean_squared_error(y_val, y_pred))
print("Validation R²:", r2_score(y_val, y_pred))

## Step 5: Cross-Validation & GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'regressor__alpha': [0.01, 0.1, 1.0, 10.0]
}

grid = GridSearchCV(pipeline, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best CV Score (neg MSE):", grid.best_score_)

## Step 6: Final Evaluation
Evaluate the best model on the test set.

In [None]:
final_model = grid.best_estimator_
y_test_pred = final_model.predict(X_test)
print("Test MSE:", mean_squared_error(y_test, y_test_pred))
print("Test R²:", r2_score(y_test, y_test_pred))

## Step 7: Reflection
- What worked well?
- What didn't?
- How would you improve this pipeline in a production setting?
- Try another model and compare!