# Week 4 Lab: Model Preparation & Evaluation

Welcome to this week's lab on **Model Preparation and Evaluation**. We'll be covering the following key concepts:
- Train/Validation/Test Splits
- Avoiding Data Leakage
- Using Pipelines for cleaner workflows
- Model Evaluation Metrics (like MSE and R²)
- Hyperparameter Tuning with GridSearchCV

---
## Part 1: Train/Test/Validation Split
We'll use the California Housing dataset.

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import pandas as pd

# Load data
data = fetch_california_housing(as_frame=True)
df = data.frame
df.head()

Let's split the data into features and target.
- Features are the columns we use to predict
- The target is what we want to predict

In [None]:
X = df.drop(columns='MedHouseVal')
y = df['MedHouseVal']

Now let's split into train and test sets using `train_test_split` from sklearn.
We’ll use 60% training, 20% validation, 20% test.

In [None]:
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)  # 0.25 x 0.8 = 0.2

## Part 2: Pipelines
Let’s build a simple pipeline to standardize the features and apply Linear Regression.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

pipeline.fit(X_train, y_train)
print("Model trained!")

## Part 3: Evaluation
Now we’ll check how well our model performs using R² and MSE.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = pipeline.predict(X_val)
print("MSE:", mean_squared_error(y_val, y_pred))
print("R²:", r2_score(y_val, y_pred))

## Part 4: Optional - Try GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge())
])

params = {'ridge__alpha': [0.1, 1.0, 10.0]}
grid = GridSearchCV(pipeline, param_grid=params, cv=5)
grid.fit(X_train, y_train)

print("Best alpha:", grid.best_params_)
print("Best score (R²):", grid.best_score_)