Lab 3: Regularized Regression Pipeline

Dataset: Ames Housing (House Prices)

Goal: Predict the SalePrice of a home using physical attributes while preventing overfitting through regularization.

1. Objective

Build an end-to-end regression pipeline. Make sure your model handles numerical data, encode neighborhood information, and use Cross-Validation to decide which regularization method (Lasso or Ridge) is better for predicting home values. Compare it against a non-regularized model, what performs better?

2. Key Concepts and Terms

**High Cardinality**: The `Neighborhood` column has many unique values. We must encode this carefully.

**The Alpha ($\alpha$) Parameter**: In Scikit-Learn's Ridge and Lasso, $\alpha$ is the regularization strength.

$\alpha = 0$: Standard Linear Regression.

$\alpha > 0$: Increasing penalty on model complexity.

**Data Leakage:** A critical error where information from the test set (like the mean house price) "leaks" into the training process.

In [2]:
#Necessary imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import Ridge, Lasso
from sklearn.datasets import fetch_openml

**Task 1: Data Preparation**

Load the dataset and separate the target (`SalePrice`) from the features. Perform an 80/20 split.

In [3]:
housing = fetch_openml(name="house_prices", as_frame=True, parser='auto')
X = housing.data[['GrLivArea', 'LotArea', 'OverallQual', 'Neighborhood', 'HouseStyle']]
y = housing.target

# TODO: Split the data (80/20 split, random_state=42)

**Task 2: The Preprocessing Recipe**

Create a `ColumnTransformer` that applies different logic to different columns:

Numeric Features (`GrLivArea`, `LotArea`): These have huge scales. Apply `StandardScaler`.

Categorical Features (`Neighborhood`, `HouseStyle`): Use `OneHotEncoder`. Use `handle_unknown='ignore'` because a test set might have a neighborhood the training set didn't see.

In [4]:
numeric_features = ['GrLivArea', 'LotArea', 'OverallQual']
categorical_features = ['Neighborhood', 'HouseStyle']

# TODO: Create a numeric transformer (Imputer + Scaler)
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', # YOUR CODE HERE)
])

# TODO: Create a categorical transformer (Imputer + OneHotEncoder)
# Hint: remember drop='first' to avoid the dummy variable trap!
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', # YOUR CODE HERE)
])

SyntaxError: closing parenthesis ']' does not match opening parenthesis '(' on line 7 (ipython-input-804891352.py, line 8)

**Task 3: Pipeline**

Construct a Pipeline that connects your **preprocessor** to a regressor. You will test both **Ridge()** and **Lasso().**

In [4]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# TODO: Create a pipeline that joins the preprocessor with Ridge regression
pipeline = Pipeline(steps=[
    ('preprocessor', ),
    ('regressor', )
])

**Task 4: Hyperparameter Tuning**

Use `GridSearchCV` to find the best `alpha` for your model. Test values across several orders of magnitude (e.g., 0.1, 1, 10, 100).

In [4]:
param_grid = {
    # YOUR CODE HERE
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)

**Task 5: Evaluation**

Print out the best params for `alpha` and the corresponding R^2 score.

In [None]:
print(f"Best Alpha: {grid_search.best_params_}")
print(f"Test R^2 Score: {grid_search.score(X_test, y_test):.4f}")

3. **Questions to Consider**



*   Compare the results if we change the regressor to `Lasso()`, Does the R^@ score improve?
*   Lasso can set coefficients to zero, why might that be good for datasets like this with 70+ features?
*   If we ended up not using `Pipeline` and scaling our data before the train-test split, what implications does this have on the model's accuracy score?

