---
title: "Practice Activity 7.1: Cross-Validation and Tuning"
format: 
  html:
    theme: lux
---

In [3]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.compose import ColumnTransformer

# Load the dataset (assuming the CSV file path is "/content/AmesHousing (1).csv")
ames = pd.read_csv("/content/AmesHousing (1).csv")

# Define features and target
X = ames[['Gr Liv Area', 'TotRms AbvGrd', 'Bldg Type']]
y = ames['SalePrice']

# Column transformer setup
ct_poly = ColumnTransformer(
    transformers=[
        ("dummify", OneHotEncoder(sparse_output=False), ["Bldg Type"]),
        ("polynomial", PolynomialFeatures(), ["Gr Liv Area", "TotRms AbvGrd"])
    ],
    remainder="drop"
)

# Pipeline with PolynomialFeatures and LinearRegression
lr_pipeline_poly = Pipeline([
    ("preprocessing", ct_poly),
    ("linear_regression", LinearRegression())
]).set_output(transform="pandas")

# Degree range for tuning
degrees = {'preprocessing__polynomial__degree': np.arange(1, 10)}  # Degrees 1 to 9

# Grid search with cross-validation
gscv = GridSearchCV(lr_pipeline_poly, degrees, cv=5, scoring='r2', n_jobs=-1)
gscv_fitted = gscv.fit(X, y)

# Extract cross-validated metrics
cv_results = pd.DataFrame({
    "degrees": np.arange(1, 10),
    "scores": gscv_fitted.cv_results_['mean_test_score']
})

# Display the best model and corresponding cross-validated score
best_model = gscv_fitted.best_estimator_
best_score = gscv_fitted.best_score_

print("Best Model:", best_model)
print("Best R-squared Score:", best_score)
print("\nCross-validated scores for each degree:\n", cv_results)


Best Model: Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('dummify',
                                                  OneHotEncoder(sparse_output=False),
                                                  ['Bldg Type']),
                                                 ('polynomial',
                                                  PolynomialFeatures(degree=3),
                                                  ['Gr Liv Area',
                                                   'TotRms AbvGrd'])])),
                ('linear_regression', LinearRegression())])
Best R-squared Score: 0.5410026448115971

Cross-validated scores for each degree:
    degrees      scores
0        1    0.532882
1        2    0.531259
2        3    0.541003
3        4    0.530984
4        5    0.399898
5        6   -1.410547
6        7  -20.793747
7        8 -132.190776
8        9 -568.868517


# Question 1
The best model found by GridSearchCV is a polynomial regression model with a degree of 3 applied to Gr Liv Area and TotRms AbvGrd. The model includes:

- One-hot encoding for Bldg Type.
- Standard scaling for Gr Liv Area and TotRms AbvGrd.
- A polynomial transformation with a degree of 3.

# Question 2
Trying all model options can take a lot of time and processing power, especially with large datasets and high polynomial degrees, and it risks overfitting, which means the model may not work well on new data. To simplify, we could start with a smaller range of degrees (like 1 to 4) and only add complexity if needed. Another approach is using RandomizedSearchCV to sample from the range instead of testing every option, saving time while still finding good results.