# Chapter 13 Practice Activities
## Complete the tasks below.
---

In [58]:
# Load Package Dependencies
import pandas as pd
import numpy as np
import random
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

# Set seed for reproducibility
np.random.seed(123)
random.seed(123)

In [59]:
# Load the Data
df_ames = pd.read_csv("Data/AmesHousing.csv")
df_ames.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


In [60]:
X = df_ames.drop("SalePrice", axis = 1)
y = df_ames["SalePrice"]

X_train, X_test, y_train, y_test = train_test_split(X, y)

# 13.2.5 P.A. 

## Consider four possible models for predicting house prices:

- Using only the size and number of rooms.
- Using size, number of rooms, and building type.
- Using size and building type, and their interaction.
- Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.

Set up a pipeline for each of these four models.

Then, get predictions on the test set for each of your pipelines, and compute the root mean squared error. Which model performed best?

*Note: You should only use the function train_test_split() one time in your code; that is, we should be predicting on the same test set for all three models.*

In [61]:
# Model 1
ct1 = ColumnTransformer(
  [
    ("standardize", StandardScaler(), ["Gr Liv Area", "TotRms AbvGrd"])
  ],
  remainder = "drop"
)

model1_pipeline = Pipeline(
  [("preprocessing", ct1),
  ("linear_regression", LinearRegression())]
).set_output(transform="pandas")

model1_pipeline

In [62]:
# Model 2
ct2 = ColumnTransformer(
  [
    ("dummify", OneHotEncoder(drop = "first", sparse_output=False), ["Bldg Type"]),
    ("standardize", StandardScaler(), ["Gr Liv Area", "TotRms AbvGrd"])
  ],
  remainder = "drop"
)

model2_pipeline = Pipeline(
  [("preprocessing", ct2),
  ("linear_regression", LinearRegression())]
).set_output(transform="pandas")

model2_pipeline

In [63]:
# Model 3 - Step 1: Preprocessing with dummification
ct3 = ColumnTransformer(
    [
        ("dummify", OneHotEncoder(drop = "first", sparse_output=False), ["Bldg Type"]),
    ],
    remainder="passthrough"
).set_output(transform="pandas")

# Model 3 - Step 2: Create interaction terms
ct3_inter = ColumnTransformer(
    [
        ("interaction", PolynomialFeatures(interaction_only=True, include_bias=False), [
            "remainder__Gr Liv Area",
            "dummify__Bldg Type_2fmCon",
            "dummify__Bldg Type_Duplex",
            "dummify__Bldg Type_Twnhs",
            "dummify__Bldg Type_TwnhsE"
        ]),
    ],
    remainder="drop"
).set_output(transform="pandas")

# Model 3 - Final Pipeline
model3_pipeline = Pipeline(
    [
        ("preprocessing", ct3),
        ("interaction_terms", ct3_inter),
        ("linear_regression", LinearRegression())
    ]
).set_output(transform="pandas")

model3_pipeline

In [64]:
# Step 1: Preprocess - One-hot encode `Bldg Type`, pass through `Gr Liv Area` and `TotRms AbvGrd`
ct4 = ColumnTransformer(
    [
        ("dummify", OneHotEncoder(drop = "first", sparse_output=False), ["Bldg Type"]),
        ("standardize", StandardScaler(), ["Gr Liv Area", "TotRms AbvGrd"])
    ],
    remainder="drop"  # Pass through `Gr Liv Area` and `TotRms AbvGrd`
).set_output(transform="pandas")


# Step 2: Generate polynomial features and pass through dummified `Bldg Type` columns
polynomial_features = ColumnTransformer(
    [
        ("gr_liv_area_poly", PolynomialFeatures(degree=5, include_bias=False), ["standardize__Gr Liv Area"]),
        ("tot_rms_abv_grd_poly", PolynomialFeatures(degree=5, include_bias=False), ["standardize__TotRms AbvGrd"]),
    ],
    remainder="passthrough"
).set_output(transform="pandas")

# Model 4 - Final Pipeline
model4_pipeline = Pipeline(
    [
        ("preprocess", ct4),               # One-hot encode `Bldg Type`
        ("polynomial_features", polynomial_features),  # Generate polynomial features and pass through dummy variable  
        ("linear_regression", LinearRegression())  # Fit linear regression model
    ]
)

In [65]:
# Step 1: Fit and transform using the ColumnTransformer to get initial feature names
ct4.fit(X_train)
X_train_dummified = ct4.transform(X_train)

# Step 2: Fit the PolynomialFeatures (interaction terms) on the transformed data
print(polynomial_features.fit_transform(X_train_dummified).columns)

Index(['gr_liv_area_poly__standardize__Gr Liv Area',
       'gr_liv_area_poly__standardize__Gr Liv Area^2',
       'gr_liv_area_poly__standardize__Gr Liv Area^3',
       'gr_liv_area_poly__standardize__Gr Liv Area^4',
       'gr_liv_area_poly__standardize__Gr Liv Area^5',
       'tot_rms_abv_grd_poly__standardize__TotRms AbvGrd',
       'tot_rms_abv_grd_poly__standardize__TotRms AbvGrd^2',
       'tot_rms_abv_grd_poly__standardize__TotRms AbvGrd^3',
       'tot_rms_abv_grd_poly__standardize__TotRms AbvGrd^4',
       'tot_rms_abv_grd_poly__standardize__TotRms AbvGrd^5',
       'remainder__dummify__Bldg Type_2fmCon',
       'remainder__dummify__Bldg Type_Duplex',
       'remainder__dummify__Bldg Type_Twnhs',
       'remainder__dummify__Bldg Type_TwnhsE'],
      dtype='object')


In [66]:
# List of models to iterate over
pipelines = [model1_pipeline, model2_pipeline, model3_pipeline, model4_pipeline]
model_names = ["Model 1", "Model 2", "Model 3", "Model 4"]
rmse_results = []

# Calculate RMSE for each model
for name, pipeline in zip(model_names, pipelines):
    # Fit the pipeline to the training data
    pipeline.fit(X_train, y_train)
    
    # Get predictions on the test set
    y_pred = pipeline.predict(X_test)
    
    # Calculate RMSE
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    rmse_results.append((name, rmse))

# Display RMSE results in a DataFrame for easy comparison
pd.DataFrame(rmse_results, columns=["Model", "RMSE"])

Unnamed: 0,Model,RMSE
0,Model 1,50591.32327
1,Model 2,49047.620949
2,Model 3,48417.09863
3,Model 4,49092.350506


Model 3 achieved the lowest RMSE (48,417), indicating it performed best in predicting house prices. This model includes size, building type, and interaction terms, suggesting that the relationship between size and building type significantly enhances prediction accuracy. Model 2, which incorporates size, number of rooms, and building type (but no interactions), closely followed with an RMSE of 49,048. Model 4, which uses 5-degree polynomial terms for size and number of rooms without interactions, resulted in a similar RMSE of 49,092, indicating that the additional non-linear terms in Model 4 provided a slight improvement but were less impactful than the interaction terms in Model 3. Model 1, the simplest model using only size and number of rooms, had the highest RMSE (50,591), showing that adding building type and interaction terms improves predictive performance. Overall, Model 3's inclusion of interaction terms proved most effective in capturing variations in house prices.

# 13.3.3 P.A. 

## Consider one hundred modeling options for house price:

- House size, trying degrees 1 through 10
- Number of rooms, trying degrees 1 through 10
- Building Type

*Hint: The dictionary of possible values that you make to give to GridSearchCV will have two elements instead of one.*


### Q1: Which model performed the best?

In [67]:
X = df_ames[["Gr Liv Area", "TotRms AbvGrd", "Bldg Type"]]
y = df_ames["SalePrice"]

# Column transformer for preprocessing
ct = ColumnTransformer(
    [
        ("dummify", OneHotEncoder(handle_unknown='ignore'), ["Bldg Type"]),
        ("scaler", StandardScaler(), ["Gr Liv Area", "TotRms AbvGrd"])
    ]
)

# Pipeline for regression with polynomial features
lr_pipeline = Pipeline(
    [
        ("preprocessing", ct),
        ("polynomial", PolynomialFeatures()),
        ("linear_regression", LinearRegression())
    ]
)

# GridSearchCV parameter grid
param_grid = {
    "polynomial__degree": np.arange(1, 11),  # Degrees 1 through 10
    "preprocessing__scaler__with_mean": [True, False]  # Control centering during scaling
}

# Perform the grid search
grid_search = GridSearchCV(lr_pipeline, param_grid, cv=5, scoring="r2")
grid_search.fit(X, y)

In [68]:
# Extract results
cv_results = grid_search.cv_results_

# Displaying a selection of results
mean_test_scores = cv_results['mean_test_score']
degrees = np.arange(1, 11)  # As per parameter grid
results_df = pd.DataFrame({
    "degrees": degrees,
    "mean_test_scores": mean_test_scores[:len(degrees)]  # Ensure alignment
})

results_df

Unnamed: 0,degrees,mean_test_scores
0,1,0.532816
1,2,0.532827
2,3,0.545071
3,4,0.545093
4,5,0.545692
5,6,0.546127
6,7,-0.380265
7,8,-0.254947
8,9,-46.00328
9,10,-46.381511


In [69]:
print(grid_search.best_estimator_)
print("Best Score (R^2):", grid_search.best_score_)
print("Best Parameters:", grid_search.best_params_)

Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('dummify',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Bldg Type']),
                                                 ('scaler',
                                                  StandardScaler(with_mean=False),
                                                  ['Gr Liv Area',
                                                   'TotRms AbvGrd'])])),
                ('polynomial', PolynomialFeatures(degree=np.int64(3))),
                ('linear_regression', LinearRegression())])
Best Score (R^2): 0.5461269130054623
Best Parameters: {'polynomial__degree': np.int64(3), 'preprocessing__scaler__with_mean': False}


In [70]:
# Sanity Check!
num_fits = len(cv_results['params']) * 5  # Multiply by number of cross-validation folds
print(f"Total model fitting steps: {num_fits}")

Total model fitting steps: 100


- The best-performing model used a polynomial transformation with a degree of 3 applied to the features 'Gr Liv Area' and 'TotRms AbvGrd'.
- Best Parameters:
  - Polynomial degree: 3
  - Preprocessing: StandardScaler with with_mean=False (i.e., without centering the data).
- The cross-validated R-squared score for this model was approximately 0.546, indicating how well the model explains the variance in the target variable (SalePrice).


### Q2: What downsides do you see of trying all possible model options? How might you go about choosing a smaller number of tuning values to try?

- Downsides:
  - Computational Cost: Trying all possible combinations of polynomial degrees and preprocessing options can be computationally expensive and time-consuming (i.e took 2 mins).
  - Risk of Overfitting: Exploring a wide range of parameter combinations may lead to overfitting on the cross-validation data, reducing the generalizability of the model.
  - Complexity: Evaluating and interpreting many combinations can make it harder to gain actionable insights or draw meaningful conclusions.

- Choosing a Smaller Number of Tuning Values:
  - Reduce Polynomial Degrees: Consider testing fewer degrees, such as key values (e.g., {1, 2, 3, 5}) based on prior knowledge or exploratory analysis.
  - Randomized Search: Use RandomizedSearchCV instead of GridSearchCV to randomly sample a subset of parameter combinations.
  - Feature Selection: Apply feature selection or dimensionality reduction techniques like PCA to focus on the most important features.
  - Use Domain Knowledge: Prioritize parameters and ranges that make sense based on the data and the problem context (e.g., if higher-degree polynomials rarely perform well, limit their exploration).