## 13.2.5 Your Turn

### Practice Activity
Consider four possible models for predicting house prices:


1.   Using only the size and number of rooms.
2.   Using size, number of rooms, and building type.
3.   Using size and building type, and their interaction.
4.   Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.

Set up a pipeline for each of these four models.

Then, get predictions on the test set for each of your pipelines, and compute the root mean squared error. Which model performed best?

Note: You should only use the function `train_test_split()` **one** time in your code; that is, we should be predicting on the **same** test set for all three models.





1.   Using only the size and number of rooms.

In [1]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

In [2]:

from google.colab import files
uploaded = files.upload()

import pandas as pd
import io

df = pd.read_csv(io.BytesIO(uploaded['AmesHousing (6).csv']))
df.head()

Saving AmesHousing.csv to AmesHousing (6).csv


Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


## Model 1: Using Only the Size and Number of Rooms

In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score


In [4]:
X = df.drop("SalePrice", axis = 1)
y = df["SalePrice"]

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [5]:
lr = LinearRegression()
std_s = StandardScaler()

ct = ColumnTransformer(
    [("standardize", std_s, ["Gr Liv Area", "TotRms AbvGrd"])],
    remainder = "drop"
).set_output(transform = "pandas")

pipe1 = Pipeline(
  [("standardize", ct),
  ("linear_regression", lr)]
).set_output(transform = "pandas")

fit_pipe1 = pipe1.fit(X_train, y_train)

y_preds = fit_pipe1.predict(X_test)

In [6]:
r2_m1 = r2_score(y_test, y_preds)
r2_m1

0.47426884836104066

In [7]:
mse_m1 = mean_squared_error(y_test, y_preds, squared = False)
mse_m1



55603.13209083621

In [8]:
predict_bedrooms_charge = pd.DataFrame({"Size": X_test["Gr Liv Area"],
                            "Number of Rooms": X_test["TotRms AbvGrd"],
                            "Test Charges": y_test,
                            "Prediction Charges": y_preds})
predict_bedrooms_charge.head()

Unnamed: 0,Size,Number of Rooms,Test Charges,Prediction Charges
2806,1948,8,195000,226500.876438
1539,2161,8,263400,257640.444216
2665,1200,7,125000,129788.482433
2052,1020,5,165000,128756.516941
1277,1800,10,130000,179580.831454


In [41]:
scores1 = cross_val_score(pipe1, X, y, cv=5, scoring='neg_root_mean_squared_error')
print(f"CV RMSE = {round(scores1.mean()) * -1}")

CV RMSE = 55806


## Model 2: Using Size, Number of Rooms, and Building Type

In [10]:
enc = OneHotEncoder(sparse_output = False)

ct = ColumnTransformer(
    [("standardize", std_s, ["Gr Liv Area", "TotRms AbvGrd"]),
     ("dummify", enc, ["Bldg Type"])],
    remainder = "drop"
).set_output(transform = "pandas")

pipe2 = Pipeline(
  [("preprocessing", ct),
  ("linear_regression", lr)]
).set_output(transform = "pandas")

fit_pipe2 = pipe2.fit(X_train, y_train)
y_preds2 = fit_pipe2.predict(X_test)

In [11]:
r2_m2 = r2_score(y_test, y_preds2)
r2_m2

0.5245631493169479

In [12]:
mse_m2 = mean_squared_error(y_test, y_preds2, squared = False)
mse_m2



52876.63632841767

In [13]:
predict_bldg_charge = pd.DataFrame({"Size": X_test["Gr Liv Area"],
                            "Number of Rooms": X_test["TotRms AbvGrd"],
                            "Building Type": X_test["Bldg Type"],
                            "Test Charges": y_test,
                            "Prediction Charges": y_preds})
predict_bldg_charge.head()

Unnamed: 0,Size,Number of Rooms,Building Type,Test Charges,Prediction Charges
2806,1948,8,1Fam,195000,226500.876438
1539,2161,8,1Fam,263400,257640.444216
2665,1200,7,1Fam,125000,129788.482433
2052,1020,5,1Fam,165000,128756.516941
1277,1800,10,Duplex,130000,179580.831454


In [40]:
scores2 = cross_val_score(pipe2, X, y, cv=5, scoring='neg_root_mean_squared_error')
print(f"CV RMSE = {round(scores2.mean()) * -1}")

CV RMSE = 54168


## Model 3: Using Size, Building Type, and their Interaction

In [15]:
enc = OneHotEncoder(sparse_output = False,handle_unknown = 'ignore')


ct_dummy = ColumnTransformer(
    [("standardize", std_s, ["Gr Liv Area"]),
     ("dummify", enc, ["Bldg Type"])],
    remainder = "passthrough"
).set_output(transform = "pandas")

ct_interaction = ColumnTransformer(
    [
        ("interaction", PolynomialFeatures(interaction_only= True), ['dummify__Bldg Type_1Fam', 'dummify__Bldg Type_2fmCon',
          'dummify__Bldg Type_Duplex', 'dummify__Bldg Type_Twnhs',
          'dummify__Bldg Type_TwnhsE', 'standardize__Gr Liv Area'])
    ],
  remainder = "drop"
).set_output(transform = "pandas")

pipe3 = Pipeline(
  [("dummify", ct_dummy),
   ("interaction", ct_interaction),
   ("linear regression", lr)]
).set_output(transform = "pandas")


fit_pipe3 = pipe3.fit(X_train, y_train)
y_preds3 = fit_pipe3.predict(X_test)

In [25]:
r2_m3 = r2_score(y_test, y_preds3)
r2_m3

0.5360087049301716

In [26]:
mse_m3 = mean_squared_error(y_test, y_preds3, squared = False)
mse_m3



52236.28906624211

In [39]:
scores3 = cross_val_score(pipe3, X, y, cv=5, scoring='neg_root_mean_squared_error')
print(f"CV RMSE = {round(scores3.mean()) * -1}")

CV RMSE = 53437


## Model 4: Using a 5-degree Polynomial on Size, a 5-degree Polynomial on Number of Rooms, and also Building Type

In [19]:
enc = OneHotEncoder(sparse_output = False)

ct_preprocess = ColumnTransformer(
    [("dummify", enc, ["Bldg Type"]),
     ("polynomial", PolynomialFeatures(degree = 5), ["Gr Liv Area", "TotRms AbvGrd"])], remainder = "drop"
).set_output(transform = "pandas")

ct_preprocess.fit_transform(X_train)

pipe4 = Pipeline(
    [("preprocessing", ct_preprocess),
      ("linear regression", lr)]
).set_output(transform = "pandas")

fit_pipe4 = pipe4.fit(X_train, y_train)
y_preds4 = fit_pipe4.predict(X_test)

In [20]:
r2_m4 = r2_score(y_test, y_preds4)
r2_m4

-0.38352223325594936

In [21]:
mse_m4 = mean_squared_error(y_test, y_preds4, squared = False)
mse_m4



90200.81704524856

In [38]:
scores4 = cross_val_score(pipe4, X, y, cv=5, scoring='neg_root_mean_squared_error')
print(f"CV RMSE = {round(scores4.mean()) * -1}")

CV RMSE = 61255


## 13.3.1 `cross_val_score` Obseration

The model with the lowest RSME and performed the best is Model 3.

In [28]:
print("Model 1 (Size and Number of Rooms) MSE:", mse_m1)
print("Model 2 (Size, Number of Rooms, and Building Type) MSE:", mse_m2)
print("Model 3 (Size, Building Type, and Interaction) MSE:", mse_m3)
print("Model 4 (5 degree Polynomial on Size, 5 degree Polynomial on Number of Rooms, and Building Type) MSE:", mse_m4)

Model 1 (Size and Number of Rooms) MSE: 55603.13209083621
Model 2 (Size, Number of Rooms, and Building Type) MSE: 52876.63632841767
Model 3 (Size, Building Type, and Interaction) MSE: 52236.28906624211
Model 4 (5 degree Polynomial on Size, 5 degree Polynomial on Number of Rooms, and Building Type) MSE: 90200.81704524856


In [43]:

if mse_m2 < mse_m3:
  print("Model 2 has a smaller RMSE than Model 3.")
elif mse_m3 < mse_m2:
  print("Model 3 has a smaller RMSE than Model 2.")
else:
  print("Model 2 and Model 3 have the same RMSE.")

Model 3 has a smaller RMSE than Model 2.


In [36]:
print("Model 1 (Size and Number of Rooms) CV RMSE:", round(scores1.mean() * -1))
print("Model 2 (Size, Number of Rooms, and Building Type) CV RMSE:", round(scores2.mean() * -1))
print("Model 3 (Size, Building Type, and Interaction) CV RMSE:", round(scores3.mean() * -1))
print("Model 4 (5 degree Polynomial on Size, 5 degree Polynomial on Number of Rooms, and Building Type) CV RMSE:", round(scores4.mean() * -1))

Model 1 (Size and Number of Rooms) CV RMSE: 55806
Model 2 (Size, Number of Rooms, and Building Type) CV RMSE: 54168
Model 3 (Size, Building Type, and Interaction) CV RMSE: 53437
Model 4 (5 degree Polynomial on Size, 5 degree Polynomial on Number of Rooms, and Building Type) CV RMSE: 61255


In [37]:
cv_rmse_scores = [round(scores1.mean() * -1), round(scores2.mean() * -1), round(scores3.mean() * -1), round(scores4.mean() * -1)]
lowest_cv_rmse = min(cv_rmse_scores)
best_model_index = cv_rmse_scores.index(lowest_cv_rmse)

print("CV RMSE scores for the four models:", cv_rmse_scores)
print("The lowest CV RMSE is:", lowest_cv_rmse)
print("The best performing model is model", best_model_index + 1)

CV RMSE scores for the four models: [55806, 54168, 53437, 61255]
The lowest CV RMSE is: 53437
The best performing model is model 3


The model with the lowest cross validation RMSE is Model 3. I would prefer this model and it does agree with my conclusion from earlier that Model 3 has the smaller RMSE and is the best model for performing.

## 13.3.3 Tuning

Consider *one hundred* modeling options for house price:


*   House size, trying degrees 1 through 10
*   Number of rooms, trying degrees 1 through 10
*   Building Type

*Hint: The dictionary of possible values that you make to give to `GridSearchCV` will have two elements instead of one.*

Q1: Which model performed the best?

Q2: What downsides do you see of trying all possible model options? How might you go about choosing a smaller number of tuning values to try?



In [45]:
from sklearn.model_selection import GridSearchCV
enc = OneHotEncoder(sparse_output = False)


ct_tuning = ColumnTransformer(
  [("size_polynomial", PolynomialFeatures(), ["Gr Liv Area"]),
  ("rooms_polynomial", PolynomialFeatures(), ["TotRms AbvGrd"]),
   ("dummify", enc, ["Bldg Type"])],
  remainder = "drop"
)

lr_pipeline_tuning = Pipeline(
  [("preprocessing", ct_tuning),
  ("linear_regression", lr)]
).set_output(transform="pandas")

degrees = {'preprocessing__size_polynomial__degree': np.arange(1, 11),
           "preprocessing__rooms_polynomial__degree": np.arange(1, 11)}

gscv = GridSearchCV(lr_pipeline_tuning, degrees, cv = 5, scoring='neg_root_mean_squared_error')
gscv_fitted = gscv.fit(X, y)

In [46]:
gscv_fitted.cv_results_


array([ -54168.08142919,  -53925.41781821,  -52781.98411754,
        -53199.00680115,  -58214.29263896,  -62623.65822039,
        -72233.78139036,  -97088.9905488 , -149965.53335844,
       -242581.14864938,  -54218.38403911,  -54152.70400263,
        -52837.4462235 ,  -53184.00062906,  -58214.29255545,
        -62623.65822039,  -72233.78139036,  -97088.98373309,
       -149965.54140738, -242581.1486491 ,  -53995.89644106,
        -54101.47839371,  -53003.26539061,  -53183.59071765,
        -55620.7387308 ,  -62623.65822039,  -72233.77978606,
        -97088.97812747, -149965.54140737, -242581.14864962,
        -53596.30184666,  -53951.82868716,  -53163.13364745,
        -52808.99922438,  -56077.28077025,  -62623.6581797 ,
        -72233.77791543,  -97088.97812747, -149965.54140735,
       -242581.14864976,  -53614.99780118,  -54253.59397717,
        -53386.55172603,  -52828.45108741,  -56302.85389305,
        -62623.65825143,  -72233.77791543,  -97088.97812746,
       -149965.54140736,

In [51]:
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np

df = gscv_fitted.cv_results_['mean_test_score'] * - 1

degrees_poly_size = np.repeat(np.arange(1, 11), 10)
degrees_poly_rooms = np.tile(np.arange(1, 11), 10)

cv_results_df = pd.DataFrame({
    "size degrees": degrees_poly_size,
    "room degrees": degrees_poly_rooms,
    "scores": df
})

pd.set_option('display.float_format', '{:.0f}'.format)
cv_results_df.sort_values(by = "scores").head()

Unnamed: 0,size degrees,room degrees,scores
2,1,3,52782
33,4,4,52809
43,5,4,52828
12,2,3,52837
63,7,4,52959


The model that performed the best is the model with a polynomial degree of 1 for the size of the house and a polynomial degree of 3 for the number of rooms.


One downside of trying all possible model options is that it can be computationally expensive and time-consuming, especially as the number of hyperparameters and their possible values increases. This is particularly true when dealing with complex models or large datasets. Additionally, testing a large number of combinations can lead to overfitting on the training data, as the model might be too closely tuned to specific noise in the data rather than generalizable patterns.