---
title: Practice Activity 7.1
author: "Deepika Agarwal"
format:
  html:
    embed-resources: true
echo: true
---


In [1]:
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

In [2]:
ames = pd.read_csv("https://raw.githubusercontent.com/kevindavisross/data301/main/data/AmesHousing.txt", sep="\t")
ames.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Type,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Fireplace Qu,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1960,1960,Hip,CompShg,BrkFace,Plywood,Stone,112.0,TA,TA,CBlock,TA,Gd,Gd,BLQ,639.0,Unf,0.0,441.0,1080.0,GasA,Fa,Y,SBrkr,1656,0,0,1656,1.0,0.0,1,0,3,1,TA,7,Typ,2,Gd,Attchd,1960.0,Fin,2.0,528.0,TA,TA,P,210,62,0,0,0,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Feedr,Norm,1Fam,1Story,5,6,1961,1961,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,CBlock,TA,TA,No,Rec,468.0,LwQ,144.0,270.0,882.0,GasA,TA,Y,SBrkr,896,0,0,896,0.0,0.0,1,0,2,1,TA,5,Typ,0,,Attchd,1961.0,Unf,1.0,730.0,TA,TA,Y,140,0,0,0,120,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,6,1958,1958,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,108.0,TA,TA,CBlock,TA,TA,No,ALQ,923.0,Unf,0.0,406.0,1329.0,GasA,TA,Y,SBrkr,1329,0,0,1329,0.0,0.0,1,1,3,1,Gd,6,Typ,0,,Attchd,1958.0,Unf,1.0,312.0,TA,TA,Y,393,36,0,0,0,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,7,5,1968,1968,Hip,CompShg,BrkFace,BrkFace,,0.0,Gd,TA,CBlock,TA,TA,No,ALQ,1065.0,Unf,0.0,1045.0,2110.0,GasA,Ex,Y,SBrkr,2110,0,0,2110,1.0,0.0,2,1,3,1,Ex,8,Typ,2,TA,Attchd,1968.0,Fin,2.0,522.0,TA,TA,Y,0,0,0,0,0,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,5,5,1997,1998,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,GLQ,791.0,Unf,0.0,137.0,928.0,GasA,Gd,Y,SBrkr,928,701,0,1629,0.0,0.0,2,1,3,1,TA,6,Typ,1,TA,Attchd,1997.0,Fin,2.0,482.0,TA,TA,Y,212,34,0,0,0,0,,MnPrv,,0,3,2010,WD,Normal,189900


In [3]:
# Specifying training and test data
X = ames.drop("SalePrice", axis=1)
y = ames["SalePrice"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 123)

# Practice Activity 1

### Part a - Using only the size and number of rooms (Model 1)

In [4]:
# Standardizing room size and number of rooms using column transformer
ct = ColumnTransformer(
  [
    ("standardize", StandardScaler(), ["Gr Liv Area", "TotRms AbvGrd"])
  ],
  remainder = "drop"
)

# Creating a pipeline
pipeline1 = Pipeline(
    [("preprocessing", ct),
    ("linear_regression", LinearRegression())]
)

pipeline1

In [5]:
pipeline_fitted = pipeline1.fit(X_train, y_train) #Make predictions on training data

y_preds = pipeline_fitted.predict(X_test)  #Make predictions on test data

rmse_model1_test = root_mean_squared_error(y_test, y_preds) #calculate rmse

rmse_model1_test

50591.3232703246

### Part b - Using size, number of rooms, and building type (Model 2)

In [6]:
 # Standardizing room size and number of rooms and dummifying building type using column transformer
ct = ColumnTransformer(
  [
    ("dummify", OneHotEncoder(sparse_output = False), ["Bldg Type"]),
    ("standardize", StandardScaler(), ["Gr Liv Area", "TotRms AbvGrd"])
  ],
  remainder = "drop"
)

# Creating a pipeline
pipeline2 = Pipeline(
  [("preprocessing", ct),
  ("linear_regression", LinearRegression())]
)

pipeline2

In [7]:
pipeline_fitted = pipeline2.fit(X_train, y_train) #Make predictions on training data

y_preds = pipeline_fitted.predict(X_test)  #Make predictions on test data
rmse_model2_test = root_mean_squared_error(y_test, y_preds) #calculate rmse

rmse_model2_test

49047.620948660064

### Part c - Using size and building type, and their interaction (Model 3)

In [8]:
# Dummifying building type using column transformer
ct_dummies = ColumnTransformer(
  [("dummify", OneHotEncoder(sparse_output = False), ["Bldg Type"])],
  remainder = "passthrough"
).set_output(transform = "pandas")

In [9]:
# Creating interactions between each building type and room size using column transformation
ct_inter = ColumnTransformer(
  [
    ("interaction", PolynomialFeatures(interaction_only = True), ["remainder__Gr Liv Area", "dummify__Bldg Type_1Fam", "dummify__Bldg Type_2fmCon",
                                                                  "dummify__Bldg Type_Duplex",	"dummify__Bldg Type_Twnhs",	"dummify__Bldg Type_TwnhsE"]),
  ],
  remainder = "drop"
).set_output(transform = "pandas")

In [10]:
# Selecting the required interaction columns
wanted_cols = ['interaction__remainder__Gr Liv Area',
               'interaction__dummify__Bldg Type_1Fam',
               'interaction__dummify__Bldg Type_2fmCon',
               'interaction__dummify__Bldg Type_Duplex',
               'interaction__dummify__Bldg Type_Twnhs',
               'interaction__dummify__Bldg Type_TwnhsE',
               'interaction__remainder__Gr Liv Area dummify__Bldg Type_1Fam',
               'interaction__remainder__Gr Liv Area dummify__Bldg Type_2fmCon',
               'interaction__remainder__Gr Liv Area dummify__Bldg Type_Duplex',
               'interaction__remainder__Gr Liv Area dummify__Bldg Type_Twnhs',
               'interaction__remainder__Gr Liv Area dummify__Bldg Type_TwnhsE']

# Keep only the columns listed in wanted_cols and drop everything else
select_subset = ColumnTransformer(transformers=[("keep", "passthrough", wanted_cols)],
                                  remainder="drop"
                                  ).set_output(transform="pandas")

In [11]:
# Creating a pipeline
pipeline3 = Pipeline([
    ("make_dummies", ct_dummies),
    ("interactions", ct_inter),
    ("select_cols", select_subset),
    ("model", LinearRegression())
])

pipeline3

In [12]:
pipeline3.fit(X_train, y_train) #Make predictions on training data

preds = pipeline3.predict(X_test) #Make predictions on test data
rmse_model3_test = root_mean_squared_error(y_test, preds) #calculate rmse

rmse_model3_test

48417.098629975495

### Part d - Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type (Model 4)

In [13]:
# Creating a column transformer to dummify categorical variable and apply polynomial features to numeric variables
ct_dummy_poly = ColumnTransformer(
  [ ("dummify", OneHotEncoder(sparse_output = False), ["Bldg Type"]),
    ("size_poly", PolynomialFeatures(degree=5, include_bias=False, interaction_only=False), ["Gr Liv Area"]),
    ("room_poly", PolynomialFeatures(degree=5, include_bias=False, interaction_only=False), ["TotRms AbvGrd"]),
  ],
  remainder = "drop"
)

# Creating a pipeline
pipeline4 = Pipeline(
  [
   ("dummy_poly", ct_dummy_poly),
  ("linear_regression", LinearRegression())]
)

pipeline4

In [14]:
pipeline4.fit(X_train, y_train) #Make predictions on training data
preds = pipeline4.predict(X_test) #Make predictions on test data
rmse_model4_test = root_mean_squared_error(y_test, preds) #calculate rmse

rmse_model4_test

51878.4308803204

Analysis: The third model performs best because of its lowest RMSE: 48417.0986299757

# Practice Activity 2

### Part a - Using only the size and number of rooms.

In [15]:
# Using cross validation to find the cross-validated R-squared value for Model 1
scores = cross_val_score(pipeline1, X, y, cv=5, scoring='r2')
scores.mean()

np.float64(0.504208752508862)

In [16]:
# Using cross validation to find the cross-validated root mean squared error for Model 1
rmse_score1 = cross_val_score(pipeline1, X, y, cv=5, scoring='neg_root_mean_squared_error')
rmse_score1

array([-61608.03513075, -54133.82663151, -58982.46718389, -56380.660898  , -47926.64190218])

In [17]:
-rmse_score1.mean() # Calculating the average rmse and multiplying by -ve sign

np.float64(55806.32634926364)

### Part b - Using size, number of rooms, and building type.

In [18]:
# Using cross validation to find the cross-validated R-squared value for Model 2
scores = cross_val_score(pipeline2, X, y, cv=5, scoring='r2')
scores.mean()

np.float64(0.5328824390692034)

In [19]:
# Using cross validation to find the cross-validated root mean squared error for Model 2
rmse_score2 = cross_val_score(pipeline2, X, y, cv=5, scoring='neg_root_mean_squared_error')
rmse_score2

array([-59447.5945622 , -51677.04316677, -57660.55257913, -54423.45405505, -47631.76278281])

In [20]:
-rmse_score2.mean() # Calculating the average rmse and multiplying by -ve sign

np.float64(54168.08142919383)

### Part c - Using size and building type, and their interaction.

In [21]:
# Using cross validation to find the cross-validated R-squared value for Model 3

scores = cross_val_score(pipeline3, X, y, cv=5, scoring='r2')
scores.mean()

np.float64(0.5448672416905465)

In [22]:
# Using cross validation to find the cross-validated root mean squared error for Model 3
rmse_score3 = cross_val_score(pipeline3, X, y, cv=5, scoring='neg_root_mean_squared_error')
rmse_score3

array([-57971.79204856, -51125.29110265, -57526.13624288, -53354.32086375, -47177.0696188 ])

In [23]:
-rmse_score3.mean() # Calculating the average rmse and multiplying by -ve sign

np.float64(53430.92197532872)

### Part d - Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.

In [24]:
# Using cross validation to find the cross-validated R-squared value for Model 4
scores = cross_val_score(pipeline4, X, y, cv=5, scoring='r2')
scores.mean()

np.float64(0.49713957610353887)

In [25]:
# Using cross validation to find the cross-validated root mean squared error for Model 4
rmse_score4 = cross_val_score(pipeline4, X, y, cv=5, scoring='neg_root_mean_squared_error')
rmse_score4

array([-60986.11547115, -56279.0971022 , -56121.1209209 , -57853.97686291, -50038.37136632])

In [26]:
-rmse_score4.mean() # Calculating the average rmse and multiplying by -ve sign

np.float64(56255.73634469775)

Analysis: The 3rd Model is the best model because it has the lowest cross-validated root mean squared error 53430.9219753275

# 13.3.3 Your Turn

### Consider one hundred modeling options for house price:
### House size, trying degrees 1 through 10
### Number of rooms, trying degrees 1 through 10
### Building Type
### Hint: The dictionary of possible values that you make to give to GridSearchCV will have two elements instead of one.

### Q1: Which model performed the best?

### Q2: What downsides do you see of trying all possible model options? How might you go about choosing a smaller number of tuning values to try?



In [27]:
# Creating a column transformer to dummify categorical variable and apply polynomial features to numeric variables
ct_poly = ColumnTransformer(
  [
    ("dummify", OneHotEncoder(sparse_output = False), ["Bldg Type"],),
    ("room_poly", PolynomialFeatures(), ["TotRms AbvGrd"]),
    ("size_poly", PolynomialFeatures(), ["Gr Liv Area"])
  ],
  remainder = "drop"
)

# Creating a pipeline
lr_pipeline_poly = Pipeline(
  [("preprocessing", ct_poly),
  ("linear_regression", LinearRegression())]
).set_output(transform="pandas")

# Define a parameter grid to tune polynomial degrees from 1 to 10
degrees = {
  "preprocessing__room_poly__degree":  np.arange(1, 11),
  "preprocessing__size_poly__degree": np.arange(1, 11),
}

gscv = GridSearchCV(lr_pipeline_poly, degrees , cv = 5, scoring='r2') #Using GridSearchCV with 5-fold cross-validation to find best degree settings

In [28]:
gscv_fitted = gscv.fit(X, y)

In [29]:
pd.DataFrame(gscv_fitted.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_preprocessing__room_poly__degree,param_preprocessing__size_poly__degree,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.003708,0.000281,0.002141,0.000407,1,1,"{'preprocessing__room_poly__degree': 1, 'prepr...",0.531978,0.532253,0.428295,0.565748,0.606138,0.532882,0.058968,17
1,0.003833,0.000577,0.002079,0.000461,1,2,"{'preprocessing__room_poly__degree': 1, 'prepr...",0.531716,0.526189,0.456095,0.582384,0.590975,0.537472,0.048296,10
2,0.003818,0.000485,0.001916,0.000121,1,3,"{'preprocessing__room_poly__degree': 1, 'prepr...",0.545224,0.534884,0.512144,0.592701,0.603250,0.557641,0.034789,1
3,0.003478,0.000106,0.001933,0.000229,1,4,"{'preprocessing__room_poly__degree': 1, 'prepr...",0.509180,0.454658,0.408596,0.547423,0.576227,0.499217,0.060912,29
4,0.003616,0.000163,0.001814,0.000028,1,5,"{'preprocessing__room_poly__degree': 1, 'prepr...",0.507440,0.445234,0.458415,0.509277,0.565332,0.497140,0.042656,30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.007658,0.001944,0.003509,0.001137,10,6,"{'preprocessing__room_poly__degree': 10, 'prep...",0.487494,0.438350,-0.315753,0.169961,0.551866,0.266384,0.318792,50
96,0.005889,0.002124,0.002858,0.001074,10,7,"{'preprocessing__room_poly__degree': 10, 'prep...",0.446858,0.382049,0.185255,-0.206586,0.513025,0.264120,0.259630,51
97,0.006711,0.001576,0.003793,0.001086,10,8,"{'preprocessing__room_poly__degree': 10, 'prep...",0.387189,0.330723,-0.523899,-1.994890,0.453637,-0.269448,0.933394,70
98,0.006747,0.002005,0.002626,0.000947,10,9,"{'preprocessing__room_poly__degree': 10, 'prep...",0.317553,0.273531,-2.874989,-7.261625,0.382935,-1.832519,2.984240,90


In [30]:
gscv_fitted.best_params_ # Shows the best combination of polynomial degrees found during grid search

{'preprocessing__room_poly__degree': np.int64(1),
 'preprocessing__size_poly__degree': np.int64(3)}

In [31]:
gscv_fitted.best_score_  # Highest cross-validated R² score achieved during the grid search

np.float64(0.5576406176063344)

Analysis:

Q1: Which model performed the best?

The best model uses degree-1 polynomial for number of rooms and degree-3 polynomial for house size with an R-squared of 0.5576

Q2: What downsides do you see of trying all possible model options? How might you go about choosing a smaller number of tuning values to try?

1. Testing too many models can be very time-consuming and computationally expensive.

2. It increases the risk of overfitting with so many options, a model might appear best just by random chance on CV folds.

3. The results can become unstable or hard to interpret (for example with very high-degree polynomials).

Better way:
Use random search — set reasonable ranges for your tuning values and test a random sample of combinations (e.g., 20–50). This saves time and usually finds good models without checking every single option.