## **Practice Acitivites 7**

Consider four possible models for predicting house prices:

Using only the size and number of rooms.  
Using size, number of rooms, and building type.  
Using size and building type, and their interaction.  
Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.  
Set up a pipeline for each of these four models.

Then, get predictions on the test set for each of your pipelines, and compute the root mean squared error. Which model performed best?

Note: You should only use the function train_test_split() one time in your code; that is, we should be predicting on the same test set for all three models.

In [None]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import GridSearchCV

In [None]:
def return_same(X):
  return X

pass_variable = FunctionTransformer(return_same)

In [None]:
ames = pd.read_csv("AmesHousing.csv")

X = ames.drop("SalePrice", axis = 1)
y = ames["SalePrice"]



X_train, X_test, y_train, y_test = train_test_split(X, y)
ames.columns

Index(['Order', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
       'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style',
       'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type',
       'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
       'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
       'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF',
       '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath',
       'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
       'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
       'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt',
      

## Model 1

In [None]:
ct = ColumnTransformer(
  [
    ("standardize", StandardScaler(), ["Gr Liv Area", "TotRms AbvGrd"])
  ],
  remainder = "drop"
)

pipeline1 = Pipeline(
    [('preprocessing', ct),
        ('linear_reg', LinearRegression())]
)

fitted_pipe = pipeline1.fit(X_train,y_train)

ypreds = fitted_pipe.predict(X_test)

mse1 = mean_squared_error(y_test, ypreds)

## Model 2

In [None]:
ct = ColumnTransformer(
  [
      ('dummify', OneHotEncoder(), ['Bldg Type']),
    ("standardize", StandardScaler(), ["Gr Liv Area", "TotRms AbvGrd"])
  ],
  remainder = "drop"
)

pipeline2 = Pipeline(
    [('preprocessing', ct),
        ('linear_reg', LinearRegression())]
)

fitted_pipe = pipeline2.fit(X_train,y_train)

ypreds = fitted_pipe.predict(X_test)

mse2 = mean_squared_error(y_test, ypreds)

## Model 3

In [None]:
ct_dummies = ColumnTransformer(
  [("dummify", OneHotEncoder(sparse_output = False), ["Bldg Type"])],
  remainder = "passthrough"
).set_output(transform = "pandas")

ct = ColumnTransformer(
  [
    ('untouched_exp_var', pass_variable, ["dummify__Bldg Type_2fmCon","dummify__Bldg Type_Duplex","dummify__Bldg Type_Twnhs","dummify__Bldg Type_1Fam"]),
    ("standardize", StandardScaler(), ["remainder__Lot Area"]),
    ("interaction", PolynomialFeatures(interaction_only = True), ["remainder__Gr Liv Area", "dummify__Bldg Type_2fmCon"]),
    ("interaction2", PolynomialFeatures(interaction_only = True), ["remainder__Gr Liv Area", "dummify__Bldg Type_Duplex"]),
    ("interaction3", PolynomialFeatures(interaction_only = True), ["remainder__Gr Liv Area", "dummify__Bldg Type_Twnhs"]),
    ("interaction4", PolynomialFeatures(interaction_only = True), ["remainder__Gr Liv Area", "dummify__Bldg Type_1Fam"]),
  ],
  remainder = "drop"
)

pipeline3 = Pipeline(
    [
        ('preprocessin1',ct_dummies),
        ('preprocessing2', ct),
        ('linear_reg', LinearRegression())]
)

fitted_pipe = pipeline3.fit(X_train,y_train)

ypreds = fitted_pipe.predict(X_test)

mse3 = mean_squared_error(y_test, ypreds)

## Model 4

In [None]:


ct = ColumnTransformer(
  [
    ('dummify', OneHotEncoder(), ['Bldg Type']),
    ("standardize", PolynomialFeatures((1,5)), ["Gr Liv Area","TotRms AbvGrd"]),
  ],
  remainder = "drop"
)

pipeline4 = Pipeline(
    [
        ('preprocessing', ct),
        ('linear_reg', LinearRegression())]
)

fitted_pipe = pipeline4.fit(X_train,y_train)

ypreds = fitted_pipe.predict(X_test)

mse4 = mean_squared_error(y_test, ypreds)

In [None]:
#do not display as scientific notation, instead use two decimal places
pd.set_option('display.float_format', lambda x: '%.2f' % x)


table = pd.DataFrame({
    'Model': ['Model 1', 'Model 2','Model 3', 'Model 4'],
    'MSE': [mse1, mse2, mse3, mse4]
})

table

Unnamed: 0,Model,MSE
0,Model 1,2967068612.3
1,Model 2,2788427634.5
2,Model 3,2715201092.79
3,Model 4,2949835816.85


The model that performed the best was model 3

#**Activity 2**

Once again consider four modeling options for house price:

Using only the size and number of rooms.  
Using size, number of rooms, and building type.  
Using size and building type, and their interaction.  
Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.  
Use cross_val_score with the pipelines you made earlier to find the cross-validated root mean squared error for each model.

Which do you prefer? Does this agree with your conclusion from earlier?

In [None]:
scores1 = cross_val_score(pipeline1, X, y, cv=5, scoring='r2')

scores2 = cross_val_score(pipeline2, X, y, cv=5, scoring='r2')

scores3 = cross_val_score(pipeline3, X, y, cv=5, scoring='r2')

scores4 = cross_val_score(pipeline4, X, y, cv=5, scoring='r2')


scores1.mean(),scores2.mean(), scores3.mean(), scores4.mean()

(0.504208752508862, 0.533394662313337, 0.5493705863435482, 0.4055615704520127)

The model that still performed the best is model 3, which agrees to the question above.

#**Activity 3**

Consider one hundred modeling options for house price:

House size, trying degrees 1 through 10  
Number of rooms, trying degrees 1 through 10  
Building Type  
Hint: The dictionary of possible values that you make to give to GridSearchCV will have two elements instead of one.  

Q1: Which model performed the best?

Q2: What downsides do you see of trying all possible model options? How might you go about choosing a smaller number of tuning values to try?

the model that performed the best was the polinomial 3 for size and polynomial 1 or totrms.

A downside of doing this is that it takes a lot more computing power specially as this gets more complicated. Once you start doing more and more you develop an intuition and that should help choosingsmaller number of tunin values.

In [None]:
ct = ColumnTransformer(
  [
    ("dummify", OneHotEncoder(sparse_output = False), ["Bldg Type"]),
    ("polynomial1", PolynomialFeatures(), ["Gr Liv Area"]),
    ("polynomial2", PolynomialFeatures(), ["TotRms AbvGrd"])
  ],
  remainder = "drop"
)

pipeline5 = Pipeline(
  [("preprocessing", ct),
  ("linear_regression", LinearRegression())]
)

degrees = {'preprocessing__polynomial1__degree': np.arange(1, 10), 'preprocessing__polynomial2__degree': np.arange(1, 10)}

gscv = GridSearchCV(pipeline5, degrees, cv = 5, scoring='r2')

gscv_fitted = gscv.fit(X,y)

In [None]:
results_df = pd.DataFrame(gscv_fitted.cv_results_)


resultsdf = results_df[['param_preprocessing__polynomial1__degree','param_preprocessing__polynomial2__degree','mean_test_score', 'rank_test_score']]

resultsdf = resultsdf.rename(columns={
    'param_preprocessing__polynomial1__degree': 'house size plonomial',
    'param_preprocessing__polynomial2__degree': 'total number of room polynomial',
    'mean_test_score': 'Score'
})
resultsdf.sort_values('rank_test_score')

Unnamed: 0,house size plonomial,total number of room polynomial,Score,rank_test_score
18,3,1,0.56,1
19,3,2,0.56,2
30,4,4,0.56,3
31,4,5,0.56,4
20,3,3,0.55,5
...,...,...,...,...
80,9,9,-4.55,77
79,9,8,-4.55,78
76,9,5,-4.55,79
74,9,3,-4.55,80
