# Part 1: Pipeline Predictions 
Consider four possible models for predicting house prices:

- Using only the size and number of rooms.
- Using size, number of rooms, and building type.
- Using size and building type, and their interaction.
- Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.
- Set up a pipeline for each of these four models.

Then, get predictions on the test set for each of your pipelines, and compute the root mean squared error. Which model performed best?

Note: You should only use the function train_test_split() one time in your code; that is, we should be predicting on the same test set for all three models.

In [7]:
import pandas as pd
import numpy as np

house = pd.read_csv("/Users/dan/calpoly/BusinessAnalytics/GSB544MACHINE/Week7/data/AmesHousing.csv")
house.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


In [21]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import sklearn as sk
from sklearn.compose import ColumnTransformer

# Create training to use for each model
X = house[["Gr Liv Area", "TotRms AbvGrd", "Bldg Type"]]
y = house["SalePrice"]
X_train, X_test, y_train, y_test = train_test_split(X, y)

# model 1 
##############
ct1 = ColumnTransformer(
  [
    ("standardize", StandardScaler(), ["Gr Liv Area", "TotRms AbvGrd"])
  ],
  remainder = "drop"
)

lr_pipeline = Pipeline(
  [("preprocessing", ct1),
  ("linear_regression", LinearRegression())]
)

# model 2
##############
ct2 = ColumnTransformer([
    ("standardize", StandardScaler(), ["Gr Liv Area", "TotRms AbvGrd"]),
    ("dummify", OneHotEncoder(), ["Bldg Type"])
])

lr_pipeline_2 = Pipeline([
    ("preprocessing", ct2),
    ("linear_regression", LinearRegression())
])

# model 3
##############
ct3 = ColumnTransformer([
    ("standardize", StandardScaler(), ["Gr Liv Area"]),
    ("dummify", OneHotEncoder(), ["Bldg Type"])
])

lr_pipeline_3 = Pipeline([
    ("preprocessing", ct3),
    ("interaction", PolynomialFeatures(degree=1, interaction_only=True)),
    ("linear_regression", LinearRegression())
])

# model 4
##############
ct4 = ColumnTransformer([
    ("polynomial", PolynomialFeatures(degree=5, include_bias=False), ["Gr Liv Area", "TotRms AbvGrd"]),
    ("dummify", OneHotEncoder(), ["Bldg Type"])
])

lr_pipeline_4 = Pipeline([
    ("preprocessing", ct4),
    ("linear_regression", LinearRegression())
])

# Helper function to calculate RMSE
def calculate_rmse(pipeline, X_train, X_test, y_train, y_test):
    pipeline.fit(X_train, y_train)
    predictions = pipeline.predict(X_test)
    rmse = np.sqrt(sk.metrics.mean_squared_error(y_test, predictions))
    return rmse

# Calculate RMSE for each model
rmse_1 = calculate_rmse(lr_pipeline, X_train, X_test, y_train, y_test)
rmse_2 = calculate_rmse(lr_pipeline_2, X_train, X_test, y_train, y_test)
rmse_3 = calculate_rmse(lr_pipeline_3, X_train, X_test, y_train, y_test)
rmse_4 = calculate_rmse(lr_pipeline_4, X_train, X_test, y_train, y_test)

# Print the RMSE for each model
print("RMSE for Model 1:", rmse_1)
print("RMSE for Model 2:", rmse_2)
print("RMSE for Model 3:", rmse_3)
print("RMSE for Model 4:", rmse_4)

# Determine the best model based on RMSE
best_model_index = np.argmin([rmse_1, rmse_2, rmse_3, rmse_4]) + 1
print(f"The best model is Model {best_model_index} with the lowest RMSE.")

RMSE for Model 1: 57000.61907316218
RMSE for Model 2: 54417.78280257894
RMSE for Model 3: 53691.48847468926
RMSE for Model 4: 147284.0018411477
The best model is Model 3 with the lowest RMSE.


# Part 2 : Cross Validation
Once again consider four modeling options for house price:

Use cross_val_score with the pipelines you made earlier to find the cross-validated root mean squared error for each model.

Which do you prefer? Does this agree with your conclusion from earlier?

In [22]:
from sklearn.model_selection import cross_val_score

# Find cross validation scores for each
scores1 = cross_val_score(lr_pipeline, X, y, cv=5, scoring='neg_root_mean_squared_error')
scores2 = cross_val_score(lr_pipeline_2, X, y, cv=5, scoring='neg_root_mean_squared_error')
scores3 = cross_val_score(lr_pipeline_3, X, y, cv=5, scoring='neg_root_mean_squared_error')
scores4 = cross_val_score(lr_pipeline_4, X, y, cv=5, scoring='neg_root_mean_squared_error')

# Find ME by taking negative of mean
rsme1 = -scores1.mean()
rsme2 = -scores2.mean()
rsme3 = -scores3.mean()
rsme4 = -scores4.mean()

# Print the RMSE for each model
print("RMSE for Model 1:", rmse_1)
print("RMSE for Model 2:", rmse_2)
print("RMSE for Model 3:", rmse_3)
print("RMSE for Model 4:", rmse_4)

# Determine the best model based on RMSE
best_model_index = np.argmin([rmse_1, rmse_2, rmse_3, rmse_4]) + 1
print(f"The best model is Model {best_model_index} with the lowest RMSE.")


RMSE for Model 1: 57000.61907316218
RMSE for Model 2: 54417.78280257894
RMSE for Model 3: 53691.48847468926
RMSE for Model 4: 147284.0018411477
The best model is Model 3 with the lowest RMSE.


Yes we get the same results. I liked teh second method better.
# Part 3 : Tuning
Consider one hundred modeling options for house price:

- House size, trying degrees 1 through 10
- Number of rooms, trying degrees 1 through 10
- Building Type

Hint: The dictionary of possible values that you make to give to GridSearchCV will have two elements instead of one.

In [34]:
from sklearn.model_selection import GridSearchCV

# Define the column transformer with polynomial and one-hot encoding
ct_poly = ColumnTransformer([
    ("dummify", OneHotEncoder(sparse_output=False), ["Bldg Type"]),
    ("poly_size", PolynomialFeatures(), ["Gr Liv Area"]),
    ("poly_rooms", PolynomialFeatures(), ["TotRms AbvGrd"])
], remainder="drop")

# Define the pipeline
lr_pipeline_poly = Pipeline([
    ("preprocessing", ct_poly),
    ("linear_regression", LinearRegression())
]).set_output(transform="pandas")

# Define the parameter grid for polynomial degrees
param_grid = {
    'preprocessing__poly_size__degree': np.arange(1, 10),
    'preprocessing__poly_rooms__degree': np.arange(1, 10)
}

# Initialize GridSearchCV with 5-fold cross-validation
gscv = GridSearchCV(lr_pipeline_poly, param_grid, cv=5, scoring='neg_root_mean_squared_error')
gscv_fitted = gscv.fit(X, y)

results_df = pd.DataFrame({
    "degrees_rooms": gscv_fitted.cv_results_["param_preprocessing__poly_rooms__degree"],
    "degrees_size": gscv_fitted.cv_results_["param_preprocessing__poly_size__degree"],
    "scores": gscv_fitted.cv_results_["mean_test_score"]
})
# Find minimum negative RSME
min_score = results_df['scores'].min()
min_score_row = results_df[results_df['scores'] == min_score]
print("Best RME Score:", -min_score)
print("Details of Minimum Score Row:\nDegrees Rooms:", min_score_row)

Best RME Score: 149965.52962182596
Details of Minimum Score Row:
Degrees Rooms:     degrees_rooms  degrees_size         scores
80              9             9 -149965.529622


1. Which model performed the best?

I found that the model that performed best had degree 9 for both rooms and size, leading to a maximum RSME of 149,965.53

2. What downsides do you see of trying all possible model options? How might you go about choosing a smaller number of tuning values to try?

A downside to this method is that it takes a lot of time for the computer to go through and evaluate all of these models. I can imagine as there is more complexity, this brute force method wont be efficient. One thing you could consider would be to make the ranges smaller and perform restricted tests, this way you could run multiple smaller tests and narrow down the best range of degrees to run a final test on. 