## In this file, a final model for the data is made.

In [131]:
# Import libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [132]:
# Load datasets
train = pd.read_csv('../datasets/train_cleaned.csv')
test = pd.read_csv('../datasets/test_cleaned.csv')

train.drop(columns = 'Unnamed: 0', inplace = True)
test.drop(columns = 'Unnamed: 0', inplace = True)

train = train.fillna('NA')
test = test.fillna('NA')

I will create binary columns for categorical variables based on the visualization results in file 2 (Introductory Visualizations).

In [133]:
# Encoding based on whether the garage is attached to the house
train['AttachedGarage'] = train['Garage Type'].isin(['Attchd', 'BuiltIn']).astype(int)
test['AttachedGarage'] = test['Garage Type'].isin(['Attchd', 'BuiltIn']).astype(int)

In [134]:
# One-hot encoding garage finish
garage_finish = {'Fin': 3, 'RFn': 2, 'Unf': 1, 'NA': 0}
train['Garage Finish'] = train['Garage Finish'].map(lambda x: garage_finish.get(x))
test['Garage Finish'] = test['Garage Finish'].map(lambda x: garage_finish.get(x))

In [135]:
# Encoding based on whether the building is residential, as residential buildings are more expensive
# Function gotten from documentation: https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html
train['Residential'] = train['MS Zoning'].isin(['RL', 'RM', 'FV', 'RH']).astype(int)
test['Residential'] = test['MS Zoning'].isin(['RL', 'RM', 'FV', 'RH']).astype(int)

In [136]:
# Encoding neighborhoods on what seems most upscale based on the data
train['Expensive Nbrhd'] = train['Neighborhood'].isin(['StBr', 'NridgHt', 'NoRidge', 
                                                             'GrnHill', 'Veenker', 'Timber']).astype(int)
test['Expensive Nbrhd'] = test['Neighborhood'].isin(['StBr', 'NridgHt', 'NoRidge', 
                                                           'GrnHill', 'Veenker', 'Timber']).astype(int)

In [137]:
# Encoding based on proximity to positive off-site features
train['PosProx'] = ((train['Condition 1'].isin(['PosA', 'PosN'])) | (train['Condition 2'].isin(['PosA', 'PosN']))).astype(int)
test['PosProx'] = ((test['Condition 1'].isin(['PosA', 'PosN'])) | (test['Condition 2'].isin(['PosA', 'PosN']))).astype(int)

In [138]:
# Encoding based on whether the roof is made of wood
train['WoodRoof'] = train['Roof Matl'].isin(['WdShngl', 'WdShake']).astype(int)
test['WoodRoof'] = test['Roof Matl'].isin(['WdShngl', 'WdShake']).astype(int)

In [139]:
# These are the variables that I have decided on using as predictors from the file 2 visualizations
X = train[['Garage Cars', 'AttachedGarage', 'Garage Finish', 
                 'Overall Qual', 'Overall Cond', 'Year Built', 'Total Bsmt SF',
                 'Residential', 'Expensive Nbrhd', 'PosProx', 'Gr Liv Area', 'WoodRoof']]

y = train['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

# Scale the data
ss = StandardScaler()
Xs_train = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)

In [140]:
# Getting the baseline RMSE

y_avg = [y.mean()]*len(X)

metrics.mean_squared_error(y, y_avg, squared = False)

79239.33504161824

The baseline root mean squared error is 79239, which means we have to get a lower RMSE for our model to be considered passable.

In [141]:
lr = LinearRegression()
lr.fit(Xs_train, y_train)

df = pd.DataFrame([lr.coef_], columns = X_train.columns, index = ['coef'])
df

Unnamed: 0,Garage Cars,AttachedGarage,Garage Finish,Overall Qual,Overall Cond,Year Built,Total Bsmt SF,Residential,Expensive Nbrhd,PosProx,Gr Liv Area,WoodRoof
coef,8630.012708,1070.170465,2492.953705,24817.563527,6310.819013,11060.041431,10637.679766,-915.51481,13204.14711,4068.523529,23440.441639,1239.134857


The three garage parameters have expectedly high coefficients in the linear model, but others are surprising.

A residential building actually decreases sale price when compared with a non-residential building, which is odd because the boxes for residential buildings in the boxplot were higher than those for non-residential buildings.

It is also surprising that total basement area has a coefficient around half that of overall quality and above-ground area, given that the scatterplots looked as if all three increased price at around the same rate.

In [142]:
lr.score(Xs_train, y_train), lr.score(Xs_test, y_test)

(0.7966144593847458, 0.8382867340555185)

Our R<sup>2</sup> of 0.79 for the training set and 0.83 for the test set indicates that there is a strong correlation within the data between the data that I chose as independent variables and the dependent variable of sale price. We do not need to change any predictors given the low bias and low variance in the model.

In [143]:
ridge = Ridge(alpha = 10)
ridge.fit(Xs_train, y_train)

ridge.score(Xs_train, y_train), ridge.score(Xs_test, y_test)

(0.7966064676106996, 0.8382947231903262)

In [144]:
lasso = Lasso()
lasso.fit(Xs_train, y_train)

lasso.score(Xs_train, y_train), lasso.score(Xs_test, y_test)

(0.7966144581729373, 0.8382856643545282)

Using the Ridge and LASSO regularization tools (with alpha for Ridge = 10 and alpha for LASSO = 1) give me almost the exact same scores as for the linear model. As such, the linear model is fine to be used for predictions.

In [145]:
lr_pred = lr.predict(Xs_train)
lr_test_pred = lr.predict(Xs_test)

print(metrics.mean_squared_error(y_train, lr_pred, squared = False))
print(metrics.mean_squared_error(y_test, lr_test_pred, squared = False))

35865.2473217198
31510.68790297132


In [146]:
ridge_pred = ridge.predict(Xs_train)
ridge_test_pred = ridge.predict(Xs_test)

print(metrics.mean_squared_error(y_train, ridge_pred, squared = False))
print(metrics.mean_squared_error(y_test, ridge_test_pred, squared = False))

35865.95195425227
31509.9095307043


In [147]:
lasso_pred = lasso.predict(Xs_train)
lasso_test_pred = lasso.predict(Xs_test)

print(metrics.mean_squared_error(y_train, lasso_pred, squared = False))
print(metrics.mean_squared_error(y_test, lasso_test_pred, squared = False))

35865.24742856567
31510.792121255807


All three models lead to an RMSE of around half the baseline for both the training and test data, indicating that they do their job well as predictors. The linear model has a slightly lower RMSE than Ridge and LASSO, so it is the best of the three in predicting price accurately.

In [148]:
# One-hot encoding garage quality
garage_qual = {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NA': 0}
train['Garage Qual'] = train['Garage Qual'].map(lambda x: garage_qual.get(x))
test['Garage Qual'] = test['Garage Qual'].map(lambda x: garage_qual.get(x))

In [149]:
# One-hot encoding garage condition
garage_cond = {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NA': 0}
train['Garage Cond'] = train['Garage Cond'].map(lambda x: garage_cond.get(x))
test['Garage Cond'] = test['Garage Cond'].map(lambda x: garage_cond.get(x))

In [150]:
# Replacing overall data and condition with garage quality and condition
X_garage = train[['Garage Cars', 'AttachedGarage', 'Garage Finish', 
                 'Garage Qual', 'Garage Cond', 'Year Built', 'Total Bsmt SF',
                 'Residential', 'Expensive Nbrhd', 'PosProx', 'Gr Liv Area', 'WoodRoof']]

y_garage = train['SalePrice']

X_garage_train, X_garage_test, y_garage_train, y_garage_test = train_test_split(X_garage, y_garage, random_state = 42)

# Scale the data
ss_garage = StandardScaler()
Xs_garage_train = ss_garage.fit_transform(X_garage_train)
Xs_garage_test = ss_garage.transform(X_garage_test)

# Get R2 score
lr_garage = LinearRegression()
lr_garage.fit(Xs_garage_train, y_garage_train)
print(lr_garage.score(Xs_garage_train, y_train))
print(lr_garage.score(Xs_garage_test, y_garage_test))

# Get RMSE
lr_pred = lr_garage.predict(Xs_garage_train)
print(metrics.mean_squared_error(y_garage_train, lr_pred, squared = False))

lr_test_pred = lr_garage.predict(Xs_garage_test)
print(metrics.mean_squared_error(y_garage_test, lr_test_pred, squared = False))

0.7489726275337555
0.7913189382481982
39845.04604877226
35795.34718233087


The R<sup>2</sup> for the model including specifically the garage quality and condition is around 0.04 less than the model which includes the overall quality and condition for both the training and test data. This result indicates that the overall measures are better are predicting price. The fact that the RMSE is 4000 higher for both training and test data supports this point.

In [151]:
# Predict the sales price for the test data using the best linear regression model
X_test = test[['Garage Cars', 'AttachedGarage', 'Garage Finish', 
                 'Overall Qual', 'Overall Cond', 'Year Built', 'Total Bsmt SF',
                 'Residential', 'Expensive Nbrhd', 'PosProx', 'Gr Liv Area', 'WoodRoof']]


Xs_test = ss.transform(X_test)

y_test = lr.predict(Xs_test)

In [152]:
# Create new file with predicted sales price for each ID in the test set

data = {'ID': test['Id'], 'SalePrice': y_test}

test_data = pd.DataFrame(data)

test_data.to_csv('../datasets/sub_reg.csv', index = False)

## Conclusions & Recommendations

Three of the seven garage variables from the initial data have a positive impact on price according to the linear model coefficients: the number of cars that can fit, the status of the interior finish, and the type of the garage. The area of the garage probably does as well given that it is largely redundant with the number of cars. It is unknown how much garage quality and condition affect price due to multicollinearity with the overall quality and condition, as well as the fact that the linear model containing the garage-specific variables is not as good of a predictor as the one with the overall components. The seventh variable, the year the garage was built, did not have enough data to use as a predictor. Further study will involve researching garage creation years, as well as the material on the garage's roof. This is because houses that are newer or which have wood roofs are more expensive, and I would like to see if those statistics for the garage also greatly increase price.

It is notable that of the three main garage statistics, the one with the highest impact on price is the number of cars. Since total basement area and greater living space area also have high coefficients, it might be that the area of the house and its related enclosed components greatly increases the price. This is not true for open spaces, as the visualizations showed that the area of places like the open porch and the pool does not correlate at all with price.

In conclusion, my recommendation to those who want to guess house prices based on the garage is first to look at the number of cars that could possibly fit, then if the walls are finished enough, then whether the garage is connected to the house. An external garage with a three-car space would be much more expensive than an internal garage that can only fit one car.