- Load the houseprices data from Thinkful's database.
- Reimplement your model from the previous checkpoint.
- Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do k-fold cross-validation to choose the best hyperparameter values for your models. Which model is the best? Why?

In [5]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from statsmodels.tools.eval_measures import mse, rmse
from sqlalchemy import create_engine

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
house_df = pd.read_sql_query('select * from houseprices',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

In [6]:
house_df = pd.concat([house_df,pd.get_dummies(house_df.mszoning, prefix="mszoning", drop_first=True)], axis=1)
house_df = pd.concat([house_df,pd.get_dummies(house_df.street, prefix="street", drop_first=True)], axis=1)
dummy_column_names = list(pd.get_dummies(house_df.mszoning, prefix="mszoning", drop_first=True).columns)
dummy_column_names = dummy_column_names + list(pd.get_dummies(house_df.street, prefix="street", drop_first=True).columns)

In [7]:
house_df['totalsqft'] = house_df['totalbsmtsf'] + house_df['firstflrsf'] + house_df['secondflrsf']

house_df['totalsqft_overallqual_interaction'] = house_df['totalsqft'] * house_df['overallqual']

# Y is the target variable
Y = np.log1p(house_df['saleprice'])
# X is the feature set
X = house_df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalsqft', 'totalsqft_overallqual_interaction'] + dummy_column_names]

X = sm.add_constant(X)

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)

results = sm.OLS(y_train, X_train).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.832
Model:,OLS,Adj. R-squared:,0.831
Method:,Least Squares,F-statistic:,520.9
Date:,"Tue, 06 Aug 2019",Prob (F-statistic):,0.0
Time:,19:28:14,Log-Likelihood:,463.99
No. Observations:,1168,AIC:,-904.0
Df Residuals:,1156,BIC:,-843.2
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,9.9162,0.102,97.518,0.000,9.717,10.116
overallqual,0.1893,0.009,20.123,0.000,0.171,0.208
grlivarea,9.58e-05,1.89e-05,5.074,0.000,5.88e-05,0.000
garagecars,0.0779,0.015,5.244,0.000,0.049,0.107
garagearea,0.0001,5.04e-05,2.132,0.033,8.57e-06,0.000
totalsqft,0.0003,2.58e-05,11.139,0.000,0.000,0.000
totalsqft_overallqual_interaction,-2.572e-05,3.02e-06,-8.526,0.000,-3.16e-05,-1.98e-05
mszoning_FV,0.3911,0.065,6.055,0.000,0.264,0.518
mszoning_RH,0.2650,0.074,3.593,0.000,0.120,0.410

0,1,2,3
Omnibus:,350.711,Durbin-Watson:,1.876
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2714.386
Skew:,-1.167,Prob(JB):,0.0
Kurtosis:,10.094,Cond. No.,533000.0


In [8]:
# We fit an OLS model using sklearn
lrm = LinearRegression()
lrm.fit(X_train, y_train)


# We are making predictions here
y_preds_train = lrm.predict(X_train)
y_preds_test = lrm.predict(X_test)

print("R-squared of the model in the training set is: {}".format(lrm.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in the test set is: {}".format(lrm.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model in the training set is: 0.8321322553132751
-----Test set statistics-----
R-squared of the model in the test set is: 0.8249302330916699
Mean absolute error of the prediction is: 0.12570372872861824
Mean squared error of the prediction is: 0.02919212187135251
Root mean squared error of the prediction is: 0.17085702172094805
Mean absolute percentage error of the prediction is: 1.0503577667823942


## Ridge Regression

In [11]:
from sklearn.linear_model import Ridge

# Fitting a ridge regression model. Alpha is the regularization
# parameter (usually called lambda). As alpha gets larger, parameter
# shrinkage grows more pronounced.
ridgeregr = Ridge(alpha=10**37) 
ridgeregr.fit(X_train, y_train)

# We are making predictions here
y_preds_train = ridgeregr.predict(X_train)
y_preds_test = ridgeregr.predict(X_test)

print("R-squared of the model on the training set is: {}".format(ridgeregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(ridgeregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))


R-squared of the model on the training set is: 0.0
-----Test set statistics-----
R-squared of the model on the test set is: -0.0013312851260176561
Mean absolute error of the prediction is: 0.3178243812258433
Mean squared error of the prediction is: 0.1669676348189956
Root mean squared error of the prediction is: 0.4086167334055173
Mean absolute percentage error of the prediction is: 2.6437648221337366


## Lasso Regression

In [9]:
from sklearn.linear_model import Lasso

lassoregr = Lasso(alpha=10**20.5) 
lassoregr.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lassoregr.predict(X_train)
y_preds_test = lassoregr.predict(X_test)

print("R-squared of the model on the training set is: {}".format(lassoregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(lassoregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model on the training set is: 0.0
-----Test set statistics-----
R-squared of the model on the test set is: -0.0013312851260176561
Mean absolute error of the prediction is: 0.3178243812258433
Mean squared error of the prediction is: 0.1669676348189956
Root mean squared error of the prediction is: 0.4086167334055173
Mean absolute percentage error of the prediction is: 2.6437648221337366


## ElasticNet Regression

In [10]:
from sklearn.linear_model import ElasticNet

elasticregr = ElasticNet(alpha=10**21, l1_ratio=0.5) 
elasticregr.fit(X_train, y_train)

# We are making predictions here
y_preds_train = elasticregr.predict(X_train)
y_preds_test = elasticregr.predict(X_test)

print("R-squared of the model on the training set is: {}".format(elasticregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(elasticregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model on the training set is: 0.0
-----Test set statistics-----
R-squared of the model on the test set is: -0.0013312851260176561
Mean absolute error of the prediction is: 0.3178243812258433
Mean squared error of the prediction is: 0.1669676348189956
Root mean squared error of the prediction is: 0.4086167334055173
Mean absolute percentage error of the prediction is: 2.6437648221337366


By a very close margin, the OLS regression is still the best model for this data.