### House prices model

In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

- Load the houseprices data from Thinkful's database.
- Reimplement your model from the previous checkpoint.
- Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do k-fold cross-validation to choose the best hyperparameter values for your models. Scikit-learn has RidgeCV, LassoCV, and ElasticNetCV that you can utilize to do this. Which model is the best? Why?

In [1]:
# Load the houseprices data from Thinkful's database.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from statsmodels.tools.eval_measures import mse, rmse
from sqlalchemy import create_engine

import warnings
warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'  
postgres_pw = '7*.8G9QH21'  
postgres_host = '142.93.121.174'  
postgres_port = '5432'  
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

hp_df = pd.read_sql_query('select * from houseprices', con=engine)
engine.dispose()

hp_df.head()

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,...,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [2]:
# Run your house prices model again and interpret the results. Which features are statistically significant, and which are not?

hp_sel = hp_df[['saleprice', 'lotfrontage', 'lotarea', 'masvnrarea', 'bsmtfinsf1', 
                'totalbsmtsf', 'firstflrsf', 'secondflrsf', 'grlivarea', 'garagearea']]
hp_sel.replace(0, 1, inplace=True)
hp_sel.fillna(1, inplace=True)
hp_sellog = np.log(hp_sel)
hp_sellog_picked = hp_sellog[['saleprice', 'firstflrsf', 'grlivarea']]
print(hp_sellog_picked.head())

   saleprice  firstflrsf  grlivarea
0  12.247694    6.752270   7.444249
1  12.109011    7.140453   7.140453
2  12.317167    6.824374   7.487734
3  11.849398    6.867974   7.448334
4  12.429216    7.043160   7.695303


In [3]:
# Let's bring the model with the best performance, 'My model' for this test.

X = hp_sellog_picked[['firstflrsf', 'grlivarea']]
Y = hp_sellog_picked['saleprice']

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.7, random_state=42)

print("The number of observations in the training set is {}".format(X_train.shape[0]))
print("The number of observations in the test set is {}".format(X_test.shape[0]))

The number of observations in the training set is 438
The number of observations in the test set is 1022


**Let's start with OLS regression.**

In [5]:
from sklearn.linear_model import LinearRegression
lrm = LinearRegression()
lrm.fit(X_train, y_train)

lrm_y_train = lrm.predict(X_train)
lrm_y_test = lrm.predict(X_test)

print("R-squared of the model in the training set is: {}".format(lrm.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in the test set is: {}".format(lrm.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, lrm_y_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, lrm_y_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, lrm_y_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - lrm_y_test) / y_test)) * 100))

R-squared of the model in the training set is: 0.5866184638252316
-----Test set statistics-----
R-squared of the model in the test set is: 0.5978770864509493
Mean absolute error of the prediction is: 0.1937960585344992
Mean squared error of the prediction is: 0.06518233114128323
Root mean squared error of the prediction is: 0.25530830605619403
Mean absolute percentage error of the prediction is: 1.6164132369121786


**This time, I will do Ridge Regression with built in cross-validation method.**

In [17]:
from sklearn import linear_model

alphas = [np.power(10.0,p) for p in np.arange(-10,40,1)]

rgm_cv = linear_model.RidgeCV(alphas=alphas, cv=5)
rgm_cv.fit(X_train, y_train)

rgm_y_train = rgm_cv.predict(X_train)
rgm_y_test = rgm_cv.predict(X_test)

print("Best alpha value is: {}".format(rgm_cv.alpha_))
print("R-squared of the model in the training set is: {}".format(rgm.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in the test set is: {}".format(rgm.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, rgm_y_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, rgm_y_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, rgm_y_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - rgm_y_test) / y_test)) * 100))

Best alpha value is: 1.0
R-squared of the model in the training set is: 0.5763096807285116
-----Test set statistics-----
R-squared of the model in the test set is: 0.5884139603832022
Mean absolute error of the prediction is: 0.19378536745237496
Mean squared error of the prediction is: 0.06518983108133193
Root mean squared error of the prediction is: 0.2553229936400792
Mean absolute percentage error of the prediction is: 1.6160419382405196


**This time, I will do Lasso Regression with built in cross-validation method.**

In [19]:
lass_cv = linear_model.LassoCV(alphas=alphas, cv=5)
lass_cv.fit(X_train, y_train)

lass_y_train = lass_cv.predict(X_train)
lass_y_test = lass_cv.predict(X_test)

print("Best alpha value is: {}".format(lass_cv.alpha_))
print("R-squared of the model in the training set is: {}".format(lass_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in the test set is: {}".format(lass_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, lass_y_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, lass_y_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, lass_y_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - lass_y_test) / y_test)) * 100))

Best alpha value is: 1e-10
R-squared of the model in the training set is: 0.5866184638252318
-----Test set statistics-----
R-squared of the model in the test set is: 0.597877086431229
Mean absolute error of the prediction is: 0.19379605854041032
Mean squared error of the prediction is: 0.06518233114447981
Root mean squared error of the prediction is: 0.25530830606245425
Mean absolute percentage error of the prediction is: 1.6164132369400568


**This time, I will do ElasticNet Regression with built in cross-validation method.**

In [20]:
elanet_cv = linear_model.ElasticNetCV(alphas=alphas, cv=5)
elanet_cv.fit(X_train, y_train)

elanet_y_train = elanet_cv.predict(X_train)
elanet_y_test = elanet_cv.predict(X_test)

print("Best alpha value is: {}".format(elanet_cv.alpha_))
print("R-squared of the model in the training set is: {}".format(elanet_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in the test set is: {}".format(elanet_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, elanet_y_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, elanet_y_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, elanet_y_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - elanet_y_test) / y_test)) * 100))

Best alpha value is: 0.0001
R-squared of the model in the training set is: 0.5866179570465635
-----Test set statistics-----
R-squared of the model in the test set is: 0.5978976999993088
Mean absolute error of the prediction is: 0.1937876078437501
Mean squared error of the prediction is: 0.06517898977701404
Root mean squared error of the prediction is: 0.25530176218940215
Mean absolute percentage error of the prediction is: 1.6163245391166192


Thress models are showing very similar model performance.  
I think this means that the data is quite neat and nice.