## Assignment

In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Reimplement your model from the previous checkpoint.
* Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do **k-fold cross-validation** to choose the best hyperparameter values for your models. Scikit-learn has RidgeCV, LassoCV, and ElasticNetCV that you can utilize to do this. Which model is the best? Why?



In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sqlalchemy import create_engine
import seaborn as sns
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse
from sklearn.linear_model import (LinearRegression, ElasticNet, Lasso, Ridge, RidgeCV, LassoCV, ElasticNetCV)
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

import warnings

sns.set_style('dark')

warnings.filterwarnings('ignore')

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
hp_df = pd.read_sql_query('select * from houseprices',con=engine, index_col='id')

# no need for an open connection, as we're only doing a single query
engine.dispose()

In [3]:
# get total home sq ft
hp_df['totalsqft'] = hp_df.totalbsmtsf + hp_df.firstflrsf + hp_df.secondflrsf

In [None]:
# current model

# adjusting target to log(x+1)
y = np.log1p(hp_df.saleprice)

# choosing categories based on a Random Forest Regression I ran in another notebook
X = hp_df[['garagecars', 'overallqual', 'totalsqft']] 

# Splitting up train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)


In [11]:
# Linear Regression
# adding constant
X_train = sm.add_constant(X_train)

results = sm.OLS(y_train, X_train).fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.816
Model:                            OLS   Adj. R-squared:                  0.815
Method:                 Least Squares   F-statistic:                     1720.
Date:                Thu, 19 Mar 2020   Prob (F-statistic):               0.00
Time:                        20:51:13   Log-Likelihood:                 403.28
No. Observations:                1168   AIC:                            -798.6
Df Residuals:                    1164   BIC:                            -778.3
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const          10.5421      0.023    454.624      

In [36]:
# Add constant to X Test too!
X_test = sm.add_constant(X_test)

results = sm.OLS(y_test, X_test).fit()

print(results.summary())


                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.748
Model:                            OLS   Adj. R-squared:                  0.745
Method:                 Least Squares   F-statistic:                     284.9
Date:                Thu, 19 Mar 2020   Prob (F-statistic):           7.54e-86
Time:                        21:38:20   Log-Likelihood:                 55.130
No. Observations:                 292   AIC:                            -102.3
Df Residuals:                     288   BIC:                            -87.55
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const          10.6892      0.051    208.290      

In [38]:
alphas = [np.power(10.0,p) for p in np.arange(-10,40,1)]

# Ridge Regression
ridge_rgr = RidgeCV(alphas=alphas, cv=5)
ridge_rgr.fit(X_train, y_train)

y_preds = ridge_rgr.predict(X_test)

print(f"The best alpha value is: {ridge_rgr.alpha_}")
print(f"R-squared of the model on the training set is: {ridge_rgr.score(X_train, y_train)}")
print("-----Test set statistics-----")
print(f"R-squared of the model on the test set is: {ridge_rgr.score(X_test, y_test)}")
print(f"Mean absolute error of the prediction is: {round(mean_absolute_error(y_test, y_preds), 2)}")
print(f"Mean squared error of the prediction is: {round(mse(y_test, y_preds),2)}")
print(f"Root mean squared error of the prediction is: {round(rmse(y_test, y_preds))}")
print(f"Mean absolute percentage error of the prediction is: {round(np.mean(np.abs((y_test - y_preds) / y_test) * 100),2)}") 


The best alpha value is: 10.0
R-squared of the model on the training set is: 0.8159309108448491
-----Test set statistics-----
R-squared of the model on the test set is: 0.6845628966881296
Mean absolute error of the prediction is: 0.13
Mean squared error of the prediction is: 0.05
Root mean squared error of the prediction is: 0.0
Mean absolute percentage error of the prediction is: 1.12


In [39]:
# Ridge Regression
LASSO_rgr = LassoCV(alphas=alphas, cv=5)
LASSO_rgr.fit(X_train, y_train)

y_preds = LASSO_rgr.predict(X_test)

print(f"The best alpha value is: {LASSO_rgr.alpha_}")
print(f"R-squared of the model on the training set is: {LASSO_rgr.score(X_train, y_train)}")
print("-----Test set statistics-----")
print(f"R-squared of the model on the test set is: {LASSO_rgr.score(X_test, y_test)}")
print(f"Mean absolute error of the prediction is: {round(mean_absolute_error(y_test, y_preds), 2)}")
print(f"Mean squared error of the prediction is: {round(mse(y_test, y_preds),2)}")
print(f"Root mean squared error of the prediction is: {round(rmse(y_test, y_preds))}")
print(f"Mean absolute percentage error of the prediction is: {round(np.mean(np.abs((y_test - y_preds) / y_test) * 100),2)}") 


The best alpha value is: 0.0001
R-squared of the model on the training set is: 0.8159467570960331
-----Test set statistics-----
R-squared of the model on the test set is: 0.6856341188855839
Mean absolute error of the prediction is: 0.13
Mean squared error of the prediction is: 0.05
Root mean squared error of the prediction is: 0.0
Mean absolute percentage error of the prediction is: 1.12


In [40]:
# Ridge Regression
ElasticNet_rgr = ElasticNetCV(alphas=alphas, cv=5)
ElasticNet_rgr.fit(X_train, y_train)

y_preds = ElasticNet_rgr.predict(X_test)

print(f"The best alpha value is: {ElasticNet_rgr.alpha_}")
print(f"R-squared of the model on the training set is: {ElasticNet_rgr.score(X_train, y_train)}")
print("-----Test set statistics-----")
print(f"R-squared of the model on the test set is: {ElasticNet_rgr.score(X_test, y_test)}")
print(f"Mean absolute error of the prediction is: {round(mean_absolute_error(y_test, y_preds), 2)}")
print(f"Mean squared error of the prediction is: {round(mse(y_test, y_preds),2)}")
print(f"Root mean squared error of the prediction is: {round(rmse(y_test, y_preds))}")
print(f"Mean absolute percentage error of the prediction is: {round(np.mean(np.abs((y_test - y_preds) / y_test) * 100),2)}") 


The best alpha value is: 0.001
R-squared of the model on the training set is: 0.815941286733658
-----Test set statistics-----
R-squared of the model on the test set is: 0.6850689330060431
Mean absolute error of the prediction is: 0.13
Mean squared error of the prediction is: 0.05
Root mean squared error of the prediction is: 0.0
Mean absolute percentage error of the prediction is: 1.12


Looks as though OLS was the best version