## Assignment

In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Reimplement your model from the previous checkpoint.
* Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do **k-fold cross-validation** to choose the best hyperparameter values for your models. Which model is the best? Why?

In [13]:
import numpy as np
import pandas as pd
from sqlalchemy import create_engine
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV


#display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings(action="ignore")

In [2]:
#importing data from SQL
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'
                       .format(postgres_user, postgres_pw, postgres_host,
                              postgres_port, postgres_db))

df = pd.read_sql_query('SELECT * FROM houseprices', con=engine)

engine.dispose()

# Reimplement Model (prepping model)

In [3]:
#one-hot encoding categorical variables and creating correlation matrix
one_hot = pd.get_dummies(df, drop_first=True)
corr_df_0 = one_hot.corr()


#finding and dropping values that have less than 5% correlation in either direction with our target variable to shrink feature space
low_corr = corr_df_0.loc[abs(corr_df_0['saleprice']) < .05]
low_corr = low_corr['saleprice'].index
low_corr_list = [x for x in low_corr]
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse

df_pca = one_hot.drop(low_corr_list, axis=1)

In [4]:
#filling missing values
df_pca['lotfrontage'].fillna(df_pca['lotfrontage'].mean(), inplace=True)
df_pca['garageyrblt'].fillna(df_pca['yearbuilt'], inplace=True)
df_pca['masvnrarea'].fillna(df_pca['masvnrarea'].mean(), inplace=True)

In [5]:
#Performing PCA 
df_pca.dropna(inplace=True)
scaled_df = StandardScaler().fit_transform(df_pca)
sklearn_pca = PCA(n_components=5)
pca_arrays  = sklearn_pca.fit_transform(scaled_df)

df_pca['pca_1'] = pca_arrays[:,0]
df_pca['pca_2'] = pca_arrays[:,1]
df_pca['pca_3'] = pca_arrays[:,2]
df_pca['pca_4'] = pca_arrays[:,3]
df_pca['pca_5'] = pca_arrays[:,4]

# OLS

In [6]:
#define target variable
Y = df_pca['saleprice']

#define predictive variables
X = df_pca[['pca_1','overallqual', 'garagearea']]

#splitting data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=465)

In [7]:
#adding constant to training features
X_train = sm.add_constant(X_train)

#fitting model using training data
model = sm.OLS(y_train, X_train).fit()

#adding constant to testing features
X_test = sm.add_constant(X_test)

#predicting using testing data
y_pred = model.predict(X_test)

#examing OLS regression results
model.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.721
Model:,OLS,Adj. R-squared:,0.72
Method:,Least Squares,F-statistic:,1001.0
Date:,"Sun, 06 Oct 2019",Prob (F-statistic):,1.25e-321
Time:,11:22:13,Log-Likelihood:,-14081.0
No. Observations:,1168,AIC:,28170.0
Df Residuals:,1164,BIC:,28190.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.68e+04,1.01e+04,1.663,0.097,-3026.256,3.66e+04
pca_1,7605.5966,586.547,12.967,0.000,6454.789,8756.404
overallqual,2.326e+04,1536.063,15.143,0.000,2.02e+04,2.63e+04
garagearea,47.4803,7.752,6.125,0.000,32.270,62.691

0,1,2,3
Omnibus:,544.692,Durbin-Watson:,1.845
Prob(Omnibus):,0.0,Jarque-Bera (JB):,8736.075
Skew:,1.736,Prob(JB):,0.0
Kurtosis:,15.94,Cond. No.,4350.0


# Lasso Regression (L1)

In [9]:
lasso_cv = LassoCV(cv=5)
lasso_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lasso_cv.predict(X_train)
y_preds_test = lasso_cv.predict(X_test)


print("Best alpha value is: {}".format(lasso_cv.alpha_))
print("R-squared of the model in training set is: {}".format(lasso_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(lasso_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

Best alpha value is: 10584.068587862395
R-squared of the model in training set is: 0.7034258193807779
-----Test set statistics-----
R-squared of the model in test set is: 0.6912667247415787
Mean absolute error of the prediction is: 28141.67955669375
Mean squared error of the prediction is: 2072743849.4481595
Root mean squared error of the prediction is: 45527.3966908735
Mean absolute percentage error of the prediction is: 15.444157839113066


# Ridge Regression (L2)

In [12]:
ridge_cv = RidgeCV(cv=5)
ridge_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = ridge_cv.predict(X_train)
y_preds_test = ridge_cv.predict(X_test)


print("Best alpha value is: {}".format(ridge_cv.alpha_))
print("R-squared of the model in training set is: {}".format(ridge_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(ridge_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

Best alpha value is: 10.0
R-squared of the model in training set is: 0.7205599865814909
-----Test set statistics-----
R-squared of the model in test set is: 0.7004688187241412
Mean absolute error of the prediction is: 28975.240654065823
Mean squared error of the prediction is: 2010963713.5414124
Root mean squared error of the prediction is: 44843.77006387189
Mean absolute percentage error of the prediction is: 16.475745046198305


# ElasticNet Regression (L1 + L2) 

In [15]:
elasticnet_cv = ElasticNetCV(cv=5)

elasticnet_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = elasticnet_cv.predict(X_train)
y_preds_test = elasticnet_cv.predict(X_test)

print("Best alpha value is: {}".format(elasticnet_cv.alpha_))
print("R-squared of the model in training set is: {}".format(elasticnet_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(elasticnet_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

Best alpha value is: 21168.13717572479
R-squared of the model in training set is: 0.37619521672984835
-----Test set statistics-----
R-squared of the model in test set is: 0.36406552510978346
Mean absolute error of the prediction is: 42561.1127877334
Mean squared error of the prediction is: 4269475878.0938587
Root mean squared error of the prediction is: 65341.226481401914
Mean absolute percentage error of the prediction is: 25.06947098940044


Based on these values we can see that the OLS model is the best as it has the highest Rsquared value. 