# 19.7 Overfitting and Regularization Assignment
In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Reimplement your model from the previous checkpoint.
* Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do **k-fold cross-validation** to choose the best hyperparameter values for your models. Which model is the best? Why?

## Import the Data

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse
from sqlalchemy import create_engine
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.linear_model import ElasticNetCV
import warnings

%matplotlib inline
warnings.filterwarnings('ignore')

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

df = pd.read_sql_query('select * from houseprices', con=engine)
engine.dispose()

df.head()

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,...,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## Reimplement Prior Model: OLS

In [3]:
# Target variable
Y = df['saleprice']

# Regression features
X = df[['overallqual', 'totalbsmtsf', 'firstflrsf', 'grlivarea',
      'garagecars']]

# Transform the target variable
Y = np.log1p(df['saleprice'])

# Create train and test data 
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
        test_size=0.2, random_state=145)


lrm = LinearRegression()
lrm.fit(X_train, Y_train)

# Predictions of train and test data
Y_preds_train = lrm.predict(X_train)
Y_preds_test = lrm.predict(X_test)

# Print Values
print("R-squared: {}".format(lrm.score(X_train, Y_train)))
print("-----Test set statistics-----")
print("R-squared: {}".format(lrm.score(X_test, Y_test)))
print("MAE: {}".format(mean_absolute_error(Y_test, Y_preds_test)))
print("MSE: {}".format(mse(Y_test, Y_preds_test)))
print("RMSE: {}".format(rmse(Y_test, Y_preds_test)))
print("MAPE: {}".format(np.mean(np.abs((Y_test-Y_preds_test)/Y_test))*100))

R-squared: 0.7824771950399767
-----Test set statistics-----
R-squared: 0.8396737856184402
MAE: 0.118975248334533
MSE: 0.02622661229142091
RMSE: 0.16194632534090087
MAPE: 0.9952869963860944


## Ridge 

In [4]:
alpha_values = [10.0**x for x in np.arange(-10, 50, 1)]

ridge = RidgeCV(alphas=alpha_values, cv=5) 
ridge.fit(X_train, Y_train)

# Predictions of train and test data
Y_preds_train = ridge.predict(X_train)
Y_preds_test = ridge.predict(X_test)

# Print Values
print('Best Alpha: {}'.format(ridge.alpha_))
print("R-squared: {}".format(ridge.score(X_train, Y_train)))
print("-----Test set statistics-----")
print("R-squared: {}".format(ridge.score(X_test, Y_test)))
print("MAE: {}".format(mean_absolute_error(Y_test, Y_preds_test)))
print("MSE: {}".format(mse(Y_test, Y_preds_test)))
print("RMSE: {}".format(rmse(Y_test, Y_preds_test)))
print("MAPE: {}".format(np.mean(np.abs((Y_test-Y_preds_test)/Y_test))*100))

Best Alpha: 1e-10
R-squared: 0.7824771950399768
-----Test set statistics-----
R-squared: 0.8396737856184477
MAE: 0.11897524833452697
MSE: 0.02622661229141968
RMSE: 0.16194632534089706
MAPE: 0.9952869963860433


## Lasso

In [5]:
lasso = LassoCV(alphas=alpha_values, cv=5) 
lasso.fit(X_train, Y_train)

# Predictions of train and test data
Y_preds_train = lasso.predict(X_train)
Y_preds_test = lasso.predict(X_test)

# Print Values
print('Best Alpha: {}'.format(lasso.alpha_))
print("R-squared: {}".format(lasso.score(X_train, Y_train)))
print("-----Test set statistics-----")
print("R-squared: {}".format(lasso.score(X_test, Y_test)))
print("MAE: {}".format(mean_absolute_error(Y_test, Y_preds_test)))
print("MSE: {}".format(mse(Y_test, Y_preds_test)))
print("RMSE: {}".format(rmse(Y_test, Y_preds_test)))
print("MAPE: {}".format(np.mean(np.abs((Y_test-Y_preds_test)/Y_test))*100))

Best Alpha: 1e-10
R-squared: 0.7824771950399767
-----Test set statistics-----
R-squared: 0.8396737856664666
MAE: 0.11897524830160777
MSE: 0.02622661228356463
RMSE: 0.16194632531664505
MAPE: 0.9952869961061893


## ElasticNet

In [6]:
# Assign ElasticNet to both the training data
elasticnet = ElasticNetCV(alphas=alpha_values, cv=5) 
elasticnet.fit(X_train, Y_train)

# Predictions of train and test data
Y_preds_train = elasticnet.predict(X_train)
Y_preds_test = elasticnet.predict(X_test)

# Print values
print('Best Alpha: {}'.format(elasticnet.alpha_))
print("R-squared: {}".format(elasticnet.score(X_train, Y_train)))
print("-----Test set statistics-----")
print("R-squared: {}".format(elasticnet.score(X_test, Y_test)))
print("MAE: {}".format(mean_absolute_error(Y_test, Y_preds_test)))
print("MSE: {}".format(mse(Y_test, Y_preds_test)))
print("RMSE: {}".format(rmse(Y_test, Y_preds_test)))
print("MAPE: {}".format(np.mean(np.abs((Y_test-Y_preds_test)/Y_test))*100))

Best Alpha: 1e-10
R-squared: 0.7824771950399767
-----Test set statistics-----
R-squared: 0.839673785645547
MAE: 0.1189752483159985
MSE: 0.026226612286986712
RMSE: 0.16194632532721054
MAPE: 0.9952869962285239


## Conclusions