## Assignment

In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Reimplement your model from the previous checkpoint.
* Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do **k-fold cross-validation** to choose the best hyperparameter values for your models. Which model is the best? Why?

This is not a graded checkpoint, but you should discuss your solution with your mentor. After you've submitted your work, take a moment to compare your solution to [this example solution](https://github.com/Thinkful-Ed/machine-learning-regression-problems/blob/master/notebooks/7.solution_overfitting_and_regularization.ipynb).

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
from scipy.stats import mode
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet


In [2]:
import warnings
warnings.filterwarnings(action="ignore")

In [3]:
# %load ../utility/overhead.py
#record module versions used in cell 1
#
def version_recorder():
    '''
    only works if import is first cell run. prints and then returns dictionary with modules:version.
    '''
    import pkg_resources
    resources = In[1].splitlines()
    ##ADD: drop lines if not _from_ or _import_
    version_dict = { resource.split()[1].split(".")[0] : pkg_resources.get_distribution(resource.split()[1].split(".")[0]).version for resource in resources }
    return version_dict
version_recorder()

{'numpy': '1.16.2',
 'pandas': '0.24.2',
 'sklearn': '0.0',
 'matplotlib': '3.0.3',
 'seaborn': '0.9.0',
 'sqlalchemy': '1.3.5',
 'scipy': '1.2.1',
 'statsmodels': '0.10.1'}

In [4]:
#credentials
user = 'dsbc_student'
pw = '7*.8G9QH21'
host = '142.93.121.174'
port = '5432'
db = 'houseprices'
dialect = 'postgresql'

engine = create_engine('{}://{}:{}@{}:{}/{}'.format(dialect, user, pw, host, port, db))
engine.table_names()

sql_query = '''
SELECT
    *
FROM
    houseprices
'''
source_df = pd.read_sql(sql_query, con=engine)
engine.dispose()
house_df = source_df.copy()

In [5]:
#create df with categorical variables to select some features from
categorical_feat = house_df.dtypes[house_df.dtypes == 'object'].index
new_categories_df = pd.DataFrame()
for feature in categorical_feat:
    new_categories_df = pd.concat([new_categories_df, 
                                   pd.get_dummies(house_df[feature], columns=categorical_feat, drop_first=True, prefix = feature)], axis=1)
#append numerical features to new df
new_categories_df = pd.concat([new_categories_df, 
                               house_df.filter(items=(house_df.columns[(house_df.dtypes.values != 'object').tolist()]), axis=1) ], 
                              axis=1) #tolist() needed to avoid hashability issue

#create X & y
X = house_df[["overallqual", "grlivarea", "fullbath", "yearbuilt",]]
pd.concat([X, new_categories_df[['exterqual_TA', 'foundation_CBlock']]])
X["overalqual_x_year"] = house_df.overallqual * house_df.yearbuilt
y = house_df.saleprice

In [13]:
def test_model(X=X, y=y, kfold = 5, alpha_lambda_start=.01, alpha_lambda_stop=1_000, LASSO_amt=0, Ridge_amt=0):
    for i, alpha_lambda in enumerate(np.logspace(np.log(alpha_lambda_start), np.log(alpha_lambda_stop), num=5)):
        #split into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=int(42*(alpha_lambda**.5)))
        
        if LASSO_amt + Ridge_amt == 0:
            sklModel = LinearRegression()
            sm.add_constant(X_train)
            smModel = sm.OLS(y_train, X_train).fit()
            OLS_switch = True
        elif (LASSO_amt==1) & (Ridge_amt==0):
            sklModel = Lasso()
            sm.add_constant(X_train)
            smModel = sm.OLS(y_train, X_train).fit_regularized(L1_wt=LASSO_amt, alpha=alpha_lambda)
        elif (LASSO_amt==0) & (Ridge_amt==1):
            sklModel = Ridge()
            sm.add_constant(X_train)
            smModel = sm.OLS(y_train, X_train).fit_regularized(L1_wt=LASSO_amt, alpha=alpha_lambda)
        else:
            sklModel = ElasticNet()
            sm.add_constant(X_train)
            smModel = sm.OLS(y_train, X_train).fit_regularized(L1_wt=LASSO_amt, alpha=alpha_lambda)
               
        sklModel = sklModel.fit(X_train, y_train)
        
        y_preds = smModel.predict(X_test)
        print('-----------{}-FOLD-------------'.format(i))
        print("lambda is {}".format(alpha_lambda))
        print('adjusted R^2 is {}'.format(sklModel.score(X_train, y_train)))
        try:
            aic, bic = smModel.aic, smModel.bic
        except:
            aic = "not calculated"
            bic = "not calculated"
        print('AIC is {} and BIC is {}'.format(aic, bic))
        print("root mean squared error is: {}".format(rmse(y_test, y_preds)))
        print("mean squared error is: {}\n".format(mse(y_test, y_preds)))
       

In [19]:
#Original Model
test_model(X, y, LASSO_amt=0, Ridge_amt=0)

-----------0-FOLD-------------
lambda is 2.4821602362195342e-05
adjusted R^2 is 0.7720052401391067
AIC is 27943.82126508257 and BIC is 27969.13650589951
root mean squared error is: 47355.158296741785
mean squared error is: 2242511017.3094726

-----------1-FOLD-------------
lambda is 0.018752586622904802
adjusted R^2 is 0.7420590116328556
AIC is 28098.335155602163 and BIC is 28123.650396419103
root mean squared error is: 37878.57620111166
mean squared error is: 1434786535.0234225

-----------2-FOLD-------------
lambda is 14.167477986237698
adjusted R^2 is 0.7454712541660629
AIC is 28120.617132577627 and BIC is 28145.932373394568
root mean squared error is: 35851.18727381184
mean squared error is: 1285307628.941928

-----------3-FOLD-------------
lambda is 10703.453156983138
adjusted R^2 is 0.7752317788376137
AIC is 28005.75657358243 and BIC is 28031.07181439937
root mean squared error is: 44691.359948722136
mean squared error is: 1997317654.0662448

-----------4-FOLD-------------
lambda

In [20]:
#Ridge Regression
test_model(X, y, LASSO_amt=1, Ridge_amt=0)

-----------0-FOLD-------------
lambda is 2.4821602362195342e-05
adjusted R^2 is 0.7623524326063086
AIC is not calculated and BIC is not calculated
root mean squared error is: 49830.83036975853
mean squared error is: 2483111655.3396487

-----------1-FOLD-------------
lambda is 0.018752586622904802
adjusted R^2 is 0.733123480593342
AIC is not calculated and BIC is not calculated
root mean squared error is: 42194.605046887904
mean squared error is: 1780384695.062858

-----------2-FOLD-------------
lambda is 14.167477986237698
adjusted R^2 is 0.73667359401947
AIC is not calculated and BIC is not calculated
root mean squared error is: 40085.01500948741
mean squared error is: 1606808428.310831

-----------3-FOLD-------------
lambda is 10703.453156983138
adjusted R^2 is 0.7624770846457819
AIC is not calculated and BIC is not calculated
root mean squared error is: 47261.601490162706
mean squared error is: 2233658975.41495

-----------4-FOLD-------------
lambda is 8086401.093759918
adjusted R^2

In [21]:
#elasticnet
test_model(X, y, LASSO_amt=.5, Ridge_amt=.5)

-----------0-FOLD-------------
lambda is 2.4821602362195342e-05
adjusted R^2 is 0.7580737928321917
AIC is not calculated and BIC is not calculated
root mean squared error is: 49830.79153953853
mean squared error is: 2483107785.4569445

-----------1-FOLD-------------
lambda is 0.018752586622904802
adjusted R^2 is 0.7303924485695552
AIC is not calculated and BIC is not calculated
root mean squared error is: 42154.46112831652
mean squared error is: 1776998593.0187488

-----------2-FOLD-------------
lambda is 14.167477986237698
adjusted R^2 is 0.7341618152791721
AIC is not calculated and BIC is not calculated
root mean squared error is: 39974.31500466891
mean squared error is: 1597945860.0924983

-----------3-FOLD-------------
lambda is 10703.453156983138
adjusted R^2 is 0.7563862750032105
AIC is not calculated and BIC is not calculated
root mean squared error is: 44196.85555688533
mean squared error is: 1953362041.116186

-----------4-FOLD-------------
lambda is 8086401.093759918
adjusted

All of the models have relatively similar results... lamda values around 10-100 seem the best.