## Assignment

In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Reimplement your model from the previous checkpoint.
* Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do **k-fold cross-validation** to choose the best hyperparameter values for your models. Which model is the best? Why?

This is not a graded checkpoint, but you should discuss your solution with your mentor. After you've submitted your work, take a moment to compare your solution to [this example solution](https://github.com/Thinkful-Ed/machine-learning-regression-problems/blob/master/notebooks/7.solution_overfitting_and_regularization.ipynb).

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
from scipy.stats import mode
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet


In [2]:
import warnings
warnings.filterwarnings(action="ignore")

In [3]:
# %load ../utility/overhead.py
#record module versions used in cell 1
#
def version_recorder():
    '''
    only works if import is first cell run. prints and then returns dictionary with modules:version.
    '''
    import pkg_resources
    resources = In[1].splitlines()
    ##ADD: drop lines if not _from_ or _import_
    version_dict = { resource.split()[1].split(".")[0] : pkg_resources.get_distribution(resource.split()[1].split(".")[0]).version for resource in resources }
    return version_dict
version_recorder()

{'numpy': '1.16.2',
 'pandas': '0.24.2',
 'sklearn': '0.0',
 'matplotlib': '3.0.3',
 'seaborn': '0.9.0',
 'sqlalchemy': '1.3.5',
 'scipy': '1.2.1',
 'statsmodels': '0.10.1'}

In [4]:
#credentials
user = 'dsbc_student'
pw = '7*.8G9QH21'
host = '142.93.121.174'
port = '5432'
db = 'houseprices'
dialect = 'postgresql'

engine = create_engine('{}://{}:{}@{}:{}/{}'.format(dialect, user, pw, host, port, db))
engine.table_names()

sql_query = '''
SELECT
    *
FROM
    houseprices
'''
source_df = pd.read_sql(sql_query, con=engine)
engine.dispose()
house_df = source_df.copy()

In [5]:
#create df with categorical variables to select some features from
categorical_feat = house_df.dtypes[house_df.dtypes == 'object'].index
new_categories_df = pd.DataFrame()
for feature in categorical_feat:
    new_categories_df = pd.concat([new_categories_df, 
                                   pd.get_dummies(house_df[feature], columns=categorical_feat, drop_first=True, prefix = feature)], axis=1)
#append numerical features to new df
new_categories_df = pd.concat([new_categories_df, 
                               house_df.filter(items=(house_df.columns[(house_df.dtypes.values != 'object').tolist()]), axis=1) ], 
                              axis=1) #tolist() needed to avoid hashability issue

#create X & y
X = house_df[["overallqual", "grlivarea", "fullbath", "yearbuilt",]]
pd.concat([X, new_categories_df[['exterqual_TA', 'foundation_CBlock']]])
X["overalqual_x_year"] = house_df.overallqual * house_df.yearbuilt
y = house_df.saleprice

In [6]:
def test_model(X=X, y=y, kfold = 5, alpha_lambda_start=100, alpha_lambda_stop=1_000_000, LASSO_amt=0, Ridge_amt=0):
    for i, alpha_lambda in enumerate(np.logspace(np.log(alpha_lambda_start), np.log(alpha_lambda_stop), num=5)):
        #split into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=int(42*(alpha_lambda**.5)))
        
        if LASSO_amt + Ridge_amt == 0:
            sklModel = LinearRegression()
            sm.add_constant(X_train)
            smModel = sm.OLS(y_train, X_train).fit()
            OLS_switch = True
        elif (LASSO_amt==1) & (Ridge_amt==0):
            sklModel = Lasso()
            sm.add_constant(X_train)
            smModel = sm.OLS(y_train, X_train).fit_regularized(L1_wt=LASSO_amt, alpha=alpha_lambda)
        elif (LASSO_amt==0) & (Ridge_amt==1):
            sklModel = Ridge()
            sm.add_constant(X_train)
            smModel = sm.OLS(y_train, X_train).fit_regularized(L1_wt=LASSO_amt, alpha=alpha_lambda)
        else:
            sklModel = ElasticNet()
            sm.add_constant(X_train)
            smModel = sm.OLS(y_train, X_train).fit_regularized(L1_wt=LASSO_amt, alpha=alpha_lambda)
               
        sklModel = sklModel.fit(X_train, y_train)
        
        y_preds = smModel.predict(X_test)
        print('-----------{}-FOLD-------------'.format(i))
        print("lambda is {}".format(alpha_lambda))
        print('adjusted R^2 is {}'.format(sklModel.score(X_train, y_train)))
        try:
            aic, bic = smModel.aic, smModel.bic
        except:
            aic = "not calculated"
            bic = "not calculated"
        print('AIC is {} and BIC is {}'.format(aic, bic))
        print("root mean squared error is: {}".format(rmse(y_test, y_preds)))
        print("mean squared error is: {}\n".format(mse(y_test, y_preds)))
       

In [7]:
#Original Model
test_model(X, y, alpha_lambda_start=10, alpha_lambda_stop=1_000, LASSO_amt=0, Ridge_amt=0)

-----------0-FOLD-------------
lambda is 200.71743249053017
adjusted R^2 is 0.7555160641583969
AIC is 28056.80910761802 and BIC is 28082.12434843496
root mean squared error is: 40615.77377008059
mean squared error is: 1649641078.9423661

-----------1-FOLD-------------
lambda is 2843.6598062637377
adjusted R^2 is 0.7426663155426882
AIC is 28103.617776532865 and BIC is 28128.933017349806
root mean squared error is: 37252.622822450925
mean squared error is: 1387757907.1517916

-----------2-FOLD-------------
lambda is 40287.487705590545
adjusted R^2 is 0.7588547816888873
AIC is 28041.590624299704 and BIC is 28066.905865116645
root mean squared error is: 41439.30764819615
mean squared error is: 1717216218.3618479

-----------3-FOLD-------------
lambda is 570772.0951897759
adjusted R^2 is 0.7529078091453837
AIC is 28152.521421440266 and BIC is 28177.836662257207
root mean squared error is: 33460.0301195766
mean squared error is: 1119573615.6029735

-----------4-FOLD-------------
lambda is 80

In [8]:
#Ridge Regression
test_model(X, y, alpha_lambda_start=10, alpha_lambda_stop=1_000_000, LASSO_amt=1, Ridge_amt=0)

-----------0-FOLD-------------
lambda is 200.71743249053017
adjusted R^2 is 0.7417182343455404
AIC is not calculated and BIC is not calculated
root mean squared error is: 42649.12381543864
mean squared error is: 1818947762.224615

-----------1-FOLD-------------
lambda is 151640.936978285
adjusted R^2 is 0.7622849412977937
AIC is not calculated and BIC is not calculated
root mean squared error is: 46835.817927350836
mean squared error is: 2193593840.923958

-----------2-FOLD-------------
lambda is 114563909.48373185
adjusted R^2 is 0.7376287989409189
AIC is not calculated and BIC is not calculated
root mean squared error is: 47629.51768843334
mean squared error is: 2268570955.2327847

-----------3-FOLD-------------
lambda is 86552415315.63667
adjusted R^2 is 0.7666555797032376
AIC is not calculated and BIC is not calculated
root mean squared error is: 196648.87956499474
mean squared error is: 38670781834.16781

-----------4-FOLD-------------
lambda is 65389882649161.6
adjusted R^2 is 0.

In [9]:
#elasticnet
test_model(X, y, alpha_lambda_start=10, alpha_lambda_stop=1_000_000, LASSO_amt=.5, Ridge_amt=.5)

-----------0-FOLD-------------
lambda is 200.71743249053017
adjusted R^2 is 0.737771355472431
AIC is not calculated and BIC is not calculated
root mean squared error is: 42859.470570931575
mean squared error is: 1836934217.6205494

-----------1-FOLD-------------
lambda is 151640.936978285
adjusted R^2 is 0.7566345992771841
AIC is not calculated and BIC is not calculated
root mean squared error is: 46560.31782607827
mean squared error is: 2167863196.0654225

-----------2-FOLD-------------
lambda is 114563909.48373185
adjusted R^2 is 0.7348960313131602
AIC is not calculated and BIC is not calculated
root mean squared error is: 68660.43781365936
mean squared error is: 4714255720.763384

-----------3-FOLD-------------
lambda is 86552415315.63667
adjusted R^2 is 0.7605857992283367
AIC is not calculated and BIC is not calculated
root mean squared error is: 196648.87956499474
mean squared error is: 38670781834.16781

-----------4-FOLD-------------
lambda is 65389882649161.6
adjusted R^2 is 0.

All of the models have relatively similar results...