## Assignment

In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Reimplement your model from the previous checkpoint.
* Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do **k-fold cross-validation** to choose the best hyperparameter values for your models. Which model is the best? Why?

This is not a graded checkpoint, but you should discuss your solution with your mentor. After you've submitted your work, take a moment to compare your solution to [this example solution](https://github.com/Thinkful-Ed/machine-learning-regression-problems/blob/master/notebooks/7.solution_overfitting_and_regularization.ipynb).

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
from scipy.stats import mode
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet

In [2]:
# %load ../utility/overhead.py
#record module versions used in cell 1
#
def version_recorder():
    '''
    only works if import is first cell run. prints and then returns dictionary with modules:version.
    '''
    import pkg_resources
    resources = In[1].splitlines()
    ##ADD: drop lines if not _from_ or _import_
    version_dict = { resource.split()[1].split(".")[0] : pkg_resources.get_distribution(resource.split()[1].split(".")[0]).version for resource in resources }
    return version_dict
version_recorder()

{'numpy': '1.16.2',
 'pandas': '0.24.2',
 'sklearn': '0.0',
 'matplotlib': '3.0.3',
 'seaborn': '0.9.0',
 'sqlalchemy': '1.3.5',
 'scipy': '1.2.1',
 'statsmodels': '0.10.1'}

In [3]:
#credentials
user = 'dsbc_student'
pw = '7*.8G9QH21'
host = '142.93.121.174'
port = '5432'
db = 'houseprices'
dialect = 'postgresql'

engine = create_engine('{}://{}:{}@{}:{}/{}'.format(dialect, user, pw, host, port, db))
engine.table_names()

sql_query = '''
SELECT
    *
FROM
    houseprices
'''
source_df = pd.read_sql(sql_query, con=engine)
engine.dispose()
house_df = source_df.copy()

In [4]:
#create df with categorical variables to select some features from
categorical_feat = house_df.dtypes[house_df.dtypes == 'object'].index
new_categories_df = pd.DataFrame()
for feature in categorical_feat:
    new_categories_df = pd.concat([new_categories_df, 
                                   pd.get_dummies(house_df[feature], columns=categorical_feat, drop_first=True, prefix = feature)], axis=1)
#append numerical features to new df
new_categories_df = pd.concat([new_categories_df, 
                               house_df.filter(items=(house_df.columns[(house_df.dtypes.values != 'object').tolist()]), axis=1) ], 
                              axis=1) #tolist() needed to avoid hashability issue

#create X & y
X = house_df[["overallqual", "grlivarea", "fullbath", "yearbuilt",]]
pd.concat([X, new_categories_df[['exterqual_TA', 'foundation_CBlock']]])
X["overalqual_x_year"] = house_df.overallqual * house_df.yearbuilt
y = house_df.saleprice

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [92]:
def test_model(X=X, y=y, kfold = 5, alpha_lambda_start=100, alpha_lambda_stop=1_000_000, LASSO_amt=0, Ridge_amt=0):
    for i, alpha_lambda in enumerate(np.logspace(np.log(alpha_lambda_start), np.log(alpha_lambda_stop), num=5)):
        #split into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=int(42/(alpha_lambda**.5)))
        
        if LASSO_amt + Ridge_amt == 0:
            sklModel = LinearRegression()
            sm.add_constant(X_train)
            smModel = sm.OLS(y_train, X_train).fit()
            OLS_switch = True
        elif (LASSO_amt==1) & (Ridge_amt==0):
            sklModel = Lasso()
            sm.add_constant(X_train)
            smModel = sm.OLS(y_train, X_train).fit_regularized(L1_wt=LASSO_amt, alpha=alpha_lambda)
        elif (LASSO_amt==0) & (Ridge_amt==1):
            sklModel = Ridge()
            sm.add_constant(X_train)
            smModel = sm.OLS(y_train, X_train).fit_regularized(L1_wt=LASSO_amt, alpha=alpha_lambda)
        else:
            sklModel = ElasticNet()
            sm.add_constant(X_train)
            smModel = sm.OLS(y_train, X_train).fit_regularized(L1_wt=LASSO_amt, alpha=alpha_lambda)
               
        sklModel = sklModel.fit(X_train, y_train)
        
        y_preds = smModel.predict(X_test)
        print('-----------{}-FOLD-------------'.format(i))
        print('adjusted R^2 is {}'.format(sklModel.score(X_train, y_train)))
        try:
            aic, bic = smModel.aic, smModel.bic
        except:
            aic = "not calculated"
            bic = "not calculated"
        print('AIC is {} and BIC is {}'.format(aic, bic))
        print("root mean squared error is: {}".format(rmse(y_test, y_preds)))
        print("mean squared error is: {}\n".format(mse(y_test, y_preds)))
       

In [93]:
#Original Model
test_model(X, y, alpha_lambda_start=10, alpha_lambda_stop=1_000_000, LASSO_amt=0, Ridge_amt=0)

-----------0-FOLD-------------
adjusted R^2 is 0.7448195078485127
AIC is 28134.704376175985 and BIC is 28160.019616992926
root mean squared error is: 34692.07982273649
mean squared error is: 1203540402.4271202

-----------1-FOLD-------------
adjusted R^2 is 0.7720052401391067
AIC is 27943.82126508257 and BIC is 27969.13650589951
root mean squared error is: 47355.158296741785
mean squared error is: 2242511017.3094726

-----------2-FOLD-------------
adjusted R^2 is 0.7720052401391067
AIC is 27943.82126508257 and BIC is 27969.13650589951
root mean squared error is: 47355.158296741785
mean squared error is: 2242511017.3094726

-----------3-FOLD-------------
adjusted R^2 is 0.7720052401391067
AIC is 27943.82126508257 and BIC is 27969.13650589951
root mean squared error is: 47355.158296741785
mean squared error is: 2242511017.3094726

-----------4-FOLD-------------
adjusted R^2 is 0.7720052401391067
AIC is 27943.82126508257 and BIC is 27969.13650589951
root mean squared error is: 47355.15829

  return ptp(axis=axis, out=out, **kwargs)
  return ptp(axis=axis, out=out, **kwargs)
  return ptp(axis=axis, out=out, **kwargs)
  return ptp(axis=axis, out=out, **kwargs)
  return ptp(axis=axis, out=out, **kwargs)


In [94]:
#Ridge Regression
test_model(X, y, alpha_lambda_start=10, alpha_lambda_stop=1_000_000, LASSO_amt=1, Ridge_amt=0)

  return ptp(axis=axis, out=out, **kwargs)
  positive)
  return ptp(axis=axis, out=out, **kwargs)


-----------0-FOLD-------------
adjusted R^2 is 0.7342035716170321
AIC is not calculated and BIC is not calculated
root mean squared error is: 36351.05735391397
mean squared error is: 1321399370.747543

-----------1-FOLD-------------
adjusted R^2 is 0.7623524326063086
AIC is not calculated and BIC is not calculated
root mean squared error is: 48940.74438199695
mean squared error is: 2395196460.663966

-----------2-FOLD-------------
adjusted R^2 is 0.7623524326063086
AIC is not calculated and BIC is not calculated
root mean squared error is: 56331.881509439765
mean squared error is: 3173280874.393562

-----------3-FOLD-------------
adjusted R^2 is 0.7623524326063086
AIC is not calculated and BIC is not calculated
root mean squared error is: 201164.81662151174
mean squared error is: 40467283446.36644

-----------4-FOLD-------------
adjusted R^2 is 0.7623524326063086
AIC is not calculated and BIC is not calculated
root mean squared error is: 201164.81662151174
mean squared error is: 404672

  positive)
  return ptp(axis=axis, out=out, **kwargs)
  positive)
  return ptp(axis=axis, out=out, **kwargs)
  positive)
  return ptp(axis=axis, out=out, **kwargs)
  positive)


In [95]:
#elasticnet
test_model(X, y, alpha_lambda_start=10, alpha_lambda_stop=1_000_000, LASSO_amt=.5, Ridge_amt=.5)

  return ptp(axis=axis, out=out, **kwargs)
  return ptp(axis=axis, out=out, **kwargs)


-----------0-FOLD-------------
adjusted R^2 is 0.7302676630179219
AIC is not calculated and BIC is not calculated
root mean squared error is: 36470.611062508906
mean squared error is: 1330105471.272797

-----------1-FOLD-------------
adjusted R^2 is 0.7580737928321917
AIC is not calculated and BIC is not calculated
root mean squared error is: 48701.95553925165
mean squared error is: 2371880473.3472443

-----------2-FOLD-------------
adjusted R^2 is 0.7580737928321917
AIC is not calculated and BIC is not calculated
root mean squared error is: 78977.73180532893
mean squared error is: 6237482121.114465

-----------3-FOLD-------------
adjusted R^2 is 0.7580737928321917
AIC is not calculated and BIC is not calculated
root mean squared error is: 201164.81662151174
mean squared error is: 40467283446.36644

-----------4-FOLD-------------
adjusted R^2 is 0.7580737928321917
AIC is not calculated and BIC is not calculated
root mean squared error is: 201164.81662151174
mean squared error is: 40467

  return ptp(axis=axis, out=out, **kwargs)
  return ptp(axis=axis, out=out, **kwargs)
  return ptp(axis=axis, out=out, **kwargs)
