## Assignment

In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Reimplement your model from the previous checkpoint.
* Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do **k-fold cross-validation** to choose the best hyperparameter values for your models. Which model is the best? Why?

This is not a graded checkpoint, but you should discuss your solution with your mentor. After you've submitted your work, take a moment to compare your solution to [this example solution](https://github.com/Thinkful-Ed/machine-learning-regression-problems/blob/master/notebooks/7.solution_overfitting_and_regularization.ipynb).

In [22]:
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
from scipy.stats import mode
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse

In [4]:
# %load ../utility/overhead.py
#record module versions used in cell 1
#
def version_recorder():
    '''
    only works if import is first cell run. prints and then returns dictionary with modules:version.
    '''
    import pkg_resources
    resources = In[1].splitlines()
    ##ADD: drop lines if not _from_ or _import_
    version_dict = { resource.split()[1].split(".")[0] : pkg_resources.get_distribution(resource.split()[1].split(".")[0]).version for resource in resources }
    return version_dict
version_recorder()

{'numpy': '1.16.4', 'pandas': '0.25.0', 'sklearn': '0.0', 'matplotlib': '3.1.1', 'seaborn': '0.9.0', 'sqlalchemy': '1.3.6', 'scipy': '1.3.0'}


{'numpy': '1.16.4',
 'pandas': '0.25.0',
 'sklearn': '0.0',
 'matplotlib': '3.1.1',
 'seaborn': '0.9.0',
 'sqlalchemy': '1.3.6',
 'scipy': '1.3.0'}

In [8]:
#credentials
user = 'dsbc_student'
pw = '7*.8G9QH21'
host = '142.93.121.174'
port = '5432'
db = 'houseprices'
dialect = 'postgresql'

engine = create_engine('{}://{}:{}@{}:{}/{}'.format(dialect, user, pw, host, port, db))
engine.table_names()

sql_query = '''
SELECT
    *
FROM
    houseprices
'''
source_df = pd.read_sql(sql_query, con=engine)
engine.dispose()
house_df = source_df.copy()

  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,saleprice,R-squared (uncentered):,0.953
Model:,OLS,Adj. R-squared (uncentered):,0.953
Method:,Least Squares,F-statistic:,7390.0
Date:,"Thu, 25 Jul 2019",Prob (F-statistic):,0.0
Time:,20:01:57,Log-Likelihood:,-17642.0
No. Observations:,1460,AIC:,35290.0
Df Residuals:,1456,BIC:,35310.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
overallqual,3.27e+04,1074.204,30.442,0.000,3.06e+04,3.48e+04
grlivarea,52.9168,2.972,17.806,0.000,47.087,58.746
fullbath,4384.2970,2740.626,1.600,0.110,-991.700,9760.294
yearbuilt,-53.4819,2.694,-19.856,0.000,-58.766,-48.198

0,1,2,3
Omnibus:,350.75,Durbin-Watson:,1.98
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7694.705
Skew:,0.561,Prob(JB):,0.0
Kurtosis:,14.191,Cond. No.,6160.0


In [25]:
#create df with categorical variables to select some features from
categorical_feat = house_df.dtypes[house_df.dtypes == 'object'].index
new_categories_df = pd.DataFrame()
for feature in categorical_feat:
    new_categories_df = pd.concat([new_categories_df, 
                                   pd.get_dummies(house_df[feature], columns=categorical_feat, drop_first=True, prefix = feature)], axis=1)
#append numerical features to new df
new_categories_df = pd.concat([new_categories_df, 
                               house_df.filter(items=(house_df.columns[(house_df.dtypes.values != 'object').tolist()]), axis=1) ], 
                              axis=1) #tolist() needed to avoid hashability issue

#create X & y
X = house_df[["overallqual", "grlivarea", "fullbath", "yearbuilt",]]
pd.concat([X, new_categories_df[['exterqual_TA', 'foundation_CBlock']]])
X["overalqual_x_year"] = house_df.overallqual * house_df.yearbuilt
y = house_df.saleprice

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


In [26]:
def test_model(X=X, y=y, kfold = 5):
    X_overall = X
    sm.add_constant(X_overall)
    overall_model = sm.OLS(y, X_overall).fit()
    print('------------test-------------')
    print('adjusted R^2 is {}'.format(overall_model.rsquared_adj))
    print('AIC is {} and BIC is {}'.format(overall_model.aic, overall_model.bic))
    for k in range(1, kfold):
        #split into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42+k)

        sm.add_constant(X_train)
        model = sm.OLS(y_train, X_train).fit()

        y_preds = model.predict(X_test)
        
        print('-----------{}-FOLD-------------'.format(k))
        print("root mean squared error is: {}".format(rmse(y_test, y_preds)))
        print("mean squared error is: {}\n".format(mse(y_test, y_preds)))
    print(overall_model.summary())
        

In [21]:
test_model(X, y)

------------test-------------
adjusted R^2 is 0.9593178009510193
AIC is 35079.660951391044 and BIC is 35106.091909964554
-----------1-FOLD-------------
root mean squared error is: 34944.75385267113
mean squared error is: 1221135821.8237736

-----------2-FOLD-------------
root mean squared error is: 34144.21352594143
mean squared error is: 1165827317.3050814

-----------3-FOLD-------------
root mean squared error is: 35102.3078392356
mean squared error is: 1232172015.6404607

-----------4-FOLD-------------
root mean squared error is: 39946.941649133834
mean squared error is: 1595758147.1193032

                                 OLS Regression Results                                
Dep. Variable:              saleprice   R-squared (uncentered):                   0.959
Model:                            OLS   Adj. R-squared (uncentered):              0.959
Method:                 Least Squares   F-statistic:                              6887.
Date:                Thu, 25 Jul 2019   Prob (F

In [28]:
#Ridge Regression
sm.regression.linear_model?