## Assignment

In this exercise, you'll predict house prices using your model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Split your data into train and test sets.
* Estimate your model from the previous checkpoint in the train set. Assess the goodness of fit of your model.
* Predict the house prices in the test set, and evaluate the performance of your model using the metrics we mentioned in this checkpoint.
* Is the performance of your model satisfactory? Why?
* Try to improve your model in terms of predictive performance by adding or removing some variables.

Please submit a link your work notebook. This is not a graded checkpoint, but you should discuss your solutions with your mentor. Also, when you're done, compare your work to [this example solution](https://github.com/Thinkful-Ed/machine-learning-regression-problems/blob/master/notebooks/6.solution_making_predictions.ipynb).

In [1]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
from scipy.stats import mode
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse

In [2]:
# record module versions used in cell 1
import pkg_resources
resources = In[1].splitlines()
version_dict = {resource.split()[1].split(".")[0]: pkg_resources.get_distribution(
    resource.split()[1].split(".")[0]).version for resource in resources}
version_dict

{'numpy': '1.16.4',
 'pandas': '0.25.0',
 'sklearn': '0.0',
 'matplotlib': '3.1.1',
 'seaborn': '0.9.0',
 'sqlalchemy': '1.3.6',
 'scipy': '1.3.0',
 'statsmodels': '0.10.1'}

In [3]:
# credentials
user = 'dsbc_student'
pw = '7*.8G9QH21'
host = '142.93.121.174'
port = '5432'
db = 'houseprices'
dialect = 'postgresql'

db_addr = f'{dialect}://{user}:{pw}@{host}:{port}/{db}'
engine = create_engine(db_addr)

query = '''
SELECT
    *
FROM
    houseprices
'''

raw_df = pd.read_sql(query, con=engine)
engine.dispose()

In [9]:
house_df = raw_df.copy(deep=True)

In [10]:
# create X & y
X = house_df[["overallqual", "grlivarea", "fullbath", "yearbuilt"]]
y = house_df.saleprice

In [108]:
def test_model(X, y, kfold=5):
    X_overall = X
    sm.add_constant(X_overall)
    overall_model = sm.OLS(y, X_overall).fit()
    print('------------test-------------')
    print('adjusted R^2 is {}'.format(overall_model.rsquared_adj))
    print('AIC is {} and BIC is {}'.format(
        overall_model.aic, overall_model.bic))
    for k in range(1, kfold):
        # split into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42+k)

        sm.add_constant(X_train)
        model = sm.OLS(y_train, X_train).fit()

        y_preds = model.predict(X_test)

        print('-----------{}-FOLD-------------'.format(k))
        print("root mean squared error is: {}".format(rmse(y_test, y_preds)))
        print("mean squared error is: {}\n".format(mse(y_test, y_preds)))
    print(overall_model.summary())

In [109]:
test_model(X=X, y=y)

------------test-------------
adjusted R^2 is 0.959304467958667
AIC is 35081.135585116026 and BIC is 35112.85273540424
-----------1-FOLD-------------
root mean squared error is: 35059.50025223219
mean squared error is: 1229168557.9362688

-----------2-FOLD-------------
root mean squared error is: 34128.9850591349
mean squared error is: 1164787621.1666534

-----------3-FOLD-------------
root mean squared error is: 35089.718158086944
mean squared error is: 1231288320.4139767

-----------4-FOLD-------------
root mean squared error is: 39934.916159303524
mean squared error is: 1594797528.6506016

                                 OLS Regression Results                                
Dep. Variable:              saleprice   R-squared (uncentered):                   0.959
Model:                            OLS   Adj. R-squared (uncentered):              0.959
Method:                 Least Squares   F-statistic:                              5737.
Date:                Thu, 25 Jul 2019   Prob (F-

In [110]:
X2 = X
X2["poolarea"] = house_df.poolarea
X2["overalqual_x_year"] = house_df.overallqual * house_df.yearbuilt

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [111]:
test_model(X2, y)

------------test-------------
adjusted R^2 is 0.959304467958667
AIC is 35081.135585116026 and BIC is 35112.85273540424
-----------1-FOLD-------------
root mean squared error is: 35059.50025223219
mean squared error is: 1229168557.9362688

-----------2-FOLD-------------
root mean squared error is: 34128.9850591349
mean squared error is: 1164787621.1666534

-----------3-FOLD-------------
root mean squared error is: 35089.718158086944
mean squared error is: 1231288320.4139767

-----------4-FOLD-------------
root mean squared error is: 39934.916159303524
mean squared error is: 1594797528.6506016

                                 OLS Regression Results                                
Dep. Variable:              saleprice   R-squared (uncentered):                   0.959
Model:                            OLS   Adj. R-squared (uncentered):              0.959
Method:                 Least Squares   F-statistic:                              5737.
Date:                Thu, 25 Jul 2019   Prob (F-

In [112]:
# create df with categorical variables to select some features from
categorical_feat = house_df.dtypes[house_df.dtypes == 'object'].index
new_categories_df = pd.DataFrame()
for feature in categorical_feat:
    new_categories_df = pd.concat([new_categories_df,
                                   pd.get_dummies(house_df[feature], columns=categorical_feat, drop_first=True, prefix=feature)], axis=1)
# append numerical features to new df
new_categories_df = pd.concat([new_categories_df,
                               house_df.filter(items=(house_df.columns[(house_df.dtypes.values != 'object').tolist()]), axis=1)],
                              axis=1)  # tolist() needed to avoid hashability issue

In [113]:
new_categories_df.corr()[["saleprice"]].sort_values(by='saleprice')

Unnamed: 0,saleprice
exterqual_TA,-0.589044
kitchenqual_TA,-0.519298
bsmtqual_TA,-0.452394
garagefinish_Unf,-0.410608
masvnrtype_None,-0.374468
garagetype_Detchd,-0.354141
foundation_CBlock,-0.343263
heatingqc_TA,-0.312677
mszoning_RM,-0.288065
lotshape_Reg,-0.267672


In [114]:
X3 = X2
X3 = pd.concat(
    [X3, new_categories_df[["exterqual_TA", 'foundation_CBlock']]], axis=1)

test_model(X3, y)

------------test-------------
adjusted R^2 is 0.9594926701716064
AIC is 35076.35831313656 and BIC is 35118.647846854175
-----------1-FOLD-------------
root mean squared error is: 34840.51676489821
mean squared error is: 1213861608.4451532

-----------2-FOLD-------------
root mean squared error is: 34040.719774624165
mean squared error is: 1158770602.774489

-----------3-FOLD-------------
root mean squared error is: 35017.457885453696
mean squared error is: 1226222356.7595234

-----------4-FOLD-------------
root mean squared error is: 39962.043619405486
mean squared error is: 1596964930.2392664

                                 OLS Regression Results                                
Dep. Variable:              saleprice   R-squared (uncentered):                   0.960
Model:                            OLS   Adj. R-squared (uncentered):              0.959
Method:                 Least Squares   F-statistic:                              4324.
Date:                Thu, 25 Jul 2019   Prob (

In [115]:
X4 = X3
X4["year_x_CBlock"] = X4.yearbuilt * X4.foundation_CBlock

In [116]:
test_model(X4, y)

------------test-------------
adjusted R^2 is 0.9594886763967176
AIC is 35077.49639705748 and BIC is 35125.0721224898
-----------1-FOLD-------------
root mean squared error is: 34823.904669477306
mean squared error is: 1212704336.4288435

-----------2-FOLD-------------
root mean squared error is: 34034.74764249327
mean squared error is: 1158364047.0882013

-----------3-FOLD-------------
root mean squared error is: 34981.33170166672
mean squared error is: 1223693567.622033

-----------4-FOLD-------------
root mean squared error is: 39930.6530657849
mean squared error is: 1594457054.2600768

                                 OLS Regression Results                                
Dep. Variable:              saleprice   R-squared (uncentered):                   0.960
Model:                            OLS   Adj. R-squared (uncentered):              0.959
Method:                 Least Squares   F-statistic:                              3843.
Date:                Thu, 25 Jul 2019   Prob (F-sta

In [118]:
X5 = X4.drop(columns=['poolarea', 'year_x_CBlock'])

In [119]:
test_model(X5, y)

------------test-------------
adjusted R^2 is 0.959503454064028
AIC is 35074.974742620594 and BIC is 35111.97808462351
-----------1-FOLD-------------
root mean squared error is: 34711.33386518356
mean squared error is: 1204876698.7002392

-----------2-FOLD-------------
root mean squared error is: 34056.06334529122
mean squared error is: 1159815450.5784883

-----------3-FOLD-------------
root mean squared error is: 35029.044913183796
mean squared error is: 1227033987.5298476

-----------4-FOLD-------------
root mean squared error is: 39976.52795205394
mean squared error is: 1598122787.1013498

                                 OLS Regression Results                                
Dep. Variable:              saleprice   R-squared (uncentered):                   0.960
Model:                            OLS   Adj. R-squared (uncentered):              0.960
Method:                 Least Squares   F-statistic:                              4943.
Date:                Thu, 25 Jul 2019   Prob (F-

  return ptp(axis=axis, out=out, **kwargs)
