###  2. House prices model

In this exercise, you'll work on your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.
* Do you think your model is satisfactory? If so, why?
* In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables. 
* For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?

In [4]:
# %load 19.4_interpreting_coefs_drill_3_house_price.py
#!/usr/bin/env python
# %load 19.2_linreg_drill.py
#!/usr/bin/env python

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
from scipy.stats import mode

#credentials
user = 'dsbc_student'
pw = '7*.8G9QH21'
host = '142.93.121.174'
port = '5432'
db = 'houseprices'
dialect = 'postgresql'

engine = create_engine('{}://{}:{}@{}:{}/{}'.format(dialect, user, pw, host, port, db))
engine.table_names()

sql_query = '''
SELECT
    *
FROM
    houseprices
'''
source_df = pd.read_sql(sql_query, con=engine)
engine.dispose()
house_df = source_df.copy()
for column in house_df.columns[house_df.dtypes== 'object']:
    print("Column {} has values {}".format(column, house_df[column].unique()))
#fillvalues
missing_numerical = ['lotfrontage', 'masvnrarea', 'garageyrblt']
for miss in missing_numerical: #column-wise
    house_df[miss] = house_df[miss].fillna(house_df[miss].mean()) #fill with column mean
    
missing_cat_ob = house_df.dtypes[house_df.isna().sum() > 0]
missing_categorical = missing_cat_ob[missing_cat_ob == 'object'].index
for miss in missing_categorical:
    house_df[miss] = house_df[miss].fillna(house_df[miss].value_counts().index[0])  #fill with most common value

categorical_feat = house_df.dtypes[house_df.dtypes == 'object'].index
new_categories_df = pd.DataFrame()
for feature in categorical_feat:
    new_categories_df = pd.concat([new_categories_df, 
                                   pd.get_dummies(house_df[feature], columns=categorical_feat, drop_first=True)], axis=1)
new_categories_df = pd.concat([new_categories_df, 
                               house_df.filter(items=(house_df.columns[(house_df.dtypes.values != 'object').tolist()]), axis=1) ], 
                              axis=1) #tolist() needed to avoid hashability issue

#Find highly (>.95) correlated values and drop
house_corr_df = house_df.corr()
house_corr_df[house_corr_df >.95].notna()#.any()

#standardize data and compute PCA
pca = PCA()
scaler = StandardScaler()
X = scaler.fit_transform(new_categories_df.drop(["saleprice"], axis=1))
y = new_categories_df.saleprice

pca.fit(X)
pca.explained_variance_ratio_

sns.set_style('darkgrid')
plt.figure(figsize=(15,5))
sns.lineplot(data=np.cumsum(pca.explained_variance_ratio_), marker="o")
plt.title("Cumulative Variance explained");

pca_75 = PCA(n_components=75)
X_pca = pca_75.fit_transform(X)
lrm = linear_model.LinearRegression()
lrm.fit(X_pca, y)

get_ipython().run_line_magic('whos', '')

import statsmodels.api as sm
sm.add_constant(X_pca)
results = sm.OLS(y, X_pca).fit()

results.summary()

results.pvalues[results.pvalues <.1 ]

pca_limited = PCA(n_components=4) #QUESTION: how to do a PCA skipping a component??? 
X_pca_limited = pca_limited.fit_transform(X)
sm.add_constant(X_pca_limited)
results_limited = sm.OLS(y, X_pca_limited).fit()
results_limited.summary()

Column mszoning has values ['RL' 'RM' 'C (all)' 'FV' 'RH']
Column street has values ['Pave' 'Grvl']
Column alley has values [None 'Grvl' 'Pave']
Column lotshape has values ['Reg' 'IR1' 'IR2' 'IR3']
Column landcontour has values ['Lvl' 'Bnk' 'Low' 'HLS']
Column utilities has values ['AllPub' 'NoSeWa']
Column lotconfig has values ['Inside' 'FR2' 'Corner' 'CulDSac' 'FR3']
Column landslope has values ['Gtl' 'Mod' 'Sev']
Column neighborhood has values ['CollgCr' 'Veenker' 'Crawfor' 'NoRidge' 'Mitchel' 'Somerst' 'NWAmes'
 'OldTown' 'BrkSide' 'Sawyer' 'NridgHt' 'NAmes' 'SawyerW' 'IDOTRR'
 'MeadowV' 'Edwards' 'Timber' 'Gilbert' 'StoneBr' 'ClearCr' 'NPkVill'
 'Blmngtn' 'BrDale' 'SWISU' 'Blueste']
Column condition1 has values ['Norm' 'Feedr' 'PosN' 'Artery' 'RRAe' 'RRNn' 'RRAn' 'PosA' 'RRNe']
Column condition2 has values ['Norm' 'Artery' 'RRNn' 'Feedr' 'PosN' 'PosA' 'RRAn' 'RRAe']
Column bldgtype has values ['1Fam' '2fmCon' 'Duplex' 'TwnhsE' 'Twnhs']
Column housestyle has values ['2Story' '1Stor

0,1,2,3
Dep. Variable:,saleprice,R-squared (uncentered):,0.119
Model:,OLS,Adj. R-squared (uncentered):,0.117
Method:,Least Squares,F-statistic:,49.24
Date:,"Tue, 23 Jul 2019",Prob (F-statistic):,6.72e-39
Time:,22:41:11,Log-Likelihood:,-19782.0
No. Observations:,1460,AIC:,39570.0
Df Residuals:,1456,BIC:,39590.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,1.529e+04,1208.436,12.656,0.000,1.29e+04,1.77e+04
x2,3213.8927,1767.414,1.818,0.069,-253.057,6680.842
x3,1.072e+04,1994.432,5.375,0.000,6806.832,1.46e+04
x4,4606.8865,2145.497,2.147,0.032,398.291,8815.483

0,1,2,3
Omnibus:,506.03,Durbin-Watson:,0.094
Prob(Omnibus):,0.0,Jarque-Bera (JB):,26678.72
Skew:,0.814,Prob(JB):,0.0
Kurtosis:,23.878,Cond. No.,1.78


The r-squared it bad though the *ic seem small

In [6]:
house_df.corr()["saleprice"].sort_values()

kitchenabvgr    -0.135907
enclosedporch   -0.128578
mssubclass      -0.084284
overallcond     -0.077856
yrsold          -0.028923
lowqualfinsf    -0.025606
id              -0.021917
miscval         -0.021190
bsmthalfbath    -0.016844
bsmtfinsf2      -0.011378
threessnporch    0.044584
mosold           0.046432
poolarea         0.092404
screenporch      0.111447
bedroomabvgr     0.168213
bsmtunfsf        0.214479
bsmtfullbath     0.227122
lotarea          0.263843
halfbath         0.284108
openporchsf      0.315856
secondflrsf      0.319334
wooddecksf       0.324413
lotfrontage      0.334901
bsmtfinsf1       0.386420
fireplaces       0.466929
garageyrblt      0.470177
masvnrarea       0.475241
yearremodadd     0.507101
yearbuilt        0.522897
totrmsabvgrd     0.533723
fullbath         0.560664
firstflrsf       0.605852
totalbsmtsf      0.613581
garagearea       0.623431
garagecars       0.640409
grlivarea        0.708624
overallqual      0.790982
saleprice        1.000000
Name: salepr

In [8]:
X_selected = house_df[["overallqual", "grlivarea", "fullbath", "yearbuilt"]]

In [10]:
sm.add_constant(X_selected)
model_selected = sm.OLS(y, X_selected).fit()
model_selected.summary()

  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,saleprice,R-squared (uncentered):,0.953
Model:,OLS,Adj. R-squared (uncentered):,0.953
Method:,Least Squares,F-statistic:,7390.0
Date:,"Tue, 23 Jul 2019",Prob (F-statistic):,0.0
Time:,22:48:06,Log-Likelihood:,-17642.0
No. Observations:,1460,AIC:,35290.0
Df Residuals:,1456,BIC:,35310.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
overallqual,3.27e+04,1074.204,30.442,0.000,3.06e+04,3.48e+04
grlivarea,52.9168,2.972,17.806,0.000,47.087,58.746
fullbath,4384.2970,2740.626,1.600,0.110,-991.700,9760.294
yearbuilt,-53.4819,2.694,-19.856,0.000,-58.766,-48.198

0,1,2,3
Omnibus:,350.75,Durbin-Watson:,1.98
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7694.705
Skew:,0.561,Prob(JB):,0.0
Kurtosis:,14.191,Cond. No.,6160.0


all the criteria look better for this model, and it is much simpler and provides useful coefficients. 