###  3. House prices model

In this exercise, you'll interpret your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Run your house prices model again and interpret the results. Which features are statistically significant, and which are not?
* Now, exclude the insignificant features from your model. Did anything change?
* Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have a more prominent effect on house prices?
* Do the results sound reasonable to you? If not, try to explain the potential reasons.

In [3]:
# %load 19.2_linreg_drill.py
#!/usr/bin/env python
'''
output is lrm model
'''
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
from scipy.stats import mode

#credentials
user = 'dsbc_student'
pw = '7*.8G9QH21'
host = '142.93.121.174'
port = '5432'
db = 'houseprices'
dialect = 'postgresql'

engine = create_engine('{}://{}:{}@{}:{}/{}'.format(dialect, user, pw, host, port, db))
engine.table_names()

sql_query = '''
SELECT
    *
FROM
    houseprices
'''
source_df = pd.read_sql(sql_query, con=engine)
engine.dispose()
house_df = source_df.copy()
for column in house_df.columns[house_df.dtypes== 'object']:
    print("Column {} has values {}".format(column, house_df[column].unique()))
#fillvalues
missing_numerical = ['lotfrontage', 'masvnrarea', 'garageyrblt']
for miss in missing_numerical: #column-wise
    house_df[miss] = house_df[miss].fillna(house_df[miss].mean()) #fill with column mean
    
missing_cat_ob = house_df.dtypes[house_df.isna().sum() > 0]
missing_categorical = missing_cat_ob[missing_cat_ob == 'object'].index
for miss in missing_categorical:
    house_df[miss] = house_df[miss].fillna(house_df[miss].value_counts().index[0])  #fill with most common value

categorical_feat = house_df.dtypes[house_df.dtypes == 'object'].index
new_categories_df = pd.DataFrame()
for feature in categorical_feat:
    new_categories_df = pd.concat([new_categories_df, 
                                   pd.get_dummies(house_df[feature], columns=categorical_feat, drop_first=True)], axis=1)
new_categories_df = pd.concat([new_categories_df, 
                               house_df.filter(items=(house_df.columns[(house_df.dtypes.values != 'object').tolist()]), axis=1) ], 
                              axis=1) #tolist() needed to avoid hashability issue

#Find highly (>.95) correlated values and drop
house_corr_df = house_df.corr()
house_corr_df[house_corr_df >.95].notna()#.any()

#standardize data and compute PCA
pca = PCA()
scaler = StandardScaler()
X = scaler.fit_transform(new_categories_df.drop(["saleprice"], axis=1))
y = new_categories_df.saleprice

pca.fit(X)
pca.explained_variance_ratio_

sns.set_style('darkgrid')
plt.figure(figsize=(15,5))
sns.lineplot(data=np.cumsum(pca.explained_variance_ratio_), marker="o")
plt.title("Cumulative Variance explained");

pca_75 = PCA(n_components=75)
X_pca = pca_75.fit_transform(X)
lrm = linear_model.LinearRegression()
lrm.fit(X_pca, y)
lrm

print('\nCoefficients: \n', lrm.coef_)
print('\nIntercept: \n', lrm.intercept_)

Column mszoning has values ['RL' 'RM' 'C (all)' 'FV' 'RH']
Column street has values ['Pave' 'Grvl']
Column alley has values [None 'Grvl' 'Pave']
Column lotshape has values ['Reg' 'IR1' 'IR2' 'IR3']
Column landcontour has values ['Lvl' 'Bnk' 'Low' 'HLS']
Column utilities has values ['AllPub' 'NoSeWa']
Column lotconfig has values ['Inside' 'FR2' 'Corner' 'CulDSac' 'FR3']
Column landslope has values ['Gtl' 'Mod' 'Sev']
Column neighborhood has values ['CollgCr' 'Veenker' 'Crawfor' 'NoRidge' 'Mitchel' 'Somerst' 'NWAmes'
 'OldTown' 'BrkSide' 'Sawyer' 'NridgHt' 'NAmes' 'SawyerW' 'IDOTRR'
 'MeadowV' 'Edwards' 'Timber' 'Gilbert' 'StoneBr' 'ClearCr' 'NPkVill'
 'Blmngtn' 'BrDale' 'SWISU' 'Blueste']
Column condition1 has values ['Norm' 'Feedr' 'PosN' 'Artery' 'RRAe' 'RRNn' 'RRAn' 'PosA' 'RRNe']
Column condition2 has values ['Norm' 'Artery' 'RRNn' 'Feedr' 'PosN' 'PosA' 'RRAn' 'RRAe']
Column bldgtype has values ['1Fam' '2fmCon' 'Duplex' 'TwnhsE' 'Twnhs']
Column housestyle has values ['2Story' '1Stor

In [4]:
%whos

Variable              Type                Data/Info
---------------------------------------------------
PCA                   ABCMeta             <class 'sklearn.decomposition.pca.PCA'>
StandardScaler        type                <class 'sklearn.preproces<...>ing.data.StandardScaler'>
X                     ndarray             1460x246: 359160 elems, type `float64`, 2873280 bytes (2.74017333984375 Mb)
X_pca                 ndarray             1460x75: 109500 elems, type `float64`, 876000 bytes (855.46875 kb)
categorical_feat      Index               Index(['mszoning', 'stree<...>],\n      dtype='object')
column                str                 salecondition
create_engine         function            <function create_engine at 0x00000178791E8E58>
db                    str                 houseprices
dialect               str                 postgresql
engine                Engine              Engine(postgresql://dsbc_<...>121.174:5432/houseprices)
feature               str                

In [5]:
lrm.coef_

array([15293.71342719,  3213.25273159, 10710.83003588,  4570.78886077,
         104.42211267,  5796.47721072, -5892.67746355,  1314.22816376,
       -1684.95680329, -2825.37836074,  -996.53745537,  1631.22594528,
        -477.15540483,   602.32879365,   400.64971486,   926.27880386,
        -624.26414729, -1373.72880582,  1834.13458121,   383.13225243,
         251.91665618, -1545.10993489,   839.00791669,  2600.21643291,
       -1427.23758014, -1446.22917025, -1750.64877197,  2481.44463395,
        3480.91479322,  1381.21978791,  -890.00871042,   -86.21154098,
          48.94140008, -1542.47398565,  4281.83118693,   330.21770707,
        -758.92889523,  -877.08418645, -1777.19083407,  1095.2728767 ,
         886.94973483,  -720.6872267 , -1794.74150167,  2168.70980048,
       -2130.08825175,  2972.64676821,  1951.25821544, -1549.84011548,
        1042.5869979 ,  2708.23185119,  1102.5274296 , -3402.06368852,
       -1553.50021346,  1695.91544799,  1171.9439866 , -1243.61672007,
      

Since I added all the variables in and did a PCA and selected 75 variables, i'm not sure they have an obvious interpretation. 

In [8]:
import statsmodels.api as sm
sm.add_constant(X_pca)
results = sm.OLS(y, X_pca).fit()

In [26]:
results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared (uncentered):,0.135
Model:,OLS,Adj. R-squared (uncentered):,0.088
Method:,Least Squares,F-statistic:,2.888
Date:,"Tue, 23 Jul 2019",Prob (F-statistic):,3.21e-14
Time:,11:19:25,Log-Likelihood:,-19769.0
No. Observations:,1460,AIC:,39690.0
Df Residuals:,1385,BIC:,40080.0
Df Model:,75,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,1.529e+04,1227.669,12.458,0.000,1.29e+04,1.77e+04
x2,3213.2527,1795.544,1.790,0.074,-309.026,6735.532
x3,1.071e+04,2026.164,5.286,0.000,6736.148,1.47e+04
x4,4570.7889,2179.485,2.097,0.036,295.341,8846.237
x5,104.4221,2436.355,0.043,0.966,-4674.923,4883.767
x6,5796.4772,2502.834,2.316,0.021,886.722,1.07e+04
x7,-5892.6775,2592.880,-2.273,0.023,-1.1e+04,-806.281
x8,1314.2282,2695.137,0.488,0.626,-3972.764,6601.220
x9,-1684.9568,2701.461,-0.624,0.533,-6984.353,3614.440

0,1,2,3
Omnibus:,533.677,Durbin-Watson:,0.058
Prob(Omnibus):,0.0,Jarque-Bera (JB):,80636.802
Skew:,-0.578,Prob(JB):,0.0
Kurtosis:,39.39,Cond. No.,3.9


In [23]:
results.pvalues[results.pvalues <.1 ]

x1    7.683273e-34
x2    7.374132e-02
x3    1.448916e-07
x4    3.615789e-02
x6    2.070557e-02
x7    2.320069e-02
dtype: float64

In [24]:
pca_limited = PCA(n_components=4) #QUESTION: how to do a PCA skipping a component???A: just don't. 
X_pca_limited = pca_limited.fit_transform(X)
sm.add_constant(X_pca_limited)
results_limited = sm.OLS(y, X_pca_limited).fit()

In [25]:
results_limited.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared (uncentered):,0.119
Model:,OLS,Adj. R-squared (uncentered):,0.117
Method:,Least Squares,F-statistic:,49.22
Date:,"Tue, 23 Jul 2019",Prob (F-statistic):,7e-39
Time:,11:19:01,Log-Likelihood:,-19782.0
No. Observations:,1460,AIC:,39570.0
Df Residuals:,1456,BIC:,39590.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,1.529e+04,1208.470,12.655,0.000,1.29e+04,1.77e+04
x2,3213.9114,1767.464,1.818,0.069,-253.136,6680.958
x3,1.071e+04,1994.479,5.370,0.000,6798.594,1.46e+04
x4,4587.5557,2145.428,2.138,0.033,379.095,8796.016

0,1,2,3
Omnibus:,506.277,Durbin-Watson:,0.094
Prob(Omnibus):,0.0,Jarque-Bera (JB):,26541.479
Skew:,0.816,Prob(JB):,0.0
Kurtosis:,23.824,Cond. No.,1.78


The coefficients changed, but are remarkably similar to the original ones!