## Sale Price Estimation Model Evaluation

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sqlalchemy import create_engine
import statsmodels.api as sm

import warnings
warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

In [2]:
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
housing_df = pd.read_sql_query('select * from houseprices',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

In [3]:
housing_reg_df =  housing_df.copy()
housing_reg_df = pd.concat([housing_reg_df,pd.get_dummies(housing_reg_df.centralair, 
                                                            prefix="centralair", drop_first=True)], axis=1)
housing_reg_df = pd.concat([housing_reg_df,pd.get_dummies(housing_reg_df.mszoning, 
                                                            prefix="mszoning", drop_first=True)], axis=1)
housing_reg_df = pd.concat([housing_reg_df,pd.get_dummies(housing_reg_df.bldgtype, 
                                                            prefix="bldgtype", drop_first=True)], axis=1)
housing_reg_df = pd.concat([housing_reg_df,pd.get_dummies(housing_reg_df.exterqual, 
                                                            prefix="exterqual", drop_first=True)], axis=1)
housing_reg_df = pd.concat([housing_reg_df,pd.get_dummies(housing_reg_df.bsmtqual, 
                                                            prefix="bsmtqual", drop_first=True)], axis=1)
housing_reg_df = pd.concat([housing_reg_df,pd.get_dummies(housing_reg_df.salecondition, 
                                                            prefix="salecondition", drop_first=True)], axis=1)

In [4]:
# Y is the target variable
Y = housing_reg_df['saleprice']
# X is the feature set
X = housing_reg_df[['overallqual', 'grlivarea', 'garagecars', 'totalbsmtsf', 'centralair_Y', 'bldgtype_Duplex',
                     'bldgtype_Twnhs','bldgtype_TwnhsE','exterqual_Gd','exterqual_TA','bsmtqual_Fa','bsmtqual_Gd',
                     'bsmtqual_TA','salecondition_Normal','salecondition_Partial']]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.802
Model:,OLS,Adj. R-squared:,0.8
Method:,Least Squares,F-statistic:,391.1
Date:,"Sun, 23 Jun 2019",Prob (F-statistic):,0.0
Time:,20:30:30,Log-Likelihood:,-17360.0
No. Observations:,1460,AIC:,34750.0
Df Residuals:,1444,BIC:,34840.0
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-7141.4517,9585.443,-0.745,0.456,-2.59e+04,1.17e+04
overallqual,1.671e+04,1174.129,14.231,0.000,1.44e+04,1.9e+04
grlivarea,45.0136,2.375,18.949,0.000,40.354,49.673
garagecars,1.345e+04,1675.413,8.025,0.000,1.02e+04,1.67e+04
totalbsmtsf,29.1075,2.695,10.802,0.000,23.822,34.393
centralair_Y,2.132e+04,4067.847,5.241,0.000,1.33e+04,2.93e+04
bldgtype_Duplex,-2.464e+04,5302.808,-4.646,0.000,-3.5e+04,-1.42e+04
bldgtype_Twnhs,-2.148e+04,5633.936,-3.813,0.000,-3.25e+04,-1.04e+04
bldgtype_TwnhsE,-1.668e+04,3696.526,-4.512,0.000,-2.39e+04,-9427.665

0,1,2,3
Omnibus:,738.694,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,101640.888
Skew:,-1.318,Prob(JB):,0.0
Kurtosis:,43.791,Cond. No.,23100.0


- R-squared and adjusted R-squared of the model are 0.802 and 0.800 respectively.
- F statistic and its associated p-value are 391.1 and 0.00 respectively.
- AIC and BIC of the model are 34750 and 34840 respectively.

According to the R-squared almost 80% of the variance in the target variable is explained by the model. Hence 20% of the variance in the target is not explained. In this sense, there is a large room for improvement.

In order to increase the goodness of fit of our model:
- We first log transform the dependent variable, applying log(1+x) transformation of NumPy.
- Second, create another variable by summing up the basement, first and the second floor areas.
- Third, add the interaction of the total area and the overall quality of the house.

In [7]:


housing_reg_df['totalsf'] = housing_reg_df['totalbsmtsf'] + housing_reg_df['firstflrsf'] + housing_reg_df['secondflrsf']

housing_reg_df['int_over_sf'] = housing_reg_df['totalsf'] * housing_reg_df['overallqual']

# Y is the target variable
Y = np.log1p(housing_reg_df['saleprice'])
# X is the feature set
X = housing_reg_df[['overallqual', 'grlivarea', 'garagecars','overallqual', 'grlivarea', 'garagecars', 'totalbsmtsf', 'centralair_Y', 'bldgtype_Duplex',
                     'bldgtype_Twnhs','bldgtype_TwnhsE','exterqual_Gd','exterqual_TA','bsmtqual_Fa','bsmtqual_Gd',
                     'bsmtqual_TA','salecondition_Normal','salecondition_Partial', 'totalsf', 'int_over_sf']]
X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.847
Model:,OLS,Adj. R-squared:,0.845
Method:,Least Squares,F-statistic:,469.2
Date:,"Sun, 23 Jun 2019",Prob (F-statistic):,0.0
Time:,20:37:07,Log-Likelihood:,638.54
No. Observations:,1460,AIC:,-1241.0
Df Residuals:,1442,BIC:,-1146.0
Df Model:,17,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,10.2575,0.054,191.056,0.000,10.152,10.363
overallqual,0.0854,0.004,19.574,0.000,0.077,0.094
grlivarea,-1.646e-05,4.31e-05,-0.382,0.703,-0.000,6.81e-05
garagecars,0.0387,0.004,10.327,0.000,0.031,0.046
overallqual,0.0854,0.004,19.574,0.000,0.077,0.094
grlivarea,-1.646e-05,4.31e-05,-0.382,0.703,-0.000,6.81e-05
garagecars,0.0387,0.004,10.327,0.000,0.031,0.046
totalbsmtsf,-0.0001,8.94e-05,-1.233,0.218,-0.000,6.52e-05
centralair_Y,0.2071,0.018,11.429,0.000,0.172,0.243

0,1,2,3
Omnibus:,422.695,Durbin-Watson:,1.99
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3497.733
Skew:,-1.109,Prob(JB):,0.0
Kurtosis:,10.251,Cond. No.,2.34e+20


- R-squared and adjusted R-squared of the model are 0.847 and 0.845 respectively. These are improvements upon the first model.
- F statistic and its associated p-value are 469.2 and 0.00 respectively. This also indicates that the second model is better than the first one.
- AIC and BIC of the model are -1241 and -1146 respectively. These values are lower than the ones of the first model. Hence, the second model is better than the first model.

#### By the lower AIC and BIC, the greater and significant F-statistic, the second estimation model has better goodness of fit than the first model.