# 19.4 Interpreting Estimated Coefficient Assignment 2
In this exercise, you'll interpret your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

    1. Load the houseprices data from Thinkful's database.
    2. Run your house prices model again and interpret the results. Which features are statistically significant, and which are not?
    3. Now, exclude the insignificant features from your model. Did anything change?
    4. Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have a more prominent effect on house prices?
    5. Do the results sound reasonable to you? If not, try to explain the potential reasons

## 1. Load House Price Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
from sklearn import linear_model
import statsmodels.formula.api as smf
import statsmodels.api as sm
from scipy.stats import bartlett
from scipy.stats import levene
import warnings

%matplotlib inline
warnings.filterwarnings('ignore')

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

df = pd.read_sql_query('select * from houseprices', con=engine)
engine.dispose()

df.head()

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,...,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
# Target variable
Y = df['saleprice']

# Regression features
X = df[['overallqual', 'totalbsmtsf', 'firstflrsf', 'grlivarea',
      'garagecars', 'garagearea']]

# Linear Regression model object
lrm = linear_model.LinearRegression()
lrm.fit(X, Y)

# Inspect the results.
print('\nCoefficients: \n', lrm.coef_)
print('\nIntercept: \n', lrm.intercept_)


Coefficients: 
 [2.39970394e+04 2.43907676e+01 1.11859135e+01 4.31228864e+01
 1.45151932e+04 1.56639341e+01]

Intercept: 
 -102650.90069029017


In [4]:
X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()
results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.762
Model:,OLS,Adj. R-squared:,0.761
Method:,Least Squares,F-statistic:,775.0
Date:,"Sat, 17 Aug 2019",Prob (F-statistic):,0.0
Time:,12:01:51,Log-Likelihood:,-17496.0
No. Observations:,1460,AIC:,35010.0
Df Residuals:,1453,BIC:,35040.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.027e+05,4903.994,-20.932,0.000,-1.12e+05,-9.3e+04
overallqual,2.4e+04,1083.393,22.150,0.000,2.19e+04,2.61e+04
totalbsmtsf,24.3908,4.318,5.649,0.000,15.921,32.860
firstflrsf,11.1859,5.032,2.223,0.026,1.315,21.057
grlivarea,43.1229,2.679,16.095,0.000,37.867,48.379
garagecars,1.452e+04,3018.621,4.809,0.000,8593.872,2.04e+04
garagearea,15.6639,10.475,1.495,0.135,-4.884,36.212

0,1,2,3
Omnibus:,431.781,Durbin-Watson:,1.975
Prob(Omnibus):,0.0,Jarque-Bera (JB):,39208.253
Skew:,-0.313,Prob(JB):,0.0
Kurtosis:,28.38,Cond. No.,11400.0


### 2. Interpreting the Coefficients
Most of the features are significant as the p-value is below .05. The only variable that is not significant is the 'garage area' feature. The OLS Regression will be re-run without the non-significant variable.

### 3. Removing the Insignificant Coefficients

In [5]:
X.drop('garagearea', axis=1, inplace=True)

lrm = linear_model.LinearRegression()
lrm.fit(X, Y)

X = sm.add_constant(X)
results = sm.OLS(Y, X).fit()
results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.762
Model:,OLS,Adj. R-squared:,0.761
Method:,Least Squares,F-statistic:,928.8
Date:,"Sat, 17 Aug 2019",Prob (F-statistic):,0.0
Time:,12:01:51,Log-Likelihood:,-17497.0
No. Observations:,1460,AIC:,35010.0
Df Residuals:,1454,BIC:,35040.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.03e+05,4901.609,-21.006,0.000,-1.13e+05,-9.33e+04
overallqual,2.396e+04,1083.493,22.109,0.000,2.18e+04,2.61e+04
totalbsmtsf,25.0167,4.299,5.819,0.000,16.583,33.450
firstflrsf,11.6608,5.024,2.321,0.020,1.805,21.516
grlivarea,43.2993,2.678,16.170,0.000,38.047,48.552
garagecars,1.819e+04,1752.914,10.377,0.000,1.48e+04,2.16e+04

0,1,2,3
Omnibus:,417.21,Durbin-Watson:,1.973
Prob(Omnibus):,0.0,Jarque-Bera (JB):,35788.405
Skew:,-0.254,Prob(JB):,0.0
Kurtosis:,27.25,Cond. No.,11100.0


### 4. Interpreting the Coefficients
Looking at the new results, all the features are now significant. The biggest change is to 'garage cars'. The coefficient went from 14k to 18k. The features with the most prominent effect on house prices are 'overall quality' and 'garage cars'. 

### 5. Conclusion
The results of the OLS Regression seem reasonable. It was surprising to see that square footage is not the biggest factors in housing prices. The area is a college town for Iowa State University. Most of the owners are renting their houses to college students, so it makes sense that the quality of the place and larger places to park would be a key factor in prices. 