# Interpreting Estimated Coefficients - House Price Model
In this exercise, we'll work with the housing price data from the previous checkpoint. 

## Load the dataset from Thinkful's database

In [1]:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import os
import matplotlib.pyplot as plt
%matplotlib inline
from sqlalchemy import create_engine
import warnings

warnings.filterwarnings('ignore')

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

# use the credentials to start a connection
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

# Use the connection to extract SQL data
house_price = pd.read_sql_query('SELECT * FROM houseprices', con=engine)

#Close the connection after query is complete
engine.dispose()

## Clean and transform the data

In [3]:
#Drop features
drop_list = list((house_price.isnull().sum()/house_price.isnull().count()).sort_values(ascending=False).head(19).index)
drop_list.append('id')

house_price = house_price.drop(drop_list, axis=1)

#List of features that are string categoricals
str_cat_cols = list(house_price.describe(include=['O']).columns)

#Uniques within each variable
uniques = pd.DataFrame()
uni_col = []
num_uni = []
avgdiff_uni = []

for col in list(house_price.columns):
    uni_col.append(list(np.unique(house_price[col])))
    num_uni.append(len(np.unique(house_price[col])))
    try:
        avgdiff_uni.append(np.mean(np.diff(np.unique(house_price[col]))))
    except: 
        avgdiff_uni.append('N/A')
    
uniques['Category'] = list(house_price.columns)
uniques['Unique Values'] = uni_col
uniques['Num Uniques'] = num_uni
uniques['Avg Diff Among Uniques'] = avgdiff_uni


#List of features that are numerical categoricals
#If a numerical variable is categorical, its unique values will tend to be close to each other and there shouldn't
#be too many unique values 
num_cat_cols = []
for col in list(house_price.columns):
    if col not in str_cat_cols:   
        if ((list(uniques[uniques['Category']==col]['Avg Diff Among Uniques'])[0] < 2) or ((list(uniques[uniques['Category']==col]['Num Uniques'])[0] < 20) and (list(uniques[uniques['Category']==col]['Avg Diff Among Uniques'])[0] < 20))):
            num_cat_cols.append(col)
            
#List of features that are numerical continuous
cont_cols = []
for col in list(house_price.columns):
    if ((col not in str_cat_cols) and (col not in num_cat_cols)):
        cont_cols.append(col)
        
#Create new dataframe containing features of interest
sale_df = house_price[['saleprice', 'grlivarea', 'garagearea', 'totalbsmtsf', 'overallqual', 'paveddrive', 'centralair', 'yearremodadd']]

#Convert string categoricals ('paveddrive' and 'centralair') to dummy numerical
unique_paved = ['N', 'P', 'Y']
dummy_paveddrive = []
for row in sale_df['paveddrive']:
    for i in range(len(unique_paved)):
        if row == unique_paved[i]:
            dummy_paveddrive.append(i)
        
sale_df['paveddrive'] = dummy_paveddrive

unique_ac = ['N', 'Y']
dummy_ac = []
for row in sale_df['centralair']:
    for i in range(len(unique_ac)):
        if row == unique_ac[i]:
            dummy_ac.append(i)
        
sale_df['centralair'] = dummy_ac

## Build a linear regression model where your target variable is saleprice. 

In [5]:
target_var = 'saleprice'
feature_set = list(sale_df.columns.drop('saleprice'))

# X is the feature set 
X = sale_df[feature_set]
# Y is the target variable
Y = sale_df[target_var]

# We add constant to the model as it's a best practice
# to do so every time!
X = sm.add_constant(X)

# We fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# We print the summary results.
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.766
Model:                            OLS   Adj. R-squared:                  0.765
Method:                 Least Squares   F-statistic:                     680.2
Date:                Mon, 23 Sep 2019   Prob (F-statistic):               0.00
Time:                        14:11:56   Log-Likelihood:                -17483.
No. Observations:                1460   AIC:                         3.498e+04
Df Residuals:                    1452   BIC:                         3.502e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const        -8.876e+05   1.16e+05     -7.684   

In [6]:
print('The estimated model is:')

equation = str(round(results.params[0], 2))
cols = feature_set
for i in range(len(cols)):
    equation =  equation + ' + ' + str(round(results.params[i+1], 2)) + '(' + cols[i] + ')'

print('Sale Price ($) = ', equation)

The estimated model is:
Sale Price ($) =  -887633.97 + 47.8(grlivarea) + 51.07(garagearea) + 29.89(totalbsmtsf) + 21257.74(overallqual) + 3900.3(paveddrive) + 4995.45(centralair) + 402.44(yearremodadd)


According to the OLS summary, only 'paveddrive' and 'centralair' have a p-value larger than 0.05, which means that only those two values are not statistically significant, while the others are significant. This is interesting in that those coefficients are two of the largest in the model, so disregarding those will probably affect the model significantly. It makes sense that all of the coefficients are positive, because one would expect that larger square footage, higher overall quality, or more recent remodels would all increase the value of a house in the eyes of potential buyers.

## Now, exclude the insignificant features from your model. Did anything change?

In [7]:
target_var = 'saleprice'
feature_set = list(sale_df.columns.drop(['saleprice', 'paveddrive', 'centralair']))

# X is the feature set 
X = sale_df[feature_set]
# Y is the target variable
Y = sale_df[target_var]

# We add constant to the model as it's a best practice
# to do so every time!
X = sm.add_constant(X)

# We fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# We print the summary results.
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.765
Model:                            OLS   Adj. R-squared:                  0.765
Method:                 Least Squares   F-statistic:                     948.6
Date:                Mon, 23 Sep 2019   Prob (F-statistic):               0.00
Time:                        14:28:17   Log-Likelihood:                -17486.
No. Observations:                1460   AIC:                         3.498e+04
Df Residuals:                    1454   BIC:                         3.502e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const        -9.076e+05   1.14e+05     -7.942   

In [8]:
print('The new estimated model is:')

equation = str(round(results.params[0], 2))
cols = feature_set
for i in range(len(cols)):
    equation =  equation + ' + ' + str(round(results.params[i+1], 2)) + '(' + cols[i] + ')'

print('Temperature Difference = ', equation)

The new estimated model is:
Temperature Difference =  -907645.92 + 46.93(grlivarea) + 53.72(garagearea) + 30.54(totalbsmtsf) + 21552.9(overallqual) + 417.3(yearremodadd)


Not much changes with the model when the insignificant features are removed. The constant coefficient becomes lower (larger in the negative direction), but the rest of the coefficients keep their same sign (all positive) and similar values.

### Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have a more prominent effect on house prices?

In [9]:
feature_set = list(sale_df.columns.drop(['paveddrive', 'centralair']))
sale_df[feature_set].corr()['saleprice']

saleprice       1.000000
grlivarea       0.708624
garagearea      0.623431
totalbsmtsf     0.613581
overallqual     0.790982
yearremodadd    0.507101
Name: saleprice, dtype: float64

Based on the coefficient values obtained from OLS, we can see that the impact of each variable on saleprice (holding the other variables constant) are ranked by impact from highest to lowest as overallqual, yearremodadd, garagearea, grlivarea, and totalbsmtsf. So, by that analysis, we know that the sale price of a house is highly dependent on the overall quality and recency of remodeling work, and dependent on various sizes, which makes sense, as home-buyers generally will pay more for larger, high-quality, recently renovated buildings.