# Final Model Construction

Below is the construction for a multiple linear regression model predicting housing prices in King County. Data is imported from the preprocessed data that was cleaned in preceding notebooks and saved as separate csv files. The data is imported to this notebook and combined for the first model. Further iterations are contained in additional subsections of this notebook.

### Import Packages

In [1]:
import scipy.stats as stats
import pandas as pd
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import median_absolute_error, mean_squared_error 
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression


### Import and Combine Cleaned Data

In [2]:
categorical_ohe = pd.read_csv("./data/cat_hot_dataframe")
categorical_ordinal = pd.read_csv("./data/cat_ordinal_dataframe")
numeric = pd.read_csv("./data/initial_numeric_inputs")
target = pd.read_csv("./data/house_price_target_natlog")

### Construct Model

In [3]:
all_predictors = pd.concat([numeric, categorical_ohe], axis=1)
exog = sm.add_constant(all_predictors)
endog = target

first_model = sm.OLS(endog, exog).fit().summary()
first_model

0,1,2,3
Dep. Variable:,ln_price,R-squared:,0.648
Model:,OLS,Adj. R-squared:,0.648
Method:,Least Squares,F-statistic:,1655.0
Date:,"Tue, 25 Oct 2022",Prob (F-statistic):,0.0
Time:,18:28:22,Log-Likelihood:,-5517.1
No. Observations:,21597,AIC:,11080.0
Df Residuals:,21572,BIC:,11280.0
Df Model:,24,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,9.3377,0.090,103.931,0.000,9.162,9.514
bedrooms,-0.0374,0.003,-12.215,0.000,-0.043,-0.031
bathrooms,0.0871,0.005,17.387,0.000,0.077,0.097
ln_sqft_living,0.4146,0.014,29.751,0.000,0.387,0.442
sqft_lot,2.654e-07,7.41e-08,3.581,0.000,1.2e-07,4.11e-07
floors,0.1163,0.006,20.768,0.000,0.105,0.127
ln_sqft_above,-0.1776,0.012,-14.554,0.000,-0.202,-0.154
age,0.0062,9.9e-05,62.310,0.000,0.006,0.006
renovated,0.0115,0.005,2.192,0.028,0.001,0.022

0,1,2,3
Omnibus:,58.161,Durbin-Watson:,1.964
Prob(Omnibus):,0.0,Jarque-Bera (JB):,75.08
Skew:,-0.008,Prob(JB):,4.97e-17
Kurtosis:,3.288,Cond. No.,7440000.0


### Evaluate Model Performance

 (JB):score is fairly low, some of our data skew may be impacting our results
 also Cond. No.	7.44e+06 is very large, indicating significant multicollinearity  
 our Rsqr at .648 is decent but will continue to improve it.


### Assess Potential Model Improvements

(1)We're going to scale our features, so they have equal impact on their outcome
(2)Assess feature for collinearity and remove as needed


# 2nd Model

In [4]:
ss = StandardScaler()
ss.fit(numeric)
num_scaled = ss.transform(numeric)