# Some potential audiences are:

1. Homeowners who want to increase the sale price of their homes through home improvement projects
2. Advocacy groups who want to promote affordable housing
3. Local elected officials who want to understand how their policy ideas (e.g. zoning changes, permitting) might impact home prices
4. Real estate investors looking for potential "fixer-uppers" or "tear-downs"

# Three things to be sure you establish during this phase are:

1. **Objectives:** what questions are you trying to answer, and for whom?
2. **Project plan:** you may want to establish more formal project management practices, such as daily stand-ups or using a Trello board, to plan the time you have remaining. Regardless, you should determine the division of labor, communication expectations, and timeline.
3. **Success criteria:** what does a successful project look like? How will you know when you have achieved it?

# READ THIS: Import the following data files from https://info.kingcounty.gov/assessor/DataDownload/default.aspx
## Download the files to local repo data directory
> 1) Real Property Sales (.ZIP, csv) <BR>
> 2) Parcel (.ZIP, csv) <BR>
> 3) Residential Building (.ZIP, csv) <BR>
> 4) Unit Breakdown (.ZIP)<BR>


In [35]:
import os
import sys

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.diagnostic import linear_rainbow, het_breuschpagan
from statsmodels.stats.outliers_influence import variance_inflation_factor

from eda.eda import *
from lr_model.build_lr import *

### Andrew's scratchwork below:
____

In [36]:
df_merged = consolidate_data(year=2019, create=True)
cols = list(df_merged.columns)
# cols = cols[2:4] + cols[6:7] + cols[10:11] + cols[27:29] + cols[35:36] + cols[43:44] + cols[48:50] 
cols = cols[2:4] +  cols[6:7] +  cols[10:11] + cols[27:29] + cols[35:36] + cols[43:44] + cols[48:50] 
df = df_merged[cols]
df.isna().sum()

df_merged.to_csv ('~\Downloads\test.csv', index = False, header=True)

Done eading Sales data.... (41818, 6)
Before EXTR_Parcel.csv:  (616089, 81)
After filtering KING county rows (103217, 27)
Filtering Residential and Condo data.... (98156, 27)
After reading EXTR_ResBldg.csv:  (517554, 30)
Done reading EXTR_LookUP.csv:  (1208, 3)
Merging....
After Merging files.csv:  (98156, 26)
Created merged file...s
Merging....Done


OSError: [Errno 22] Invalid argument: 'C:\\Users\\awyeh\\Downloads\test.csv'

In [37]:
df = df_merged.copy()
df = df.dropna().reset_index(drop = True)
# df.drop(columns = ['DistrictName', 'PropType'], inplace = True)
print(df.shape)
df.head()


(3884, 58)


Unnamed: 0,Merged_Key,DocumentDate,SalePrice,PropertyType,PrincipalUse,PropertyClass,Area,SubArea,SqFtLot,WaterSystem,...,SqFtDeck,HeatSystem,Bedrooms,BathHalfCount,Bath3qtrCount,BathFullCount,FpSingleStory,FpMultiStory,YrRenovated,PcntComplete
0,98400000450,2019,409950,11,6,8,51.0,6.0,7875.0,2.0,...,140.0,5.0,3.0,1.0,0.0,2.0,0.0,1.0,0.0,0.0
1,797320002320,2019,540000,3,6,8,23.0,4.0,8621.0,2.0,...,0.0,5.0,3.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
2,82607009096,2019,930000,11,6,8,70.0,3.0,212911.0,1.0,...,0.0,5.0,3.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0
3,410200000075,2019,379950,11,6,8,40.0,9.0,14149.0,1.0,...,520.0,4.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,868229001120,2019,620000,14,6,8,95.0,10.0,4046.0,2.0,...,0.0,5.0,3.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0


In [48]:
col = []
dic = df_merged.corr()['SalePrice'].to_dict()
for x in dic:
    if dic[x] >= 0.10:
        col.append(x)
        print(x)

SalePrice
NbrLivingUnits
Stories
SqFt1stFloor
SqFt2ndFloor
SqFtTotLiving
SqFtGarageAttached
SqFtOpenPorch
Bedrooms
BathFullCount


In [49]:
# # create a smaller df to save space and processing power
fsm_df = df_merged[col].copy()
fsm_df.dropna(inplace=True)
fsm = ols(formula="SalePrice ~ NbrLivingUnits + Stories + SqFt1stFloor + SqFt2ndFloor + SqFtTotLiving + SqFtGarageAttached + SqFtOpenPorch + Bedrooms + BathFullCount", data=fsm_df)
fsm_results = fsm.fit()
fsm_results.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.101
Model:,OLS,Adj. R-squared:,0.1
Method:,Least Squares,F-statistic:,375.7
Date:,"Tue, 29 Sep 2020",Prob (F-statistic):,0.0
Time:,10:47:51,Log-Likelihood:,-467020.0
No. Observations:,30199,AIC:,934100.0
Df Residuals:,30189,BIC:,934100.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.309e+06,6.18e+04,-21.175,0.000,-1.43e+06,-1.19e+06
NbrLivingUnits,1.446e+06,4.31e+04,33.525,0.000,1.36e+06,1.53e+06
Stories,1.745e+05,2.29e+04,7.606,0.000,1.3e+05,2.19e+05
SqFt1stFloor,64.3776,26.880,2.395,0.017,11.692,117.063
SqFt2ndFloor,-101.9707,23.247,-4.386,0.000,-147.536,-56.405
SqFtTotLiving,380.9222,18.034,21.122,0.000,345.574,416.270
SqFtGarageAttached,12.1660,32.655,0.373,0.709,-51.839,76.171
SqFtOpenPorch,307.7534,63.810,4.823,0.000,182.683,432.824
Bedrooms,-1.217e+05,9962.601,-12.217,0.000,-1.41e+05,-1.02e+05

0,1,2,3
Omnibus:,57055.139,Durbin-Watson:,0.757
Prob(Omnibus):,0.0,Jarque-Bera (JB):,152232768.592
Skew:,14.418,Prob(JB):,0.0
Kurtosis:,349.63,Cond. No.,27500.0


In [16]:
# # create a smaller df to save space and processing power
fsm_df = df_merged[['SalePrice', 'SqFt1stFloor', 'SqFt2ndFloor', 'SqFtTotLiving', 'SqFtGarageAttached','SqFtGarageAttached','SqFtOpenPorch','SqFtEnclosedPorch','Bedrooms','BathHalfCount','BathFullCount']].copy()
fsm_df.dropna(inplace=True)
fsm = ols(formula="SalePrice ~ SqFt1stFloor + SqFt2ndFloor + SqFtTotLiving + SqFtGarageAttached + SqFtGarageAttached + SqFtOpenPorch + SqFtEnclosedPorch + Bedrooms + BathHalfCount + BathFullCount", data=fsm_df)
fsm_results = fsm.fit()
fsm_results.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.066
Model:,OLS,Adj. R-squared:,0.066
Method:,Least Squares,F-statistic:,236.9
Date:,"Tue, 29 Sep 2020",Prob (F-statistic):,0.0
Time:,10:28:34,Log-Likelihood:,-467590.0
No. Observations:,30199,AIC:,935200.0
Df Residuals:,30189,BIC:,935300.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3.001e+05,3.05e+04,9.826,0.000,2.4e+05,3.6e+05
SqFt1stFloor,6.5141,25.506,0.255,0.798,-43.480,56.508
SqFt2ndFloor,-19.5031,20.158,-0.968,0.333,-59.013,20.007
SqFtTotLiving,371.5526,18.300,20.303,0.000,335.684,407.421
SqFtGarageAttached[0],-30.1488,16.653,-1.810,0.070,-62.789,2.492
SqFtGarageAttached[1],-30.1488,16.653,-1.810,0.070,-62.789,2.492
SqFtOpenPorch,287.7428,65.025,4.425,0.000,160.291,415.194
SqFtEnclosedPorch,884.2497,204.802,4.318,0.000,482.830,1285.669
Bedrooms,-8.779e+04,1.01e+04,-8.713,0.000,-1.08e+05,-6.8e+04

0,1,2,3
Omnibus:,58498.659,Durbin-Watson:,0.671
Prob(Omnibus):,0.0,Jarque-Bera (JB):,155297047.098
Skew:,15.297,Prob(JB):,0.0
Kurtosis:,352.976,Cond. No.,5.25e+16
