# Some potential audiences are:

1. Homeowners who want to increase the sale price of their homes through home improvement projects
2. Advocacy groups who want to promote affordable housing
3. Local elected officials who want to understand how their policy ideas (e.g. zoning changes, permitting) might impact home prices
4. Real estate investors looking for potential "fixer-uppers" or "tear-downs"

# Three things to be sure you establish during this phase are:

1. **Objectives:** what questions are you trying to answer, and for whom?
2. **Project plan:** you may want to establish more formal project management practices, such as daily stand-ups or using a Trello board, to plan the time you have remaining. Regardless, you should determine the division of labor, communication expectations, and timeline.
3. **Success criteria:** what does a successful project look like? How will you know when you have achieved it?

# READ THIS: Import the following data files from https://info.kingcounty.gov/assessor/DataDownload/default.aspx
## Download the files to local repo data directory
> 1) Real Property Sales (.ZIP, csv) <BR>
> 2) Parcel (.ZIP, csv) <BR>
> 3) Residential Building (.ZIP, csv) <BR>
> 4) Unit Breakdown (.ZIP)<BR>


In [1]:
import os
import sys

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.diagnostic import linear_rainbow, het_breuschpagan
from statsmodels.stats.outliers_influence import variance_inflation_factor

from eda.eda import *
from lr_model.build_lr import *

### Andrew's scratchwork below:
____

In [2]:
df_merged = consolidate_data(year=2019, create=True)
cols = list(df_merged.columns)
# cols = cols[2:4] + cols[6:7] + cols[10:11] + cols[27:29] + cols[35:36] + cols[43:44] + cols[48:50] 
cols = cols[2:4] +  cols[6:7] +  cols[10:11] + cols[27:29] + cols[35:36] + cols[43:44] + cols[48:50] 
df = df_merged[cols]
df.isna().sum()

df_merged.to_csv ('~\Downloads\test.csv', index = False, header=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Done eading Sales data.... (41818, 6)
Before EXTR_Parcel.csv:  (616089, 81)
After filtering KING county rows (103217, 27)
Filtering Residential and Condo data.... (98156, 27)
After reading EXTR_ResBldg.csv:  (517554, 30)
Done reading EXTR_LookUP.csv:  (1208, 3)
Merging....
After Merging files.csv:  (98156, 26)
Created merged file...s
Merging....Done


OSError: [Errno 22] Invalid argument: 'C:\\Users\\awyeh\\Downloads\test.csv'

In [36]:
df = df_merged.copy()
df = df.dropna().reset_index(drop = True)
df.drop(columns = ['DocumentDate', 'DistrictName', 'Address', 'Merged_Key', 'PropertyType'], inplace = True)
print(df.shape)
df.head()


(3884, 55)


Unnamed: 0,SalePrice,PrincipalUse,PropertyClass,PropType,Area,SubArea,SqFtLot,WaterSystem,SewerSystem,Access,...,SqFtDeck,HeatSystem,Bedrooms,BathHalfCount,Bath3qtrCount,BathFullCount,FpSingleStory,FpMultiStory,YrRenovated,PcntComplete
0,409950,6,8,R,51.0,6.0,7875.0,2.0,2.0,4.0,...,140.0,5.0,3.0,1.0,0.0,2.0,0.0,1.0,0.0,0.0
1,540000,6,8,R,23.0,4.0,8621.0,2.0,2.0,4.0,...,0.0,5.0,3.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
2,930000,6,8,R,70.0,3.0,212911.0,1.0,1.0,1.0,...,0.0,5.0,3.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0
3,379950,6,8,R,40.0,9.0,14149.0,1.0,1.0,4.0,...,520.0,4.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,620000,6,8,R,95.0,10.0,4046.0,2.0,2.0,4.0,...,0.0,5.0,3.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0


In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3884 entries, 0 to 3883
Data columns (total 55 columns):
SalePrice                 3884 non-null int64
PrincipalUse              3884 non-null int64
PropertyClass             3884 non-null int64
PropType                  3884 non-null object
Area                      3884 non-null float64
SubArea                   3884 non-null float64
SqFtLot                   3884 non-null float64
WaterSystem               3884 non-null float64
SewerSystem               3884 non-null float64
Access                    3884 non-null float64
SeattleSkyline            3884 non-null float64
LakeWashington            3884 non-null float64
LakeSammamish             3884 non-null float64
SmallLakeRiverCreek       3884 non-null float64
OtherView                 3884 non-null float64
WfntLocation              3884 non-null float64
WfntFootage               3884 non-null float64
WfntBank                  3884 non-null float64
WfntPoorQuality           3884 non-n

In [38]:
df.corr()

Unnamed: 0,SalePrice,PrincipalUse,PropertyClass,Area,SubArea,SqFtLot,WaterSystem,SewerSystem,Access,SeattleSkyline,...,SqFtDeck,HeatSystem,Bedrooms,BathHalfCount,Bath3qtrCount,BathFullCount,FpSingleStory,FpMultiStory,YrRenovated,PcntComplete
SalePrice,1.0,0.015982,-0.060807,0.082551,0.00907,-0.066596,-0.009903,-0.048746,-0.050652,0.001672,...,0.074741,0.080091,0.106218,0.113627,0.036762,0.175239,0.073894,0.088082,-0.005312,-0.013907
PrincipalUse,0.015982,1.0,-0.333725,-0.016969,0.02812,-0.003165,0.01123,0.008502,0.00849,-0.000364,...,-0.015732,0.011709,-0.018247,0.001389,-0.018158,-0.01612,0.012974,-0.013239,-0.005288,-0.000785
PropertyClass,-0.060807,-0.333725,1.0,0.001053,0.03599,0.005296,0.007609,0.017246,-0.006798,0.000833,...,0.009775,-0.017378,0.011313,0.030645,-0.029856,0.012483,-0.01003,-0.000211,0.017821,-0.16837
Area,0.082551,-0.016969,0.001053,1.0,0.043277,0.172436,-0.348267,-0.456295,-0.135217,0.001348,...,0.123736,-0.125731,-0.196714,0.071476,-0.043885,0.055523,0.03799,0.031162,0.058255,0.019493
SubArea,0.00907,0.02812,0.03599,0.043277,1.0,-0.051866,0.100274,0.029407,0.045362,-0.018902,...,-0.019356,0.031184,0.03833,0.101288,-0.044598,0.086381,0.064035,0.011829,-0.018884,-0.026043
SqFtLot,-0.066596,-0.003165,0.005296,0.172436,-0.051866,1.0,-0.515045,-0.34922,0.039697,-0.002387,...,-0.081998,-0.484187,-0.432102,-0.109309,-0.103091,-0.265125,-0.004351,-0.07465,-0.02435,-0.003923
WaterSystem,-0.009903,0.01123,0.007609,-0.348267,0.100274,-0.515045,1.0,0.451093,0.232462,0.00794,...,-0.080622,0.271566,0.300557,0.091833,0.039588,0.161226,0.087418,0.078535,-0.060439,-0.031428
SewerSystem,-0.048746,0.008502,0.017246,-0.456295,0.029407,-0.34922,0.451093,1.0,0.201584,0.022046,...,-0.184354,0.1738,0.203807,0.023998,-0.001605,0.11631,0.048915,-0.068877,-0.058989,-0.056486
Access,-0.050652,0.00849,-0.006798,-0.135217,0.045362,0.039697,0.232462,0.201584,1.0,0.006003,...,-0.140421,-0.069822,0.001912,-0.071928,-0.084175,-0.052999,0.04101,-0.027839,-0.052267,-0.015798
SeattleSkyline,0.001672,-0.000364,0.000833,0.001348,-0.018902,-0.002387,0.00794,0.022046,0.006003,1.0,...,0.03086,0.027096,0.01171,-0.013798,0.013011,0.03276,0.034069,-0.009361,-0.003739,-0.000555


In [39]:
col = []
dic = df_merged.corr()['SalePrice'].to_dict()
for x in dic:
    if dic[x] >= abs(0.10):
        col.append(x)
        print(x)

SalePrice
NbrLivingUnits
Stories
SqFt1stFloor
SqFt2ndFloor
SqFtTotLiving
SqFtGarageAttached
SqFtOpenPorch
Bedrooms
BathFullCount


In [40]:
# # create a smaller df to save space and processing power
fsm_df = df_merged[col].copy()
fsm_df.dropna(inplace=True)
fsm = ols(formula="SalePrice ~ NbrLivingUnits + Stories + SqFt1stFloor + SqFt2ndFloor + SqFtTotLiving + SqFtGarageAttached + SqFtOpenPorch + Bedrooms + BathFullCount", data=fsm_df)
fsm_results = fsm.fit()
fsm_results.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.101
Model:,OLS,Adj. R-squared:,0.1
Method:,Least Squares,F-statistic:,375.7
Date:,"Tue, 29 Sep 2020",Prob (F-statistic):,0.0
Time:,13:43:57,Log-Likelihood:,-467020.0
No. Observations:,30199,AIC:,934100.0
Df Residuals:,30189,BIC:,934100.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.309e+06,6.18e+04,-21.175,0.000,-1.43e+06,-1.19e+06
NbrLivingUnits,1.446e+06,4.31e+04,33.525,0.000,1.36e+06,1.53e+06
Stories,1.745e+05,2.29e+04,7.606,0.000,1.3e+05,2.19e+05
SqFt1stFloor,64.3776,26.880,2.395,0.017,11.692,117.063
SqFt2ndFloor,-101.9707,23.247,-4.386,0.000,-147.536,-56.405
SqFtTotLiving,380.9222,18.034,21.122,0.000,345.574,416.270
SqFtGarageAttached,12.1660,32.655,0.373,0.709,-51.839,76.171
SqFtOpenPorch,307.7534,63.810,4.823,0.000,182.683,432.824
Bedrooms,-1.217e+05,9962.601,-12.217,0.000,-1.41e+05,-1.02e+05

0,1,2,3
Omnibus:,57055.139,Durbin-Watson:,0.757
Prob(Omnibus):,0.0,Jarque-Bera (JB):,152232768.592
Skew:,14.418,Prob(JB):,0.0
Kurtosis:,349.63,Cond. No.,27500.0


In [41]:
# # create a smaller df to save space and processing power
fsm_df = df_merged[['SalePrice', 'SqFt1stFloor', 'SqFt2ndFloor', 'SqFtTotLiving', 'SqFtGarageAttached','SqFtGarageAttached','SqFtOpenPorch','SqFtEnclosedPorch','Bedrooms','BathHalfCount','BathFullCount']].copy()
fsm_df.dropna(inplace=True)
fsm = ols(formula="SalePrice ~ SqFt1stFloor + SqFt2ndFloor + SqFtTotLiving + SqFtGarageAttached + SqFtGarageAttached + SqFtOpenPorch + SqFtEnclosedPorch + Bedrooms + BathHalfCount + BathFullCount", data=fsm_df)
fsm_results = fsm.fit()
fsm_results.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.066
Model:,OLS,Adj. R-squared:,0.066
Method:,Least Squares,F-statistic:,236.9
Date:,"Tue, 29 Sep 2020",Prob (F-statistic):,0.0
Time:,13:43:57,Log-Likelihood:,-467590.0
No. Observations:,30199,AIC:,935200.0
Df Residuals:,30189,BIC:,935300.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3.001e+05,3.05e+04,9.826,0.000,2.4e+05,3.6e+05
SqFt1stFloor,6.5141,25.506,0.255,0.798,-43.480,56.508
SqFt2ndFloor,-19.5031,20.158,-0.968,0.333,-59.013,20.007
SqFtTotLiving,371.5526,18.300,20.303,0.000,335.684,407.421
SqFtGarageAttached[0],-30.1488,16.653,-1.810,0.070,-62.789,2.492
SqFtGarageAttached[1],-30.1488,16.653,-1.810,0.070,-62.789,2.492
SqFtOpenPorch,287.7428,65.025,4.425,0.000,160.291,415.194
SqFtEnclosedPorch,884.2497,204.802,4.318,0.000,482.830,1285.669
Bedrooms,-8.779e+04,1.01e+04,-8.713,0.000,-1.08e+05,-6.8e+04

0,1,2,3
Omnibus:,58498.659,Durbin-Watson:,0.671
Prob(Omnibus):,0.0,Jarque-Bera (JB):,155297047.098
Skew:,15.297,Prob(JB):,0.0
Kurtosis:,352.976,Cond. No.,5.25e+16


In [42]:
df

Unnamed: 0,SalePrice,PrincipalUse,PropertyClass,PropType,Area,SubArea,SqFtLot,WaterSystem,SewerSystem,Access,...,SqFtDeck,HeatSystem,Bedrooms,BathHalfCount,Bath3qtrCount,BathFullCount,FpSingleStory,FpMultiStory,YrRenovated,PcntComplete
0,409950,6,8,R,51.0,6.0,7875.0,2.0,2.0,4.0,...,140.0,5.0,3.0,1.0,0.0,2.0,0.0,1.0,0.0,0.0
1,540000,6,8,R,23.0,4.0,8621.0,2.0,2.0,4.0,...,0.0,5.0,3.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
2,930000,6,8,R,70.0,3.0,212911.0,1.0,1.0,1.0,...,0.0,5.0,3.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0
3,379950,6,8,R,40.0,9.0,14149.0,1.0,1.0,4.0,...,520.0,4.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,620000,6,8,R,95.0,10.0,4046.0,2.0,2.0,4.0,...,0.0,5.0,3.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3879,740000,6,8,R,84.0,2.0,72465.0,2.0,1.0,4.0,...,150.0,5.0,3.0,1.0,0.0,2.0,1.0,0.0,0.0,0.0
3880,480000,6,8,R,23.0,5.0,7500.0,2.0,2.0,4.0,...,0.0,5.0,3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3881,522000,6,8,R,100.0,5.0,201682.0,1.0,1.0,4.0,...,0.0,4.0,2.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
3882,945000,6,8,R,95.0,5.0,165528.0,1.0,1.0,3.0,...,0.0,5.0,4.0,0.0,2.0,1.0,0.0,2.0,0.0,0.0


In [43]:
df.select_dtypes(include = 'object').head()

Unnamed: 0,PropType,WfntAccessRights,WfntProximityInfluence,PowerLines,OtherNuisances,BuildingNumber,ZipCode,DaylightBasement
0,R,N,N,N,N,12704,98058,Y
1,R,N,N,N,N,10246,98146,
2,R,N,N,N,N,30606,98019,
3,R,N,N,N,N,37316,98022,
4,R,N,N,N,N,13753,98053,N


In [45]:
from sklearn.preprocessing import LabelEncoder


In [49]:
label_encoder = LabelEncoder()
status_labels = label_encoder.fit_transform(df["WfntAccessRights"])
label_encoder.classes_
df["WfntAccessRights_Encoded"] = status_labels

In [50]:
label_encoder = LabelEncoder()
status_labels = label_encoder.fit_transform(df["WfntProximityInfluence"])
label_encoder.classes_
df["WfntProximityInfluence_Encoded"] = status_labels

In [51]:
label_encoder = LabelEncoder()
status_labels = label_encoder.fit_transform(df["PowerLines"])
label_encoder.classes_
df["PowerLines"] = status_labels

In [52]:
label_encoder = LabelEncoder()
status_labels = label_encoder.fit_transform(df["OtherNuisances"])
label_encoder.classes_
df["OtherNuisances_Encoded"] = status_labels

In [53]:
label_encoder = LabelEncoder()
status_labels = label_encoder.fit_transform(df["BuildingNumber"])
label_encoder.classes_
df["BuildingNumber_Encoded"] = status_labels

In [54]:
label_encoder = LabelEncoder()
status_labels = label_encoder.fit_transform(df["ZipCode"])
label_encoder.classes_
df["ZipCode_Encoded"] = status_labels

In [55]:
label_encoder = LabelEncoder()
status_labels = label_encoder.fit_transform(df["DaylightBasement"])
label_encoder.classes_
df["DaylightBasement_Encoded"] = status_labels

In [56]:
df.corr()

Unnamed: 0,SalePrice,PrincipalUse,PropertyClass,Area,SubArea,SqFtLot,WaterSystem,SewerSystem,Access,SeattleSkyline,...,FpSingleStory,FpMultiStory,YrRenovated,PcntComplete,OtherNuisances_Encoded,WfntAccessRights_Encoded,WfntProximityInfluence_Encoded,BuildingNumber_Encoded,ZipCode_Encoded,DaylightBasement_Encoded
SalePrice,1.0,0.015982,-0.060807,0.082551,0.00907,-0.066596,-0.009903,-0.048746,-0.050652,0.001672,...,0.073894,0.088082,-0.005312,-0.013907,0.015179,-0.013141,0.000289,-0.011847,0.013079,-0.021474
PrincipalUse,0.015982,1.0,-0.333725,-0.016969,0.02812,-0.003165,0.01123,0.008502,0.00849,-0.000364,...,0.012974,-0.013239,-0.005288,-0.000785,-0.003965,-0.002618,-0.001264,0.015835,-0.004468,-0.022831
PropertyClass,-0.060807,-0.333725,1.0,0.001053,0.03599,0.005296,0.007609,0.017246,-0.006798,0.000833,...,-0.01003,-0.000211,0.017821,-0.16837,0.009069,-0.005135,0.00289,0.00777,-0.003172,0.012028
Area,0.082551,-0.016969,0.001053,1.0,0.043277,0.172436,-0.348267,-0.456295,-0.135217,0.001348,...,0.03799,0.031162,0.058255,0.019493,0.016722,0.03289,0.009271,0.052375,-0.144141,-0.159492
SubArea,0.00907,0.02812,0.03599,0.043277,1.0,-0.051866,0.100274,0.029407,0.045362,-0.018902,...,0.064035,0.011829,-0.018884,-0.026043,-0.020166,-0.058128,-0.0282,0.247305,-0.458182,0.018686
SqFtLot,-0.066596,-0.003165,0.005296,0.172436,-0.051866,1.0,-0.515045,-0.34922,0.039697,-0.002387,...,-0.004351,-0.07465,-0.02435,-0.003923,-0.022193,-0.0166,-0.00684,0.005228,-0.051488,-0.133097
WaterSystem,-0.009903,0.01123,0.007609,-0.348267,0.100274,-0.515045,1.0,0.451093,0.232462,0.00794,...,0.087418,0.078535,-0.060439,-0.031428,-0.010095,0.022379,-0.033448,-0.099692,0.152763,0.136653
SewerSystem,-0.048746,0.008502,0.017246,-0.456295,0.029407,-0.34922,0.451093,1.0,0.201584,0.022046,...,0.048915,-0.068877,-0.058989,-0.056486,-0.087861,-0.017812,-0.034773,-0.114597,0.1717,0.099454
Access,-0.050652,0.00849,-0.006798,-0.135217,0.045362,0.039697,0.232462,0.201584,1.0,0.006003,...,0.04101,-0.027839,-0.052267,-0.015798,-0.017934,0.037509,-0.025418,-0.056144,0.063782,0.00315
SeattleSkyline,0.001672,-0.000364,0.000833,0.001348,-0.018902,-0.002387,0.00794,0.022046,0.006003,1.0,...,0.034069,-0.009361,-0.003739,-0.000555,-0.002803,-0.001851,-0.000893,-0.010783,0.004012,0.004234
