# Some potential audiences are:

1. Homeowners who want to increase the sale price of their homes through home improvement projects
2. Advocacy groups who want to promote affordable housing
3. Local elected officials who want to understand how their policy ideas (e.g. zoning changes, permitting) might impact home prices
4. Real estate investors looking for potential "fixer-uppers" or "tear-downs"

# Three things to be sure you establish during this phase are:

1. **Objectives:** what questions are you trying to answer, and for whom?
2. **Project plan:** you may want to establish more formal project management practices, such as daily stand-ups or using a Trello board, to plan the time you have remaining. Regardless, you should determine the division of labor, communication expectations, and timeline.
3. **Success criteria:** what does a successful project look like? How will you know when you have achieved it?

# READ THIS: Import the following data files from https://info.kingcounty.gov/assessor/DataDownload/default.aspx
## Download the files to local repo data directory
> 1) Real Property Sales (.ZIP, csv) <BR>
> 2) Parcel (.ZIP, csv) <BR>
> 3) Residential Building (.ZIP, csv) <BR>
> 4) Unit Breakdown (.ZIP)<BR>


In [2]:
import os
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.diagnostic import linear_rainbow, het_breuschpagan
from statsmodels.stats.outliers_influence import variance_inflation_factor

### TODO: 1) Read EXTR_RPSale.csv only once. First check if we have already creaed df_modelview, then skip reading. 

In [3]:
# Data File: EXTR_RPSale.csv
#Table: EXTR_RPSale 
#Keys: Major, Minor
#Fields: SalePrice, PropertyType, PrincipalUse


df_rp_sales = pd.read_csv('data/EXTR_RPSale.csv', encoding = "ISO-8859-1", low_memory=False)
print("Before Filer EXTR_RPSale.csv: ", df_rp_sales.shape)

# Filter the following columns from EXTR_RPsale table
# Primary key: 'Major', 'Minor' 
# Select Fields: 'DocumentDate', 'SalePrice', 'PropertyType', 'PrincipalUse', 'PropertyClass
cols = list(df_rp_sales.columns)
df_rp_sales = df_rp_sales[cols[1:5] + cols[14:16] + cols[22:23]]
print("After Filer EXTR_RPSale.csv: ", df_rp_sales.shape)

Before Filer EXTR_RPSale.csv:  (2107155, 24)
After Filer EXTR_RPSale.csv:  (2107155, 7)


SyntaxError: unexpected EOF while parsing (<ipython-input-14-a49d5a5e2402>, line 3)

In [None]:
### TODO: 2) Read EXTR_Parcel.csv only once. First check if we have already creaed df_modelview, then skip reading the file.

In [109]:
#Data File: EXTR_Parcel.csv
#Table: EXTR_Parcel
#Keys: Major, Minor
#Fields: PropType, Area, SubArea,DistrictName, SqFtLot, WaterSystem, SewerSystem, Access, WaterProblems, AirportNoise, TrafficNoise,PowerLines,  LandSlideHazard, SeismicHazard

df_parcel = pd.read_csv('data/EXTR_Parcel.csv', encoding = "ISO-8859-1", low_memory=False)
print("Before EXTR_Parcel.csv: ", df_parcel.shape)
df_parcel.columns
# Filter the following columns from EXTR_Parcel table
# Primary key: 'Major', 'Minor' 
# Select Fields: PropType, Area, SubArea, DistrictName, SqFtLot, WaterSystem, SewerSystem, Access, WaterProblems, AirportNoise, TrafficNoise,PowerLines,  LandSlideHazard, SeismicHazard
cols = list(df_parcel.columns)
df_par = df_parcel[cols[:2] + cols[10:11] + cols[15:16]]
print("After Filer EXTR_Parcel.csv: ", df_par.shape)


Before EXTR_Parcel.csv:  (616089, 81)


Index(['Major', 'Minor', 'PropName', 'PlatName', 'PlatLot', 'PlatBlock',
       'Range', 'Township', 'Section', 'QuarterSection', 'PropType', 'Area',
       'SubArea', 'SpecArea', 'SpecSubArea', 'DistrictName', 'LevyCode',
       'CurrentZoning', 'HBUAsIfVacant', 'HBUAsImproved', 'PresentUse',
       'SqFtLot', 'WaterSystem', 'SewerSystem', 'Access', 'Topography',
       'StreetSurface', 'RestrictiveSzShape', 'InadequateParking',
       'PcntUnusable', 'Unbuildable', 'MtRainier', 'Olympics', 'Cascades',
       'Territorial', 'SeattleSkyline', 'PugetSound', 'LakeWashington',
       'LakeSammamish', 'SmallLakeRiverCreek', 'OtherView', 'WfntLocation',
       'WfntFootage', 'WfntBank', 'WfntPoorQuality', 'WfntRestrictedAccess',
       'WfntAccessRights', 'WfntProximityInfluence', 'TidelandShoreland',
       'LotDepthFactor', 'TrafficNoise', 'AirportNoise', 'PowerLines',
       'OtherNuisances', 'NbrBldgSites', 'Contamination', 'DNRLease',
       'AdjacentGolfFairway', 'AdjacentGreenbelt', 

In [108]:
#Data File: EXTR_ValueHistory_5.csv
#Table: EXTR_ValueHistory
#Keys: Major, Minor
#Fields: TaxYr, LandVal, ImpsVal

df_value_history = pd.read_csv('data/EXTR_ValueHistory_V.csv', encoding = "ISO-8859-1")
print("EXTR_ValueHistory_5.csv: ", df_value_history.shape)

EXTR_ValueHistory_5.csv:  (23212862, 16)


In [96]:
#Data File: EXTR_ResBldg.csv
#Table: EXTR_ResBldg
#Keys: Major, Minor
#Fields:  BldgNbr, NbrLivingUnits, Address, BuildingNumber, Stories, BldgGrade, SqFt1stFloor, SqFtHalfFloor, SqFt2ndFloor, SqFtUpperFloor, SqFtTotLiving, SqFtTotBasement, SqFtFinBasement, SqFtOpenPorch, SqFtEnclosedPorch, SqFtDeck, HeatSystem, HeatSource, Bedrooms, BathHafCouunt, Bath3qtrCount, BathFullCount, FpSingleStory, FpMultiStory, YrBuilt, YrRenovated 

df_res_bldg = pd.read_csv('data/EXTR_ResBldg.csv', encoding = "ISO-8859-1", low_memory=False)
print("EXTR_ResBldg.csv: ", df_res_bldg.shape)

EXTR_ResBldg.csv:  (517554, 50)


In [41]:
#Data File: EXTR_UnitBreakdown.csv
#Table: EXTR_UnitBreakdown
#Keys: Major, Minor
#Fields:  'UnitTypeItemId', 'NbrThisType', 'SqFt','NbrBedrooms', 'NbrBaths'

df_unit_breakdown = pd.read_csv('data/EXTR_UnitBreakdown.csv', encoding = "ISO-8859-1", low_memory=False)
print("EXTR_UnitBreakdown: ", df_unit_breakdown.shape)

EXTR_UnitBreakdown:  (25382, 7)


## READ: Explore the column names from each
> 1) Real Property Sales <BR>
> 2) Parcel <BR>
> 3) Residential Building  <BR>
> 4) Unit Breakdown <BR>

> As you will notice, every file has Major+Minor as the key for each table.<BR>
> We will not use all the columns from each table in our regression model.<BR>
> Moreover,  each files have HUGE number of records, all with Major+Minor as the key fileds.<BR>
> We will read and create ONE view of data joined from each of these tables required to answer many business qustions.<BR>
> With that in mind, we will create ONE csv file from data extracted from each of these csv file and store it in data directory<BR> as, "par_rps_rb_ub.csv'

In [37]:
df_rp_sales.columns

Index(['ExciseTaxNbr', 'Major', 'Minor', 'DocumentDate', 'SalePrice',
       'RecordingNbr', 'Volume', 'Page', 'PlatNbr', 'PlatType', 'PlatLot',
       'PlatBlock', 'SellerName', 'BuyerName', 'PropertyType', 'PrincipalUse',
       'SaleInstrument', 'AFForestLand', 'AFCurrentUseLand', 'AFNonProfitUse',
      dtype='object')

In [31]:
df_res_bldg.columns

Index(['Major', 'Minor', 'BldgNbr', 'NbrLivingUnits', 'Address',
       'BuildingNumber', 'Fraction', 'DirectionPrefix', 'StreetName',
       'StreetType', 'DirectionSuffix', 'ZipCode', 'Stories', 'BldgGrade',
       'BldgGradeVar', 'SqFt1stFloor', 'SqFtHalfFloor', 'SqFt2ndFloor',
       'SqFtUpperFloor', 'SqFtUnfinFull', 'SqFtUnfinHalf', 'SqFtTotLiving',
       'SqFtTotBasement', 'SqFtFinBasement', 'FinBasementGrade',
       'SqFtGarageBasement', 'SqFtGarageAttached', 'DaylightBasement',
       'SqFtOpenPorch', 'SqFtEnclosedPorch', 'SqFtDeck', 'HeatSystem',
       'HeatSource', 'BrickStone', 'ViewUtilization', 'Bedrooms',
       'BathHalfCount', 'Bath3qtrCount', 'BathFullCount', 'FpSingleStory',
       'FpMultiStory', 'FpFreestanding', 'FpAdditional', 'YrBuilt',
       'YrRenovated', 'PcntComplete', 'Obsolescence', 'PcntNetCondition',
       'Condition', 'AddnlCost'],
      dtype='object')

In [40]:
df_value_history.columns

Index(['Major', 'Minor', 'TaxYr', 'OmitYr', 'ApprLandVal', 'ApprImpsVal',
       'ApprImpIncr', 'LandVal', 'ImpsVal', 'TaxValReason', 'TaxStatus',
       'LevyCode', 'ChangeDate', 'ChangeDocId', 'Reason', 'SplitCode'],
      dtype='object')

In [42]:
df_unit_breakdown.columns

Index(['Major', 'Minor', 'UnitTypeItemId', 'NbrThisType', 'SqFt',
       'NbrBedrooms', 'NbrBaths'],
      dtype='object')

In [75]:
df_rp_sales.columns

Index(['Minor', 'DocumentDate', 'SalePrice', 'PropertyType'], dtype='object')