# Some potential audiences are:

1. Homeowners who want to increase the sale price of their homes through home improvement projects
2. Advocacy groups who want to promote affordable housing
3. Local elected officials who want to understand how their policy ideas (e.g. zoning changes, permitting) might impact home prices
4. Real estate investors looking for potential "fixer-uppers" or "tear-downs"

# Three things to be sure you establish during this phase are:

1. **Objectives:** what questions are you trying to answer, and for whom?
2. **Project plan:** you may want to establish more formal project management practices, such as daily stand-ups or using a Trello board, to plan the time you have remaining. Regardless, you should determine the division of labor, communication expectations, and timeline.
3. **Success criteria:** what does a successful project look like? How will you know when you have achieved it?

# READ THIS: Import the following data files from https://info.kingcounty.gov/assessor/DataDownload/default.aspx
## Download the files to local repo data directory
> 1) Real Property Sales (.ZIP, csv) <BR>
> 2) Parcel (.ZIP, csv) <BR>
> 3) Residential Building (.ZIP, csv) <BR>
> 4) Unit Breakdown (.ZIP)<BR>


In [4]:
import os
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.diagnostic import linear_rainbow, het_breuschpagan
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [2]:
# A function named parse_2019(df) that takes a dataframe as the input.
# It takes in a dataframe, looks for relevant columns, and then keeps the rows that
# are in the year 2019.

def parse_2019(df):
    if 'DocumentDate' in df.columns:
        df = df[pd.to_datetime(df['DocumentDate']).dt.year == 2019 ]
    elif 'ChangeDate' in df.columns:
        df = df[df['ChangeDate'].astype(str).str[:4] == '2019']
    df.reset_index(drop = True)
    return df

In [3]:
## Create a function get_data(create_csv)
## If create_csv = True:
##.   create a combined file rp_cons.csv from other csv files and return a dataframe rp_cons
## If create_csv = False:
##    return a data_frame with all columns from rp_cons.csv

import pandas as pd

def get_data (create_csv):

    if create_csv == False:    
        rp_cons = pd.read_csv("data/rp_cons.csv")   
    return rp_cons
    
    df_rp_sales = get_sale()
    df_parcel = get_parcel()
    df_res_bldg = get_resBldg()
    df_unit_breakdown = get_unit_breakdown()
    
    
    return df_rpsales

### 1) Read EXTR_RPSale.csv 

In [9]:
# Data File: EXTR_RPSale.csv -------------------------------------------------------------
#Table: EXTR_RPSale 
#Keys: Major, Minor
#Fields: SalePrice, PropertyType, PrincipalUse
def get_sale():
    df_rp_sales = pd.read_csv('data/EXTR_RPSale.csv', encoding = "ISO-8859-1", low_memory=False)
    print("Before Filer EXTR_RPSale.csv: ", df_rp_sales.shape)

    # Filter the following columns from EXTR_RPsale table
    # Primary key: 'Major', 'Minor' 
    # Select Fields: 'DocumentDate', 'SalePrice', 'PropertyType', 'PrincipalUse', 'PropertyClass
    cols = list(df_rp_sales.columns)
    df_rp_sales = df_rp_sales[cols[1:5] + cols[14:16] + cols[22:23]]
    df_rp_sales = parse_2019(df_rp_sales)
    print("After Filer EXTR_RPSale.csv: ", df_rp_sales.shape)  
    return df_rp_sales

### 2) Read EXTR_Parcel.csv 

In [5]:
#Data File: EXTR_Parcel.csv
#Table: EXTR_Parcel
#Keys: Major, Minor
#Fields: PropType, Area, SubArea,DistrictName, SqFtLot, WaterSystem, SewerSystem, Access, WaterProblems, AirportNoise, TrafficNoise,PowerLines,  LandSlideHazard, SeismicHazard

def get_parcel():
    df_parcel = pd.read_csv('data/EXTR_Parcel.csv', encoding = "ISO-8859-1", low_memory=False)
    print("Before EXTR_Parcel.csv: ", df_parcel.shape)
    df_parcel.columns
    
    # Filter the following columns from EXTR_Parcel table
    # Primary key: 'Major', 'Minor' 
    # Select Fields: PropType, Area, SubArea, DistrictName, SqFtLot, WaterSystem, SewerSystem, Access, WaterProblems, AirportNoise, TrafficNoise,PowerLines,  LandSlideHazard, SeismicHazard
    cols = list(df_parcel.columns)
    df_parcel = df_parcel[cols[:2] + cols[10:11] + cols[15:16]]  ######## Change this
    df_parcel = parse_2019(df_parcel)
    print("After Filer EXTR_Parcel.csv: ", df_par.shape)
    return df_parcel

### 3) Read EXTR_ResBldg.csv 

In [6]:
#Data File: EXTR_ResBldg.csv
#Table: EXTR_ResBldg
#Keys: Major, Minor
#Fields: BldgNbr, NbrLivingUnits, Address, BuildingNumber, Stories, BldgGrade, SqFt1stFloor, SqFtHalfFloor, SqFt2ndFloor, SqFtUpperFloor, SqFtTotLiving, SqFtTotBasement, SqFtFinBasement, SqFtOpenPorch, SqFtEnclosedPorch, SqFtDeck, HeatSystem, HeatSource, Bedrooms, BathHafCouunt, Bath3qtrCount, BathFullCount, FpSingleStory, FpMultiStory, YrBuilt, YrRenovated 
def get_resBldg():
    df_res_bldg = pd.read_csv('data/EXTR_ResBldg.csv', encoding = "ISO-8859-1", low_memory=False)
    print("Before EXTR_ResBldg.csv: ", df_res_bldg.shape)

    # Filter the following columns from EXTR_Parcel table
    # Primary key: 'Major', 'Minor' 
    # Select Fields: PropType, Area, SubArea, DistrictName, SqFtLot, WaterSystem, SewerSystem, Access, WaterProblems, AirportNoise, TrafficNoise,PowerLines,  LandSlideHazard, SeismicHazard
    cols = list(df_parcel.columns)
    df_res_bldg = df_res_bldg[cols[:2] + cols[10:11] + cols[15:16]]  ######## Change this    
    df_res_bldg = parse_2018(df_res_bldg)
    print("After Filer EXTR_Parcel.csv: ", df_res_bldg.shape)
    return df_res_bldg

### 4) Read EXTR_UnitBreakdown.csv 

In [18]:
#Data File: EXTR_UnitBreakdown.csv
#Table: EXTR_UnitBreakdown
#Keys: Major, Minor
#Fields:  'UnitTypeItemId', 'NbrThisType', 'SqFt','NbrBedrooms', 'NbrBaths'
def get_unitbreakdown():
    df_unit_breakdown = pd.read_csv('data/EXTR_UnitBreakdown.csv', encoding = "ISO-8859-1", low_memory=False)
    print("EXTR_UnitBreakdown: ", df_unit_breakdown.shape)
    ###todo extract column
    return df_unit_breakdown

### Andrew's scratchwork below:
____

In [18]:
df_rp_sales.columns

Index(['Major', 'Minor', 'DocumentDate', 'SalePrice', 'PropertyType',
       'PrincipalUse', 'PropertyClass'],
      dtype='object')

In [82]:
df_par.columns

Index(['Major', 'Minor', 'PropType', 'DistrictName'], dtype='object')

In [19]:
df_res_bldg.columns

Index(['Major', 'Minor', 'BldgNbr', 'NbrLivingUnits', 'Address',
       'BuildingNumber', 'Fraction', 'DirectionPrefix', 'StreetName',
       'StreetType', 'DirectionSuffix', 'ZipCode', 'Stories', 'BldgGrade',
       'BldgGradeVar', 'SqFt1stFloor', 'SqFtHalfFloor', 'SqFt2ndFloor',
       'SqFtUpperFloor', 'SqFtUnfinFull', 'SqFtUnfinHalf', 'SqFtTotLiving',
       'SqFtTotBasement', 'SqFtFinBasement', 'FinBasementGrade',
       'SqFtGarageBasement', 'SqFtGarageAttached', 'DaylightBasement',
       'SqFtOpenPorch', 'SqFtEnclosedPorch', 'SqFtDeck', 'HeatSystem',
       'HeatSource', 'BrickStone', 'ViewUtilization', 'Bedrooms',
       'BathHalfCount', 'Bath3qtrCount', 'BathFullCount', 'FpSingleStory',
       'FpMultiStory', 'FpFreestanding', 'FpAdditional', 'YrBuilt',
       'YrRenovated', 'PcntComplete', 'Obsolescence', 'PcntNetCondition',
       'Condition', 'AddnlCost'],
      dtype='object')

In [20]:
df_value_history.columns

Index(['Major', 'Minor', 'TaxYr', 'OmitYr', 'ApprLandVal', 'ApprImpsVal',
       'ApprImpIncr', 'LandVal', 'ImpsVal', 'TaxValReason', 'TaxStatus',
       'LevyCode', 'ChangeDate', 'ChangeDocId', 'Reason', 'SplitCode'],
      dtype='object')

In [21]:
df_unit_breakdown.columns

Index(['Major', 'Minor', 'UnitTypeItemId', 'NbrThisType', 'SqFt',
       'NbrBedrooms', 'NbrBaths'],
      dtype='object')

In [56]:
df = df_rp_sales.head()

In [58]:
df['DocumentDate'] = pd.to_datetime(df['DocumentDate'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [62]:
df[df['DocumentDate'].dt.year == 2019 ]

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,PropertyType,PrincipalUse,PropertyClass
4,213043,120,2019-12-20,560000,3,6,8


In [63]:
df['DocumentDate'] = pd.to_datetime(df['DocumentDate'])
df[df['DocumentDate'].dt.year == 2019 ]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Major,Minor,DocumentDate,SalePrice,PropertyType,PrincipalUse,PropertyClass
4,213043,120,2019-12-20,560000,3,6,8


In [97]:
df_value_history['ChangeDate'][0][:4]

'1994'

In [104]:
df_value_history = df_value_history[df_value_history['ChangeDate'].astype(str).str[:4] == '2019']
df_value_history.reset_index(drop = True)