# Objective:
The goal of this notebook is to create a standardized method for processing our data. The original datasets downloaded from the King County website are extremely large (see Data Download section of the readme), but by the end of this notebook they will be much smaller, allowing all further analysis to be performed easily with Pandas. 

## Process Overview:
### 1. Import
All data in the project will be stored in a 'data' folder found in the projects main directory.

### 2. Filtering
Once imported, the data will then be filtered to include only sales and properties that would be of interest to our target audience, which is primarily prospective homeowners.

### 3. Export
Once properly filter, the new dataframes will be exported with the suffix '_filtered'. For instance EXTR_RPSale.csv will become EXTR_RPSale_filtered.csv.

# 1. Import

### A. Set-up
Before we begin the data import, we'll set up our notebook with a few important modules and variables. First we'll add our repository home directory to our path so that we can import our custom modules:

In [None]:
#let's add the project directory to our module path
import os
import sys

module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

Then we'll import the remainder of the modules we'll be using, as well as setting our data_folder variable to the path of our data folder.

In [None]:
#also import the rest of our modules
import pandas as pd
import seaborn as sns
import numpy as np
from scipy import stats
from src import data_cleaning

data_folder = '../../data/'

### B. Data Import

We'll be looking at only two of our CSVs for the initial data cleaning process. These CSV's are quite large and will take some time to load. 

#### Note on Unique Identifiers
Note the parameters passed to the read_csv method below. The data types of a few of the columns MUST be declared through the dtypes method. This is because the unique identifier for a piece of land will be created using the Major and Minor columns. For example, the first entry in our sales data will have a Major value of 919715 and a Minor value of 0200, so our unique identifer would be 9197150200. These values are numeric, but are also zero padded, therefore importing as an integer would remove the zero padding and result in an incorrect unique identifer of 919715200. We'll go into more detail on the unique identifiers (PINS) later on.

In [None]:
rp_sale = pd.read_csv(data_folder+'EXTR_RPSale.csv', dtype={'Major': 'string', 'Minor':'string'})
res_bldg = pd.read_csv(data_folder+'EXTR_ResBldg.csv', dtype={'Major': 'string', 'Minor':'string', 'ZipCode': 'string'})

# 2. Filtering

The motivation behind filtering the data set comes from our target audience, home buyers. Our original data set contains sales deeds that may not provide accurate information for home buyers, such as foreclosure deeds as well as sales pertaining to commercial or public property.

Specifically, we are looking at the following properties:
- A. Property that was sold in 2019, to see the factors that affect the current market.

- B. Property that was sold, as opposed to foreclosed or transferred as part of a settlement.

- C. Property that is residential, as opposed to commercial.

- D. Property that was not 'sold' for a value of zero dollars, as these represent sales circumstances such as inheritance, which does not provide accurate portrayal of a property's value.

- E. Property that has one building on a given tax parcel. Our data does not include an accurate way to differentiate the values for multiple homes on a single parcel, so these will be excluded to maintain accuracy.

- F. Property that is not a mansion, simply because first time homebuyers are likely not investing in  mansions.

Each of the previous criteria represent one filter that will be applied to the data sets.

### A. Was sold in 2019

This filter cuts the most substantial amount of entries from our data set, which records sales from at least a few decades ago. It's using a filter function that first converts all document dates to a pandas DateTime object, then filters all events that are not from 2019.

In [None]:
filter_a = data_cleaning.filter_data_by_year(rp_sale, 2019)
filter_a

It cuts about 2 million entries.

In [None]:
print(f'Change in dataset size: {len(filter_a) - len(rp_sale)}')

Ultimately we'll need normally distributed data in order for our regression model to fit, so let's keep track of what these filters do to our data. Here's a look at what the data looks like right now.

In [None]:
sns.distplot(filter_a.SalePrice)

We'll keep track of that as we go throughout our filtering process.

### B. Is listed as some sale other than foreclosures, settlements, etc. 
i.e. a standard sale.

A note on Look Up codes. For many of our filters, we will be utilizing the data set's 'LookUp' codes found in the EXTR_LookUp.csv. To use look up codes to find sale reason, we will check the column description in this data sets .doc file (this is found in the projects 'references/data_documentation' folder. Reading the description for the Sale Reason column tells us that it is assigned the lookup code 5. Now we can check the EXTR_LookUp.csv to find the meanings for each associated value found in the sale reason column: 

- 1 - None 
- 2 - Assumption         
- 3 - Mortgage Assumption 
- 4 - Foreclosure   
- 5 - Trust     
- 6 - Executor-to admin guardian 
- 7 - Testamentary Trust 
- 8 - Estate Settlement   
- 9 - Settlement      
- 10 - Property Settlement
- 11 - Divorce Settlement      
- 12 - Tenancy Partition    
- 13 - Community Prop Established    
- 14 - Partial Int - love,aff,gft  
- 15 - Easement       
- 16 - Correction (refiling) 
- 17 - Trade         
- 18 - Other      
- 19 - Quit Claim Deed - gift/full or part interest

For the purposes of this filter, we want to exclude everything that does not fall under category 1 (None, no reason listed) or 18 (Other, not specified). All other categories are considered non-standard sales.

In [None]:
filter_b = filter_a[(filter_a['SaleReason']==1)|(filter_a['SaleReason']==18)]
filter_b

This removes very little of are data. This is unsurprising because most non-standard sales are listed with a zero sale price and were thus filtered out with the previous non-zero filter.

In [None]:
print(f'Change in dataset size: {len(filter_b) - len(filter_a)}')

In [None]:
sns.distplot(filter_b.SalePrice)

### C. Is residential
Because we are targetting prospective homeowners, we don't want to look at commercial buildings. The lookup codes assign residential buildings to the PropertyClass values of 7 and 8.

In [None]:
filter_c = filter_b[(filter_b['PropertyClass'] == 7)|(filter_b['PropertyClass'] == 8)]
filter_c

This is a substantial filter, removing approximately 25% of our data.

In [None]:
print(f'Change in dataset size: {len(filter_c) - len(filter_b)}')

In [None]:
sns.distplot(filter_c.SalePrice)

We can start to see a clearer picture of our distribution now.

### D. Was not sold for zero dollars
Most zero valued sales are the result of some non-standard sale such as inheritance. The reason these sales were not caught by the previous filter b is not known. A further investigation on the nature of these zero_valued homes will be found in a separate notebook in this folder.

In [None]:
filter_d = filter_c[filter_c['SalePrice'] != 0]
filter_d

This is our second most substantial filter, removing about a third of our 2019 sales data.

In [None]:
print(f'Change in dataset size: {len(filter_d) - len(filter_c)}')

In [None]:
sns.distplot(filter_d.SalePrice)

### E. Property is on a parcel that contains only one building.

Because the sales data does not specify which building on a given parcel is being sold, we limited our data set to only include parcels that have one building one them. This process requires joining our sales data to our residential building data. In order to do this, we need to create a PIN column on which we can join these two data sets.

#### Creating Our Unique Identifier: PINs
The process of creating the PIN column simply involves concatenating the Major and Minor columns. This is the reason these columns must always be imported as strings, otherwise the zero-padding will be removed and the PINs may become mismatched.

A function found in the projects data_cleaning module creates the PIN columns for us.

In [None]:
res_bldg = data_cleaning.add_PIN_column(res_bldg)

filter_d = data_cleaning.add_PIN_column(filter_d)

### Filtering by number of buildings per parcel

Each row in the residential building data set represents a single building, and each building is assigned to a parcel via their PIN. If we simply count the number of rows that are assigned to each parcel, we can filter out any parcel number that contains only one building.

In [None]:
number_of_buildings = res_bldg.groupby('PIN').BldgNbr.count()
number_of_buildings.sort_values()

### List of useable PINs
Once I filter out all multi-building parcels, I will create a dataframe containing only the PINs of the building I want to look at.

In [None]:
one_building_parcels = pd.DataFrame(number_of_buildings[number_of_buildings==1].index) 
one_building_parcels = one_building_parcels.set_index('PIN')
one_building_parcels

Then I can perform an inner merge with the sales data to quickly filter out all multi-building parcels from the sales data.

In [None]:
filter_e = pd.merge(filter_d, one_building_parcels, how='inner', on='PIN')
filter_e

This has a noticable effect on the data.

In [None]:
len(filter_e) - len(filter_d)

In [None]:
sns.distplot(filter_e.SalePrice)

### f. Removing mansions

In [None]:
filter_e = pd.merge(filter_e, res_bldg, how='inner', on='PIN')

In [None]:
filter_f = filter_e[filter_e.BldgGrade<12]
filter_f

In [None]:
sns.distplot(filter_f.SalePrice)

# Normalizing the Sales Data

Based on the distribution plot we recieved after the final filter was applied, we are still not operating on a normal distribution. The following steps will be performed here simply as a means of demonstrating the process that will be used with each iteration of our model testing. 

We will first log-transform the sales data and assign those values to a new column:

In [None]:
filter_f['LogSalePrice'] = np.log(filter_f.SalePrice)
sns.distplot(filter_f.LogSalePrice)

This is a lot more normal than our right skewed distribution from before, now let's remove the outliers.

In [None]:
z_scores = np.abs(stats.zscore(filter_f.LogSalePrice))
no_outliers = filter_f[z_scores < 3]

In [None]:
sns.distplot(no_outliers.LogSalePrice)

This is much closer to the type of distribution that will lend itself to linear regression.

# Final PINs creation

I will create a csv containing only the unique pins of the properties that fall under our criteria. The PIN csv will be created from the unique values of the PIN column from my filter_five dataframe. This PINS.csv file can be used to filter any additional datasets we have.

In [None]:
unique_pins = filter_f.PIN.unique()

PINS = pd.DataFrame(unique_pins, columns=['PIN']).set_index('PIN')
PINS.to_csv(data_folder+'PINS.csv')
PINS

# Export

In [None]:
res_bldg_final = pd.merge(res_bldg, PINS, on='PIN', how='inner')
res_bldg_final

# We use filter_d because it is the most recent dataframe that doesn't have the res_bldg data merged into it. 
# We want to keep only the columns from the original sales csv
sales_final = pd.merge(filter_d, PINS, on='PIN', how='inner')
sales_final

In [None]:
res_bldg_final.to_csv(data_folder+'EXTR_ResBldg_final.csv')

sales_final.to_csv(data_folder+'EXTR_RPSale_final.csv')

# Adding more data

First, make sure you import the original csv with the right arguments, it may need special encoding, and it may need you to specifiy data type.

In [None]:
#note that some csv's need tobe read with an encoding argument set to 'latin-1'
parcel = pd.read_csv(data_folder+'EXTR_Parcel.csv', dtype={'Major': 'string', 'Minor':'string'}, encoding='latin-1')
accessory = pd.read_csv(data_folder+'EXTR_Accessory_V.csv', dtype={'Major': 'string', 'Minor':'string'}, encoding='latin-1')

Then, filter the file by our PINS

In [None]:
parcel = data_cleaning.add_PIN_column(parcel)
accessory = data_cleaning.add_PIN_column(accessory)

Check to make sure it looks okay.

In [None]:
parcel.head()

In [None]:
accessory.head()

Filter with the list of PINs

In [None]:
PINS = pd.read_csv(data_folder+'PINS.csv', dtype={'PIN': 'string'})
PINS = PINS.set_index('PIN')


parcel_final = parcel.join(PINS, how='inner', on='PIN')
print('finished parcels')

accessory_final = accessory.join(PINS, how='inner', on='PIN')
print('finished accessory')

Export it the data file, with the suffix '_final'

In [None]:
parcel_final.to_csv(data_folder+'EXTR_Parcel_final.csv')
accessory_final.to_csv(data_folder+'EXTR_Accessory_V_final.csv')