# Merging Pluto and Census data with HPD data.
In this notebook, we merge the processed HPD data with the PLUTO and census data. We will merge the HPD and PLUTO data first, using BBL as a merge key. Then we will merge this set with the census, using BoroughID and census tract as keys.

In [47]:
import pandas as pd
from get_clean_pluto_bbl_data import *
import re

pluto = get_clean_pluto_data()
hpd = pd.read_csv('data/merged_complaints_problems_violations.csv')

merged_hpd_pluto = pd.merge(pluto, hpd, on='BBL', how='inner')
merged_hpd_pluto.head(10)

Unnamed: 0.1,UnitsRes,AssessTot,YearBuilt,BBL,CT2010,YearLastAlter,Avg_value_per_res_unit,Unnamed: 0,ProblemID,ComplaintID,...,MajorCategoryID,MinorCategoryID,CodeID,StatusDate,StatusDescriptionID,BoroughID,ReceivedDate,Tot_A_violations,Tot_B_violations,Tot_C_violations
0,8,104219,1920,2022600018,1900,2009,13027.375,40661,15420728,7449790,...,59,348,2713,2015-04-30,5,2,2015-04-28,1,0,4
1,8,104219,1920,2022600018,1900,2009,13027.375,40662,15420727,7449789,...,59,349,2716,2015-04-30,5,2,2015-04-28,1,0,4
2,12,347400,1925,2022610045,1900,0,28950.0,57662,15360614,7422700,...,63,375,2817,2015-05-04,1,2,2015-04-04,4,8,7
3,12,347400,1925,2022610045,1900,0,28950.0,57663,15360602,7422690,...,59,349,2717,2015-04-08,2,2,2015-04-04,4,8,7
4,12,347400,1925,2022610045,1900,0,28950.0,57664,15471137,7469150,...,63,375,2817,2015-06-15,1,2,2015-05-21,4,8,7
5,12,347400,1925,2022610045,1900,0,28950.0,57665,15471138,7469150,...,63,375,2817,2015-06-15,1,2,2015-05-21,4,8,7
6,12,347400,1925,2022610045,1900,0,28950.0,57666,15471139,7469150,...,58,343,2691,2015-06-12,1,2,2015-05-21,4,8,7
7,12,347400,1925,2022610045,1900,0,28950.0,57667,15541483,7496114,...,63,375,2817,2015-07-18,3,2,2015-06-24,4,8,7
8,12,347400,1925,2022610045,1900,0,28950.0,57668,15541484,7496114,...,58,343,2686,2015-07-18,2,2,2015-06-24,4,8,7
9,12,347400,1925,2022610045,1900,0,28950.0,57669,15541485,7496114,...,9,65,2536,2015-07-18,3,2,2015-06-24,4,8,7


In [81]:
merged_hpd_pluto.shape

(278448, 22)

In [82]:
merged_hpd_pluto.columns

Index([u'UnitsRes', u'AssessTot', u'YearBuilt', u'BBL', u'CT2010',
       u'YearLastAlter', u'Avg_value_per_res_unit', u'ProblemID',
       u'ComplaintID', u'UnitTypeID', u'SpaceTypeID', u'TypeID',
       u'MajorCategoryID', u'MinorCategoryID', u'CodeID', u'StatusDate',
       u'StatusDescriptionID', u'BoroughID', u'ReceivedDate',
       u'Tot_A_violations', u'Tot_B_violations', u'Tot_C_violations'],
      dtype='object')

Note the index of from hpd has been added as a feature to the merged dataset (as evidenced by the fact it has 278448 unique values in a dataframe with 278448 rows), so we'll drop it before proceeding.

In [73]:
len(merged_hpd_pluto['Unnamed: 0'].unique())

278448

In [74]:
merged_hpd_pluto = merged_hpd_pluto.drop('Unnamed: 0', axis=1)

In [141]:
from get_income_data_from_census import *
income = get_clean_income_data()

merged_pluto_hpd_census = pd.merge(income, merged_hpd_pluto, on=['CT2010','BoroughID'], how='inner')

In [142]:
merged_pluto_hpd_census.head(7)

Unnamed: 0,Median_income,State,CT2010,BoroughID,UnitsRes,AssessTot,YearBuilt,BBL,YearLastAlter,Avg_value_per_res_unit,...,TypeID,MajorCategoryID,MinorCategoryID,CodeID,StatusDate,StatusDescriptionID,ReceivedDate,Tot_A_violations,Tot_B_violations,Tot_C_violations
0,69514,36,200,2,2,7363,1940,2034410083,0,3681.5,...,4,10,341,2678,2015-05-04,1,2015-04-17,0,0,0
1,69514,36,200,2,2,26280,2002,2034430058,0,13140.0,...,1,10,71,2461,2015-07-23,1,2015-06-28,0,0,0
2,69514,36,200,2,2,26280,2002,2034430058,0,13140.0,...,1,9,68,647,2015-07-23,1,2015-06-28,0,0,0
3,69514,36,200,2,3,20750,1945,2034520056,0,6916.666667,...,3,10,341,2681,2015-10-01,1,2015-09-15,0,0,0
4,69514,36,200,2,2,20966,1945,2034530034,1994,10483.0,...,1,63,375,2817,2015-09-21,1,2015-09-01,0,0,0
5,69514,36,200,2,4,67574,1945,2034530044,0,16893.5,...,1,59,349,2715,2015-04-10,5,2015-04-09,3,6,5
6,69514,36,200,2,4,67574,1945,2034530044,0,16893.5,...,1,59,349,2715,2015-04-30,5,2015-04-29,3,6,5


In [143]:
print merged_pluto_hpd_census.shape

(278448, 24)


In [144]:
merged_pluto_hpd_census.columns

Index([u'Median_income', u'State', u'CT2010', u'BoroughID', u'UnitsRes',
       u'AssessTot', u'YearBuilt', u'BBL', u'YearLastAlter',
       u'Avg_value_per_res_unit', u'ProblemID', u'ComplaintID', u'UnitTypeID',
       u'SpaceTypeID', u'TypeID', u'MajorCategoryID', u'MinorCategoryID',
       u'CodeID', u'StatusDate', u'StatusDescriptionID', u'ReceivedDate',
       u'Tot_A_violations', u'Tot_B_violations', u'Tot_C_violations'],
      dtype='object')

Now, to save on disc space, let's drop ID and other non-informative features. Note we retain ComplaintID to retain the heirarchical structure in our data, but drop ProblemID (which is a unique identifier). Similarly, we retain ReceivedDate to capture potential temporal relationships in complaint outcome, but drop StatusDate. Finally, we retain BBL and CT2010 to allow for additional merges in the future (and because CT2010 might capture informative neighborhood variability in complain outcome).

In [145]:
merged_pluto_hpd_census = merged_pluto_hpd_census.drop(['State', 'ProblemID','StatusDate'],axis=1)
merged_pluto_hpd_census.shape

(278448, 21)

In [146]:
merged_pluto_hpd_census.columns

Index([u'Median_income', u'CT2010', u'BoroughID', u'UnitsRes', u'AssessTot',
       u'YearBuilt', u'BBL', u'YearLastAlter', u'Avg_value_per_res_unit',
       u'ComplaintID', u'UnitTypeID', u'SpaceTypeID', u'TypeID',
       u'MajorCategoryID', u'MinorCategoryID', u'CodeID',
       u'StatusDescriptionID', u'ReceivedDate', u'Tot_A_violations',
       u'Tot_B_violations', u'Tot_C_violations'],
      dtype='object')

Now, before we write this final dataset to disc, let's see if it has any missing values:

In [147]:
merged_pluto_hpd_census.isnull().any(axis=1).sum()

1

It looks like we still have one record with missing values- since it represents an insignificant fraction of our total dataset, we drop it.

In [148]:
merged_pluto_hpd_census = merged_pluto_hpd_census[~(merged_pluto_hpd_census.isnull().any(axis=1))]

In [149]:
merged_pluto_hpd_census.shape

(278447, 21)

In [150]:
merged_pluto_hpd_census.head(5)

Unnamed: 0,Median_income,CT2010,BoroughID,UnitsRes,AssessTot,YearBuilt,BBL,YearLastAlter,Avg_value_per_res_unit,ComplaintID,...,SpaceTypeID,TypeID,MajorCategoryID,MinorCategoryID,CodeID,StatusDescriptionID,ReceivedDate,Tot_A_violations,Tot_B_violations,Tot_C_violations
0,69514,200,2,2,7363,1940,2034410083,0,3681.5,7438405,...,543,4,10,341,2678,1,2015-04-17,0,0,0
1,69514,200,2,2,26280,2002,2034430058,0,13140.0,7498988,...,543,1,10,71,2461,1,2015-06-28,0,0,0
2,69514,200,2,2,26280,2002,2034430058,0,13140.0,7498988,...,543,1,9,68,647,1,2015-06-28,0,0,0
3,69514,200,2,3,20750,1945,2034520056,0,6916.666667,7561878,...,543,3,10,341,2681,1,2015-09-15,0,0,0
4,69514,200,2,2,20966,1945,2034530034,1994,10483.0,7550202,...,543,1,63,375,2817,1,2015-09-01,0,0,0


In [152]:
merged_pluto_hpd_census.to_csv('data/merged_hpd_census_pluto.csv')