445 of them from Census data. how?

In [32]:
from __future__ import division
import pandas as pd
from IPython.display import display
from censusgeocode import CensusGeocode


## Preprocessing

In [33]:
cg = CensusGeocode()

In [72]:
def get_block_group_id(lat, long_, cg):
    d = cg.coordinates(x=long_, y=lat)[0]['2010 Census Blocks']
    return d[0]['TRACT'] + d[0]['BLKGRP']

In [2]:
df = pd.read_hdf('../data/data_w_descs_and_census.h5')
df.shape

(905650, 155)

In [3]:
new_df = df[df.value_total.isnull()]

In [4]:
old_df = df
df = new_df

In [19]:
df_census = pd.read_pickle('../data/census_data_aggregated.pkl')['tract_and_block_group']
df_census.head(1).T

0    0001001
Name: tract_and_block_group, dtype: object

In [5]:
df.shape

(445, 155)

In [29]:
df.head(1).T

Unnamed: 0,74
CASE_ENQUIRY_ID,101000382293
OPEN_DT,2012-02-08 12:55:07
TARGET_DT,NaT
CLOSED_DT,2012-02-08 12:55:54
OnTime_Status,ONTIME
CASE_STATUS,Closed
CLOSURE_REASON,Case Closed Case Resolved potholes patched.
CASE_TITLE,Highway Maintenance
SUBJECT,Public Works Department
REASON,Highway Maintenance


## Looking at rows with incorrect Census block group IDs

In [12]:
df[['Property_Type', 'CASE_ENQUIRY_ID']].groupby('Property_Type').count()

Unnamed: 0_level_0,CASE_ENQUIRY_ID
Property_Type,Unnamed: 1_level_1
Address,143
Intersection,302


In [16]:
df[['TYPE', 'CASE_ENQUIRY_ID']].groupby('TYPE').count().sort_values('CASE_ENQUIRY_ID', ascending=False)

Unnamed: 0_level_0,CASE_ENQUIRY_ID
TYPE,Unnamed: 1_level_1
Request for Pothole Repair,60
Traffic Signal Repair,42
Street Light Outages,39
Pothole Repair (Internal),32
Requests for Street Cleaning,30
Request for Snow Plowing,28
Sign Repair,14
Schedule a Bulk Item Pickup,14
Graffiti Removal,12
Tree Maintenance Requests,10


In [110]:
df[['tract_and_block_group', 'CASE_ENQUIRY_ID']].groupby('tract_and_block_group').count().sort_values('CASE_ENQUIRY_ID', ascending=False)

Unnamed: 0_level_0,CASE_ENQUIRY_ID
tract_and_block_group,Unnamed: 1_level_1
9811003,130
4001001,129
4002003,40
4006001,34
4161022,24
4003003,15
3731001,11
3424004,11
3515001,9
4012003,9


In [30]:
with pd.option_context("display.max_rows", 200):
    display(df.tail()[['TYPE', 'LOCATION_STREET_NAME', 'Property_Type', 'tract_and_block_group', 'LATITUDE', 'LONGITUDE']])

Unnamed: 0,TYPE,LOCATION_STREET_NAME,Property_Type,tract_and_block_group,LATITUDE,LONGITUDE
892138,Traffic Signal Repair,INTERSECTION Commonwealth Ave & Essex St,Intersection,4001001,42.3504,-71.1109
892665,Request for Pothole Repair,INTERSECTION Commonwealth Ave & Winslow Rd,Intersection,4002003,42.3518,-71.1224
893215,Request for Pothole Repair,INTERSECTION Commonwealth Ave & University Rd,Intersection,4001001,42.3502,-71.1094
899810,Unshoveled Sidewalk,64 Corey Rd,Address,4006001,42.3402,-71.1428
902884,Schedule a Bulk Item Pickup,7 Sherman St,Address,3501031,42.3872,-71.0765


In [73]:
get_block_group_id(42.3402, -71.1428, cg)

u'0005042'

Looking at 64 Corey Rd, the Census geocoding API returns a different Census block group ID than the SHP file.

In [40]:
'0005042' in df_census.tolist()

True

And this ID is in our dataset.

Let's overwrite the rows with messed-up Census block groups from another source than the SHP files, and see if all of these Census block groups are in our dataset.

## Are newly created Census block group IDs in our dataset?

In [44]:
df.shape

(445, 155)

In [46]:
df_lat_longs = df[['LONGITUDE', 'LATITUDE']].drop_duplicates()
df_lat_longs.shape

(48, 2)

Many of the lat long points with incorrected Census block group IDs are the exact same points. Weird. Let's investigate further.

Makes sense, as many of these issues deal with traffic lights, which are in the exact same location.

In [47]:
with pd.option_context("display.max_rows", 200):
    display(df.tail(20)[['TYPE', 'LOCATION_STREET_NAME', 'Property_Type', 'tract_and_block_group', 'LATITUDE', 'LONGITUDE']].sort_values('tract_and_block_group'))

Unnamed: 0,TYPE,LOCATION_STREET_NAME,Property_Type,tract_and_block_group,LATITUDE,LONGITUDE
902884,Schedule a Bulk Item Pickup,7 Sherman St,Address,3501031,42.3872,-71.0765
893215,Request for Pothole Repair,INTERSECTION Commonwealth Ave & University Rd,Intersection,4001001,42.3502,-71.1094
892138,Traffic Signal Repair,INTERSECTION Commonwealth Ave & Essex St,Intersection,4001001,42.3504,-71.1109
890710,General Lighting Request,INTERSECTION Commonwealth Ave & Essex St,Intersection,4001001,42.3504,-71.1109
888377,Traffic Signal Repair,INTERSECTION Commonwealth Ave & Saint Marys St,Intersection,4001001,42.3499,-71.1066
886922,Street Light Outages,INTERSECTION Commonwealth Ave & University Rd,Intersection,4001001,42.3502,-71.1094
871095,Request for Snow Plowing,INTERSECTION Commonwealth Ave & Essex St,Intersection,4001001,42.3504,-71.1109
875513,Traffic Signal Inspection,INTERSECTION Commonwealth Ave & Essex St,Intersection,4001001,42.3504,-71.1109
875168,Request for Pothole Repair,INTERSECTION Commonwealth Ave & Saint Marys St,Intersection,4001001,42.3499,-71.1066
883555,Tree Maintenance Requests,1066 Commonwealth Ave,Address,4002003,42.3516,-71.1233


In [79]:
df_lat_longs.head(2).apply(lambda row: row['LATITUDE'], axis=1)

74      42.2932
4613    42.3518
dtype: float64

In [84]:
new_block_grp_ids = df_lat_longs.apply(lambda row: get_block_group_id(row['LATITUDE'], row['LONGITUDE'], cg), axis=1)

In [90]:
aa = pd.DataFrame(new_block_grp_ids)
aa.columns = ['new_block_grp_ids']
aa.head(2)

Unnamed: 0,new_block_grp_ids
74,9811003
4613,4002003


In [92]:
aa['is_in_dataset'] = aa.isin(df_census.tolist())
aa.head(2)

Unnamed: 0,new_block_grp_ids,is_in_dataset
74,9811003,False
4613,4002003,False


In [94]:
aa['old_block_grp_ids'] = df['tract_and_block_group']
aa.head()

Unnamed: 0,new_block_grp_ids,is_in_dataset,old_block_grp_ids
74,9811003,False,9811003
4613,4002003,False,4002003
6762,4161022,False,4161022
7090,9811003,False,9811003
7571,4001001,False,4001001


In [95]:
aa

Unnamed: 0,new_block_grp_ids,is_in_dataset,old_block_grp_ids
74,9811003,False,9811003
4613,4002003,False,4002003
6762,4161022,False,4161022
7090,9811003,False,9811003
7571,4001001,False,4001001
12551,4001001,False,4001001
13649,1101031,True,9811003
19450,3012,True,3731001
25490,9811003,False,9811003
27623,7031,True,4002003


Let's look at data point 74.

In [106]:
df.ix[74:75].loc[:, "LATITUDE":].T

Unnamed: 0,74
LATITUDE,42.2932
LONGITUDE,-71.0965
Source,Employee Generated
Geocoded_Location,"(42.2932, -71.0965)"
case_enquiry_id,1.01e+11
description,
specific_location,
title,Highway Maintenance at Intersection of Canterb...
tract_and_block_group,9811003
bedroom_total_ppl,


In [109]:
df_census.tail(20)

625    1805002
626    1805003
627    1805004
628    9801011
629    9803001
630    9807001
631    9810001
632    9811001
633    9811002
634    9811004
635    9812011
636    9812021
637    9813001
638    9813002
639    9815011
640    9815021
641    9816001
642    9817001
643    9818001
644    9901010
Name: tract_and_block_group, dtype: object

Only some of them are, unfortunately. This means that the other NA values are for a different reason, the main other reason being the inner join.

## Reason for missing X values

I had joined the Census data in an 'inner' fashion, and so if there were any missing values for a given Census block group, that block group would be omitted. This resulted in missing X values for ~250 of the 445 points.

There also seem to be mislabeled block groups, affecting ~200 of the data points. When I use the `censusgeocode` package, I get different block groups for these rows.

Since the points are so few, and the process to fix the PKL file laborious enough, I will drop these NAs for the time-being.