## Objective

Since what I'm most interested in is which factors are most associated with whether an issue is unresolved, random forests seems like the most appropriate model choice, since I can look at the variable importance plot.

Logistic regression is a close second; random forests usually have better performance.

Random forests can also do a better job at getting signal out of raw lat and long. For logistic regression, there would need to be a linear relationship between lat and long and the log-odds of the issue being unresolved, which is a taller order.

In [35]:
from __future__ import division
import pandas as pd
from datetime import timedelta, datetime

In [42]:
df = pd.read_pickle('../data/data_w_transformed_census_and_removed_invalid_rows_and_cols.pkl')
df.shape

(716310, 36)

In [43]:
df.head(1).T

Unnamed: 0,0
CASE_ENQUIRY_ID,101000958209
OPEN_DT,2013-11-01 09:27:19
TARGET_DT,2013-11-15 09:27:19
CLOSED_DT,2013-11-27 10:15:45
CASE_TITLE,Sign Repair
SUBJECT,Transportation - Traffic Division
REASON,Signs & Signals
TYPE,Sign Repair
Department,BTDT
SubmittedPhoto,False


## Preprocessing

In [33]:
df.TARGET_DT.head(100).isnull().sum()

21

In [40]:
df.LOCATION_ZIPCODE.head(100).isnull().sum()

14

In [None]:
df['days_from_feb_2016'] = (datetime(year=2016, month=2, day=1) - df.OPEN_DT.map(lambda x: x.days)

In [44]:
df = df.drop(
    ['OPEN_DT',
     'TARGET_DT',
     'CLOSED_DT',
     'COMPLETION_TIME',
     'Property_ID',
     'LOCATION_STREET_NAME',
     'CASE_TITLE',
     'CASE_ENQUIRY_ID'
    ], 
    axis=1
)

In [45]:
df.shape

(716310, 30)

In [53]:
df['LOCATION_ZIPCODE'] = df['LOCATION_ZIPCODE'].astype('object').fillna('other')

In [46]:
df.head(1).T

Unnamed: 0,0
CASE_ENQUIRY_ID,101000958209
CASE_TITLE,Sign Repair
SUBJECT,Transportation - Traffic Division
REASON,Signs & Signals
TYPE,Sign Repair
Department,BTDT
SubmittedPhoto,False
neighborhood,Downtown / Financial District
LOCATION_ZIPCODE,
Property_Type,Intersection


In [50]:
df[['LOCATION_ZIPCODE', 'tract_and_block_group']].dtypes

LOCATION_ZIPCODE         float64
tract_and_block_group     object
dtype: object

## scratch, investigating race for 0701018

In [54]:
aa = pd.read_pickle('../data/census_data_aggregated.pkl')
aa.head(2)

Unnamed: 0,tract_and_block_group,bedroom_total_ppl,bedroom_0,bedroom_1,bedroom_2,bedroom_3,bedroom_4,bedroom_5+,school_total,school_0_none,...,value_175000_199999,value_200000_249999,value_250000_299999,value_300000_399999,value_400000_499999,value_500000_749999,value_750000_999999,value_1000000_1499999,value_1500000_1999999,value_2000000+
0,1001,560,8,120,193,183,56,0,960,23,...,13,38,22,5,51,41,0,0,0,0
1,1002,485,9,30,239,132,36,39,794,27,...,0,9,11,55,19,8,0,0,0,0


In [57]:
aa[aa.tract_and_block_group == '0701018'].T.loc['race_total':]

Unnamed: 0,214
race_total,210
race_white,163
race_black,19
race_asian,0
race_hispanic,14
race_other,14
income_total,122
income_0_10000,0
income_10000_14999,0
income_15000_19999,0
