## Objective

Since what I'm most interested in is which factors are most associated with whether an issue is unresolved, random forests seems like the most appropriate model choice, since I can look at the variable importance plot.

Logistic regression is a close second; random forests usually have better performance.

Random forests can also do a better job at getting signal out of raw lat and long. For logistic regression, there would need to be a linear relationship between lat and long and the log-odds of the issue being unresolved, which is a taller order.

Thinking about my engineered feature `days_from_feb_2016` for example, Logistic Regression wouldn't do a good job with it, but a tree-based model would.

In [35]:
from __future__ import division
import pandas as pd
from datetime import timedelta, datetime

In [36]:
import warnings
import seaborn as sns

warnings.filterwarnings("ignore", category=DeprecationWarning)
sns.set_style("whitegrid")
sns.set_context("poster")

from pylab import rcParams
rcParams['figure.figsize'] = 20, 5

import matplotlib.pyplot as plt

%matplotlib inline

In [72]:
df = pd.read_pickle('../data/data_w_transformed_census_and_removed_invalid_rows_and_cols.pkl')
df.shape

(786867, 36)

In [68]:
df.head(1).T

Unnamed: 0,905425
REASON,Street Cleaning
TYPE,Request for Snow Plowing
SubmittedPhoto,True
neighborhood,Dorchester
LOCATION_ZIPCODE,2124
Property_Type,Address
LATITUDE,42.2809
LONGITUDE,-71.068
Source,Citizens Connect App
race_white,0.242399


## Preprocessing

In [None]:
df.TARGET_DT.head(100).isnull().sum()

In [None]:
df.LOCATION_ZIPCODE.head(100).isnull().sum()

In [40]:
df.race_white.isnull().sum()

0

In [41]:
df.poverty_pop_w_ssi.isnull().sum()

0

In [75]:
df['days_from_feb_2016'] = (datetime(year=2016, month=2, day=1) - df.OPEN_DT).map(lambda x: x.days)
df['LOCATION_ZIPCODE'] = df['LOCATION_ZIPCODE'].astype('object').fillna('other')
df['neighborhood'] = df['neighborhood'].fillna('other')
df['is_snow_in_type'] = df.TYPE.map(lambda txt: 'snow' in txt.lower())
df['Property_Type'] = df['Property_Type'].fillna('other')
# df['race_majority'] = df[[i for i in df.columns if 'race' in i]].idxmax(axis=1) # not useful bc will be dummified

In [76]:
df = df.drop(
    ['OPEN_DT',
     'TARGET_DT',
     'CLOSED_DT',
     'COMPLETION_TIME',
     'Property_ID',
     'LOCATION_STREET_NAME',
     'CASE_TITLE',
     'CASE_ENQUIRY_ID',
     'SUBJECT',
     'Department',
     'tract_and_block_group'
    ], 
    axis=1
)

In [44]:
df.shape

(716310, 27)

In [7]:
df.head(1).T

Unnamed: 0,0
REASON,Signs & Signals
TYPE,Sign Repair
SubmittedPhoto,False
neighborhood,Downtown / Financial District
LOCATION_ZIPCODE,other
Property_Type,Intersection
LATITUDE,42.3537
LONGITUDE,-71.058
Source,Constituent Call
race_white,0.77619


### Dummifying bc sklearn's random forest implementation doesn't accept string values

In [45]:
def dummify(df, column):
    # from Darren's linear regression slides
    print '{} is your baseline'.format(sorted(df[column].unique())[-1])
    dummy = pd.get_dummies(df[column]).rename(columns=lambda x: column+'_'+str(x)).iloc[:,0:len(df[column].unique())-1]
    df = df.drop(column,axis=1) #Why not inplace? because if we do inplace, it will affect the df directly
    return pd.concat([df,dummy],axis=1)

In [77]:
df1 = dummify(df, 'TYPE')
df2 = dummify(df1, 'neighborhood')
df3 = dummify(df2, 'LOCATION_ZIPCODE')
df4 = dummify(df3, 'Property_Type')
df5 = dummify(df4, 'Source')
df6 = dummify(df5, 'school')
df7 = dummify(df6, 'housing')
df8 = dummify(df7, 'REASON')

Zoning is your baseline
other is your baseline
other is your baseline
other is your baseline
Twitter is your baseline
8_6th_grade is your baseline
rent is your baseline
Weights and Measures is your baseline


In [47]:
df8.shape

(716310, 356)

In [78]:
df.isnull().sum()

REASON                             0
TYPE                               0
SubmittedPhoto                     0
neighborhood                       0
LOCATION_ZIPCODE                   0
Property_Type                      0
LATITUDE                           0
LONGITUDE                          0
Source                             0
race_white                         0
race_black                         0
race_asian                         0
race_hispanic                      0
race_other                         0
poverty_pop_below_poverty_level    0
poverty_pop_w_public_assistance    0
poverty_pop_w_food_stamps          0
poverty_pop_w_ssi                  0
school                             0
housing                            0
bedroom                            0
value                              0
rent                               0
income                             0
is_issue_unresolved                0
days_from_feb_2016                 0
is_snow_in_type                    0
d

## OK, let's put it into the model!

In [81]:
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import ShuffleSplit


Let's split the data first between train and test, 80/20.

In [79]:
X_train, X_test, y_train, y_test = train_test_split(
    df8.drop('is_issue_unresolved', axis=1), 
    df8.is_issue_unresolved, 
    test_size=0.2, 
    random_state=300
)

In [53]:
pipe = make_pipeline(
    RandomForestClassifier(
        bootstrap=True, 
        class_weight='balanced_subsample', 
        n_estimators=10
    )
)

In [82]:
cv = ShuffleSplit(X_train.shape[0], n_iter=1, test_size=0.2, random_state=300)

In [83]:
params = {'randomforestclassifier__max_depth': [10, 20],
      'randomforestclassifier__min_samples_leaf': [101, 1001], # odd for bin class, to avoid ties, 1% of num_rows for max bound
} # taking too long

params = {'randomforestclassifier__max_depth': [10],
      'randomforestclassifier__min_samples_leaf': [1001], # odd for bin class, to avoid ties, 1% of num_rows for max bound
} # Brad said hyperparams don't matter _that_ much for RF

model = GridSearchCV(pipe, param_grid=params, n_jobs=-1, cv=cv, verbose=True)
model.fit(X_train, y_train);

Fitting 1 folds for each of 1 candidates, totalling 1 fits


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  1.3min finished


Let's look at the performance of our best model.

In [86]:
y_predict = model.predict(X_test)
y_predict

array([False, False, False, ...,  True, False,  True], dtype=bool)

In [88]:
confusion_matrix(y_test, y_predict)

array([[106477,  36795],
       [  2508,  11594]])

In [None]:
feature_importances = pd.np.argsort(model.feature_importances_)
print "12: top five:", list(df.columns[feature_importances[-1:-6:-1]])

In [None]:
# 13. Calculate the standard deviation for feature importances across all trees

n = 10 # top 10 features

importances = model.feature_importances_[:n]
std = np.std([tree.feature_importances_ for tree in model.estimators_], axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(n):
    print("%d. %s (%f)" % (f + 1, features[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(10), importances[indices], yerr=std[indices], color="r", align="center")
plt.xticks(range(10), indices)
plt.xlim([-1, 10])
plt.show()

In [None]:
pd.DataFrame(model.cv_results)

If my best model included 40 trees, was that enough?

In [None]:
model.cv_results_['params']

Let's make variable importance graph.

A downside of this graph is that highly correlated features will split importance.

## Using Logistic Regression with L2 Lasso Regularization

which is my favorite way to do feature subset selection at the moment :)

There are probably issues when features are co-linear. I'll need to read up on this.

In [None]:
# http://nbviewer.jupyter.org/github/JWarmenhoven/ISLR-python/blob/master/Notebooks/Chapter%206.ipynb#6.6.2-The-Lasso
from sklearn.linear_model import Lasso

In [None]:
lasso = Lasso(max_iter=10000)
coefs = []

for a in alphas*2:
    lasso.set_params(alpha=a)
    lasso.fit(scale(X_train), y_train)
    coefs.append(lasso.coef_)

ax = plt.gca()
ax.plot(alphas*2, coefs)
ax.set_xscale('log')
ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis
plt.axis('tight')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Lasso coefficients as a function of the regularization');