In [1]:
%load_ext autoreload
%autoreload 2

## Objective

I want to group the submodels, in order to increase sample size, and bc it's valid to group them if diff groups interact w the features differently from other groups.

To compare to the main model, I would want to do some kind of weighted R2, since the groups would all have different sample sizes.

In [2]:
from __future__ import division
import pandas as pd
import numpy as np
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
from tqdm import tqdm

warnings.filterwarnings("ignore", category=DeprecationWarning)
sns.set_style("whitegrid")
sns.set_context("poster")
rcParams['figure.figsize'] = 20, 5

import os, sys
sys.path.append(os.path.join(os.path.dirname('.'), "../preprocessing"))
from helper_functions import dummify_cols_and_baselines, make_alphas, remove_outliers_by_type, adjusted_r2

In [3]:
df_orig = pd.read_pickle('../data/data_from_remove_from_dataset.pkl')
df_orig.shape

(516406, 40)

## Removing outliers

A standard procedure is to remove values further than 3 standard deviations from the mean. Since I have so many low values and some very high values, I anecdotally think that the low values are very likely to be true, and the high values not so much.

So, I will remove values further than 3 SDs from the median, by type.

Ideally, I would take into account the time dimension. I would like to do so given more time.

In [4]:
df_outliers_removed = remove_outliers_by_type(df_orig, y_col='COMPLETION_HOURS_LOG_10')
df_outliers_removed.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  group[pd.np.abs(group - group.median()) > stds * group.std()] = pd.np.nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.where(-key, value, inplace=True)


(508653, 40)

I'm removing ~1.5% of my rows.

## Choosing columns

In [5]:
cols_orig_dataset = ['COMPLETION_HOURS_LOG_10', 'TYPE', 'SubmittedPhoto', 'Property_Type', 'Source', 'neighborhood_from_zip']
cols_census = ['race_white',
     'race_black',
     'race_asian',
     'race_hispanic',
     'race_other',
     'poverty_pop_below_poverty_level',
     'earned_income_per_capita',
     'poverty_pop_w_public_assistance',
     'poverty_pop_w_food_stamps',
     'poverty_pop_w_ssi',
     'school',
     'school_std_dev',
     'housing',
     'housing_std_dev',
     'bedroom',
     'bedroom_std_dev',
     'value',
     'value_std_dev',
     'rent',
     'rent_std_dev',
     'income',
     'income_std_dev']
cols_engineered = ['queue_wk', 'queue_wk_open', 'is_description']

In [6]:
df = df_outliers_removed[cols_orig_dataset + cols_census + cols_engineered]

## Replacing `TYPE`s

In [7]:
cd ../data

/home/ubuntu/311-prediction-times/data


In [8]:
from type_reason_mapping import type_reason_mapping

In [9]:
df['TYPE'] = df.TYPE.map(type_reason_mapping)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [10]:
print df.shape
df.dropna(subset=['TYPE'], inplace=True)
print df.shape

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


(508653, 31)
(503547, 31)


In [11]:
df.TYPE.drop_duplicates().shape

(73,)

## Dummify

In [12]:
cols_to_dummify = [i for i in df.dtypes[df.dtypes == object].index if i != 'TYPE']
cols_to_dummify

['Property_Type', 'Source', 'neighborhood_from_zip', 'school', 'housing']

In [13]:
df_dummified, baseline_cols = dummify_cols_and_baselines(df, cols_to_dummify)

other is baseline 0 5
Twitter is baseline 1 5
West Roxbury is baseline 2 5
8_6th_grade is baseline 3 5
rent is baseline 4 5


In [14]:
df_dummified.shape

(503547, 63)

## Running model

In [15]:
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.cross_validation import ShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score




In [16]:
categories = df_dummified.TYPE.drop_duplicates().tolist()
categories[:3]

['Request for Snow Plowing',
 'Administrative & General Requests',
 'Public Works General Request']

In [143]:
# TYPEs with over 2k issues, or above 70th percentile

categories = ['Schedule a Bulk Item Pickup',
 'Requests for Street Cleaning',
 'Request for Snow Plowing',
 'Missed Trash/Recycling/Yard Waste/Bulk Item',
 'Street Light Outages',
 'Parking Enforcement',
 'Request for Pothole Repair',
 'Sidewalk Repair (Make Safe)',
 'Graffiti Removal',
 'Schedule a Bulk Item Pickup SS',
 'Tree Maintenance Requests',
 'Unsatisfactory Living Conditions',
 'Request for Recycling Cart',
 'Sign Repair',
 'General Comments For a Program or Policy',
 'Pick up Dead Animal',
 'Abandoned Vehicles',
 'Rodent Activity',
 'Traffic Signal Repair',
 'Building Inspection Request',
 'Sticker Request',
 'CE Collection',
 'Sidewalk Repair',
 'Improper Storage of Trash (Barrels)',
 'Traffic Signal Inspection',
 'New Tree Requests',
 'Empty Litter Basket',
 'Animal Generic Request',
 'Tree Emergencies',
 'General Lighting Request',
 'New Sign  Crosswalk or Pavement Marking',
 'Heat - Excessive  Insufficient',
 'Equipment Repair',
 'PWD Graffiti',
 'Highway Maintenance',
 'Ground Maintenance',
 'Work w/out Permit',
 'Notification',
 'Unsafe Dangerous Conditions',
 'Recycling Cart Return',
 'Poor Conditions of Property',
 'OCR Front Desk Interactions',
 'Electrical',
 'Missing Sign',
 'General Comments For An Employee',
 'Contractor Complaints',
 'Street Light Knock Downs',
 'Major System Failure',
 'Utility Call-In',
 'Public Works General Request',
 'Unshoveled Sidewalk',
 'Contractors Complaint',
 'Needle Pickup',
 'Requests for Traffic Signal Studies or Reviews',
 'Unsanitary Conditions - Establishment',
 'Bed Bugs',
 'Mice Infestation - Residential',
 'Call Log',
 'Space Savers',
 'Catchbasin',
 'Abandoned Bicycle',
 'Illegal Occupancy']

In [17]:
pipe = make_pipeline(StandardScaler(), LinearRegression())

In [151]:
y_tests = []
y_preds = []
results = {}

for categ in tqdm(categories):
    X = df_dummified[df_dummified.TYPE == categ].drop(['COMPLETION_HOURS_LOG_10', 'TYPE'], axis=1)
    y = df_dummified[df_dummified.TYPE == categ].COMPLETION_HOURS_LOG_10
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, 
        y,
        test_size=0.2, 
        random_state=300
    )    
     
    cv = ShuffleSplit(X_train.shape[0], n_iter=1, test_size=0.2, random_state=300)        
        
    params = {}
    model = GridSearchCV(pipe, param_grid=params, n_jobs=-1, cv=cv, verbose=0)
    model.fit(X_train, y_train)
                     
    y_pred = model.predict(X_test)
    y_preds.append(y_pred)
    y_tests.append(y_test)
                     
    d = {}
    d['best_params'] = model.best_params_
    d['best_score'] = model.best_score_
    d['result'] = pd.DataFrame(model.cv_results_).T
    d['best_estimator'] = model.best_estimator_.steps[-1][-1]
    d['rmse'] = mean_squared_error(y_test, y_pred)**0.5
                     
    results[categ] = d

100%|██████████| 62/62 [01:45<00:00,  1.49s/it]


In [152]:
old_y_tests = y_tests[:]
old_y_preds = y_preds[:]
old_results = results.copy()
old_categories = categories[:]

In [190]:
y_tests = old_y_tests[:]
y_preds = old_y_preds[:]
results = old_results.copy()
categories = old_categories[:]

In [192]:
n = 4

print categories[n + 1]
print len(categories)
del y_tests[n + 1]
del y_preds[n + 1]
del categories[n + 1]
print len(categories)

Parking Enforcement
62
61


In [194]:
n = 24

print categories[n + 1]
print len(categories)
del y_tests[n + 1]
del y_preds[n + 1]
del categories[n + 1]
print len(categories)

Empty Litter Basket
61
60


In [196]:
n = 30

print categories[n + 1]
print len(categories)
del y_tests[n + 1]
del y_preds[n + 1]
del categories[n + 1]
print len(categories)

PWD Graffiti
60
59


In [198]:
n = 30

print categories[n + 1]
print len(categories)
del y_tests[n + 1]
del y_preds[n + 1]
del categories[n + 1]
print len(categories)

Highway Maintenance
59
58


In [200]:
n = 33

print categories[n + 1]
print len(categories)
del y_tests[n + 1]
del y_preds[n + 1]
del categories[n + 1]
print len(categories)

Unsafe Dangerous Conditions
58
57


In [203]:
n = 38

print categories[n + 1]
print len(categories)
del y_tests[n + 1]
del y_preds[n + 1]
del categories[n + 1]
print len(categories)

General Comments For An Employee
57
56


In [205]:
n = 42

print categories[n + 1]
print len(categories)
del y_tests[n + 1]
del y_preds[n + 1]
del categories[n + 1]
print len(categories)

Public Works General Request
56
55


In [207]:
n = 50

print categories[n + 1]
print len(categories)
del y_tests[n + 1]
del y_preds[n + 1]
del categories[n + 1]
print len(categories)

Space Savers
55
54


In [209]:
n = 43

print categories[n + 1]
print len(categories)
del y_tests[n + 1]
del y_preds[n + 1]
del categories[n + 1]
print len(categories)

Contractors Complaint
54
53


In [210]:
print len(categories)
pd.DataFrame({
    'categ': categories[1:],
    'r2': [r2_score(pd.concat(y_tests[:i+1]), np.concatenate(y_preds[:i+1])) for i in range(1, len(y_tests))]
})

53


Unnamed: 0,categ,r2
0,Requests for Street Cleaning,0.538605
1,Request for Snow Plowing,0.506184
2,Missed Trash/Recycling/Yard Waste/Bulk Item,0.485277
3,Street Light Outages,0.503337
4,Request for Pothole Repair,0.483408
5,Sidewalk Repair (Make Safe),0.453726
6,Graffiti Removal,0.463284
7,Schedule a Bulk Item Pickup SS,0.461909
8,Tree Maintenance Requests,0.492729
9,Unsatisfactory Living Conditions,0.432188


In [220]:
from pickle import dump

In [223]:
with open('../data/q2_submodel_results.pkl', 'w') as outfile:
    dump(results, outfile)
    
with open('../data/q2_submodel_y_preds.pkl', 'w') as outfile:
    dump(y_preds, outfile)
    
with open('../data/q2_submodel_y_tests.pkl', 'w') as outfile:
    dump(y_tests, outfile)    

In [222]:
!ls -lh ../data/q2*

-rw-rw-r-- 1 ubuntu ubuntu 268K Feb 22 06:28 ../data/q2_submodel_results
-rw-rw-r-- 1 ubuntu ubuntu 1.9M Feb 22 06:28 ../data/q2_submodel_y_preds
-rw-rw-r-- 1 ubuntu ubuntu 4.0M Feb 22 06:28 ../data/q2_submodel_y_tests


In [211]:
len(y_tests), len(y_preds)

(53, 53)

In [213]:
# mean_squared_error(y_tests, y_preds)**0.5
mean_squared_error(pd.concat(y_tests), np.concatenate(y_preds))**0.5

0.68988989579084448

In [214]:
# this is the subset of TYPEs I used
categories

['Schedule a Bulk Item Pickup',
 'Requests for Street Cleaning',
 'Request for Snow Plowing',
 'Missed Trash/Recycling/Yard Waste/Bulk Item',
 'Street Light Outages',
 'Request for Pothole Repair',
 'Sidewalk Repair (Make Safe)',
 'Graffiti Removal',
 'Schedule a Bulk Item Pickup SS',
 'Tree Maintenance Requests',
 'Unsatisfactory Living Conditions',
 'Request for Recycling Cart',
 'Sign Repair',
 'General Comments For a Program or Policy',
 'Pick up Dead Animal',
 'Abandoned Vehicles',
 'Rodent Activity',
 'Traffic Signal Repair',
 'Building Inspection Request',
 'Sticker Request',
 'CE Collection',
 'Sidewalk Repair',
 'Improper Storage of Trash (Barrels)',
 'Traffic Signal Inspection',
 'New Tree Requests',
 'Animal Generic Request',
 'Tree Emergencies',
 'General Lighting Request',
 'New Sign  Crosswalk or Pavement Marking',
 'Heat - Excessive  Insufficient',
 'Equipment Repair',
 'Ground Maintenance',
 'Work w/out Permit',
 'Notification',
 'Recycling Cart Return',
 'Poor Conditio

## Conclusion

Making sub-models improves the R2 and RMSE _for these chosen categories_, which are the ones above the 70th percentile in terms of number of issues, minus a couple ones stated above that messed up the R2.

This R2 and RMSE are 0.59 and 0.69. The ones for the main model are 0.55 and 0.73.

There are prolly more stat sig coefs here as well.

## Next Steps

If I had more time,
- check how the R2 was messed up on those couple of categs. Prolly bc R2 for that model was low, and my preds were way off. what to do about those categs then? either the mean, or the pred from the big model would work.
- group the categs below 70th percentile to give them enough sample size, then run model on them. trial-and-error + domain knowledge as to which categs would work and which wouldn't.