## Objective

Let's use the subsetted features from L2 Lasso regularization and use them for OLS, which is more interpretable. Hopefully I won't be in a high variance situation.

In [101]:
from __future__ import division
import pandas as pd
import numpy as np
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
import string

warnings.filterwarnings("ignore", category=DeprecationWarning)
sns.set_style("whitegrid")
sns.set_context("poster")
rcParams['figure.figsize'] = 20, 5

from helper_functions import dummify_cols_and_baselines, make_alphas, remove_outliers_by_type

In [3]:
df_orig = pd.read_pickle('../data/data_from_remove_from_dataset.pkl')
df_orig.shape

(516406, 40)

## Removing outliers

A standard procedure is to remove values further than 3 standard deviations from the mean. Since I have so many low values and some very high values, I anecdotally think that the low values are very likely to be true, and the high values not so much.

So, I will remove values further than 3 SDs from the median, by type.

Ideally, I would take into account the time dimension. I would like to do so given more time.

In [4]:
df_outliers_removed = remove_outliers_by_type(df_orig, y_col='COMPLETION_HOURS_LOG_10')
df_outliers_removed.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._where(-key, value, inplace=True)


(508653, 40)

I'm removing ~1.5% of my rows.

## Choosing columns

In [5]:
cols_orig_dataset = ['COMPLETION_HOURS_LOG_10', 'TYPE', 'SubmittedPhoto', 'Property_Type', 'Source', 'neighborhood_from_zip']
cols_census = ['race_white',
     'race_black',
     'race_asian',
     'race_hispanic',
     'race_other',
     'poverty_pop_below_poverty_level',
     'earned_income_per_capita',
     'poverty_pop_w_public_assistance',
     'poverty_pop_w_food_stamps',
     'poverty_pop_w_ssi',
     'school',
     'school_std_dev',
     'housing',
     'housing_std_dev',
     'bedroom',
     'bedroom_std_dev',
     'value',
     'value_std_dev',
     'rent',
     'rent_std_dev',
     'income',
     'income_std_dev']
cols_engineered = ['queue_wk', 'queue_wk_open', 'is_description']

In [6]:
df = df_outliers_removed[cols_orig_dataset + cols_census + cols_engineered]

## Dummify

In [7]:
cols_to_dummify = df.dtypes[df.dtypes == object].index
cols_to_dummify

Index([u'TYPE', u'Property_Type', u'Source', u'neighborhood_from_zip',
       u'school', u'housing'],
      dtype='object')

In [8]:
df_dummified, baseline_cols = dummify_cols_and_baselines(df, cols_to_dummify)

Zoning is baseline 0 6
other is baseline 1 6
Twitter is baseline 2 6
West Roxbury is baseline 3 6
8_6th_grade is baseline 4 6
rent is baseline 5 6


In [9]:
df_dummified.shape

(508653, 253)

## Removing columns as per L2 results

In [22]:
col_blacklist = ['Property_Type_Address',
 'Property_Type_Intersection',
 'Source_Constituent Call',
 'SubmittedPhoto',
 'TYPE_ADA',
 'TYPE_Alert Boston',
 'TYPE_Animal Noise Disturbances',
 'TYPE_Automotive Noise Disturbance',
 'TYPE_BWSC General Request',
 'TYPE_BWSC Pothole',
 'TYPE_Big Buildings Online Request',
 'TYPE_Billing Complaint',
 'TYPE_Bridge Maintenance',
 'TYPE_CE Collection',
 'TYPE_Cemetery Maintenance Request',
 'TYPE_City/State Snow Issues',
 'TYPE_Contractor Complaints',
 'TYPE_Corporate or Community Group Service Day Clean Up',
 'TYPE_Downed Wire',
 'TYPE_Dumpster & Loading Noise Disturbances',
 'TYPE_Fire Department Request',
 'TYPE_Fire Hydrant',
 'TYPE_Fire in Food Establishment',
 'TYPE_Follow-Up',
 'TYPE_Food Alert - Confirmed',
 'TYPE_Food Alert - Unconfirmed',
 'TYPE_General Traffic Engineering Request',
 'TYPE_Ground Maintenance',
 'TYPE_HP Sign Application New',
 'TYPE_HP Sign Application Renewal',
 'TYPE_Heat/Fuel Assistance',
 'TYPE_Idea Collection',
 'TYPE_Knockdown Replacement',
 'TYPE_Loud Parties/Music/People',
 'TYPE_Mechanical',
 'TYPE_Misc. Snow Complaint',
 'TYPE_Mosquitoes (West Nile)',
 'TYPE_Municipal Parking Lot Complaints',
 'TYPE_New Tree Warrantee Inspection',
 'TYPE_News Boxes',
 'TYPE_No Utilities - Food Establishment - Electricity',
 'TYPE_No Utilities - Food Establishment - Flood',
 'TYPE_No Utilities - Food Establishment - Sewer',
 'TYPE_No Utilities - Food Establishment - Water',
 'TYPE_No Utilities Residential - Electricity',
 'TYPE_No Utilities Residential - Gas',
 'TYPE_No Utilities Residential - Water',
 'TYPE_OCR Metrolist',
 'TYPE_Occupying W/Out A Valid CO/CI',
 'TYPE_One Boston Day',
 'TYPE_PWD Graffiti',
 'TYPE_Parking Meter Repairs',
 'TYPE_Parks General Request',
 'TYPE_Pavement Marking Inspection',
 'TYPE_Phone Bank Service Inquiry',
 'TYPE_Planting',
 'TYPE_Poor Ventilation',
 'TYPE_Private Parking Lot Complaints',
 'TYPE_Public Events Noise Disturbances',
 'TYPE_Rat Bite',
 'TYPE_Rental Unit Delivery Conditions',
 'TYPE_Request for Litter Basket Installation',
 'TYPE_Roadway Flooding',
 'TYPE_Rooftop & Mechanical Disturbances',
 'TYPE_Schedule a Bulk Item Pickup SS',
 'TYPE_Senior Shoveling',
 'TYPE_Sewage/Septic Back-Up',
 'TYPE_Sidewalk Cover / Manhole',
 'TYPE_Sidewalk Repair (Make Safe)',
 'TYPE_Sign Shop WO',
 'TYPE_Snow Removal',
 'TYPE_Snow/Ice Control',
 'TYPE_Student Overcrowding',
 'TYPE_Transfer Not Completed',
 'TYPE_Undefined Noise Disturbance',
 'TYPE_Unit Pricing Wrong/Missing',
 'TYPE_Unsanitary Conditions - Employees',
 'TYPE_Unsanitary Conditions - Establishment',
 'TYPE_Unsanitary Conditions - Food',
 'TYPE_Utility Casting Repair',
 'TYPE_Valet Parking Problems',
 'TYPE_Walk-In Service Inquiry',
 'TYPE_Watermain Break',
 'TYPE_Work Hours-Loud Noise Complaints',
 'TYPE_Yardwaste Asian Longhorned Beetle Affected Area',
 'bedroom',
 'bedroom_std_dev',
 'earned_income_per_capita',
 'housing_own',
 'housing_std_dev',
 'income',
 'income_std_dev',
 'is_description',
 'neighborhood_from_zip_Allston / Brighton',
 'neighborhood_from_zip_Back Bay',
 'neighborhood_from_zip_Beacon Hill',
 'neighborhood_from_zip_Brookline',
 'neighborhood_from_zip_Charlestown',
 'neighborhood_from_zip_Chestnut Hill',
 'neighborhood_from_zip_Dorchester',
 'neighborhood_from_zip_Downtown / Financial District',
 'neighborhood_from_zip_Fenway / Kenmore / Audubon Circle / Longwood',
 'neighborhood_from_zip_Hyde Park',
 'neighborhood_from_zip_Jamaica Plain',
 'neighborhood_from_zip_Mattapan',
 'neighborhood_from_zip_Mission Hill',
 'neighborhood_from_zip_Roslindale',
 'neighborhood_from_zip_Roxbury',
 'neighborhood_from_zip_South Boston',
 'neighborhood_from_zip_South Boston / South Boston Waterfront',
 'neighborhood_from_zip_South End',
 'neighborhood_from_zip_West End',
 'poverty_pop_below_poverty_level',
 'poverty_pop_w_food_stamps',
 'poverty_pop_w_public_assistance',
 'poverty_pop_w_ssi',
 'queue_wk',
 'race_asian',
 'race_black',
 'race_hispanic',
 'race_other',
 'race_white',
 'rent',
 'rent_std_dev',
 'school_0_none',
 'school_11_9th_grade',
 'school_13_11th_grade',
 'school_14_12th_grade_no_diploma',
 'school_15_hs_diploma',
 'school_18_some_college_no_degree',
 'school_19_associates',
 'school_20_bachelors',
 'school_21_masters',
 'school_22_professional_school',
 'school_std_dev',
 'value',
 'value_std_dev']

In [114]:
df_dummified_and_filtered = df_dummified.drop(col_blacklist, axis=1)

## Running a model

In [31]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [52]:
X_train, X_test, y_train, y_test = train_test_split(
    df_dummified_and_filtered.drop('COMPLETION_HOURS_LOG_10', axis=1), 
    df_dummified_and_filtered.COMPLETION_HOURS_LOG_10, 
    test_size=0.2, 
    random_state=300
)

In [115]:
df_dummified_and_filtered.columns = [col.translate(None, string.punctuation).replace(' ', '') if col != 'COMPLETION_HOURS_LOG_10' else col for col in df_dummified_and_filtered.columns]

In [113]:
list(df_dummified_and_filtered.columns)

['COMPLETION_HOURS_LOG_10',
 'queuewkopen',
 'TYPEAbandoned Bicycle',
 'TYPEAbandoned Building',
 'TYPEAbandoned Vehicles',
 'TYPEAnimal Found',
 'TYPEAnimal Generic Request',
 'TYPEAnimal Lost',
 'TYPEBed Bugs',
 'TYPEBicycle Issues',
 'TYPEBreathe Easy',
 'TYPEBuilding Inspection Request',
 'TYPECall Log',
 'TYPECarbon Monoxide',
 'TYPECatchbasin',
 'TYPECheckin',
 'TYPEChronic DampnessMold',
 'TYPEConstruction Debris',
 'TYPEContractors Complaint',
 'TYPECross Metering  SubMetering',
 'TYPEEgress',
 'TYPEElectrical',
 'TYPEEmpty Litter Basket',
 'TYPEEquipment Repair',
 'TYPEExceeding Terms of Permit',
 'TYPEGeneral Comments For An Employee',
 'TYPEGeneral Comments For a Program or Policy',
 'TYPEGeneral Lighting Request',
 'TYPEGraffiti Removal',
 'TYPEHeat  Excessive  Insufficient',
 'TYPEHighway Maintenance',
 'TYPEHousing Discrimination Intake Form',
 'TYPEIllegal Auto Body Shop',
 'TYPEIllegal Dumping',
 'TYPEIllegal Occupancy',
 'TYPEIllegal Posting of Signs',
 'TYPEIllegal Ro

In [116]:
col_list = ' + '.join(df_dummified_and_filtered.drop('COMPLETION_HOURS_LOG_10', axis=1).columns[15:17])

est = smf.ols(
    'COMPLETION_HOURS_LOG_10 ~ {}'.format(col_list), 
    pd.concat([X_train, y_train], axis=1)).fit()
est.summary()

NameError: name 'TYPEChronicDampnessMold' is not defined

## Running model

In [28]:
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.cross_validation import ShuffleSplit
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score


In [26]:
pipe = make_pipeline(StandardScaler(), LinearRegression())
pipe = make_pipeline(PolynomialFeatures(), StandardScaler(), LinearRegression())

In [27]:
params = {}
model = GridSearchCV(pipe, param_grid=params, n_jobs=-1, cv=3)
model.fit(X_train, y_train);

In [30]:
pd.DataFrame(model.cv_results_).T.head()

Unnamed: 0,0
mean_fit_time,2.34785
mean_score_time,0.228844
mean_test_score,0.555478
mean_train_score,0.555907
params,{}


We will want to run a model with just the above features to find out which ones are statistically significant, but we get a sense here that these factors are likely to be signficant:

- when source is from the mobile app or desktop website
- neighborhoods of East Boston and the North End
- the number of issues in the workers' queue at the time

## Conclusion

We didn't get a better $R^2$, which makes sense, since we weren't in an overfit situation anyways when we tried this regularization parameter.

We did find subset our features and got somewhat of an indication which ones are more likely to be significantly correlated to completion time than others. We also avoided crazy predictions that would have affected our $R^2$, at least for this particular random seed.