## Objective

Null model

In [1]:
from __future__ import division
import pandas as pd
import numpy as np
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline

warnings.filterwarnings("ignore", category=DeprecationWarning)
sns.set_style("whitegrid")
sns.set_context("poster")
rcParams['figure.figsize'] = 20, 5

import os, sys
sys.path.append(os.path.join(os.path.dirname('.'), "../preprocessing"))
from helper_functions import dummify_cols_and_baselines, make_alphas, remove_outliers_by_type, adjusted_r2

In [2]:
df_orig = pd.read_pickle('../data/data_from_remove_from_dataset.pkl')
df_orig.shape

(516406, 40)

## Removing outliers

A standard procedure is to remove values further than 3 standard deviations from the mean. Since I have so many low values and some very high values, I anecdotally think that the low values are very likely to be true, and the high values not so much.

So, I will remove values further than 3 SDs from the median, by type.

Ideally, I would take into account the time dimension. I would like to do so given more time.

In [3]:
df_outliers_removed = remove_outliers_by_type(df_orig, y_col='COMPLETION_HOURS_LOG_10')
df_outliers_removed.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  group[pd.np.abs(group - group.median()) > stds * group.std()] = pd.np.nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.where(-key, value, inplace=True)


(508653, 40)

I'm removing ~1.5% of my rows.

## Choosing columns

In [4]:
cols_orig_dataset = ['COMPLETION_HOURS_LOG_10', 'TYPE']

In [5]:
df = df_outliers_removed[cols_orig_dataset]

## Running model

In [6]:
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.cross_validation import ShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score




In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop('COMPLETION_HOURS_LOG_10', axis=1), 
    df.COMPLETION_HOURS_LOG_10, 
    test_size=0.2, 
    random_state=300
)

In [16]:
from utilities import scale

In [8]:
d = {}

categs = df.TYPE.drop_duplicates().tolist()

for categ in categs:
    mean = df[df.TYPE == categ].COMPLETION_HOURS_LOG_10.mean()
    d[categ] = mean

In [10]:
y_pred = X_test.TYPE.map(d)

In [13]:
r2_score(y_test, y_pred)

0.54523173983379714

In [15]:
mean_squared_error(y_test, y_pred)**0.5

0.75511885449691629

In [24]:
X_test.shape

(101731, 252)

In [23]:
adjusted_r2(y_test, y_pred, X_test.shape[1])

0.54046083765815289

How many coefficients go to 0?

In [17]:
pd.Series(model.best_estimator_.steps[-1][-1].coef_).describe()

count    252.000000
mean      -0.004569
std        0.049632
min       -0.363193
25%        0.000000
50%       -0.000000
75%        0.000000
max        0.145756
dtype: float64

In [29]:
len(X_train.columns[model.best_estimator_.steps[-1][-1].coef_ != 0])

115

In [30]:
len(X_train.columns)

252

About half at `α=0.01`.

Which don't go to zero?

In [27]:
sorted(list(X_train.columns[model.best_estimator_.steps[-1][-1].coef_ != 0]))

['Source_Citizens Connect App',
 'Source_Self Service',
 'TYPE_Abandoned Bicycle',
 'TYPE_Abandoned Building',
 'TYPE_Abandoned Vehicles',
 'TYPE_Animal Found',
 'TYPE_Animal Generic Request',
 'TYPE_Animal Lost',
 'TYPE_Bed Bugs',
 'TYPE_Bicycle Issues',
 'TYPE_Breathe Easy',
 'TYPE_Building Inspection Request',
 'TYPE_Call Log',
 'TYPE_Carbon Monoxide',
 'TYPE_Catchbasin',
 'TYPE_Checkin',
 'TYPE_Chronic Dampness/Mold',
 'TYPE_Construction Debris',
 'TYPE_Contractors Complaint',
 'TYPE_Cross Metering - Sub-Metering',
 'TYPE_Egress',
 'TYPE_Electrical',
 'TYPE_Empty Litter Basket',
 'TYPE_Equipment Repair',
 'TYPE_Exceeding Terms of Permit',
 'TYPE_General Comments For An Employee',
 'TYPE_General Comments For a Program or Policy',
 'TYPE_General Lighting Request',
 'TYPE_Graffiti Removal',
 'TYPE_Heat - Excessive  Insufficient',
 'TYPE_Highway Maintenance',
 'TYPE_Housing Discrimination Intake Form',
 'TYPE_Illegal Auto Body Shop',
 'TYPE_Illegal Dumping',
 'TYPE_Illegal Occupancy',


We will want to run a model with just the above features to find out which ones are statistically significant, but we get a sense here that these factors are likely to be signficant:

- when source is from the mobile app or desktop website
- neighborhoods of East Boston and the North End
- the number of issues in the workers' queue at the time

## Conclusion

We didn't get a better $R^2$, which makes sense, since we weren't in an overfit situation anyways when we tried this regularization parameter.

We did find subset our features and got somewhat of an indication which ones are more likely to be significantly correlated to completion time than others. We also avoided crazy predictions that would have affected our $R^2$, at least for this particular random seed.

## Appendix

These columns went to zero at `α=0.01`.

In [31]:
sorted(list(X_train.columns[model.best_estimator_.steps[-1][-1].coef_ == 0]))

['Property_Type_Address',
 'Property_Type_Intersection',
 'Source_Constituent Call',
 'SubmittedPhoto',
 'TYPE_ADA',
 'TYPE_Alert Boston',
 'TYPE_Animal Noise Disturbances',
 'TYPE_Automotive Noise Disturbance',
 'TYPE_BWSC General Request',
 'TYPE_BWSC Pothole',
 'TYPE_Big Buildings Online Request',
 'TYPE_Billing Complaint',
 'TYPE_Bridge Maintenance',
 'TYPE_CE Collection',
 'TYPE_Cemetery Maintenance Request',
 'TYPE_City/State Snow Issues',
 'TYPE_Contractor Complaints',
 'TYPE_Corporate or Community Group Service Day Clean Up',
 'TYPE_Downed Wire',
 'TYPE_Dumpster & Loading Noise Disturbances',
 'TYPE_Fire Department Request',
 'TYPE_Fire Hydrant',
 'TYPE_Fire in Food Establishment',
 'TYPE_Follow-Up',
 'TYPE_Food Alert - Confirmed',
 'TYPE_Food Alert - Unconfirmed',
 'TYPE_General Traffic Engineering Request',
 'TYPE_Ground Maintenance',
 'TYPE_HP Sign Application New',
 'TYPE_HP Sign Application Renewal',
 'TYPE_Heat/Fuel Assistance',
 'TYPE_Idea Collection',
 'TYPE_Knockdown R

What are the coef values?

In [70]:
coef_values = pd.DataFrame({
    'name': X_train.columns,
    'coef': model.best_estimator_.steps[-1][-1].coef_
})

In [75]:
coef_values[coef_values.coef != 0].sort_values('coef')[~coef_values.name.str.contains('TYPE')]

  if __name__ == '__main__':


Unnamed: 0,coef,name
217,-0.021895,Source_Citizens Connect App
234,-5.4e-05,neighborhood_from_zip_North End
228,0.006279,neighborhood_from_zip_East Boston
219,0.012568,Source_Self Service
22,0.130668,queue_wk_open
