In [1]:
%load_ext autoreload
%autoreload 2

## Objective

I want to group the submodels, in order to increase sample size, and bc it's valid to group them if diff groups interact w the features differently from other groups.

To compare to the main model, I would want to do some kind of weighted R2, since the groups would all have different sample sizes.

In [2]:
from __future__ import division
import pandas as pd
import numpy as np
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline

warnings.filterwarnings("ignore", category=DeprecationWarning)
sns.set_style("whitegrid")
sns.set_context("poster")
rcParams['figure.figsize'] = 20, 5

import os, sys
sys.path.append(os.path.join(os.path.dirname('.'), "../preprocessing"))
from helper_functions import dummify_cols_and_baselines, make_alphas, remove_outliers_by_type, adjusted_r2

In [3]:
df_orig = pd.read_pickle('../data/data_from_remove_from_dataset.pkl')
df_orig.shape

(516406, 40)

## Removing outliers

A standard procedure is to remove values further than 3 standard deviations from the mean. Since I have so many low values and some very high values, I anecdotally think that the low values are very likely to be true, and the high values not so much.

So, I will remove values further than 3 SDs from the median, by type.

Ideally, I would take into account the time dimension. I would like to do so given more time.

In [4]:
df_outliers_removed = remove_outliers_by_type(df_orig, y_col='COMPLETION_HOURS_LOG_10')
df_outliers_removed.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  group[pd.np.abs(group - group.median()) > stds * group.std()] = pd.np.nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.where(-key, value, inplace=True)


(508653, 40)

I'm removing ~1.5% of my rows.

## Choosing columns

In [5]:
cols_orig_dataset = ['COMPLETION_HOURS_LOG_10', 'TYPE', 'SubmittedPhoto', 'Property_Type', 'Source', 'neighborhood_from_zip']
cols_census = ['race_white',
     'race_black',
     'race_asian',
     'race_hispanic',
     'race_other',
     'poverty_pop_below_poverty_level',
     'earned_income_per_capita',
     'poverty_pop_w_public_assistance',
     'poverty_pop_w_food_stamps',
     'poverty_pop_w_ssi',
     'school',
     'school_std_dev',
     'housing',
     'housing_std_dev',
     'bedroom',
     'bedroom_std_dev',
     'value',
     'value_std_dev',
     'rent',
     'rent_std_dev',
     'income',
     'income_std_dev']
cols_engineered = ['queue_wk', 'queue_wk_open', 'is_description']

In [6]:
df = df_outliers_removed[cols_orig_dataset + cols_census + cols_engineered]

## Replacing `TYPE`s

In [7]:
cd ../data

/home/ubuntu/311-prediction-times/data


In [11]:
from type_reason_mapping import type_reason_mapping

In [24]:
df['TYPE'] = df.TYPE.map(type_reason_mapping)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [27]:
print df.shape
df.dropna(subset=['TYPE'], inplace=True)
print df.shape

(508653, 31)
(503547, 31)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [29]:
df.TYPE.drop_duplicates().shape

(73,)

## Dummify

In [30]:
cols_to_dummify = df.dtypes[df.dtypes == object].index
cols_to_dummify

Index([u'TYPE', u'Property_Type', u'Source', u'neighborhood_from_zip',
       u'school', u'housing'],
      dtype='object')

In [31]:
df_dummified, baseline_cols = dummify_cols_and_baselines(df, cols_to_dummify)

Work w/out Permit is baseline 0 6
other is baseline 1 6
Twitter is baseline 2 6
West Roxbury is baseline 3 6
8_6th_grade is baseline 4 6
rent is baseline 5 6


In [32]:
df_dummified.shape

(503547, 134)

## Running model

In [33]:
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.cross_validation import ShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score




In [34]:
X_train, X_test, y_train, y_test = train_test_split(
    df_dummified.drop('COMPLETION_HOURS_LOG_10', axis=1), 
    df_dummified.COMPLETION_HOURS_LOG_10, 
    test_size=0.2, 
    random_state=300
)

In [46]:
pipe = make_pipeline(StandardScaler(), LassoCV())
pipe = make_pipeline(StandardScaler(), LinearRegression())

In [53]:
X_train.shape

(402837, 133)

In [36]:
cv = ShuffleSplit(X_train.shape[0], n_iter=1, test_size=0.2, random_state=300)

In [47]:
params = {'lassocv__alphas': make_alphas(-2, -4)}
# params = {'lassocv__alphas': [[0.01]]}
params = {}
model = GridSearchCV(pipe, param_grid=params, n_jobs=-1, cv=cv, verbose=5)
model.fit(X_train, y_train);

Fitting 1 folds for each of 1 candidates, totalling 1 fits
[CV]  ................................................................
[CV] ............... , score=-690121205215257728.000000, total=  13.5s


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   20.2s finished


In [48]:
print model.best_score_
print model.best_params_

-6.90121205215e+17
{}


In [50]:
model.score(X_test, y_test)

0.55399622137502469

In [49]:
pd.DataFrame(model.cv_results_).T

Unnamed: 0,0
mean_fit_time,12.8303
mean_score_time,0.639778
mean_test_score,-6.90121e+17
mean_train_score,0.54982
params,{}
rank_test_score,1
split0_test_score,-6.90121e+17
split0_train_score,0.54982
std_fit_time,0
std_score_time,0


In [54]:
y_test.describe()

count    100710.000000
mean          1.711108
std           1.121643
min          -2.954243
25%           1.350562
50%           1.863026
75%           2.368733
max           4.586019
Name: COMPLETION_HOURS_LOG_10, dtype: float64

In [44]:
print model.best_score_
print model.best_params_

0.554923033094
{'lassocv__alphas': [0.0001]}


In [38]:
pd.DataFrame(model.cv_results_).T

Unnamed: 0,0,1,2,3,4
mean_fit_time,27.3412,27.27,49.2425,32.7091,91.5081
mean_score_time,1.41861,1.37996,0.994238,1.30914,0.977914
mean_test_score,0.544238,0.508494,0.553598,0.551151,0.554923
mean_train_score,0.539395,0.503698,0.548536,0.546209,0.549749
param_lassocv__alphas,[0.01],[0.03],[0.001],[0.003],[0.0001]
params,{u'lassocv__alphas': [0.01]},{u'lassocv__alphas': [0.03]},{u'lassocv__alphas': [0.001]},{u'lassocv__alphas': [0.003]},{u'lassocv__alphas': [0.0001]}
rank_test_score,4,5,2,3,1
split0_test_score,0.544238,0.508494,0.553598,0.551151,0.554923
split0_train_score,0.539395,0.503698,0.548536,0.546209,0.549749
std_fit_time,0,0,0,0,0


In [39]:
model.score(X_test, y_test)

0.55400011672517624

In [51]:
y_pred = model.predict(X_test)

In [52]:
mean_squared_error(y_test, y_pred)**0.5

0.74906904394988794

In [40]:
y_pred = model.predict(X_test)

In [41]:
mean_squared_error(y_test, y_pred)**0.5

0.74906577279757391

In [None]:
X_test.shape

In [None]:
adjusted_r2(y_test, y_pred, X_test.shape[1])

How many coefficients go to 0?

In [None]:
pd.Series(model.best_estimator_.steps[-1][-1].coef_).describe()

In [None]:
len(X_train.columns[model.best_estimator_.steps[-1][-1].coef_ != 0])

In [None]:
len(X_train.columns)

About half at `α=0.01`.

Which don't go to zero?

In [None]:
sorted(list(X_train.columns[model.best_estimator_.steps[-1][-1].coef_ != 0]))

We will want to run a model with just the above features to find out which ones are statistically significant, but we get a sense here that these factors are likely to be signficant:

- when source is from the mobile app or desktop website
- neighborhoods of East Boston and the North End
- the number of issues in the workers' queue at the time

## Conclusion

We didn't get a better $R^2$, which makes sense, since we weren't in an overfit situation anyways when we tried this regularization parameter.

We did find subset our features and got somewhat of an indication which ones are more likely to be significantly correlated to completion time than others. We also avoided crazy predictions that would have affected our $R^2$, at least for this particular random seed.

## Appendix

These columns went to zero at `α=0.01`.

In [None]:
sorted(list(X_train.columns[model.best_estimator_.steps[-1][-1].coef_ == 0]))

What are the coef values?

In [None]:
coef_values = pd.DataFrame({
    'name': X_train.columns,
    'coef': model.best_estimator_.steps[-1][-1].coef_
})

In [None]:
coef_values[coef_values.coef != 0].sort_values('coef')[~coef_values.name.str.contains('TYPE')]