## Objective

Now that I know that my submodels perform a bit better than the model on the entire dataset, I can take advantage of the submodels and find the coefficients that are most associated with response variable.

In [2]:
CATEGORY_GROUPS_IN_QUESTION = [['Pick up Dead Animal'],
['Abandoned Vehicles', 'Abandoned Bicycle'],
['Rodent Activity',	'Bed Bugs', 'Mice Infestation - Residential'],
['Sidewalk Repair', 'Sidewalk Repair (Make Safe)'],
['Needle Pickup'],
['Unsatisfactory Living Conditions', 'Poor Conditions of Property', 'Unsanitary Conditions - Establishment', 'Illegal Occupancy', 'Heat - Excessive  Insufficient'],
['Request for Pothole Repair'],
['Graffiti Removal']]

## Objective

Does this have statistically significant coefs, assuming homoskedacity and a linear predictor-response relationship and Normalized residuals and imperfect collinearity?

In [3]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
from __future__ import division
import pandas as pd
import numpy as np
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
from tqdm import tqdm

warnings.filterwarnings("ignore", category=DeprecationWarning)
sns.set_style("whitegrid")
sns.set_context("poster")
rcParams['figure.figsize'] = 20, 5

import os, sys
sys.path.append(os.path.join(os.path.dirname('.'), "../../preprocessing"))
sys.path.append(os.path.join(os.path.dirname('.'), ".."))
from helper_functions import dummify_cols_and_baselines, make_alphas, remove_outliers_by_type, adjusted_r2, transform_school, get_vifs

from utilities import remove_one_feature



In [5]:
df_orig = pd.read_pickle('../../data/data_from_remove_from_dataset.pkl')
df_orig.shape

(516406, 40)

In [6]:
df_orig = transform_school(df_orig)
df_orig.shape

  df.school = df.school.str.extract(r'(\d\d?)').astype(int)


(516406, 40)

## Filterering by `TYPE`

In [7]:
i = 0
print CATEGORY_GROUPS_IN_QUESTION[i]
df_orig = df_orig[df_orig.TYPE.isin(CATEGORY_GROUPS_IN_QUESTION[i])]
df_orig.shape

['Pick up Dead Animal']


(7454, 40)

## Removing outliers

A standard procedure is to remove values further than 3 standard deviations from the mean. Since I have so many low values and some very high values, I anecdotally think that the low values are very likely to be true, and the high values not so much.

So, I will remove values further than 3 SDs from the median, by type.

Ideally, I would take into account the time dimension. I would like to do so given more time.

In [8]:
df_outliers_removed = remove_outliers_by_type(df_orig, y_col='COMPLETION_HOURS_LOG_10')
df_outliers_removed.shape

(7372, 40)

## Remove `TYPE` col

In [9]:
df_outliers_removed.drop('TYPE', axis=1, inplace=True)

## Choosing columns

In [None]:
['race_black',
 'race_other',
 'earned_income_per_capita',
 'poverty_pop_w_public_assistance',
 'poverty_pop_w_ssi',
 'bedroom',
 'bedroom_std_dev',
 'value_std_dev',
 'rent',
 'income',
 'is_description']

In [281]:
cols_orig_dataset = ['COMPLETION_HOURS_LOG_10', 'SubmittedPhoto']
cols_census = [
     'poverty_pop_below_poverty_level',
]
cols_engineered = ['queue_wk', 'queue_wk_open']

In [282]:
df = df_outliers_removed[cols_orig_dataset + cols_census + cols_engineered]
df.shape

(7372, 5)

## Removing NAs for cols like `school_std_dev`

In [283]:
aa = df.isnull().any().reset_index()
nas = aa[aa[0] == True]['index']
print nas

Series([], Name: index, dtype: object)


In [284]:
# this is a bad temporary band-aid
df = df.dropna(subset=nas.tolist())
df.shape

(7372, 5)

## Dummify

In [285]:
cols_to_dummify = [i for i in df.dtypes[df.dtypes == object].index if i != 'TYPE']
cols_to_dummify

[]

In [286]:
df_dummified, baseline_cols = dummify_cols_and_baselines(df, cols_to_dummify, chosen_col_i=2)

In [287]:
df_dummified.shape

(7372, 5)

## Checking for multicollinearity

In [259]:
df_dummified.head(1).T

Unnamed: 0,905400
COMPLETION_HOURS_LOG_10,0.0124857
SubmittedPhoto,False
poverty_pop_below_poverty_level,0.262473
rent_std_dev,0.0878398
income_std_dev,0.0461804
queue_wk,12873
queue_wk_open,1


In [19]:
get_vifs(df_dummified.drop(['SubmittedPhoto', 'is_description'], axis=1), 'COMPLETION_HOURS_LOG_10')

poverty_pop_w_food_stamps          4.003574
race_black                         3.204188
poverty_pop_below_poverty_level    3.189694
earned_income_per_capita           3.176380
income                             2.522978
race_hispanic                      2.204418
school                             1.954809
poverty_pop_w_ssi                  1.857534
school_std_dev                     1.718347
rent_std_dev                       1.688859
queue_wk                           1.651506
housing_std_dev                    1.645407
race_asian                         1.581442
value_std_dev                      1.535014
poverty_pop_w_public_assistance    1.485504
income_std_dev                     1.480597
rent                               1.435624
bedroom                            1.417223
Source_Citizens Connect App        1.386203
queue_wk_open                      1.362655
value                              1.344422
bedroom_std_dev                    1.152301
race_other                      

## Running model

In [20]:
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.cross_validation import ShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import string
from StringIO import StringIO


In [21]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [39]:
X_train, X_test, y_train, y_test = train_test_split(
    df_dummified.drop('COMPLETION_HOURS_LOG_10', axis=1), 
    df_dummified.COMPLETION_HOURS_LOG_10, 
    test_size=0.2, 
    random_state=300
)

## Use LassoCV to find col subsets

In [23]:
pipe = make_pipeline(StandardScaler(), LassoCV())
cv = ShuffleSplit(X_train.shape[0], n_iter=1, test_size=0.2, random_state=300)

In [28]:
params = {'lassocv__alphas': make_alphas(-3, 0)}
# params = {'lassocv__alphas': make_alphas(-2, -2)}
model = GridSearchCV(pipe, param_grid=params, n_jobs=1, cv=cv, verbose=0)
model.fit(X_train, y_train)
pd.DataFrame(model.cv_results_).T.iloc[2:5]

Unnamed: 0,0,1,2,3,4,5,6
mean_test_score,0.0180734,0.0183029,0.0171696,0.00765176,-0.00271772,-0.00271772,-0.00271772
mean_train_score,0.0202094,0.0189817,0.0155542,0.00732475,0,0,0
param_lassocv__alphas,[0.001],[0.003],[0.01],[0.03],[0.1],[0.3],[1.0]


In [29]:
model.best_params_

{'lassocv__alphas': [0.0030000000000000001]}

In [30]:
'{} cols go to zero out of {}'.format(
    len(X_train.columns[model.best_estimator_.steps[-1][-1].coef_ == 0]),
    len(X_train.columns)
)

'11 cols go to zero out of 26'

In [31]:
cols_zero = list(X_train.columns[model.best_estimator_.steps[-1][-1].coef_ == 0])
cols_zero

['race_black',
 'race_other',
 'earned_income_per_capita',
 'poverty_pop_w_public_assistance',
 'poverty_pop_w_ssi',
 'bedroom',
 'bedroom_std_dev',
 'value_std_dev',
 'rent',
 'income',
 'is_description']

## Use subsetted cols to run lin reg

In [288]:
df_dummified.columns = [col.translate(None, string.punctuation).replace(' ', '') if col != 'COMPLETION_HOURS_LOG_10' else col for col in df_dummified.columns]

In [289]:
X_train, X_test, y_train, y_test = train_test_split(
    df_dummified.drop(['COMPLETION_HOURS_LOG_10'], axis=1), 
    df_dummified.COMPLETION_HOURS_LOG_10, 
    test_size=0.2, 
    random_state=300
)

In [290]:
col_list = ' + '.join(df_dummified.drop(['COMPLETION_HOURS_LOG_10'], axis=1).columns)

est = smf.ols(
    'COMPLETION_HOURS_LOG_10 ~ {}'.format(col_list), 
    pd.concat([X_train, y_train], axis=1)).fit()

In [291]:
est.summary().tables[0]

0,1,2,3
Dep. Variable:,COMPLETION_HOURS_LOG_10,R-squared:,0.019
Model:,OLS,Adj. R-squared:,0.018
Method:,Least Squares,F-statistic:,28.71
Date:,"Sat, 04 Mar 2017",Prob (F-statistic):,1.16e-23
Time:,14:40:28,Log-Likelihood:,-4891.1
No. Observations:,5897,AIC:,9792.0
Df Residuals:,5892,BIC:,9826.0
Df Model:,4,,
Covariance Type:,nonrobust,,


### Getting adjusted $R^2$ on test set

In [294]:
y_pred = est.predict(X_test)

In [295]:
adjusted_r2(y_test, y_pred, num_features=X_test.shape[1])

0.020443590379377725

In [296]:
mean_squared_error(y_test, y_pred)**0.5

0.57095618044238183

## Interpreting model

Which features are most associated with completion time?

In [292]:
df_results = pd.read_csv(StringIO(est.summary().tables[1].as_csv()), index_col=0).reset_index()
df_results.columns = ['coef_name'] + [i.rstrip().lstrip() for i in df_results.columns][1:]
df_results.coef_name = df_results.coef_name.map(lambda x: x.strip())
df_results = df_results.sort_values('P>|t|')
df_results['pct_diff_for_y'] = (10**df_results.coef - 1) * 100
df_results['pct_diff_for_y_abs'] = pd.np.abs((10**df_results.coef - 1) * 100)
df_results.sort_values('pct_diff_for_y_abs', inplace=True, ascending=False)
df_results.shape

(5, 8)

In [293]:
df_results.sort_values('P>|t|')

Unnamed: 0,coef_name,coef,std err,t,P>|t|,[95.0% Conf. Int.],pct_diff_for_y,pct_diff_for_y_abs
0,Intercept,0.4404,0.024,18.654,0.0,0.394 0.487,175.676661,175.676661
1,SubmittedPhoto[T.True],0.113,0.025,4.573,0.0,0.065 0.161,29.717927,29.717927
4,queuewkopen,-0.0015,0.0,-3.92,0.0,-0.002 -0.001,-0.344792,0.344792
3,queuewk,-1.6e-05,2e-06,-6.814,0.0,-2.06e-05 -1.14e-05,-0.003689,0.003689
2,povertypopbelowpovertylevel,-0.1046,0.052,-2.003,0.045,-0.207 -0.002,-21.40408,21.40408


In [154]:
df_results[df_results['P>|t|'] < 0.1]

Unnamed: 0,coef_name,coef,std err,t,P>|t|,[95.0% Conf. Int.],pct_diff_for_y,pct_diff_for_y_abs
0,Intercept,2.4058,0.026,91.46,0.0,2.354 2.457,25356.576639,25356.576639
1,isdescription[T.True],-0.0998,0.014,-7.336,0.0,-0.126 -0.073,-20.530588,20.530588
2,raceblack,0.0723,0.018,4.061,0.0,0.037 0.107,18.113625,18.113625
4,racehispanic,0.0715,0.03,2.358,0.018,0.012 0.131,17.896252,17.896252
9,SourceConstituentCall,-0.0428,0.014,-2.97,0.003,-0.071 -0.015,-9.38502,9.38502
8,SourceCitizensConnectApp,-0.0395,0.019,-2.113,0.035,-0.076 -0.003,-8.693856,8.693856
7,queuewkopen,0.001,6.7e-05,14.848,0.0,0.001 0.001,0.230524,0.230524
6,rent,-1.7e-05,8e-06,-2.035,0.042,-3.32e-05 -6.2e-07,-0.003889,0.003889


In [280]:
scores = []

for col in X_train.columns:
    if col != 'Intercept':
        score = remove_one_feature([col], df_dummified)
        scores.append((col, score))
        
sorted(scores, key=lambda x: x[1])[::-1]        

[('rentstddev', 13),
 ('SubmittedPhoto', 12.831),
 ('queuewkopen', 12.765),
 ('povertypopbelowpovertylevel', 12.239),
 ('queuewk', 8.798)]

## Interpretation

Most we can say is sign, and then magnitude for comparable coefs. Rank is meaningless, eg for rent vs "source: mobile app".

All the results are weird.

### Weird results
- Having more **issues in the queue** is associated with **better** completion time.
- Having a higher proportion of **people below the poverty level** is associated with a **better** completion time.
- **Submitting a photo** is associated with a **worse** completion time.

In [130]:
df_dummified.COMPLETION_HOURS_LOG_10.map(lambda x: 10**x).describe()

count    9390.000000
mean      383.803573
std       405.297599
min         2.275000
25%       171.718333
50%       307.753611
75%       479.439375
max      7198.562222
Name: COMPLETION_HOURS_LOG_10, dtype: float64