## Objective

Now that I know that my submodels perform a bit better than the model on the entire dataset, I can take advantage of the submodels and find the coefficients that are most associated with response variable.

In [2]:
CATEGORY_GROUPS_IN_QUESTION = [['Pick up Dead Animal', 'Animal Generic Request'],
['Abandoned Vehicles', 'Abandoned Bicycle'],
['Rodent Activity',	'Bed Bugs', 'Mice Infestation - Residential'],
['Sidewalk Repair', 'Sidewalk Repair (Make Safe)'],
['Needle Pickup'],
['Unsatisfactory Living Conditions', 'Poor Conditions of Property', 'Unsanitary Conditions - Establishment', 'Illegal Occupancy', 'Heat - ],Excessive  Insufficient'],
['Request for Pothole Repair'],
['Graffiti Removal']]

## Objective

Does this have statistically significant coefs, assuming homoskedacity and a linear predictor-response relationship and Normalized residuals and imperfect collinearity?

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
from __future__ import division
import pandas as pd
import numpy as np
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
from tqdm import tqdm

from utilities import remove_one_feature

warnings.filterwarnings("ignore", category=DeprecationWarning)
sns.set_style("whitegrid")
sns.set_context("poster")
rcParams['figure.figsize'] = 20, 5

import os, sys
sys.path.append(os.path.join(os.path.dirname('.'), "../preprocessing"))
from helper_functions import dummify_cols_and_baselines, make_alphas, remove_outliers_by_type, adjusted_r2, transform_school, get_vifs



In [5]:
df_orig = pd.read_pickle('../data/data_from_remove_from_dataset.pkl')
df_orig.shape

(516406, 40)

In [6]:
df_orig = transform_school(df_orig)
df_orig.shape

  df.school = df.school.str.extract(r'(\d\d?)').astype(int)


(516406, 40)

## Filterering by `TYPE`

In [7]:
i = 2
print CATEGORY_GROUPS_IN_QUESTION[i]
df_orig = df_orig[df_orig.TYPE.isin(CATEGORY_GROUPS_IN_QUESTION[i])]
df_orig.shape

['Rodent Activity', 'Bed Bugs', 'Mice Infestation - Residential']


(12336, 40)

## Removing outliers

A standard procedure is to remove values further than 3 standard deviations from the mean. Since I have so many low values and some very high values, I anecdotally think that the low values are very likely to be true, and the high values not so much.

So, I will remove values further than 3 SDs from the median, by type.

Ideally, I would take into account the time dimension. I would like to do so given more time.

In [8]:
df_outliers_removed = remove_outliers_by_type(df_orig, y_col='COMPLETION_HOURS_LOG_10')
df_outliers_removed.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  group[pd.np.abs(group - group.median()) > stds * group.std()] = pd.np.nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.where(-key, value, inplace=True)


(12200, 40)

## Remove `TYPE` col

In [9]:
df_outliers_removed.drop('TYPE', axis=1, inplace=True)

## Choosing columns

In [10]:
['SubmittedPhoto',
 'race_asian',
 'race_other',
 'earned_income_per_capita',
 'poverty_pop_w_public_assistance',
 'poverty_pop_w_food_stamps',
 'poverty_pop_w_ssi',
 'school',
 'school_std_dev',
 'bedroom',
 'bedroom_std_dev',
 'value',
 'rent',
 'rent_std_dev',
 'income',
 'income_std_dev']

['SubmittedPhoto',
 'race_asian',
 'race_other',
 'earned_income_per_capita',
 'poverty_pop_w_public_assistance',
 'poverty_pop_w_food_stamps',
 'poverty_pop_w_ssi',
 'school',
 'school_std_dev',
 'bedroom',
 'bedroom_std_dev',
 'value',
 'rent',
 'rent_std_dev',
 'income',
 'income_std_dev']

In [11]:
cols_orig_dataset = ['COMPLETION_HOURS_LOG_10', 'Property_Type', 'Source']
cols_census = [
     'race_black',
     'race_asian',
     'race_hispanic',
     'race_other',
     'housing',
     'housing_std_dev',
     'value_std_dev',
]
cols_engineered = ['queue_wk', 'queue_wk_open', 'is_description']

In [12]:
df = df_outliers_removed[cols_orig_dataset + cols_census + cols_engineered]
df.shape

(12200, 13)

## Removing NAs for cols like `school_std_dev`

In [13]:
aa = df.isnull().any().reset_index()
nas = aa[aa[0] == True]['index']
print nas

8    housing_std_dev
9      value_std_dev
Name: index, dtype: object


In [14]:
# this is a bad temporary band-aid
df = df.dropna(subset=nas.tolist())
df.shape

(11808, 13)

## Dummify

In [15]:
cols_to_dummify = [i for i in df.dtypes[df.dtypes == object].index if i != 'TYPE']
cols_to_dummify

['Property_Type', 'Source', 'housing']

In [16]:
df_dummified, baseline_cols = dummify_cols_and_baselines(df, cols_to_dummify, chosen_col_i=2)

Intersection is baseline 0 3
Self Service is baseline 1 3
own is baseline 2 3


In [17]:
df_dummified.shape

(11808, 13)

## Checking for multicollinearity

In [96]:
df_dummified.head(1).T

Unnamed: 0,905400
COMPLETION_HOURS_LOG_10,0.0124857
SubmittedPhoto,False
poverty_pop_below_poverty_level,0.262473
bedroom,2
queue_wk,12873
queue_wk_open,1
is_description,True


In [18]:
get_vifs(df_dummified.drop(['SubmittedPhoto', 'is_description'], axis=1), 'COMPLETION_HOURS_LOG_10')

poverty_pop_w_food_stamps          4.791899
earned_income_per_capita           3.418244
race_black                         3.066740
poverty_pop_below_poverty_level    3.022113
income                             2.569122
housing_std_dev                    2.322690
school                             2.255340
race_hispanic                      2.200435
poverty_pop_w_ssi                  2.048711
school_std_dev                     1.937076
race_asian                         1.793934
value_std_dev                      1.751350
poverty_pop_w_public_assistance    1.747422
Source_Citizens Connect App        1.712687
bedroom                            1.678416
rent_std_dev                       1.544883
rent                               1.539295
value                              1.500641
income_std_dev                     1.478636
Source_Constituent Call            1.458455
queue_wk                           1.416018
queue_wk_open                      1.293717
bedroom_std_dev                 

## Running model

In [20]:
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.cross_validation import ShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import string
from StringIO import StringIO


In [21]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [21]:
X_train, X_test, y_train, y_test = train_test_split(
    df_dummified.drop('COMPLETION_HOURS_LOG_10', axis=1), 
    df_dummified.COMPLETION_HOURS_LOG_10, 
    test_size=0.2, 
    random_state=300
)

## Use LassoCV to find col subsets

In [22]:
pipe = make_pipeline(StandardScaler(), LassoCV())

In [23]:
cv = ShuffleSplit(X_train.shape[0], n_iter=1, test_size=0.2, random_state=300)

In [30]:
params = {'lassocv__alphas': make_alphas(-2, 4)}
model = GridSearchCV(pipe, param_grid=params, n_jobs=-1, cv=cv, verbose=1)
model.fit(X_train, y_train);

Fitting 1 folds for each of 13 candidates, totalling 13 fits


[Parallel(n_jobs=-1)]: Done  13 out of  13 | elapsed:    0.8s finished


In [31]:
pd.DataFrame(model.cv_results_).T.iloc[2:5]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
mean_test_score,0.0460184,0.0363292,-0.000671688,-0.000671688,-0.000671688,-0.000671688,-0.000671688,-0.000671688,-0.000671688,-0.000671688,-0.000671688,-0.000671688,-0.000671688
mean_train_score,0.047898,0.0373485,0,0,0,0,0,0,0,0,0,0,0
param_lassocv__alphas,[0.01],[0.03],[0.1],[0.3],[1.0],[3.0],[10.0],[30.0],[100.0],[300.0],[1000.0],[3000.0],[10000.0]


In [29]:
model.best_params_

{'lassocv__alphas': [0.0001]}

In [32]:
'{} cols go to zero out of {}'.format(
    len(X_train.columns[model.best_estimator_.steps[-1][-1].coef_ == 0]),
    len(X_train.columns)
)

'16 cols go to zero out of 27'

In [33]:
cols_zero = list(X_train.columns[model.best_estimator_.steps[-1][-1].coef_ == 0])
cols_zero

['SubmittedPhoto',
 'race_asian',
 'race_other',
 'earned_income_per_capita',
 'poverty_pop_w_public_assistance',
 'poverty_pop_w_food_stamps',
 'poverty_pop_w_ssi',
 'school',
 'school_std_dev',
 'bedroom',
 'bedroom_std_dev',
 'value',
 'rent',
 'rent_std_dev',
 'income',
 'income_std_dev']

## Use subsetted cols to run lin reg

In [22]:
df_dummified.columns = [col.translate(None, string.punctuation).replace(' ', '') if col != 'COMPLETION_HOURS_LOG_10' else col for col in df_dummified.columns]

In [23]:
X_train, X_test, y_train, y_test = train_test_split(
    df_dummified.drop(['COMPLETION_HOURS_LOG_10'], axis=1), 
    df_dummified.COMPLETION_HOURS_LOG_10, 
    test_size=0.2, 
    random_state=300
)

In [24]:
col_list = ' + '.join(df_dummified.drop(['COMPLETION_HOURS_LOG_10'], axis=1).columns)

est = smf.ols(
    'COMPLETION_HOURS_LOG_10 ~ {}'.format(col_list), 
    pd.concat([X_train, y_train], axis=1)).fit()

In [61]:
est.summary().tables[0]

0,1,2,3
Dep. Variable:,COMPLETION_HOURS_LOG_10,R-squared:,0.053
Model:,OLS,Adj. R-squared:,0.051
Method:,Least Squares,F-statistic:,43.69
Date:,"Thu, 23 Feb 2017",Prob (F-statistic):,1.31e-101
Time:,08:07:14,Log-Likelihood:,-7467.1
No. Observations:,9446,AIC:,14960.0
Df Residuals:,9433,BIC:,15050.0
Df Model:,12,,
Covariance Type:,nonrobust,,


### Getting adjusted $R^2$ on test set

In [25]:
y_pred = est.predict(X_test)

In [26]:
adjusted_r2(y_test, y_pred, num_features=X_test.shape[1])

0.037585441784451626

In [27]:
mean_squared_error(y_test, y_pred)**0.5

0.52784835063471025

## Interpreting model

Which features are most associated with completion time?

In [30]:
df_results = pd.read_csv(StringIO(est.summary().tables[1].as_csv()), index_col=0).reset_index()
df_results.columns = ['coef_name'] + [i.rstrip().lstrip() for i in df_results.columns][1:]
df_results.coef_name = df_results.coef_name.map(lambda x: x.strip())
df_results = df_results.sort_values('P>|t|')
df_results['pct_diff_for_y'] = (10**df_results.coef - 1) * 100
df_results['pct_diff_for_y_abs'] = pd.np.abs((10**df_results.coef - 1) * 100)
df_results.sort_values('pct_diff_for_y_abs', inplace=True, ascending=False)
df_results.shape

(13, 8)

In [32]:
df_results

Unnamed: 0,coef_name,coef,std err,t,P>|t|,[95.0% Conf. Int.],pct_diff_for_y,pct_diff_for_y_abs
0,Intercept,2.4981,0.035,70.648,0.0,2.429 2.567,31384.731934,31384.731934
7,valuestddev,0.6195,0.256,2.419,0.016,0.118 1.122,316.389721,316.389721
11,SourceCitizensConnectApp,-0.1846,0.031,-6.045,0.0,-0.244 -0.125,-34.626761,34.626761
1,isdescription[T.True],-0.1719,0.013,-12.803,0.0,-0.198 -0.146,-32.686837,32.686837
5,raceother,-0.1385,0.102,-1.361,0.174,-0.338 0.061,-27.30576,27.30576
6,housingstddev,0.0917,0.038,2.395,0.017,0.017 0.167,23.509397,23.509397
10,PropertyTypeAddress,0.0791,0.022,3.659,0.0,0.037 0.121,19.977553,19.977553
4,racehispanic,0.0762,0.038,1.998,0.046,0.001 0.151,19.179072,19.179072
3,raceasian,0.0727,0.051,1.438,0.151,-0.026 0.172,18.222462,18.222462
12,SourceConstituentCall,0.0546,0.018,2.969,0.003,0.019 0.091,13.396591,13.396591


In [31]:
df_results[df_results['P>|t|'] < 0.1][df_results.coef_name != 'Intercept']

  if __name__ == '__main__':


Unnamed: 0,coef_name,coef,std err,t,P>|t|,[95.0% Conf. Int.],pct_diff_for_y,pct_diff_for_y_abs
7,valuestddev,0.6195,0.256,2.419,0.016,0.118 1.122,316.389721,316.389721
11,SourceCitizensConnectApp,-0.1846,0.031,-6.045,0.0,-0.244 -0.125,-34.626761,34.626761
1,isdescription[T.True],-0.1719,0.013,-12.803,0.0,-0.198 -0.146,-32.686837,32.686837
6,housingstddev,0.0917,0.038,2.395,0.017,0.017 0.167,23.509397,23.509397
10,PropertyTypeAddress,0.0791,0.022,3.659,0.0,0.037 0.121,19.977553,19.977553
4,racehispanic,0.0762,0.038,1.998,0.046,0.001 0.151,19.179072,19.179072
12,SourceConstituentCall,0.0546,0.018,2.969,0.003,0.019 0.091,13.396591,13.396591
2,raceblack,-0.0515,0.024,-2.115,0.034,-0.099 -0.004,-11.182202,11.182202
9,queuewkopen,0.0003,4.8e-05,5.348,0.0,0.000 0.000,0.069101,0.069101
8,queuewk,-6e-06,2e-06,-3.109,0.002,-9.23e-06 -2.09e-06,-0.001303,0.001303


In [50]:
scores = []

for col in X_train.columns:
    if col != 'Intercept':
        score = remove_one_feature([col], df_dummified)
        scores.append((col, score))
        
sorted(scores, key=lambda x: x[1])[::-1]        

[('povertypopbelowpovertylevel', 34.675),
 ('housingstddev', 32.474000000000004),
 ('raceasian', 30.57),
 ('valuestddev', 30.568),
 ('isdescription', 30.462),
 ('raceother', 30.351),
 ('racehispanic', 30.161),
 ('raceblack', 28.481),
 ('SourceCitizensConnectApp', 28.402),
 ('SourceConstituentCall', 28.344),
 ('PropertyTypeAddress', 28.279),
 ('queuewk', 28.198999999999998),
 ('queuewkopen', 26.027)]

## Conclusion

- **More diverse housing values** in an area is associated with **worse** completion time.
- **More diversity in whether ppl buy or rent** is associated with **worse** completion time.


- **More Hispanic** areas associated with **worse** completion time.
- **More Black** areas associated with **better** completion time, to same degree as for Hispanic, but opposite.


- **From app** is associated with **better** completion time compared to website.
- **From call** associated with **better** completion time compared to website, to lesser degree than from app.
- **Description** associated with **better** completion time, to same degree as from app.


- **Location: address** associated with **worse** completion time, compared to intersection.


- **Lots of open issues in queue** associated with **worse** completion time.


### Notable
- **More diverse housing values** in an area is associated with **worse** completion time.
- **More diversity in whether ppl buy or rent** is associated with **worse** completion time.

- **From app, from call** is associated with **better** completion time compared to website.