## Objective

Use Decision Tree + pruning to see if the data is non-linear, and to incorporate lat/long points. Try to improve upon $R^2$ of 0.56 from Linear Regression with outliers removed.

In [45]:
from __future__ import division
import pandas as pd
import numpy as np
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
from scipy.stats import beta, binom

rcParams['figure.figsize'] = 20, 5
warnings.filterwarnings("ignore", category=DeprecationWarning)
sns.set_style("whitegrid")
sns.set_context("poster")

from helper_functions import dummify_cols_and_baselines, make_alphas, remove_outliers_by_type

In [2]:
df_orig = pd.read_pickle('../data/data_from_remove_from_dataset.pkl')
df_orig.shape

(516406, 40)

## Removing outliers

A standard procedure is to remove values further than 3 standard deviations from the mean. Since I have so many low values and some very high values, I anecdotally think that the low values are very likely to be true, and the high values not so much.

So, I will remove values further than 3 SDs from the median, by type.

Ideally, I would take into account the time dimension. I would like to do so given more time.

In [3]:
df_outliers_removed = remove_outliers_by_type(df_orig, y_col='COMPLETION_HOURS_LOG_10')
df_outliers_removed.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  group[pd.np.abs(group - group.median()) > stds * group.std()] = pd.np.nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._where(-key, value, inplace=True)


(508653, 40)

I'm removing ~1.5% of my rows.

## Choosing columns

In [4]:
df_orig.head(1).loc[:, :]

Unnamed: 0,CASE_ENQUIRY_ID,OPEN_DT,CLOSED_DT,TYPE,SubmittedPhoto,LOCATION_ZIPCODE,Property_Type,LATITUDE,LONGITUDE,Source,...,housing,housing_std_dev,bedroom,bedroom_std_dev,value,value_std_dev,rent,rent_std_dev,income,income_std_dev
905425,101001983786,2017-01-07 10:51:37,2017-01-07 11:46:43,Request for Snow Plowing,True,2124.0,Address,42.2809,-71.068,Citizens Connect App,...,own,26.870058,3,61.512329,350000.0,20.979404,1750,19.162161,112500,28.61672


In [5]:
cols_orig_dataset = ['COMPLETION_HOURS_LOG_10', 'TYPE', 'SubmittedPhoto', 'Property_Type', 'Source', 'neighborhood_from_zip']
cols_census = ['race_white',
     'race_black',
     'race_asian',
     'race_hispanic',
     'race_other',
     'poverty_pop_below_poverty_level',
     'earned_income_per_capita',
     'poverty_pop_w_public_assistance',
     'poverty_pop_w_food_stamps',
     'poverty_pop_w_ssi',
     'school',
     'school_std_dev',
     'housing',
     'housing_std_dev',
     'bedroom',
     'bedroom_std_dev',
     'value',
     'value_std_dev',
     'rent',
     'rent_std_dev',
     'income',
     'income_std_dev']
cols_engineered = ['queue_wk', 'queue_wk_open', 'is_description']
cols_for_trees = ['LATITUDE', 'LONGITUDE']

In [6]:
df = df_outliers_removed[cols_orig_dataset + cols_census + cols_engineered + cols_for_trees]

## Dummify

In [7]:
cols_to_dummify = df.dtypes[df.dtypes == object].index
cols_to_dummify

Index([u'TYPE', u'Property_Type', u'Source', u'neighborhood_from_zip',
       u'school', u'housing'],
      dtype='object')

In [8]:
df_dummified, baseline_cols = dummify_cols_and_baselines(df, cols_to_dummify)

Zoning is baseline 0 6
other is baseline 1 6
Twitter is baseline 2 6
West Roxbury is baseline 3 6
8_6th_grade is baseline 4 6
rent is baseline 5 6


In [9]:
df_dummified.shape

(508653, 255)

## Running model

In [13]:
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.cross_validation import ShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score


In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    df_dummified.drop('COMPLETION_HOURS_LOG_10', axis=1), 
    df_dummified.COMPLETION_HOURS_LOG_10, 
    test_size=0.2, 
    random_state=300
)

In [15]:
pipe = make_pipeline(DecisionTreeRegressor(splitter='random'))

In [32]:
cv = ShuffleSplit(X_train.shape[0], n_iter=1, test_size=0.2, random_state=300)

In [58]:
params = {'decisiontreeregressor__max_depth': binom(100, 0.83),
          'decisiontreeregressor__min_samples_split': beta(1, 50), # http://homepage.divms.uiowa.edu/~mbognar/applets/beta.html
          'decisiontreeregressor__min_samples_leaf': beta(3, 40)}
model = RandomizedSearchCV(pipe, param_distributions=params, n_jobs=-1, cv=cv, verbose=100, n_iter=100)
model.fit(X_train, y_train);

Fitting 1 folds for each of 100 candidates, totalling 100 fits
Pickling array (shape=(254,), dtype=object).
Pickling array (shape=(406922,), dtype=object).
Pickling array (shape=(2, 406922), dtype=bool).
Memmaping (shape=(19, 406922), dtype=float64) to new file /dev/shm/joblib_memmaping_pool_52444_140214301504848/52444-140214389388432-7bbbe223bab168b8bb13d66f9c531fb8.pkl
Memmaping (shape=(5, 406922), dtype=int64) to new file /dev/shm/joblib_memmaping_pool_52444_140214301504848/52444-140214389388432-389b7c8adcba4caca8208567368f53bd.pkl
Memmaping (shape=(228, 406922), dtype=uint8) to new file /dev/shm/joblib_memmaping_pool_52444_140214301504848/52444-140214389388432-0dd6e0cfba23dae21160906f37814f3c.pkl
Pickling array (shape=(2,), dtype=object).
Pickling array (shape=(19,), dtype=object).
Pickling array (shape=(5,), dtype=object).
Pickling array (shape=(228,), dtype=object).
Pickling array (shape=(19,), dtype=int64).
Pickling array (shape=(5,), dtype=int64).
Pickling array (shape=(406922,

In [67]:
results = pd.DataFrame(model2.cv_results_)
results[results.mean_test_score > 0.3].T

Unnamed: 0,8
mean_fit_time,10.8678
mean_score_time,1.05293
mean_test_score,0.313319
mean_train_score,0.314407
param_decisiontreeregressor__max_depth,85
param_decisiontreeregressor__min_samples_leaf,0.016263
param_decisiontreeregressor__min_samples_split,0.00741915
params,"{u'decisiontreeregressor__max_depth': 85, u'de..."
rank_test_score,1
split0_test_score,0.313319


In [19]:
results = pd.DataFrame(model.cv_results_).T
results

Unnamed: 0,0,1,2,3,4
mean_fit_time,8.68406,8.76414,12.1061,8.32671,7.00678
mean_score_time,1.05191,1.04649,1.04655,1.03941,1.01848
mean_test_score,0.170211,0.170211,0.354086,0.170068,0.0663755
mean_train_score,0.170712,0.170712,0.354199,0.170543,0.0656133
param_decisiontreeregressor__max_depth,50,50,75,10,10
param_decisiontreeregressor__min_samples_leaf,0.05,0.05,0.01,0.05,0.1
param_decisiontreeregressor__min_samples_split,0.05,0.1,0.05,0.05,0.05
params,"{u'decisiontreeregressor__max_depth': 50, u'de...","{u'decisiontreeregressor__max_depth': 50, u'de...","{u'decisiontreeregressor__max_depth': 75, u'de...","{u'decisiontreeregressor__max_depth': 10, u'de...","{u'decisiontreeregressor__max_depth': 10, u'de..."
rank_test_score,2,2,1,4,5
split0_test_score,0.170211,0.170211,0.354086,0.170068,0.0663755


In [20]:
model.best_params_

{'decisiontreeregressor__max_depth': 75,
 'decisiontreeregressor__min_samples_leaf': 0.01,
 'decisiontreeregressor__min_samples_split': 0.05}

In [44]:
model.best_score_

0.35408592034503239

In [57]:
model.score(X_test, y_test)

0.35378645355806615

## Conclusion

After much trial and error, the best pre-pruned decision tree I could come up with has an $R^2$ score of 0.35, much less than with Linear Regression.

I'm not in an overfit situation, looking at the CV_train and CV_test $R^2$s. When the tree is bigger, its $R^2$ gets lower, for both CV_train and CV_test sets. I'm not exactly sure why at the moment. I suspect it has to do with the tree-building stopping rule (node purity) being different from minimizing RSS. Or minimizing RSS somehow increases $R^2$. I'll need to read up more on this.