Let's try a super-simple decision tree + pruning model, to see to what extent the data is non-linear. Our RMSE for one of the folds was 1,881 and $R^2$ was 0.24, for comparison.

In [41]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.cross_validation import train_test_split
from sklearn.model_selection import ShuffleSplit
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

In [2]:
df = pd.read_hdf('../data/data_till_june_2016_w_descs_and_census.h5')
df = df.dropna(subset=['CLOSED_DT'])
df.shape

(718936, 155)

The performance btwn CV and test sets will prolly vary widely, bc the method is very sensitive to the data itself.

## Splitting the data

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop('COMPLETION_TIME', axis=1), 
    df.COMPLETION_TIME, 
    test_size=0.2, 
    random_state=300
)

In [4]:
ss = ShuffleSplit(n_splits=1, test_size=0.2, random_state=300)

## Making the regressors

In [5]:
label_encoder = LabelEncoder()
label_encoder.fit(df['TYPE']);

In [6]:
pipe = make_pipeline(DecisionTreeRegressor(max_depth=5))

In [7]:
training_scores = []
test_scores = []

for CV_train_index, CV_test_index in ss.split(y_train):
    X_CV_train = label_encoder.transform(X_train.iloc[CV_train_index]['TYPE'])[:, pd.np.newaxis]
    X_CV_test = label_encoder.transform(X_train.iloc[CV_test_index]['TYPE'])[:, pd.np.newaxis]
    y_CV_train = y_train.iloc[CV_train_index]
    y_CV_test = y_train.iloc[CV_test_index]
    
    pipe.fit(X_CV_train, y_CV_train)
    
    training_score = pipe.score(X_CV_train, y_CV_train)
    test_score = pipe.score(X_CV_test, y_CV_test)
    
    training_scores += [training_score]
    test_scores += [test_score]
    print "Training: {}. Test: {}".format(training_score, test_score)

Training: 0.217100930488. Test: 0.212150795535


I'm predicting quite different CV scores for each fold, and perhaps better performance than Linear Regression's $R^2$ of 0.24.

Funny--the error scores are all pretty similar. Not exactly sure why. Would need to investigate and think abou that more. Also, `max_depth=2` results in $R^2$ values around 11%. `max_depth=5` results in values around 21%.

I should think about what would be good intuitive values.

## Trying GridSearchCV

In [33]:
params = {
    'decisiontreeregressor__splitter': ['best', 'random'],
    'decisiontreeregressor__max_depth': pd.np.linspace(5, 100, 10, dtype=int)
}

grid_search = GridSearchCV(pipe, param_grid=params, n_jobs=-1, cv=5)

In [34]:
grid_search.fit(
    label_encoder.transform(X_train['TYPE'])[:, pd.np.newaxis], 
    y_train
)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    9.4s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   10.9s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('decisiontreeregressor', DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best'))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'decisiontreeregressor__max_depth': array([  5,  15,  26,  36,  47,  57,  68,  78,  89, 100]), 'decisiontreeregressor__splitter': ['random']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=True)

In [35]:
grid_search.best_params_

{'decisiontreeregressor__max_depth': 26,
 'decisiontreeregressor__splitter': 'random'}

Let's take the best-performing model and see the train, test scores to compare to the previous models.

In [39]:
ss = ShuffleSplit(n_splits=5, test_size=0.2, random_state=300)
pipe = make_pipeline(DecisionTreeRegressor(splitter='random', max_depth=26))
# or maybe I can do grid_search.best_estimator_

In [40]:
training_scores = []
test_scores = []

for CV_train_index, CV_test_index in ss.split(y_train):
    X_CV_train = label_encoder.transform(X_train.iloc[CV_train_index]['TYPE'])[:, pd.np.newaxis]
    X_CV_test = label_encoder.transform(X_train.iloc[CV_test_index]['TYPE'])[:, pd.np.newaxis]
    y_CV_train = y_train.iloc[CV_train_index]
    y_CV_test = y_train.iloc[CV_test_index]
    
    pipe.fit(X_CV_train, y_CV_train)
    
    training_score = pipe.score(X_CV_train, y_CV_train)
    test_score = pipe.score(X_CV_test, y_CV_test)
    
    training_scores += [training_score]
    test_scores += [test_score]
    print "Training: {}. Test: {}".format(training_score, test_score)

Training: 0.254834348481. Test: 0.248655835452
Training: 0.254828274779. Test: 0.248929497224
Training: 0.254584520211. Test: 0.249920065375
Training: 0.254775321356. Test: 0.248635337174
Training: 0.25436498845. Test: 0.250574475084


On one of the folds, both our test and training $R^2$ scores are 0.25, slightly better than Linear Regression's 0.24, and better than 0.21 from our un-optimized Decision Tree.

## Checking the RMSE
Linear regression had 1,881. I'm expecting this RMSE to be slightly better.

In [43]:
pipe.predict(label_encoder.transform(X_test.head()['TYPE'])[:, pd.np.newaxis])

array([ 1338.01899432,     8.83411144,  2940.6833601 ,   116.96636007,
         495.57189188])

In [45]:
mse = mean_squared_error(
    y_test,
    pipe.predict(label_encoder.transform(X_test['TYPE'])[:, pd.np.newaxis])
)

mse**0.5

1886.4624048482124

It's slightly higher, at 1,886.