## Summary

To apply random forest modeling technique to the problem of predicting customer value.

In [2]:
import time

import sys
sys.path.append('../../common_routines')

from relevant_functions import (get_train_data,
                                get_test_data,
                                get_all_predictor_cols,
                                get_rel_cols)

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

import numpy as np

In [3]:
INPUT_DIR = '../../input/'

In [4]:
ts = time.time()
train = get_train_data(INPUT_DIR)
time.time() - ts

5.264724016189575

### Model using all predictors.

Let us construct a RandomForestRegressor using all available predictors and see how it performs.

In [6]:
all_predictor_cols = get_all_predictor_cols(train)

In [18]:
X = train[all_predictor_cols]
Y = train[['log_target']].values.ravel() 

In [21]:
ts = time.time()
cross_val_scores = cross_val_score(RandomForestRegressor(n_estimators=10), X, Y, cv=5, 
                                   scoring='neg_mean_squared_error')
time.time() - ts

93.69904685020447

In [20]:
print(np.sqrt(-cross_val_scores))
print(np.sqrt(-cross_val_scores.mean()))

[1.45926763 1.5459559  1.4605605  1.45519944 1.58377132]
1.5019064974970413


In [22]:
ts = time.time()
cross_val_scores = cross_val_score(RandomForestRegressor(n_estimators=20), X, Y, cv=5, 
                                   scoring='neg_mean_squared_error')
time.time() - ts

182.92887115478516

In [23]:
print(np.sqrt(-cross_val_scores))
print(np.sqrt(-cross_val_scores.mean()))

[1.43088676 1.50581471 1.409959   1.45473708 1.52704562]
1.4663579048875914


In [24]:
ts = time.time()
cross_val_scores = cross_val_score(RandomForestRegressor(n_estimators=30), X, Y, cv=5, 
                                   scoring='neg_mean_squared_error')
time.time() - ts

273.5930070877075

In [25]:
print(np.sqrt(-cross_val_scores))
print(np.sqrt(-cross_val_scores.mean()))

[1.42211505 1.51459686 1.40415326 1.43688826 1.50210224]
1.45664136017895


In [26]:
ts = time.time()
cross_val_scores = cross_val_score(RandomForestRegressor(n_estimators=40), X, Y, cv=5, 
                                   scoring='neg_mean_squared_error')
time.time() - ts

377.41454696655273

In [27]:
print(np.sqrt(-cross_val_scores))
print(np.sqrt(-cross_val_scores.mean()))

[1.41280729 1.47719827 1.40252617 1.43266864 1.50997465]
1.447603654374002


### Finetuning parameters

The default values of other parameters such as max_depth, min_samples_leaf look reasonable and we do not see any reason to change them now. 

### Trainer on denser columns

Let us see whether training the random forest regressor only on denser columns would give us a better result or not.

In [40]:
percentages_fill_of_data_cols = [0, 1, 2, 5, 10, 15, 20, 25, 30, 35]
#percentages_fill_of_data_cols = [2]
percent_to_cross_val_score = dict()

In [41]:
ts = time.time()
for percent in percentages_fill_of_data_cols:
    rel_cols  = get_rel_cols(percent, train)
    X = train[rel_cols]
    Y = train[['log_target']].values.ravel()
    cross_val_scores = cross_val_score(RandomForestRegressor(random_state=0, n_estimators=10), 
                                       X, 
                                       Y,
                                       cv=5, 
                                       scoring='neg_mean_squared_error')
    percent_to_cross_val_score[percent] = np.sqrt(-cross_val_scores.mean())

time.time() - ts

293.7400279045105

In [42]:
percent_to_cross_val_score

{0: 1.4886070358264116,
 1: 1.5063663807614536,
 2: 1.4955048213960929,
 5: 1.512796560770603,
 10: 1.518216323411931,
 15: 1.52604954208457,
 20: 1.537076621849582,
 25: 1.5373016475326733,
 30: 1.5373016475326733,
 35: 1.6951832984270436}

### Conclusion

It looks like the model performance is at it's best when it uses maximum number of predictors. I am not quite sure of any fine tunings that can be done at this point. Hence, let us train the model over the entire training set and generate predictions on the test set.

In [48]:
ts = time.time()
X = train[all_predictor_cols]
Y = train[['log_target']].values.ravel()
my_model = RandomForestRegressor(random_state=0, n_estimators=200)
my_model.fit(X, Y)
time.time() - ts

498.724182844162

In [49]:
ts = time.time()
test = get_test_data(INPUT_DIR)
time.time() - ts

71.0817358493805

In [50]:
ts = time.time()
new_X = test[all_predictor_cols]
test_log_predictions = my_model.predict(new_X)
test_log_predictions[test_log_predictions < 0 ] = 0

test['target'] = np.exp(test_log_predictions) - 1.0
time.time() - ts

8.671373128890991

In [51]:
test[['ID', 'target']].to_csv('submission_random_forest_sklearn.csv', index=False)