## Summary

To apply the technique of gradient boosting using h2o package to the problem of predicting customer value.

In [31]:
import time
import sys
sys.path.append('../../common_routines/')

from relevant_functions import (get_train_data,
                                get_test_data,
                                get_all_predictor_cols,
                                get_rel_cols)
import h2o
import numpy as np

In [2]:
INPUT_DIR = '../../input/'

In [3]:
ts = time.time()
train = get_train_data(INPUT_DIR)
time.time() - ts

5.704378128051758

In [4]:
train.drop(columns=['ID', 'target'], inplace=True)

### Start an h2o instance and build relevant h2o frames


In [6]:
h2o.init(nthreads=-1, max_mem_size=15)

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "11.0.5" 2019-10-15 LTS; Java(TM) SE Runtime Environment 18.9 (build 11.0.5+10-LTS); Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.5+10-LTS, mixed mode)
  Starting server from /Users/babs4JESUS/anaconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/cz/3nvpl4mj0g5ds3hlsc15wxdr0000gn/T/tmp4qlvh0xx
  JVM stdout: /var/folders/cz/3nvpl4mj0g5ds3hlsc15wxdr0000gn/T/tmp4qlvh0xx/h2o_babs4JESUS_started_from_python.out
  JVM stderr: /var/folders/cz/3nvpl4mj0g5ds3hlsc15wxdr0000gn/T/tmp4qlvh0xx/h2o_babs4JESUS_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.26.0.10
H2O cluster version age:,24 days
H2O cluster name:,H2O_from_python_babs4JESUS_6u8a56
H2O cluster total nodes:,1
H2O cluster free memory:,15 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


In [7]:
ts = time.time()
train_h2o = h2o.H2OFrame(train)
time.time() - ts

Parse progress: |█████████████████████████████████████████████████████████| 100%


15.05705189704895

### Build a model with all predictors.

Let us a build a baseline and see the results of cross validation.

In [8]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [10]:
X_COLUMNS  = get_all_predictor_cols(train)
Y_COLUMN = 'log_target'

In [12]:
ts = time.time()
cross_val_model = H2OGradientBoostingEstimator(model_id='cross_val_model', seed=1, nfolds=2, max_depth=5, ntrees=50)
cross_val_model.train(x=X_COLUMNS, y=Y_COLUMN, training_frame=train_h2o)
time.time() - ts

gbm Model Build progress: |███████████████████████████████████████████████| 100%


88.71518898010254

In [13]:
cross_val_model

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  cross_val_model


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,50.0,50.0,14439.0,5.0,5.0,5.0,7.0,28.0,18.2




ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 1.7808523491659962
RMSE: 1.3344857995370338
MAE: 1.0595151148828377
RMSLE: 0.09016652770745107
Mean Residual Deviance: 1.7808523491659962

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 2.3185859998530303
RMSE: 1.5226903821371665
MAE: 1.226609234733006
RMSLE: 0.10228831454670782
Mean Residual Deviance: 2.3185859998530303

Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid
0,mae,1.2266421,0.0037743356,1.229311,1.2239733
1,mean_residual_deviance,2.318553,0.0037960603,2.3158686,2.321237
2,mse,2.318553,0.0037960603,2.3158686,2.321237
3,r2,0.24340093,0.009393007,0.23675908,0.2500428
4,residual_deviance,2.318553,0.0037960603,2.3158686,2.321237
5,rmse,1.5226792,0.0012465069,1.5217979,1.5235606
6,rmsle,0.10228882,5.9027152e-05,0.10233056,0.10224708



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
0,,2019-12-02 11:37:58,59.277 sec,0.0,1.75075,1.438681,3.065127
1,,2019-12-02 11:37:58,1 min 0.101 sec,1.0,1.719974,1.415904,2.958311
2,,2019-12-02 11:37:59,1 min 0.706 sec,2.0,1.697094,1.397148,2.880127
3,,2019-12-02 11:38:00,1 min 1.322 sec,3.0,1.668388,1.373293,2.78352
4,,2019-12-02 11:38:00,1 min 1.952 sec,4.0,1.650176,1.357531,2.72308
5,,2019-12-02 11:38:01,1 min 2.558 sec,5.0,1.630022,1.340471,2.656971
6,,2019-12-02 11:38:01,1 min 3.157 sec,6.0,1.615083,1.326749,2.608494
7,,2019-12-02 11:38:06,1 min 7.353 sec,13.0,1.52058,1.23688,2.312164
8,,2019-12-02 11:38:10,1 min 11.440 sec,20.0,1.45035,1.171573,2.103515
9,,2019-12-02 11:38:14,1 min 15.464 sec,27.0,1.407056,1.129327,1.979807



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,f190486d6,5405.516113,1.0,0.179344
1,58e2e02e6,2186.076172,0.404416,0.07253
2,eeb9cd3aa,1392.38208,0.257585,0.046196
3,15ace8c9f,1139.797363,0.210858,0.037816
4,9fd594eec,1009.970459,0.186841,0.033509
5,58232a6fb,846.819702,0.156658,0.028096
6,1702b5bf0,683.994751,0.126536,0.022694
7,6eef030c1,651.400085,0.120507,0.021612
8,20aa07010,609.879211,0.112825,0.020235
9,f514fdb2e,504.996796,0.093422,0.016755



See the whole table with table.as_data_frame()




In [14]:
ts = time.time()
cross_val_model = H2OGradientBoostingEstimator(model_id='cross_val_model', seed=1, nfolds=2, max_depth=5, ntrees=100)
cross_val_model.train(x=X_COLUMNS, y=Y_COLUMN, training_frame=train_h2o)
time.time() - ts

gbm Model Build progress: |███████████████████████████████████████████████| 100%


182.18479800224304

In [15]:
cross_val_model

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  cross_val_model


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,100.0,100.0,25820.0,5.0,5.0,5.0,6.0,28.0,15.77




ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 1.5139818970277212
RMSE: 1.2304397169417611
MAE: 0.961454532210082
RMSLE: 0.08319628149549935
Mean Residual Deviance: 1.5139818970277212

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 2.2984272652199564
RMSE: 1.516056484838199
MAE: 1.2141520319978834
RMSLE: 0.10171707243355507
Mean Residual Deviance: 2.2984272652199564

Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid
0,mae,1.2141579,0.0006822369,1.2146404,1.2136756
1,mean_residual_deviance,2.298327,0.01150892,2.2901888,2.306465
2,mse,2.298327,0.01150892,2.2901888,2.306465
3,r2,0.25001892,0.0067833415,0.24522237,0.25481546
4,residual_deviance,2.298327,0.01150892,2.2901888,2.306465
5,rmse,1.516021,0.0037957653,1.513337,1.518705
6,rmsle,0.10171634,8.2094884e-05,0.10165829,0.10177439



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
0,,2019-12-02 11:41:18,2 min 4.346 sec,0.0,1.75075,1.438681,3.065127
1,,2019-12-02 11:41:19,2 min 5.258 sec,1.0,1.719974,1.415904,2.958311
2,,2019-12-02 11:41:20,2 min 5.870 sec,2.0,1.697094,1.397148,2.880127
3,,2019-12-02 11:41:21,2 min 6.542 sec,3.0,1.668388,1.373293,2.78352
4,,2019-12-02 11:41:21,2 min 7.172 sec,4.0,1.650176,1.357531,2.72308
5,,2019-12-02 11:41:22,2 min 7.771 sec,5.0,1.630022,1.340471,2.656971
6,,2019-12-02 11:41:26,2 min 12.415 sec,12.0,1.531287,1.247027,2.344841
7,,2019-12-02 11:41:31,2 min 16.718 sec,18.0,1.47108,1.19149,2.164077
8,,2019-12-02 11:41:35,2 min 20.905 sec,25.0,1.416229,1.13849,2.005705
9,,2019-12-02 11:41:39,2 min 24.922 sec,32.0,1.384825,1.107995,1.917741



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,f190486d6,5491.446777,1.0,0.150841
1,58e2e02e6,2243.213623,0.408492,0.061617
2,eeb9cd3aa,1395.883545,0.254192,0.038343
3,15ace8c9f,1163.8302,0.211935,0.031969
4,9fd594eec,1114.294434,0.202915,0.030608
5,58232a6fb,870.31842,0.158486,0.023906
6,1702b5bf0,700.070251,0.127484,0.01923
7,6eef030c1,686.174377,0.124953,0.018848
8,20aa07010,620.95752,0.113077,0.017057
9,491b9ee45,511.703705,0.093182,0.014056



See the whole table with table.as_data_frame()




In [16]:
ts = time.time()
cross_val_model = H2OGradientBoostingEstimator(model_id='cross_val_model', seed=1, nfolds=2, max_depth=7, ntrees=100)
cross_val_model.train(x=X_COLUMNS, y=Y_COLUMN, training_frame=train_h2o)
time.time() - ts

gbm Model Build progress: |███████████████████████████████████████████████| 100%


222.59902119636536

In [17]:
cross_val_model

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  cross_val_model


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,100.0,100.0,42978.0,7.0,7.0,7.0,9.0,58.0,29.19




ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 1.2415494493812296
RMSE: 1.1142483786756119
MAE: 0.8440805115649507
RMSLE: 0.07547570613780785
Mean Residual Deviance: 1.2415494493812296

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 2.293171443701004
RMSE: 1.5143221069841792
MAE: 1.2090826627334046
RMSLE: 0.1014516281998931
Mean Residual Deviance: 2.293171443701004

Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid
0,mae,1.209023,0.0068453727,1.2041825,1.2138634
1,mean_residual_deviance,2.2927306,0.050550405,2.256986,2.328475
2,mse,2.2927306,0.050550405,2.256986,2.328475
3,r2,0.25193468,0.0059826346,0.25616503,0.24770431
4,residual_deviance,2.2927306,0.050550405,2.256986,2.328475
5,rmse,1.5141305,0.016692882,1.5023268,1.5259342
6,rmsle,0.10143946,0.0010708619,0.10068225,0.10219668



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
0,,2019-12-02 11:45:36,2 min 24.216 sec,0.0,1.75075,1.438681,3.065127
1,,2019-12-02 11:45:37,2 min 25.325 sec,1.0,1.709734,1.407883,2.923189
2,,2019-12-02 11:45:38,2 min 26.182 sec,2.0,1.682019,1.385716,2.829188
3,,2019-12-02 11:45:39,2 min 27.110 sec,3.0,1.65016,1.359121,2.723029
4,,2019-12-02 11:45:40,2 min 28.044 sec,4.0,1.618128,1.332239,2.618338
5,,2019-12-02 11:45:44,2 min 32.281 sec,9.0,1.511893,1.236054,2.285819
6,,2019-12-02 11:45:48,2 min 36.313 sec,14.0,1.438257,1.165441,2.068583
7,,2019-12-02 11:45:52,2 min 40.403 sec,17.0,1.406821,1.134943,1.979145
8,,2019-12-02 11:45:57,2 min 45.112 sec,23.0,1.358132,1.085399,1.844524
9,,2019-12-02 11:46:02,2 min 49.896 sec,29.0,1.319701,1.045702,1.741611



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,f190486d6,5458.003906,1.0,0.126988
1,58e2e02e6,2638.195068,0.483363,0.061381
2,9fd594eec,2027.576294,0.371487,0.047174
3,eeb9cd3aa,1218.794067,0.223304,0.028357
4,15ace8c9f,888.589172,0.162805,0.020674
5,58232a6fb,703.36261,0.128868,0.016365
6,2288333b4,624.84845,0.114483,0.014538
7,1702b5bf0,570.123413,0.104456,0.013265
8,6eef030c1,542.253784,0.09935,0.012616
9,20aa07010,508.855438,0.093231,0.011839



See the whole table with table.as_data_frame()




In [18]:
ts = time.time()
cross_val_model = H2OGradientBoostingEstimator(model_id='cross_val_model', seed=1, nfolds=2, max_depth=7, ntrees=150)
cross_val_model.train(x=X_COLUMNS, y=Y_COLUMN, training_frame=train_h2o)
time.time() - ts

gbm Model Build progress: |███████████████████████████████████████████████| 100%


337.2153468132019

In [19]:
cross_val_model

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  cross_val_model


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,150.0,150.0,59238.0,7.0,7.0,7.0,8.0,58.0,26.42




ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 1.0778218085002096
RMSE: 1.0381819727293522
MAE: 0.7711573618093022
RMSLE: 0.07039231465041787
Mean Residual Deviance: 1.0778218085002096

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 2.2947532782754685
RMSE: 1.5148443082625582
MAE: 1.2074094197648644
RMSLE: 0.10143854322413627
Mean Residual Deviance: 2.2947532782754685

Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid
0,mae,1.2073301,0.009088827,1.2009034,1.2137569
1,mean_residual_deviance,2.2943168,0.0500595,2.2589192,2.329714
2,mse,2.2943168,0.0500595,2.2589192,2.329714
3,r2,0.25141597,0.005815165,0.25552788,0.247304
4,residual_deviance,2.2943168,0.0500595,2.2589192,2.329714
5,rmse,1.5146551,0.016525049,1.5029701,1.5263401
6,rmsle,0.10142666,0.0010505634,0.1006838,0.10216952



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
0,,2019-12-02 11:51:58,3 min 50.379 sec,0.0,1.75075,1.438681,3.065127
1,,2019-12-02 11:51:59,3 min 51.383 sec,1.0,1.709734,1.407883,2.923189
2,,2019-12-02 11:52:00,3 min 52.116 sec,2.0,1.682019,1.385716,2.829188
3,,2019-12-02 11:52:01,3 min 52.902 sec,3.0,1.65016,1.359121,2.723029
4,,2019-12-02 11:52:02,3 min 53.710 sec,4.0,1.618128,1.332239,2.618338
5,,2019-12-02 11:52:06,3 min 58.221 sec,10.0,1.498233,1.222307,2.244703
6,,2019-12-02 11:52:11,4 min 2.678 sec,16.0,1.415714,1.143557,2.004247
7,,2019-12-02 11:52:15,4 min 7.097 sec,22.0,1.361817,1.089266,1.854544
8,,2019-12-02 11:52:19,4 min 11.308 sec,28.0,1.32401,1.050306,1.753003
9,,2019-12-02 11:52:24,4 min 15.820 sec,34.0,1.296105,1.022044,1.679888



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,f190486d6,5495.141602,1.0,0.117318
1,58e2e02e6,2670.307129,0.48594,0.057009
2,9fd594eec,2045.2677,0.372196,0.043665
3,eeb9cd3aa,1243.671631,0.226322,0.026552
4,15ace8c9f,932.321899,0.169663,0.019904
5,58232a6fb,731.685791,0.133151,0.015621
6,2288333b4,627.647827,0.114219,0.0134
7,1702b5bf0,596.453613,0.108542,0.012734
8,6eef030c1,571.546814,0.104009,0.012202
9,20aa07010,521.569031,0.094915,0.011135



See the whole table with table.as_data_frame()




In [25]:
ts = time.time()
cross_val_model = H2OGradientBoostingEstimator(model_id='cross_val_model', 
                                               seed=1, 
                                               nfolds=2, 
                                               max_depth=7, 
                                               learn_rate=0.01,
                                               ntrees=150)
cross_val_model.train(x=X_COLUMNS, y=Y_COLUMN, training_frame=train_h2o)
time.time() - ts

gbm Model Build progress: |███████████████████████████████████████████████| 100%


1646.698215007782

In [26]:
cross_val_model

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  cross_val_model


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,150.0,150.0,96502.0,7.0,7.0,7.0,27.0,61.0,46.02




ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 2.0428042820369052
RMSE: 1.4292670436405175
MAE: 1.1580878510339347
RMSLE: 0.09661482651658675
Mean Residual Deviance: 2.0428042820369052

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 2.440520566289461
RMSE: 1.562216555503577
MAE: 1.2756068927687714
RMSLE: 0.10509527607688122
Mean Residual Deviance: 2.440520566289461

Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid
0,mae,1.2755895,0.0020020644,1.2741737,1.2770051
1,mean_residual_deviance,2.440189,0.038041506,2.4132893,2.4670882
2,mse,2.440189,0.038041506,2.4132893,2.4670882
3,r2,0.2037863,0.0012244628,0.20465213,0.20292048
4,residual_deviance,2.440189,0.038041506,2.4132893,2.4670882
5,rmse,1.5620866,0.0121765025,1.5534766,1.5706967
6,rmsle,0.105084404,0.0009832624,0.10438913,0.10577967



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
0,,2019-12-02 12:06:23,4 min 0.337 sec,0.0,1.75075,1.438681,3.065127
1,,2019-12-02 12:06:24,4 min 1.357 sec,1.0,1.7465,1.435446,3.050261
2,,2019-12-02 12:06:25,4 min 2.156 sec,2.0,1.742263,1.432208,3.03548
3,,2019-12-02 12:06:25,4 min 2.956 sec,3.0,1.73815,1.429109,3.021167
4,,2019-12-02 12:06:26,4 min 3.755 sec,4.0,1.734401,1.426308,3.008146
5,,2019-12-02 12:06:31,4 min 8.219 sec,10.0,1.713828,1.410512,2.937206
6,,2019-12-02 12:06:35,4 min 12.704 sec,16.0,1.694705,1.395418,2.872026
7,,2019-12-02 12:06:40,4 min 17.177 sec,22.0,1.677683,1.381528,2.81462
8,,2019-12-02 12:06:44,4 min 21.750 sec,28.0,1.658171,1.365583,2.749531
9,,2019-12-02 12:06:49,4 min 26.388 sec,34.0,1.639299,1.349818,2.687301



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,f190486d6,53523.726562,1.0,0.233322
1,58e2e02e6,22502.982422,0.42043,0.098096
2,eeb9cd3aa,16023.227539,0.299367,0.069849
3,9fd594eec,14319.862305,0.267542,0.062424
4,2288333b4,4876.493652,0.091109,0.021258
5,15ace8c9f,4853.660156,0.090682,0.021158
6,20aa07010,3590.814697,0.067088,0.015653
7,f514fdb2e,3123.293945,0.058353,0.013615
8,ced6a7e91,2624.224121,0.049029,0.01144
9,715fa74a4,2547.499023,0.047596,0.011105



See the whole table with table.as_data_frame()




### Plateauing performance

The performance looks to be plateauing at around 1.51. Hence let us build a model over the entire data set and generate predictions on the test data using the same.

In [29]:
ts = time.time()
model_full_data = H2OGradientBoostingEstimator(model_id='model_full_data', seed=1, max_depth=7, ntrees=100)
model_full_data.train(x=X_COLUMNS, y=Y_COLUMN, training_frame=train_h2o)
time.time() - ts

gbm Model Build progress: |███████████████████████████████████████████████| 100%


75.62345480918884

In [34]:
ts = time.time()
test = get_test_data(INPUT_DIR)
time.time() - ts

74.46290802955627

In [36]:
ts = time.time()
test_prediction_out = model_full_data.predict(h2o.H2OFrame(test[X_COLUMNS]))
time.time() - ts

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%


141.9186339378357

In [37]:
ts = time.time()
test_log_predictions = test_prediction_out.as_data_frame()['predict'].values.tolist()
test_log_predictions = [x if x > 0 else 0 for x in test_log_predictions]
test['target'] = np.exp(test_log_predictions) - 1.0
time.time() - ts

0.11169886589050293

In [38]:
test[['ID', 'target']].to_csv('submission_gradient_boosting_h2o.csv', index=False)