# Decision Tree

In h2o there is no class dedicated to creating a single decision tree.

However, we can create a single decision tree with a Gradient Boosting Estimator by appropriately setting some parameters:
- ntrees = 1 (obviously)
- sample_rate = 1: Use all of the data.
- col_sample_rate = 1: Use all of the features.

We now have two parameters to control the complexity
- min_rows: This option specifies the minimum number of observations for a leaf in order to split it further. 
- max_depth: How deep the tree can be grown.

There is no cost-complexity pruning with a cp parameter (as there is with R).  So we will simply fix the max_depth to be quite large and then try different values for min_rows.

The best trees were evaluated based on the highest auc using cross-validation with 5 folds.


**Import packages**

I am not familiar with colab, but some students last year used it.  It did not, at that time, have H2O, but they were able to install it quite simply with the following code:

 - !apt-get install default-jre
 - !java -version
 - !pip install h2o

In [1]:
import os
import numpy as np
import pandas as pd
import pickle


In [2]:
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.grid.grid_search import H2OGridSearch

In [3]:
# Set directories
print(os.getcwd())
dirRawData = "../input/"
dirPData   = "../PData/"
dirPOutput = "../POutput/"

/home/jovyan/Projects/final_assignment/PCode


In [4]:
f_name = dirPData + '01_df_250k.pickle'

with (open(f_name, "rb")) as f:
    dict_ = pickle.load(f)

df_train = dict_['df_train']
df_test  = dict_['df_test']

del f_name, dict_

In [5]:
f_name = dirPData + '01_vars.pickle'

with open(f_name, "rb") as f:
    dict_ = pickle.load(f)

vars_ind_numeric     = dict_['vars_ind_numeric']
vars_ind_hccv        = dict_['vars_ind_hccv']
vars_ind_categorical = dict_['vars_ind_categorical']
vars_notToUse        = dict_['vars_notToUse']
var_dep              = dict_['var_dep']

del f_name, dict_

**Start the h2o JVM and load our data if it not already there**

In [6]:
h2o.init(port=54321)

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_212"; OpenJDK Runtime Environment (build 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03); OpenJDK 64-Bit Server VM (build 25.212-b03, mixed mode)
  Starting server from /opt/conda/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpuudw3b_y
  JVM stdout: /tmp/tmpuudw3b_y/h2o_jovyan_started_from_python.out
  JVM stderr: /tmp/tmpuudw3b_y/h2o_jovyan_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,03 secs
H2O cluster timezone:,Etc/UTC
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.3
H2O cluster version age:,"1 year, 2 months and 6 days !!!"
H2O cluster name:,H2O_from_python_jovyan_pc65jn
H2O cluster total nodes:,1
H2O cluster free memory:,3.257 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


In [7]:
#remove all data loaded in JVM
# for key in h2o.ls()['key']:
#     h2o.remove(key)

In [8]:
# Is df_train already in the JVM?
h2o.ls()

Unnamed: 0,key


In [9]:
# It if is, then just create a handle:
#h2o_df_train = h2o.get_frame('df_train')

In [10]:
# Otherwise run this code.
h2o_df_train = h2o.H2OFrame(df_train[vars_ind_numeric + vars_ind_categorical + var_dep],
                           destination_frame='df_train')

Parse progress: |█████████████████████████████████████████████████████████| 100%


H2O says somewhere that it needs the target to be a enum type.  I'm not sure if it really does? But anyway...

In [11]:
h2o_df_train[var_dep].types

{'target': 'int'}

In [12]:
h2o_df_train[var_dep] = h2o_df_train[var_dep].asfactor()
h2o_df_train[var_dep].types

{'target': 'enum'}

**Define the features we will use**

In this quick notebook, we ignore the hccv's.  For the main assigment you should deal sensibly with them.

In [13]:
# Need some proper way to deal with hcccv (eg target encoding).  For now just remove.
features = vars_ind_categorical + vars_ind_numeric
features = [var for var in features if var not in vars_ind_hccv]

**GridSearch**


I have more or less randomly chosen the list below to search for min leaf size.

In [14]:
[2**idx for idx in 7+np.arange(10)]

[128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536]

In [15]:
hyper_params = {'min_rows' : [2**idx for idx in 7+np.arange(10)]} 
search_criteria = {'strategy': "Cartesian"}

In [16]:
grid_dt = H2OGridSearch(
                    H2OGradientBoostingEstimator(
                        seed = 2020,   
                        nfolds = 5,
                        ntrees = 1,
                        max_depth = 20,
                        #min_rows = 1,
                        sample_rate = 1,
                        col_sample_rate = 1,
                        ),
                    grid_id = 'grid_dt',
                    search_criteria = search_criteria,
                    hyper_params = hyper_params)

In [17]:
grid_dt.train(x=features,
              y= 'target',
              training_frame=h2o_df_train,
              seed=2020)

gbm Grid Build progress: |████████████████████████████████████████████████| 100%


In [18]:
grid_dt = grid_dt.get_grid(sort_by='auc', decreasing=True)
df_perf_auc = grid_dt.sorted_metric_table()
df_perf_auc.head(10)

Unnamed: 0,Unnamed: 1,min_rows,model_ids,auc
0,,256.0,grid_dt_model_2,0.7388888093104397
1,,128.0,grid_dt_model_1,0.7388733476630628
2,,512.0,grid_dt_model_3,0.7350174964254721
3,,1024.0,grid_dt_model_4,0.7246645448035562
4,,2048.0,grid_dt_model_5,0.7138846337796211
5,,4096.0,grid_dt_model_6,0.7025770576720012
6,,8192.0,grid_dt_model_7,0.6907673391093535
7,,16384.0,grid_dt_model_8,0.6629129630702371
8,,32768.0,grid_dt_model_9,0.6011916855332662
9,,65536.0,grid_dt_model_10,0.5455296151905261


**Best Decision Tree**

 - The Decision Tree with the highest AUC score was the one with max depth = 256.
 - The average CV performance of this model is ?
 - Now train once on the full data with this setting
 - Then predict on test



In [19]:
model_dt = H2OGradientBoostingEstimator(
                        model_id = 'model_dt',
                        seed = 2020,   
                        sample_rate = 1,
                        col_sample_rate = 1,
                        ntrees = 1,
                        min_rows = 256,
                        max_depth = 20
                        )

In [20]:
model_dt.train(x=features,
               y='target',
               training_frame = h2o_df_train,
               )

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [21]:
model_dt

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  model_dt


ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.24001580625489133
RMSE: 0.48991408048237534
LogLoss: 0.6731428655995333
Mean Per-Class Error: 0.31219079716152864
AUC: 0.7617381057846201
pr_auc: 0.7505776106968719
Gini: 0.5234762115692402
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.47823293876597356: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,58709.0,68460.0,0.5383,(68460.0/127169.0)
1,16737.0,106094.0,0.1363,(16737.0/122831.0)
Total,75446.0,174554.0,0.3408,(85197.0/250000.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.4782329,0.7135128,275.0
max f2,0.4594794,0.8381674,358.0
max f0point5,0.4969149,0.6878134,173.0
max accuracy,0.4926204,0.687896,199.0
max precision,0.5421061,1.0,0.0
max recall,0.4435804,1.0,398.0
max specificity,0.5421061,1.0,0.0
max absolute_mcc,0.4906506,0.3755764,210.0
max min_per_class_accuracy,0.4915839,0.6869520,205.0


Gains/Lift Table: Avg response rate: 49.13 %, avg score: 49.13 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.010092,0.5393555,2.0103090,2.0103090,0.9877130,0.5408853,0.9877130,0.5408853,0.0202880,0.0202880,101.0308961,101.0308961
,2,0.020624,0.5375537,1.9533785,1.9812364,0.9597417,0.5381050,0.9734290,0.5394655,0.0205730,0.0408610,95.3378503,98.1236444
,3,0.030236,0.5355548,1.9167382,1.9607325,0.9417395,0.5363142,0.9633549,0.5384637,0.0184237,0.0592847,91.6738226,96.0732513
,4,0.040432,0.5325417,1.8724276,1.9384641,0.9199686,0.5341472,0.9524139,0.5373752,0.0190913,0.0783760,87.2427594,93.8464088
,5,0.051108,0.5305422,1.8156948,1.9128187,0.8920944,0.5313709,0.9398137,0.5361209,0.0193844,0.0977603,81.5694770,91.2818685
,6,0.101624,0.5231772,1.7118644,1.8129269,0.8410801,0.5262847,0.8907345,0.5312315,0.0864765,0.1842369,71.1864378,81.2926851
,7,0.150092,0.5158770,1.5817924,1.7382885,0.7771726,0.5199058,0.8540628,0.5275741,0.0766663,0.2609032,58.1792397,73.8288464
,8,0.200988,0.5111226,1.4468281,1.6644822,0.7108614,0.5132802,0.8178001,0.5239545,0.0736378,0.3345410,44.6828090,66.4482231
,9,0.301068,0.5038509,1.3302772,1.5533869,0.6535971,0.5075546,0.7632163,0.5185029,0.1331341,0.4676751,33.0277215,55.3386937



Scoring History: 


0,1,2,3,4,5,6,7,8,9
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error
,2020-07-13 17:06:59,0.016 sec,0.0,0.4999247,0.6929966,0.5,0.0,1.0,0.508676
,2020-07-13 17:07:02,3.396 sec,1.0,0.4899141,0.6731429,0.7617381,0.7505776,2.0103090,0.340788


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
f09,2706.9526367,1.0,0.2073145
e20,1624.2966309,0.6000462,0.1243983
f27,1418.1507568,0.5238920,0.1086104
f13,1347.3355713,0.4977315,0.1031870
e11,1020.5000610,0.3769922,0.0781560
---,---,---,---
e24,0.0,0.0,0.0
e25,0.0,0.0,0.0
f30,0.0,0.0,0.0



See the whole table with table.as_data_frame()




**Create Predictions**

When, the h2o tries to make predictions, we get a warning telling us that in some features there are some observations with new levels of the factors and these values were not present in the training dataset.  There is not alot we can do about this.  You should make sure you udnerstand how H2O makes predictions in such a case.

In [22]:
h2o_df_test = h2o.H2OFrame(df_test[vars_ind_numeric + vars_ind_categorical],
                           destination_frame='df_test')

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [23]:
preds = model_dt.predict(h2o_df_test)
# There is no need to round your predictions


gbm prediction progress: |████████████████████████████████████████████████| 100%




In [24]:
df_test['Predicted'] = np.round(preds[2].as_data_frame(), 5)
df_preds_dt = df_test[['unique_id', 'Predicted']].copy()
df_test[['unique_id', 'Predicted']].to_csv(dirPOutput + '04b_df_preds_dt_250k.csv', index=False)

Now you can submit 04b_df_preds_dt_250k.csv on Kaggle.  You should get an AUROC of around 0.75

**Note**

If you shut down your h2o JVM in this session, then any other Python notebooks open will also loose the JVM since they all connect to the same JVM!  

In [25]:
h2o.cluster().shutdown()

H2O session _sid_892e closed.
