# Decision Tree

In h2o there is no class dedicated to creating a single decision tree.

However, we can create a single decision tree with a Gradient Boosting Estimator by appropriately setting some parameters:
- ntrees = 1 (obviously)
- sample_rate = 1: Use all of the data.
- col_sample_rate = 1: Use all of the features.

We now have two parameters to control the complexity
- min_rows: This option specifies the minimum number of observations for a leaf in order to split it further. 
- max_depth: How deep the tree can be grown.

There is no cost-complexity pruning with a cp parameter (as there is with R).  So we will simply fix the max_depth to be quite large and then try different values for min_rows.

The best trees were evaluated based on the highest auc using cross-validation with 5 folds.


**Import packages**

I am not familiar with colab, but some students last year used it.  It did not, at that time, have H2O, but they were able to install it quite simply with the following code:

 - !apt-get install default-jre
 - !java -version
 - !pip install h2o

In [1]:
import os
import numpy as np
import pandas as pd
import pickle


In [2]:
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.grid.grid_search import H2OGridSearch

In [3]:
# Set directories
print(os.getcwd())
dirRawData = "../input/"
dirPData   = "../PData/"
dirPOutput = "../POutput/"

/home/jovyan/Projects/final_assignment/PCode


In [None]:
f_name = dirPData + '01_df_250k.pickle'

with (open(f_name, "rb")) as f:
    dict_ = pickle.load(f)

df_train = dict_['df_train']
df_test  = dict_['df_test']

del f_name, dict_

In [None]:
f_name = dirPData + '01_vars.pickle'

with open(f_name, "rb") as f:
    dict_ = pickle.load(f)

vars_ind_numeric     = dict_['vars_ind_numeric']
vars_ind_hccv        = dict_['vars_ind_hccv']
vars_ind_categorical = dict_['vars_ind_categorical']
vars_notToUse        = dict_['vars_notToUse']
var_dep              = dict_['var_dep']

del f_name, dict_

**Start the h2o JVM and load our data if it not already there**

In [None]:
h2o.init(port=54321)

In [None]:
# Is df_train already in the JVM?
h2o.ls()

In [None]:
# It if is, then just create a handle:
#h2o_df_train = h2o.get_frame('df_train')

In [None]:
# Otherwise run this code.
h2o_df_train = h2o.H2OFrame(df_train[vars_ind_numeric + vars_ind_categorical + var_dep],
                           destination_frame='df_train')

H2O says somewhere that it needs the target to be a enum type.  I'm not sure if it really does? But anyway...

In [None]:
h2o_df_train[var_dep].types

In [None]:
h2o_df_train[var_dep] = h2o_df_train[var_dep].asfactor()
h2o_df_train[var_dep].types

**Define the features we will use**

In this quick notebook, we ignore the hccv's.  For the main assigment you should deal sensibly with them.

In [None]:
# Need some proper way to deal with hcccv (eg target encoding).  For now just remove.
features = vars_ind_categorical + vars_ind_numeric
features = [var for var in features if var not in vars_ind_hccv]

**GridSearch**


I have more or less randomly chosen the list below to search for min leaf size.

In [None]:
[2**idx for idx in 7+np.arange(10)]

In [None]:
hyper_params = {'min_rows' : [2**idx for idx in 7+np.arange(10)]} 
search_criteria = {'strategy': "Cartesian"}

In [None]:
grid_dt = H2OGridSearch(
                    H2OGradientBoostingEstimator(
                        seed = 2020,   
                        nfolds = 5,
                        ntrees = 1,
                        max_depth = 20,
                        #min_rows = 1,
                        sample_rate = 1,
                        col_sample_rate = 1,
                        ),
                    grid_id = 'grid_dt',
                    search_criteria = search_criteria,
                    hyper_params = hyper_params)

In [None]:
grid_dt.train(x=features,
              y= 'target',
              training_frame=h2o_df_train,
              seed=2020)

In [None]:
grid_dt = grid_dt.get_grid(sort_by='auc', decreasing=True)
df_perf_auc = grid_dt.sorted_metric_table()
df_perf_auc.head(10)

**Best Decision Tree**

 - The Decision Tree with the highest AUC score was the one with max depth = 256.
 - The average CV performance of this model is ?
 - Now train once on the full data with this setting
 - Then predict on test



In [None]:
model_dt = H2OGradientBoostingEstimator(
                        model_id = 'model_dt',
                        seed = 2020,   
                        sample_rate = 1,
                        col_sample_rate = 1,
                        ntrees = 1,
                        min_rows = 256,
                        max_depth = 20
                        )

In [None]:
model_dt.train(x=features,
               y='target',
               training_frame = h2o_df_train,
               )

In [None]:
model_dt

**Create Predictions**

When, the h2o tries to make predictions, we get a warning telling us that in some features there are some observations with new levels of the factors and these values were not present in the training dataset.  There is not alot we can do about this.  You should make sure you udnerstand how H2O makes predictions in such a case.

In [None]:
h2o_df_test = h2o.H2OFrame(df_test[vars_ind_numeric + vars_ind_categorical],
                           destination_frame='df_test')

In [None]:
preds = model_dt.predict(h2o_df_test)
# There is no need to round your predictions


In [None]:
df_test['Predicted'] = np.round(preds[2].as_data_frame(), 5)
df_preds_dt = df_test[['unique_id', 'Predicted']].copy()
df_test[['unique_id', 'Predicted']].to_csv(dirPOutput + '04b_df_preds_dt_250k.csv', index=False)

Now you can submit 04b_df_preds_dt_250k.csv on Kaggle.  You should get an AUROC of around 0.75

**Note**

If you shut down your h2o JVM in this session, then any other Python notebooks open will also loose the JVM since they all connect to the same JVM!  

In [None]:
h2o.cluster().shutdown()