# GBM and XGBoost Gridsearch Demo

In this tutorial, we will go through a step-by-step workflow to demonstrate how easy it is to use H2OXGBoost with Gridsearch.

## Start the H2O-3 cluster

The `os` commands below check whether this notebook is being run on the Aquarium platform. We use `h2o.init` command to connect to the H2O-3 cluster, starting it if it is not already up. The parameters used in `h2o.init` will depend on your specific environment.

In [None]:
import os
import h2o

startup = '/home/h2o/bin/aquarium_startup'
if os.path.exists(startup):
    os.system(startup)
    local_url = 'http://localhost:54321/h2o'
    aquarium = True
    !sleep 5
else:
    local_url = 'http://localhost:54321'
    aquarium = False

h2o.init(url=local_url)

Note: The method you use for starting and stopping an H2O-3 cluster will depend on how H2O is installed and configured on your system. Regardless of how H2O is installed, if you start a cluster, you will need to ensure that it is shut down when you are done.

## Titanic Data Set

We will look at the famous Titanic passenger data set and try to predict who lived and who died....

In [None]:
if aquarium:
    input_csv = "/home/h2o/data/titanic/titanic.csv"
else:
    input_csv = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv"

titanic = h2o.import_file(path = input_csv)

Note: The method you use for starting and stopping an H2O-3 cluster will depend on how H2O is installed and configured on your system. Regardless of how H2O is installed, if you start a cluster, you will need to ensure that it is shut down when you are done.

In [None]:
titanic.head()

Set `survived` as a factor so that H2O can build a classification model. Also cast `ticket` as a factor rather than numeric.

In [None]:
titanic["survived"] = titanic["survived"].asfactor()
titanic["ticket"] = titanic["ticket"].asfactor()

Set the predictors and response variables. Note that we exclude `name` because it is a text variable. We also exclude `boat` and `body`, because those variables would not have been known at the time of setting sail. Including those is a classic example of *data leakage*.

In [None]:
# Set predictors and response variable
response = "survived"
exclude = ["name", "survived", "boat", "body"]
# not including boat or body due to data leakage

predictors = list(set(titanic.col_names) - set(exclude))
predictors

Now create training and test data sets. Rather than creating a validation data set, we will use k-fold cross-validation.

In [None]:
train, test = titanic.split_frame(seed = 1234, 
                                  ratios = [0.75], 
                                  destination_frames = ["train.hex", "test.hex"])

## Default GBM Model

Build a GBM model with default settings.

In [None]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

#We only provide the required parameters, everything else is default
gbm_model = H2OGradientBoostingEstimator(seed = 1234, nfolds = 5)
gbm_model.train(x = predictors
                , y = response
                , training_frame = train
                , validation_frame = test
                , model_id = "gbm_default.hex"
               )

## Show a detailed model summary
print(gbm_model)

In [None]:
%matplotlib inline
gbm_model.plot()

In [None]:
print("Training Data")
gbm_model.model_performance(train = True).plot()
print("Cross-Validation")
gbm_model.model_performance(xval = True).plot()
print("Testing Data")
gbm_model.model_performance(valid = True).plot()

The default GBM model overtrained pretty severely.

## Default XGBoost Models

Build an XGBoost default model

In [None]:
from h2o.estimators import H2OXGBoostEstimator

param = {"seed": 1234,
         "nfolds": 5
        }

xgboost_model = H2OXGBoostEstimator(**param)
xgboost_model.train(x = predictors
                    , y = response
                    , training_frame = train
                    , validation_frame = test
                    , model_id = "xgb_default.hex"
                   )

In [None]:
print(xgboost_model)

In [None]:
xgboost_model.plot()

In [None]:
print("Training Data")
xgboost_model.model_performance(train = True).plot()
print("Cross-Validation")
xgboost_model.model_performance(xval = True).plot()
print("Testing Data")
xgboost_model.model_performance(valid = True).plot()

The default XGBoost model gives us a better result. Let's use gridsearch with early stopping on both models to see if we can improve their performance.

## GBM Gridsearch 

### Notes on parameter values

Our strategy is to start with a large number of trees and a small learning rate in combination with early stopping.

- Early stopping kicks in if the AUC doesn't improve by 0.001 for 5 consecutive scoring intervals. 
- We begin with a not-so-small 0.05 learning rate, but use `learn_rate_annealing` to decrease the learning rate by 1% after each tree. (Alternately, we could set annealing to 1 and make the learning rate smaller.)
- We sample 80% of rows per tree (`sample_rate`)
- We sample 80% of columns per split (`col_sample_rate`)

In [None]:
from h2o.grid.grid_search import H2OGridSearch

gbm_params = {'max_depth': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25]
              , 'ntrees': [5000]
              , 'learn_rate': [0.05]
              , 'learn_rate_annealing': [0.99]
              , 'sample_rate': [0.8]
              , 'col_sample_rate': [0.8]
              , 'stopping_metric': 'AUC'
              , 'stopping_rounds': [5]
              , 'stopping_tolerance': [0.001]
             }

gbm_grid = H2OGridSearch(model = H2OGradientBoostingEstimator,
                         hyper_params = gbm_params
                        )

Early stopping is only reproducible if we use `score_tree_interval`; here we set it to score every 10 trees.

In [None]:
gbm_grid.train(x = predictors, y = response
               , training_frame = train
               , validation_frame = test
               , score_tree_interval = 10
               , seed = 1234
               , grid_id = "gbm_grid"
              )

## XGBoost Gridsearch

Let's do the same with XGBoost

In [None]:
xgboost_params = {'max_depth': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25]
                  , 'ntrees': [5000]
                  , 'learn_rate': [1, 0.1, 0.01, 0.001]
                  , 'sample_rate': [0.8]
                  , 'col_sample_rate': [0.8]
                  , 'stopping_metric': 'AUC'
                  , 'stopping_rounds': [5]
                  , 'stopping_tolerance': [0.001]
                 }

xgboost_grid = H2OGridSearch(model = H2OXGBoostEstimator
                             , hyper_params = xgboost_params
                            )

In [None]:
xgboost_grid.train(x = predictors, y = response
                   , training_frame = train               
                   , validation_frame = test               
                   , score_tree_interval = 10              
                   , seed = 1234
                   , grid_id = "xgboost_grid"
              )

## Grid summary

### GBM Grid

In [None]:
## sort the grid models by decreasing AUC
sorted_gbm_grid = gbm_grid.get_grid(sort_by="auc", decreasing = True)
sorted_gbm_grid

In [None]:
best_gbm = sorted_gbm_grid.models[0]
best_gbm_perf = best_gbm.model_performance(test)
best_gbm_perf.auc()

### XGBoost Grid

In [None]:
## sort the grid models by decreasing AUC
sorted_xgboost_grid = xgboost_grid.get_grid(sort_by="auc", decreasing = True)
sorted_xgboost_grid

In [None]:
best_xgboost = sorted_xgboost_grid.models[0]
best_xgboost_perf = best_xgboost.model_performance(test)
best_xgboost_perf.auc()

Even with gridsearch, XGBoost does a better job than GBM.

## Stop H2O Cluster

In [None]:
h2o.cluster().shutdown()