This notebook compares the Random Search functionality in H2O to the hyperparameter based optimization carried out in h2ohyperopt. In this notebook, we use the Gradient boosting model as our base estimator and optimize the models on very similar search spaces.

In [9]:
#Imports
import h2o
import h2ohyperopt
import time
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.grid.grid_search import H2OGridSearch
h2o.init()

0,1
H2O cluster uptime:,21 hours 2 minutes 29 seconds 54 milliseconds
H2O cluster version:,3.8.3.3
H2O cluster name:,H2O_started_from_python_abhishek_eqq491
H2O cluster total nodes:,1
H2O cluster total free memory:,0 B
H2O cluster total cores:,4
H2O cluster allowed cores:,4
H2O cluster healthy:,True
H2O Connection ip:,127.0.0.1
H2O Connection port:,54321


For this example, we stick to the Titanic dataset as our example dataset. The evaluation metric used for this dataset is Area under ROC curve(AUC) since its a binary classification problem.

In [2]:
def data():
    """
    Function to process the example titanic dataset.
    Train-Valid-Test split is 60%, 20% and 20% respectively.
    Output
    ---------------------
    trainFr: Training H2OFrame.
    testFr: Test H2OFrame.
    validFr: Validation H2OFrame.
    predictors: List of predictor columns for the Training frame.
    response: String defining the response column for Training frame.
    """
    titanic_df = h2o.import_file(path="https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")

    # Basic preprocessing
    # columns_to_be_used - List of columns which are used in the training/test
    # data
    # columns_to_factorize - List of columns with categorical variables
    columns_to_be_used = ['pclass', 'age', 'sex', 'sibsp', 'parch', 'ticket',
                          'embarked', 'fare', 'survived']
    columns_to_factorize = ['pclass', 'sex', 'sibsp', 'embarked', 'survived']
    # Factorizing the columns in the columns_to_factorize list
    for col in columns_to_factorize:
        titanic_df[col] = titanic_df[col].asfactor()
    # Selecting only the columns we need
    titanic_frame = titanic_df[columns_to_be_used]
    trainFr, testFr, validFr = titanic_frame.split_frame([0.6, 0.2],
                                                         seed=1234)
    predictors = trainFr.names[:]
    # Removing the response column from the list of predictors
    predictors.remove('survived')
    response = 'survived'
    return trainFr, testFr, validFr, predictors, response

In [3]:
trainFr, testFr, validFr, predictors, response = data()


Parse Progress: [##################################################] 100%


#### H2O's Random Search

In [4]:
gbm_search_space = {'max_depth': range(2,20),
                    'col_sample_rate': [i * 0.02 for i in range(25, 40)],
                    'learn_rate':[i * 0.01 for i in range(5,20)]}

In [20]:
search_criteria = {'strategy': 'RandomDiscrete',
                   'max_models': 100,
                   'stopping_rounds':10,
                   'stopping_metric':'AUC'}

In [21]:
startTime = time.time()
gbm_grid_rnd= H2OGridSearch(model=H2OGradientBoostingEstimator,                                
                            grid_id='gbm_grid_rnd2',
                            hyper_params=gbm_search_space,
                            search_criteria=search_criteria)
gbm_grid_rnd.train(x=predictors,
                   y=response, 
                   training_frame=trainFr, 
                   validation_frame=validFr,
                   nfolds=5,  fold_assignment='Random',
                   ntrees = 200,
                   seed=0xDECAF)
endTime = time.time()


gbm Grid Build Progress: [##################################################] 100%


In [25]:
print "Time of execution for 100 models with H2O's Random Search : ", endTime - startTime

Time of execution for 100 models with H2O's Random Search :  941.013220072


Best model accuracy

In [26]:
gbm_gridperf_rnd = gbm_grid_rnd.get_grid(sort_by='auc', decreasing=True)

In [27]:
print "Training Score: ", gbm_gridperf_rnd[0].model_performance(trainFr).auc()
print "Validation Score: ", gbm_gridperf_rnd[0].model_performance(validFr).auc()
print "Test Score: ", gbm_gridperf_rnd[0].model_performance(testFr).auc()

Training Score:  0.938090845407
Validation Score:  0.87537993921
Test Score:  0.808678500986


#### H2OHyperopt Smart Search

In [4]:
h2ohyper_gbm = h2ohyperopt.GBMOptimizer(metric='auc')

In [5]:
h2ohyper_gbm.select_optimization_parameters({'col_sample_rate': ('uniform', (0.5, 0.8)),
                                             'ntrees': 200,
                                             'learn_rate': ('uniform',(0.05, 0.2)),
                                             'max_depth': ('randint', (2, 20)),
                                             'nfolds': 5,
                                             'fold_assignment': 'Random',
                                             'seed': 0xDECAF})

In [6]:
startTime = time.time()
h2ohyper_gbm.start_optimization(num_evals=100, trainingFr=trainFr,
                         validationFr=validFr, response=response,
                         predictors=predictors)
endTime = time.time()


gbm Model Build Progress: [##################################################] 100%

gbm Model Build Progress: [##################################################] 100%

gbm Model Build Progress: [##################################################] 100%

gbm Model Build Progress: [##################################################] 100%

gbm Model Build Progress: [##################################################] 100%

gbm Model Build Progress: [##################################################] 100%

gbm Model Build Progress: [##################################################] 100%

gbm Model Build Progress: [##################################################] 100%

gbm Model Build Progress: [##################################################] 100%

gbm Model Build Progress: [##################################################] 100%

gbm Model Build Progress: [##################################################] 100%

gbm Model Build Progress: [#####################################

In [7]:
print "Time of execution for 100 models with H2OHyperopt's Search : ", endTime - startTime

Time of execution for 100 models with H2OHyperopt's Search :  983.272632837


Best Model Accuracy

In [8]:
best_model = h2ohyper_gbm.best_model
print "Training Score: ", best_model.model_performance(trainFr).auc()
print "Validation Score: ", best_model.model_performance(validFr).auc()
print "Test Score: ", best_model.model_performance(testFr).auc()

Training Score:  0.923317542777
Validation Score:  0.883209990749
Test Score:  0.820202874049
