This notebook introduces the h2ohyperopt module for Python. Here we begin with using multi model dockers and introduce the functions available to a user.

In [1]:
#Imports
import h2o
import h2ohyperopt

In [2]:
#Initializing h2o
h2o.init()

0,1
H2O cluster uptime:,4 hours 30 minutes 55 seconds 282 milliseconds
H2O cluster version:,3.8.3.3
H2O cluster name:,H2O_started_from_python_abhishek_bdx287
H2O cluster total nodes:,1
H2O cluster total free memory:,1.11 GB
H2O cluster total cores:,4
H2O cluster allowed cores:,4
H2O cluster healthy:,True
H2O Connection ip:,127.0.0.1
H2O Connection port:,54321


### Data Processing
The test dataset used for demonstrating the capabilities of H2OHyperopt is the titanic dataset. The function ```data()``` is used to preprocess the dataset.

In [3]:
def data():
    """
    Function to process the example titanic dataset.
    Train-Valid-Test split is 60%, 20% and 20% respectively.
    Output
    ---------------------
    trainFr: Training H2OFrame.
    testFr: Test H2OFrame.
    validFr: Validation H2OFrame.
    predictors: List of predictor columns for the Training frame.
    response: String defining the response column for Training frame.
    """
    titanic_df = h2o.import_file(path="https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")

    # Basic preprocessing
    # columns_to_be_used - List of columns which are used in the training/test
    # data
    # columns_to_factorize - List of columns with categorical variables
    columns_to_be_used = ['pclass', 'age', 'sex', 'sibsp', 'parch', 'ticket',
                          'embarked', 'fare', 'survived']
    columns_to_factorize = ['pclass', 'sex', 'sibsp', 'embarked', 'survived']
    # Factorizing the columns in the columns_to_factorize list
    for col in columns_to_factorize:
        titanic_df[col] = titanic_df[col].asfactor()
    # Selecting only the columns we need
    titanic_frame = titanic_df[columns_to_be_used]
    trainFr, testFr, validFr = titanic_frame.split_frame([0.6, 0.2],
                                                         seed=1234)
    predictors = trainFr.names[:]
    # Removing the response column from the list of predictors
    predictors.remove('survived')
    response = 'survived'
    return trainFr, testFr, validFr, predictors, response

In [4]:
trainFr, testFr, validFr, predictors, response = data()


Parse Progress: [##################################################] 100%


### Mutiple Model Type Based Optimization

Let us demonstrate the ModelDocker. Since the problem is a binary classification problem, we specify the metric to AUC.

Docking three types of models - GBM's, GLM's and DLE's

In [5]:
model_gbm = h2ohyperopt.GBMOptimizer(metric='auc')
# To use the default search space
# model_gbm.select_optimization_parameters("Default")
# To use a combination of Default parameters and the customized parameters.
model_gbm.select_optimization_parameters({'col_sample_rate': 'Default',
                                          'ntrees': 200,
                                          'learn_rate': ('uniform',(0.05, 0.2)),
                                          'nfolds': 7})

In [6]:
model_dle = h2ohyperopt.DLEOptimizer(metric='auc')
# Selecting parameters to optimize on
model_dle.select_optimization_parameters({'epsilon': 'Default',
                                        'adaptive_rate': True,                                           'hidden': ('choice', [[10, 20], [30, 40]]),
                                        'nfolds':7})

In [7]:
model_glm = h2ohyperopt.GLMOptimizer(metric='auc', problemType='Classification')
# Selecting default parameters to optimize on
model_glm.select_optimization_parameters('Default')

Once the individual model optimizers are created, they are passed in a list to the ModelDocker to run multi model optimization.

#### Initializing a docker and optimizing

We initialize the docker with three different types of models each optimized over a unique surface.

In [8]:
docker = h2ohyperopt.ModelDocker([model_dle, model_gbm, model_glm], 'auc')                                     
docker.start_optimization(num_evals=20, trainingFr=trainFr,
                          validationFr=validFr, response=response,                                              predictors=predictors)


glm Model Build Progress: [##################################################] 100%

glm Model Build Progress: [##################################################] 100%

gbm Model Build Progress: [##################################################] 100%

glm Model Build Progress: [##################################################] 100%

gbm Model Build Progress: [##################################################] 100%

glm Model Build Progress: [##################################################] 100%

deeplearning Model Build Progress: [##################################################] 100%

gbm Model Build Progress: [##################################################] 100%

glm Model Build Progress: [##################################################] 100%

gbm Model Build Progress: [##################################################] 100%

gbm Model Build Progress: [##################################################] 100%

glm Model Build Progress: [############################

The function ```best_in_class_ensembles``` builds an ensemble from the best model of each type. We can specify the number of models of each type we need to for our ensemble. Building best in class ensembles using 2 models from each class.

In [9]:
docker.best_in_class_ensembles(numModels=2)


deeplearning prediction Progress: [##################################################] 100%

deeplearning prediction Progress: [##################################################] 100%

glm prediction Progress: [##################################################] 100%

glm prediction Progress: [##################################################] 100%

gbm prediction Progress: [##################################################] 100%

gbm prediction Progress: [##################################################] 100%

gbm Model Build Progress: [##################################################] 100%
Model Trained


To generate the test scores on the ```ensemble_model```, we use the function ```score ensemble```. Generating the test scores,

In [10]:
docker.score_ensemble(testFr)


deeplearning prediction Progress: [##################################################] 100%

deeplearning prediction Progress: [##################################################] 100%

glm prediction Progress: [##################################################] 100%

glm prediction Progress: [##################################################] 100%

gbm prediction Progress: [##################################################] 100%

gbm prediction Progress: [##################################################] 100%


0.7771484925331079