<div class="alert alert-warning">

<b>Warning</b>
    
By design asyncio does not allow nested event loops. Jupyter is using Tornado which already starts an event loop. Therefore the following patch is required to run this tutorial.
    
</div>

In [1]:
!pip install nest_asyncio

import nest_asyncio
nest_asyncio.apply()



# Hyperparameter Search for Machine Learning (Advanced)

In this tutorial, we will show how to treat a learning method as a hyperparameter in the hyperparameter search. We will consider [Random Forest (RF)](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) classifier and [Gradient Boosting (GB)](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) classifier methods in [Scikit-Learn](https://scikit-learn.org/stable/) for the Airlines data set. Each of these methods have its own set of hyperparameters and some common parameters. We model them using ConfigSpace a python package to express conditional hyperparameters and more.

Create a mapping to record the classification algorithms of interest:

In [2]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier


CLASSIFIERS = {
    "RandomForest": RandomForestClassifier,
    "GradientBoosting": GradientBoostingClassifier,
}

Create a baseline code to test the accuracy of the default configuration for both models:

In [3]:
from deephyper.benchmark.datasets import airlines as dataset
from sklearn.utils import check_random_state

rs_clf = check_random_state(42)
rs_data = check_random_state(42)

ratio_test = 0.33
ratio_valid = (1 - ratio_test) * 0.33

train, valid, test, _ = dataset.load_data(
    random_state=rs_data,
    test_size=ratio_test,
    valid_size=ratio_valid,
    categoricals_to_integers=True,
)

for clf_name, clf_class in CLASSIFIERS.items():
    print(clf_name)

    clf = clf_class(random_state=rs_clf)

    clf.fit(*train)

    acc_train = clf.score(*train)
    acc_valid = clf.score(*valid)
    acc_test = clf.score(*test)

    print(f"Accuracy on Training: {acc_train:.3f}")
    print(f"Accuracy on Validation: {acc_valid:.3f}")
    print(f"Accuracy on Testing: {acc_test:.3f}\n")

RandomForest
Accuracy on Training: 0.879
Accuracy on Validation: 0.620
Accuracy on Testing: 0.620

GradientBoosting
Accuracy on Training: 0.649
Accuracy on Validation: 0.648
Accuracy on Testing: 0.649



The accuracy values show that the RandomForest classifier with default hyperparameters results in overfitting and thus poor generalization (high accuracy on training data but not on the validation and test data). On the contrary GradientBoosting does not show any sign of overfitting and has a better accuracy on the validation and testing set, which shows a better generalization than RandomForest.

Next, we optimize the hyperparameters, where we seek to find the right classifier and its corresponding hyperparameters to improve the accuracy on the vaidation and test data. Create a `load_data` function to load and return training and validation data:

In [4]:
import numpy as np
from sklearn.utils import resample

def load_data(verbose=0, subsample=True):

    # In this case passing a random state is critical to make sure
    # that the same data are loaded all the time and that the test set
    # is not mixed with either the training or validation set.
    # It is important to not avoid setting a global seed for safety reasons.
    random_state = np.random.RandomState(seed=42)

    # Proportion of the test set on the full dataset
    ratio_test = 0.33

    # Proportion of the valid set on "dataset \ test set"
    # here we want the test and validation set to have same number of elements
    ratio_valid = (1 - ratio_test) * 0.33

    # The 3rd result is ignored with "_" because it corresponds to the test set
    # which is not interesting for us now.
    (X_train, y_train), (X_valid, y_valid), _, _ = dataset.load_data(
        random_state=random_state,
        test_size=ratio_test,
        valid_size=ratio_valid,
        categoricals_to_integers=True,
    )

    # Uncomment the next line if you want to sub-sample the training data to speed-up
    # the search, "n_samples" controls the size of the new training data
    if subsample:
        X_train, y_train = resample(X_train, y_train, n_samples=int(1e4))
        
    if verbose:
        print(f"X_train shape: {np.shape(X_train)}")
        print(f"y_train shape: {np.shape(y_train)}")
        print(f"X_valid shape: {np.shape(X_valid)}")
        print(f"y_valid shape: {np.shape(y_valid)}")
    return (X_train, y_train), (X_valid, y_valid)

print("With subsampling")
_ = load_data(verbose=1)
print("\nWithout subsampling")
_ = load_data(verbose=1, subsample=False)

With subsampling
X_train shape: (10000, 7)
y_train shape: (10000,)
X_valid shape: (119258, 7)
y_valid shape: (119258,)

Without subsampling
X_train shape: (242128, 7)
y_train shape: (242128,)
X_valid shape: (119258, 7)
y_valid shape: (119258,)


<div class="alert alert-info">
    
<b>Tip</b> 
    
Subsampling with `X_train, y_train = resample(X_train, y_train, n_samples=int(1e4))` can be useful if you want to speed-up your search. By subsampling the training time will reduce.
    
</div>

Create a `run` function to train and evaluate a given hyperparameter configuration. This function has to return a scalar value (typically, validation accuracy), which will be maximized by the search algorithm.

In [5]:
from deephyper.problem import filter_parameters
from sklearn.metrics import accuracy_score
from sklearn.utils import check_random_state


def run(config: dict) -> float:

    config["random_state"] = check_random_state(42)

    (X_train, y_train), (X_valid, y_valid) = load_data()

    clf_class = CLASSIFIERS[config["classifier"]]

    # keep parameters possible for the current classifier
    config["n_jobs"] = 4
    clf_params = filter_parameters(clf_class, config)

    try:  # good practice to manage the fail value yourself...
        clf = clf_class(**clf_params)

        clf.fit(X_train, y_train)

        fit_is_complete = True
    except:
        fit_is_complete = False

    if fit_is_complete:
        y_pred = clf.predict(X_valid)
        acc = accuracy_score(y_valid, y_pred)
    else:
        acc = -1.0

    return acc

Create the `HpProblem` to define the search space of hyperparameters for each model:

In [6]:
import ConfigSpace as cs
from deephyper.problem import HpProblem


problem = HpProblem(seed=42)

#! Default value are very important when adding conditional and forbidden clauses
#! Otherwise the creation of the problem can fail if the default configuration is not
#! Acceptable
classifier = problem.add_hyperparameter(
    name="classifier",
    value=["RandomForest", "GradientBoosting"],
    default_value="RandomForest",
)

# For both
problem.add_hyperparameter(name="n_estimators", value=(1, 1000, "log-uniform"))
problem.add_hyperparameter(name="max_depth", value=(1, 50))
problem.add_hyperparameter(
    name="min_samples_split", value=(2, 10),
)
problem.add_hyperparameter(name="min_samples_leaf", value=(1, 10))
criterion = problem.add_hyperparameter(
    name="criterion",
    value=["friedman_mse", "mse", "gini", "entropy"],
    default_value="gini",
)

# GradientBoosting
loss = problem.add_hyperparameter(name="loss", value=["deviance", "exponential"])
learning_rate = problem.add_hyperparameter(name="learning_rate", value=(0.01, 1.0))
subsample = problem.add_hyperparameter(name="subsample", value=(0.01, 1.0))

gradient_boosting_hp = [loss, learning_rate, subsample]
for hp_i in gradient_boosting_hp:
    problem.add_condition(cs.EqualsCondition(hp_i, classifier, "GradientBoosting"))

forbidden_criterion_rf = cs.ForbiddenAndConjunction(
    cs.ForbiddenEqualsClause(classifier, "RandomForest"),
    cs.ForbiddenInClause(criterion, ["friedman_mse", "mse"]),
)
problem.add_forbidden_clause(forbidden_criterion_rf)

forbidden_criterion_gb = cs.ForbiddenAndConjunction(
    cs.ForbiddenEqualsClause(classifier, "GradientBoosting"),
    cs.ForbiddenInClause(criterion, ["gini", "entropy"]),
)
problem.add_forbidden_clause(forbidden_criterion_gb)

if __name__ == "__main__":
    print(problem)

Configuration space object:
  Hyperparameters:
    classifier, Type: Categorical, Choices: {RandomForest, GradientBoosting}, Default: RandomForest
    criterion, Type: Categorical, Choices: {friedman_mse, mse, gini, entropy}, Default: gini
    learning_rate, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.505
    loss, Type: Categorical, Choices: {deviance, exponential}, Default: deviance
    max_depth, Type: UniformInteger, Range: [1, 50], Default: 26
    min_samples_leaf, Type: UniformInteger, Range: [1, 10], Default: 6
    min_samples_split, Type: UniformInteger, Range: [2, 10], Default: 6
    n_estimators, Type: UniformInteger, Range: [1, 1000], Default: 32, on log-scale
    subsample, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.505
  Conditions:
    learning_rate | classifier == 'GradientBoosting'
    loss | classifier == 'GradientBoosting'
    subsample | classifier == 'GradientBoosting'
  Forbidden Clauses:
    (Forbidden: classifier == 'RandomForest' && Forbidden: cri

Create an `Evaluator` object using the `ray` backend to distribute the evaluation of the run-function defined previously.

In [7]:
from deephyper.evaluator.evaluate import Evaluator

evaluator = Evaluator.create(run, 
                 method="ray", 
                 method_kwargs={
                     "address": None, 
                     "num_cpus": 2,
                     "num_cpus_per_task": 1
                 })

print("Number of workers: ", evaluator.num_workers)

2021-09-16 10:53:21,544	INFO services.py:1267 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


Number of workers:  2


<div class="alert alert-info">
    
<b>Tip</b> 
    
You can open the ray-dashboard at an address like <a>http://127.0.0.1:port</a> in a browser to monitor the CPU usage of the execution.
    
</div>

Finally, you can define a Bayesian optimization search called `AMBS` (for Asynchronous Model-Based Search) and link to it the defined `problem` and `evaluator`.

In [8]:
from deephyper.search.hps import AMBS

search = AMBS(problem, evaluator)

In [9]:
results = search.search(30)

Once the search is over, a file named `results.csv` is saved in the current directory. The same dataframe is returned by the `search.search(...)` call. It contains the hyperparameters configurations evaluated during the search and their corresponding `objective` value (i.e, validation accuracy), `duration` of computation and time of computation with `elapsed_sec`.

In [10]:
results

Unnamed: 0,classifier,criterion,max_depth,min_samples_leaf,min_samples_split,n_estimators,learning_rate,loss,subsample,id,objective,elapsed_sec,duration
0,GradientBoosting,mse,45,8,7,1,0.09864,exponential,0.992143,2,0.556206,6.641314,3.54416
1,RandomForest,gini,16,9,5,12,,,,1,0.633869,6.782724,3.685584
2,RandomForest,entropy,44,8,6,4,,,,4,0.601637,7.824726,0.797032
3,RandomForest,entropy,22,10,2,7,,,,5,0.62322,8.840261,0.773575
4,GradientBoosting,mse,6,9,3,129,0.650378,deviance,0.483268,3,0.583374,9.364809,2.582544
5,RandomForest,gini,3,5,3,2,,,,7,0.61964,10.388878,0.753619
6,RandomForest,gini,31,9,4,1,,,,8,0.584791,11.444127,0.783659
7,RandomForest,gini,32,1,4,355,,,,6,0.629241,12.76311,3.59123
8,RandomForest,gini,14,8,5,317,,,,9,0.641232,14.624261,2.900479
9,RandomForest,entropy,3,5,5,92,,,,11,0.624369,16.076699,1.172646


The `deephyper-analytics` command line is a way of analyzing this type of file. For example, we want to output the best configuration we can use the `topk` functionnality.

In [11]:
!deephyper-analytics topk results.csv

'0':
  classifier: RandomForest
  criterion: gini
  duration: 1.4827599525
  elapsed_sec: 47.402384758
  id: 24
  learning_rate: null
  loss: null
  max_depth: 48
  min_samples_leaf: 9
  min_samples_split: 2
  n_estimators: 99
  objective: 0.6428751111
  subsample: null



Let us define a test to evaluate the best configuration on the training, validation and test data sets.

In [12]:
from pprint import pprint
import pandas as pd


config = results.iloc[results.objective.argmax()][:-2].to_dict()
print("Best config is:")
pprint(config)

config["random_state"] = check_random_state(42)

rs_data = check_random_state(42)

ratio_test = 0.33
ratio_valid = (1 - ratio_test) * 0.33

train, valid, test, _ = dataset.load_data(
    random_state=rs_data,
    test_size=ratio_test,
    valid_size=ratio_valid,
    categoricals_to_integers=True,
)

clf_class = CLASSIFIERS[config["classifier"]]
config["n_jobs"] = 4
clf_params = filter_parameters(clf_class, config)

clf = clf_class(**clf_params)

clf.fit(*train)

acc_train = clf.score(*train)
acc_valid = clf.score(*valid)
acc_test = clf.score(*test)

print(f"Accuracy on Training: {acc_train:.3f}")
print(f"Accuracy on Validation: {acc_valid:.3f}")
print(f"Accuracy on Testing: {acc_test:.3f}")

Best config is:
{'classifier': 'RandomForest',
 'criterion': 'gini',
 'id': 24,
 'learning_rate': nan,
 'loss': nan,
 'max_depth': 48,
 'min_samples_leaf': 9,
 'min_samples_split': 2,
 'n_estimators': 99,
 'objective': 0.6428751111036576,
 'subsample': nan}


DEBUG:openml.datasets.dataset:Data pickle file already exists and is up to date.


Accuracy on Training: 0.755
Accuracy on Validation: 0.665
Accuracy on Testing: 0.664


Compared to the default configuration, we can see the accuracy improvement and the reduction of overfitting between the training and  the validation/test data sets.