<div class="alert alert-warning">

<b>Warning</b>
    
By design asyncio does not allow nested event loops. Jupyter is using Tornado which already starts an event loop. Therefore the following patch is required to run this tutorial.
    
</div>

In [1]:
!pip install nest_asyncio

import nest_asyncio
nest_asyncio.apply()



# Hyperparameter Search for Machine Learning (Basic)

In this tutorial, we will show how to tune the hyperparameters of the [Random Forest (RF) classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html>)
in [scikit-learn](https://scikit-learn.org/stable/) for the Airlines data set.


Let us start by creating a function `run_baseline` to test the accuracy of the baseline model.

In [2]:
import numpy as np
from sklearn.utils import check_random_state
from sklearn.ensemble import RandomForestClassifier
from deephyper.benchmark.datasets import airlines as dataset
    
def run_baseline():

    rs_data = np.random.RandomState(seed=42)

    ratio_test = 0.33
    ratio_valid = (1 - ratio_test) * 0.33

    train, valid, test, _ = dataset.load_data(
        random_state=rs_data,
        test_size=ratio_test,
        valid_size=ratio_valid,
        categoricals_to_integers=True,
    )

    rs_classifier = check_random_state(42)

    classifier = RandomForestClassifier(n_jobs=4, random_state=rs_classifier)
    classifier.fit(*train)

    acc_train = classifier.score(*train)
    acc_valid = classifier.score(*valid)
    acc_test = classifier.score(*test)

    print(f"Accuracy on Training: {acc_train:.3f}")
    print(f"Accuracy on Validation: {acc_valid:.3f}")
    print(f"Accuracy on Testing: {acc_test:.3f}")


run_baseline()

Accuracy on Training: 0.879
Accuracy on Validation: 0.620
Accuracy on Testing: 0.620


The accuracy values show that the RandomForest classifier with default hyperparameters results in overfitting and thus poor generalization (high accuracy on training data but not on the validation and test data).

Next, we optimize the hyperparameters of the RandomForest classifier to address the overfitting problem and improve the accuracy on the validation and test data. Create the `load_data` function to load and return training and validation data.

<div class="alert alert-info">
    
<b>Tip</b>
    
Subsampling with <code>X_train, y_train = resample(X_train, y_train, n_samples=int(1e4))</code>can be useful if you want to speed-up your search. By subsampling the training time will reduce.
    
</div>

In [3]:
def load_data(verbose=0):

    # In this case passing a random state is critical to make sure
    # that the same data are loaded all the time and that the test set
    # is not mixed with either the training or validation set.
    # It is important to not avoid setting a global seed for safety reasons.
    random_state = np.random.RandomState(seed=42)

    # Proportion of the test set on the full dataset
    ratio_test = 0.33

    # Proportion of the valid set on "dataset \ test set"
    # here we want the test and validation set to have same number of elements
    ratio_valid = (1 - ratio_test) * 0.33

    # The 3rd result is ignored with "_" because it corresponds to the test set
    # which is not interesting for us now.
    (X_train, y_train), (X_valid, y_valid), _, _ = dataset.load_data(
        random_state=random_state,
        test_size=ratio_test,
        valid_size=ratio_valid,
        categoricals_to_integers=True,
    )

    # Uncomment the next line if you want to sub-sample the training data to speed-up
    # the search, "n_samples" controls the size of the new training data
    # X_train, y_train = resample(X_train, y_train, n_samples=int(1e4))
    if verbose:
        print(f"X_train shape: {np.shape(X_train)}")
        print(f"y_train shape: {np.shape(y_train)}")
        print(f"X_valid shape: {np.shape(X_valid)}")
        print(f"y_valid shape: {np.shape(y_valid)}")
    return (X_train, y_train), (X_valid, y_valid)

_  = load_data(verbose=1)

X_train shape: (242128, 7)
y_train shape: (242128,)
X_valid shape: (119258, 7)
y_valid shape: (119258,)


Create a function `run` to train and evaluate the RF model given a specific configuration `config`. This function has to return a scalar value (typically, validation accuracy), which will be maximized by the search algorithm.

In [4]:
def run(config: dict):

    rs = check_random_state(42)

    (X, y), (vX, vy) = load_data()

    classifier = RandomForestClassifier(
        n_jobs=4,
        random_state=rs,
        n_estimators=config["n_estimators"],
        criterion=config["criterion"],
        max_depth=config["max_depth"],
        min_samples_split=config["min_samples_split"]
    )
    classifier.fit(X, y)

    mean_accuracy = classifier.score(vX, vy)

    return mean_accuracy

Create a `HpProblem` instance named `problem` to define the search space of hyper-parameters for the RF model.

In [5]:
from deephyper.problem import HpProblem

problem = HpProblem()

problem.add_hyperparameter((10, 300), "n_estimators")
problem.add_hyperparameter(["gini", "entropy"], "criterion")
problem.add_hyperparameter((1, 50), "max_depth")
problem.add_hyperparameter((2, 10), "min_samples_split")

# We define a starting point with the defaul hyperparameters from sklearn-learn
# that we consider good in average.
problem.add_starting_point(
    n_estimators=100, criterion="gini", max_depth=50, min_samples_split=2
)

problem

Configuration space object:
  Hyperparameters:
    criterion, Type: Categorical, Choices: {gini, entropy}, Default: gini
    max_depth, Type: UniformInteger, Range: [1, 50], Default: 26
    min_samples_split, Type: UniformInteger, Range: [2, 10], Default: 6
    n_estimators, Type: UniformInteger, Range: [10, 300], Default: 155


  Starting Point:
{0: {'criterion': 'gini',
     'max_depth': 50,
     'min_samples_split': 2,
     'n_estimators': 100}}

Create an `Evaluator` object using the `ray` backend to distribute the evaluation of the run-function defined previously.

In [6]:
from deephyper.evaluator.evaluate import Evaluator

evaluator = Evaluator.create(run, 
                 method="ray", 
                 method_kwargs={
                     "address": None, 
                     "num_cpus": 2,
                     "num_cpus_per_task": 1
                 })

print("Number of workers: ", evaluator.num_workers)

2021-08-31 16:33:42,099	INFO services.py:1267 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


Number of workers:  2


<div class="alert alert-info">
    
<b>Tip</b> 
    
You can open the ray-dashboard at an address like <a>http://127.0.0.1:8266</a> in a browser to monitor the CPU usage of the execution.
    
</div>

Finally, you can define a Bayesian optimization search called `AMBS` (for Asynchronous Model-Based Search) and link to it the defined `problem` and `evaluator`.

In [7]:
from deephyper.search.hps import AMBS

search = AMBS(problem, evaluator)

In [8]:
results = search.search(20)

Once the search is over, a file named `results.csv` is saved in the current directory. The same dataframe is returned by the `search.search(...)` call. It contains the hyperparameters configurations evaluated during the search and their corresponding `objective` value (i.e, validation accuracy), `duration` of computation and time of computation with `elapsed_sec`.

In [9]:
results

Unnamed: 0,criterion,max_depth,min_samples_split,n_estimators,id,objective,elapsed_sec,duration
0,gini,35,5,68,2,0.635966,15.584538,12.659195
1,gini,50,2,100,1,0.620369,21.130914,18.205591
2,gini,38,4,80,4,0.630356,35.085798,13.778017
3,entropy,33,2,125,3,0.620965,39.129643,23.444741
4,gini,9,2,37,5,0.649013,39.753345,4.484855
5,gini,49,3,10,7,0.621434,42.905477,2.999732
6,entropy,17,2,10,8,0.653985,45.822875,2.722389
7,gini,45,10,56,6,0.650204,47.139352,7.735802
8,entropy,18,2,21,9,0.655243,49.793846,3.775522
9,entropy,23,2,22,11,0.6321,54.876923,4.903988


The `deephyper-analytics` command line is a way of analyzing this type of file. For example, we want to output the best configuration we can use the `topk` functionnality.

In [11]:
!deephyper-analytics topk results.csv

'0':
  criterion: gini
  duration: 13.887526989
  elapsed_sec: 74.4813959599
  id: 15
  max_depth: 15
  min_samples_split: 2
  n_estimators: 103
  objective: 0.6638296802



Let us define a function `test_config` to evaluate the best configuration on the training, validation and test data sets.

In [12]:
def test_config(config):

    rs_data = np.random.RandomState(seed=42)

    ratio_test = 0.33
    ratio_valid = (1 - ratio_test) * 0.33

    train, valid, test, _ = dataset.load_data(
        random_state=rs_data,
        test_size=ratio_test,
        valid_size=ratio_valid,
        categoricals_to_integers=True,
    )

    rs_classifier = check_random_state(42)

    classifier = RandomForestClassifier(
        n_jobs=4,
        random_state=rs_classifier,
        n_estimators=config["n_estimators"],
        criterion=config["criterion"],
        max_depth=config["max_depth"],
        min_samples_split=config["min_samples_split"],
    )
    classifier.fit(*train)

    acc_train = classifier.score(*train)
    acc_valid = classifier.score(*valid)
    acc_test = classifier.score(*test)

    print(f"Accuracy on Training: {acc_train:.3f}")
    print(f"Accuracy on Validation: {acc_valid:.3f}")
    print(f"Accuracy on Testing: {acc_test:.3f}")

In [14]:
best_config = results.iloc[results.objective.argmax()][:-2].to_dict()
test_config(best_config)

DEBUG:openml.datasets.dataset:Data pickle file already exists and is up to date.


Accuracy on Training: 0.739
Accuracy on Validation: 0.664
Accuracy on Testing: 0.663


Compared to the default configuration, we can see the accuracy improvement and the reduction of overfitting between the training and  the validation/test data sets.