<div class="alert alert-warning">

<b>Warning</b>
    
By design asyncio does not allow nested event loops. Jupyter is using Tornado which already starts an event loop. Therefore the following patch is required to run this tutorial.
    
</div>

In [1]:
!pip install nest_asyncio

import nest_asyncio
nest_asyncio.apply()



# Automated Machine Learning with Scikit-Learn

In this tutorial, we will show how to automatically search among different machine learning algorithms from [Scikit-Learn](https://scikit-learn.org/stable/). Automated machine learning only requires the user to link the data with a predifined problem and run function that we provide.

## Classification

On this part of the tutorial we focus on the classification case.

Create `run` function to train and evaluate the model corresponding to the configuration generated by the search. This function has to return a scalar value (typically, validation accuracy), which will be maximized by the search algorithm. In the case of *automated machine learning* we use the `run` function provided at `deephyper.sklearn.classifier.run` and wrap it with our data such as:

In [2]:
from deephyper.sklearn.classifier import run as sklearn_run


def load_data():
    from sklearn.datasets import load_breast_cancer

    X, y = load_breast_cancer(return_X_y=True)

    return X, y


def run(config):
    return sklearn_run(config, load_data)

We are ready to go! But, let us look at the problem provided by DeepHyper to understand better what is happening under the hood. The run function is the following. Just execute the problem module:

In [3]:
from deephyper.sklearn.classifier.autosklearn1.problem import Problem

Problem

Configuration space object:
  Hyperparameters:
    C, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    alpha, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    classifier, Type: Categorical, Choices: {RandomForest, Logistic, AdaBoost, KNeighbors, MLP, SVC, XGBoost}, Default: RandomForest
    gamma, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    kernel, Type: Categorical, Choices: {linear, poly, rbf, sigmoid}, Default: linear
    max_depth, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    n_estimators, Type: UniformInteger, Range: [1, 2000], Default: 45, on log-scale
    n_neighbors, Type: UniformInteger, Range: [1, 100], Default: 50
  Conditions:
    (C | classifier == 'Logistic' || C | classifier == 'SVC')
    (gamma | kernel == 'rbf' || gamma | kernel == 'poly' || gamma | kernel == 'sigmoid')
    (n_estimators | classifier == 'RandomForest' || n_estimators | classifier == 'AdaBoost')
    a

Create an `Evaluator` object using the `ray` backend to distribute the evaluation of the run-function defined previously.

In [4]:
from deephyper.evaluator.evaluate import Evaluator

evaluator = Evaluator.create(run, 
                 method="ray", 
                 method_kwargs={
                     "address": None, 
                     "num_cpus": 2,
                     "num_cpus_per_task": 1
                 })

print("Number of workers: ", evaluator.num_workers)

2021-09-16 16:01:59,531	INFO services.py:1267 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


Number of workers:  2


<div class="alert alert-info">
    
<b>Tip</b> 
    
You can open the ray-dashboard at an address like <a>http://127.0.0.1:port</a> in a browser to monitor the CPU usage of the execution.
    
</div>

Finally, you can define a Bayesian optimization search called `AMBS` (for Asynchronous Model-Based Search) and link to it the defined `Problem` and `evaluator`.

In [5]:
from deephyper.search.hps import AMBS

search = AMBS(Problem, evaluator)

In [6]:
results = search.search(10)



Once the search is over, a file named `results.csv` is saved in the current directory. The same dataframe is returned by the `search.search(...)` call. It contains the hyperparameters configurations evaluated during the search and their corresponding `objective` value (i.e, validation accuracy), `duration` of computation and time of computation with `elapsed_sec`.

In [7]:
results

Unnamed: 0,classifier,C,alpha,kernel,max_depth,n_estimators,n_neighbors,gamma,id,objective,elapsed_sec,duration
0,AdaBoost,,,,,28.0,,,1,0.952128,7.018792,3.644613
1,RandomForest,,,,7.0,22.0,,,2,0.957447,7.18974,3.815548
2,MLP,,0.000159,,,,,,4,0.978723,7.685323,0.317695
3,RandomForest,,,,3.0,407.0,,,3,0.946809,7.850828,0.661686
4,MLP,,0.001463,,,,,,5,0.978723,8.097145,0.246882
5,MLP,,0.048322,,,,,,6,0.978723,8.395381,0.386671
6,SVC,8.478327,,poly,,,,0.00316,8,0.718085,8.581014,0.017535
7,MLP,,0.002032,,,,,,7,0.978723,8.746737,0.352
8,SVC,1.6e-05,,linear,,,,,9,0.643617,8.902902,0.156634
9,SVC,2.363564,,linear,,,,,10,0.973404,9.081975,0.179543


The `deephyper-analytics` command line is a way of analyzing this type of file. For example, we want to output the best configuration we can use the `topk` functionnality.

In [8]:
!deephyper-analytics topk results.csv -k 3

'0':
  C: null
  alpha: 0.0001591634
  classifier: MLP
  duration: 0.3176951408
  elapsed_sec: 7.6853232384
  gamma: null
  id: 4
  kernel: null
  max_depth: null
  n_estimators: null
  n_neighbors: null
  objective: 0.9787234043
'1':
  C: null
  alpha: 0.0014633497
  classifier: MLP
  duration: 0.2468817234
  elapsed_sec: 8.0971448421
  gamma: null
  id: 5
  kernel: null
  max_depth: null
  n_estimators: null
  n_neighbors: null
  objective: 0.9787234043
'2':
  C: null
  alpha: 0.0483216274
  classifier: MLP
  duration: 0.3866710663
  elapsed_sec: 8.3953812122
  gamma: null
  id: 6
  kernel: null
  max_depth: null
  n_estimators: null
  n_neighbors: null
  objective: 0.9787234043



## Regression

On this part of the tutorial we focus on the regression case.

Create `run` function to train and evaluate the model corresponding to the configuration generated by the search. This function has to return a scalar value (typically, validation $R^2$), which will be maximized by the search algorithm. In the case of *automated machine learning* we use the `run` function provided at `deephyper.sklearn.regressor.run` and wrap it with our data such as:

In [9]:
from deephyper.sklearn.regressor import run as sklearn_run


def load_data():
    from sklearn.datasets import load_boston

    X, y = load_boston(return_X_y=True)
    return X, y


def run(config):
    return sklearn_run(config, load_data)

We are ready to go! But, let us look at the problem provided by DeepHyper to understand better what is happening under the hood. The run function is the following. Just execute the problem module:

In [10]:
from deephyper.sklearn.regressor.autosklearn1.problem import Problem

Problem

Configuration space object:
  Hyperparameters:
    C, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    alpha, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    gamma, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    kernel, Type: Categorical, Choices: {linear, poly, rbf, sigmoid}, Default: linear
    max_depth, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    n_estimators, Type: UniformInteger, Range: [1, 2000], Default: 45, on log-scale
    n_neighbors, Type: UniformInteger, Range: [1, 100], Default: 50
    regressor, Type: Categorical, Choices: {RandomForest, Linear, AdaBoost, KNeighbors, MLP, SVR, XGBoost}, Default: RandomForest
  Conditions:
    (gamma | kernel == 'rbf' || gamma | kernel == 'poly' || gamma | kernel == 'sigmoid')
    (n_estimators | regressor == 'RandomForest' || n_estimators | regressor == 'AdaBoost')
    C | regressor == 'SVR'
    alpha | regressor == 'MLP'
    kernel | r

Create an `Evaluator` object using the `ray` backend to distribute the evaluation of the run-function defined previously.

In [11]:
from deephyper.evaluator.evaluate import Evaluator

evaluator = Evaluator.create(run, 
                 method="ray", 
                 method_kwargs={
                     "address": None, 
                     "num_cpus": 2,
                     "num_cpus_per_task": 1
                 })

print("Number of workers: ", evaluator.num_workers)

Number of workers:  2


<div class="alert alert-info">
    
<b>Tip</b> 
    
You can open the ray-dashboard at an address like <a>http://127.0.0.1:port</a> in a browser to monitor the CPU usage of the execution.
    
</div>

Finally, you can define a Bayesian optimization search called `AMBS` (for Asynchronous Model-Based Search) and link to it the defined `Problem` and `evaluator`.

In [12]:
from deephyper.search.hps import AMBS

search = AMBS(Problem, evaluator)

In [13]:
results = search.search(10)



Once the search is over, a file named `results.csv` is saved in the current directory. The same dataframe is returned by the `search.search(...)` call. It contains the hyperparameters configurations evaluated during the search and their corresponding `objective` value (i.e, validation accuracy), `duration` of computation and time of computation with `elapsed_sec`.

In [14]:
results

Unnamed: 0,regressor,C,alpha,kernel,max_depth,n_estimators,n_neighbors,gamma,id,objective,elapsed_sec,duration
0,MLP,,0.000274,,,,,,1,0.675673,0.480368,0.24246
1,MLP,,0.00045,,,,,,3,0.675673,0.932815,0.326241
2,AdaBoost,,,,,491.0,,,2,0.808517,1.150966,0.913018
3,AdaBoost,,,,,14.0,,,5,0.777684,1.389541,0.03977
4,MLP,,0.001947,,,,,,4,0.675674,1.610403,0.459993
5,RandomForest,,,,24.0,145.0,,,6,0.864622,1.848695,0.239134
6,AdaBoost,,,,,656.0,,,7,0.812025,2.709419,0.889866
7,RandomForest,,,,4.0,1.0,,,9,0.734077,2.887269,0.011393
8,RandomForest,,,,16.0,8.0,,,10,0.802004,3.078153,0.027789
9,RandomForest,,,,82.0,6.0,,,11,0.754988,3.377223,0.017984


The `deephyper-analytics` command line is a way of analyzing this type of file. For example, we want to output the best configuration we can use the `topk` functionnality.

In [15]:
!deephyper-analytics topk results.csv -k 3

'0':
  C: null
  alpha: null
  duration: 0.2391338348
  elapsed_sec: 1.8486950397
  gamma: null
  id: 6
  kernel: null
  max_depth: 24.0
  n_estimators: 145.0
  n_neighbors: null
  objective: 0.8646224101
  regressor: RandomForest
'1':
  C: null
  alpha: null
  duration: 0.8898658752
  elapsed_sec: 2.7094190121
  gamma: null
  id: 7
  kernel: null
  max_depth: null
  n_estimators: 656.0
  n_neighbors: null
  objective: 0.8120254829
  regressor: AdaBoost
'2':
  C: null
  alpha: null
  duration: 0.9130177498
  elapsed_sec: 1.1509661674
  gamma: null
  id: 2
  kernel: null
  max_depth: null
  n_estimators: 491.0
  n_neighbors: null
  objective: 0.8085167869
  regressor: AdaBoost

