# Automated Machine Learning with Scikit-Learn

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deephyper/tutorials/blob/main/tutorials/colab/AutoML_with_Sklearn.ipynb)

In this tutorial, we will show how to automatically search among different machine learning algorithms from [Scikit-Learn](https://scikit-learn.org/stable/). Automated machine learning only requires the user to link the data with a predifined problem and run function that we provide.

Let us start by installing DeepHyper.

In [1]:
!pip install deephyper



<div class="alert alert-warning">

<b>Warning</b>
    
By design asyncio does not allow nested event loops. Jupyter is using Tornado which already starts an event loop. Therefore the following patch is required to run this tutorial.
    
</div>

In [2]:
!pip install nest_asyncio

import nest_asyncio
nest_asyncio.apply()



## Classification

On this part of the tutorial we focus on the classification case.

Create `run` function to train and evaluate the model corresponding to the configuration generated by the search. This function has to return a scalar value (typically, validation accuracy), which will be maximized by the search algorithm. In the case of *automated machine learning* we use the `run` function provided at `deephyper.sklearn.classifier.run_autosklearn1` and wrap it with our data such as:

In [3]:
from deephyper.sklearn.classifier import run_autosklearn1


def load_data():
    from sklearn.datasets import load_breast_cancer

    X, y = load_breast_cancer(return_X_y=True)

    return X, y


def run(config):
    return run_autosklearn1(config, load_data)

We are ready to go! But, let us look at the problem provided by DeepHyper in `deephyper.sklearn.classifier.problem_autosklearn1` to understand better what is happening under the hood.

In [4]:
from deephyper.sklearn.classifier import problem_autosklearn1

problem_autosklearn1

Configuration space object:
  Hyperparameters:
    C, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    alpha, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    classifier, Type: Categorical, Choices: {RandomForest, Logistic, AdaBoost, KNeighbors, MLP, SVC, XGBoost}, Default: RandomForest
    gamma, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    kernel, Type: Categorical, Choices: {linear, poly, rbf, sigmoid}, Default: linear
    max_depth, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    n_estimators, Type: UniformInteger, Range: [1, 2000], Default: 45, on log-scale
    n_neighbors, Type: UniformInteger, Range: [1, 100], Default: 50
  Conditions:
    (C | classifier == 'Logistic' || C | classifier == 'SVC')
    (gamma | kernel == 'rbf' || gamma | kernel == 'poly' || gamma | kernel == 'sigmoid')
    (n_estimators | classifier == 'RandomForest' || n_estimators | classifier == 'AdaBoost')
    a

Create an `Evaluator` object using the `ray` backend to distribute the evaluation of the run-function defined previously.

In [5]:
from deephyper.evaluator import Evaluator
from deephyper.evaluator.callback import LoggerCallback

evaluator = Evaluator.create(run, 
                 method="ray", 
                 method_kwargs={
                     "address": None, 
                     "num_cpus": 1,
                     "num_cpus_per_task": 1,
                     "callbacks": [LoggerCallback()]
                 })

print("Number of workers: ", evaluator.num_workers)

Number of workers:  1


<div class="alert alert-info">
    
<b>Tip</b> 
    
If you execute this tutorial locally, you can open the ray-dashboard at an address like <a>http://127.0.0.1:port</a> in a browser to monitor the CPU usage of the execution.
    
</div>

Finally, you can define a Bayesian optimization search called `AMBS` (for Asynchronous Model-Based Search) and link to it the defined `problem_autosklearn1` and `evaluator`.

In [6]:
from deephyper.search.hps import AMBS

search = AMBS(problem_autosklearn1, evaluator)

In [7]:
results = search.search(10)

[00001] -- best objective: -1.00000 -- received objective: -1.00000
[00002] -- best objective: 0.96277 -- received objective: 0.96277
[00003] -- best objective: 0.96277 -- received objective: 0.94149
[00004] -- best objective: 0.97872 -- received objective: 0.97872




[00005] -- best objective: 0.97872 -- received objective: 0.97872




[00006] -- best objective: 0.97872 -- received objective: 0.95745
[00007] -- best objective: 0.97872 -- received objective: 0.96809
[00008] -- best objective: 0.97872 -- received objective: 0.97340
[00009] -- best objective: 0.97872 -- received objective: 0.96277
[00010] -- best objective: 0.97872 -- received objective: 0.96809


Once the search is over, a file named `results.csv` is saved in the current directory. The same dataframe is returned by the `search.search(...)` call. It contains the hyperparameters configurations evaluated during the search and their corresponding `objective` value (i.e, validation accuracy), `duration` of computation and time of computation with `elapsed_sec`.

In [8]:
results

Unnamed: 0,classifier,C,alpha,kernel,max_depth,n_estimators,n_neighbors,gamma,id,objective,elapsed_sec,duration
0,AdaBoost,,,,,1.0,,,1,-1.0,13.452292,3.579814
1,KNeighbors,,,,,,12.0,,2,0.962766,15.11949,0.128026
2,KNeighbors,,,,,,26.0,,3,0.941489,16.815022,0.126511
3,MLP,,0.001721,,,,,,4,0.978723,18.776632,0.450363
4,MLP,,0.719261,,,,,,5,0.978723,20.71752,0.451758
5,RandomForest,,,,7.0,1284.0,,,6,0.957447,25.709774,3.390466
6,RandomForest,,,,18.0,140.0,,,7,0.968085,27.733081,0.481195
7,AdaBoost,,,,,256.0,,,8,0.973404,30.094701,0.792026
8,AdaBoost,,,,,13.0,,,9,0.962766,31.660862,0.056839
9,AdaBoost,,,,,7.0,,,10,0.968085,33.250499,0.037967


The `deephyper-analytics` command line is a way of analyzing this type of file. For example, we want to output the best configuration we can use the `topk` functionnality.

In [9]:
!deephyper-analytics topk results.csv -k 3

'0': {C: null, alpha: 0.0017211808, classifier: MLP, duration: 0.4503633976, elapsed_sec: 18.7766315937,
  gamma: null, id: 4, kernel: null, max_depth: null, n_estimators: null, n_neighbors: null,
  objective: 0.9787234043}
'1': {C: null, alpha: 0.7192606656, classifier: MLP, duration: 0.4517583847, elapsed_sec: 20.7175195217,
  gamma: null, id: 5, kernel: null, max_depth: null, n_estimators: null, n_neighbors: null,
  objective: 0.9787234043}
'2': {C: null, alpha: null, classifier: AdaBoost, duration: 0.7920260429, elapsed_sec: 30.0947012901,
  gamma: null, id: 8, kernel: null, max_depth: null, n_estimators: 256.0, n_neighbors: null,
  objective: 0.9734042553}



## Regression

On this part of the tutorial we focus on the regression case.

Create `run` function to train and evaluate the model corresponding to the configuration generated by the search. This function has to return a scalar value (typically, validation $R^2$), which will be maximized by the search algorithm. In the case of *automated machine learning* we use the `run`-function provided at `deephyper.sklearn.regressor.run_autosklearn1` and wrap it with our data such as:

In [10]:
from deephyper.sklearn.regressor import run_autosklearn1


def load_data():
    from sklearn.datasets import fetch_california_housing

    X, y = fetch_california_housing(return_X_y=True)
    return X, y


def run(config):
    return run_autosklearn1(config, load_data)

We are ready to go! But, let us look at the problem provided by DeepHyper to understand better what is happening under the hood. 

In [11]:
from deephyper.sklearn.regressor import problem_autosklearn1

problem_autosklearn1

Configuration space object:
  Hyperparameters:
    C, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    alpha, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    gamma, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    kernel, Type: Categorical, Choices: {linear, poly, rbf, sigmoid}, Default: linear
    max_depth, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    n_estimators, Type: UniformInteger, Range: [1, 2000], Default: 45, on log-scale
    n_neighbors, Type: UniformInteger, Range: [1, 100], Default: 50
    regressor, Type: Categorical, Choices: {RandomForest, Linear, AdaBoost, KNeighbors, MLP, SVR, XGBoost}, Default: RandomForest
  Conditions:
    (gamma | kernel == 'rbf' || gamma | kernel == 'poly' || gamma | kernel == 'sigmoid')
    (n_estimators | regressor == 'RandomForest' || n_estimators | regressor == 'AdaBoost')
    C | regressor == 'SVR'
    alpha | regressor == 'MLP'
    kernel | r

Create an `Evaluator` object using the `ray` backend to distribute the evaluation of the run-function defined previously.

In [12]:
from deephyper.evaluator import Evaluator
from deephyper.evaluator.callback import LoggerCallback

evaluator = Evaluator.create(run, 
                 method="ray", 
                 method_kwargs={
                     "address": None, 
                     "num_cpus": 2,
                     "num_cpus_per_task": 1,
                     "callbacks": [LoggerCallback()]
                 })

print("Number of workers: ", evaluator.num_workers)

Number of workers:  1


<div class="alert alert-info">
    
<b>Tip</b> 
    
You can open the ray-dashboard at an address like <a>http://127.0.0.1:port</a> in a browser to monitor the CPU usage of the execution.
    
</div>

Finally, you can define a Bayesian optimization search called `AMBS` (for Asynchronous Model-Based Search) and link to it the defined `Problem` and `evaluator`.

In [13]:
from deephyper.search.hps import AMBS

search = AMBS(problem_autosklearn1, evaluator)

In [14]:
results = search.search(10)

[00001] -- best objective: -0.05886 -- received objective: -0.05886
[00002] -- best objective: 0.56451 -- received objective: 0.56451
[00003] -- best objective: 0.56451 -- received objective: 0.44250




[00004] -- best objective: 0.78391 -- received objective: 0.78391
[00005] -- best objective: 0.78391 -- received objective: 0.76766
[00006] -- best objective: 0.78391 -- received objective: 0.57538
[00007] -- best objective: 0.78391 -- received objective: 0.47485
[00008] -- best objective: 0.78391 -- received objective: 0.77684




[00009] -- best objective: 0.80735 -- received objective: 0.80735
[00010] -- best objective: 0.80735 -- received objective: 0.71937


Once the search is over, a file named `results.csv` is saved in the current directory. The same dataframe is returned by the `search.search(...)` call. It contains the hyperparameters configurations evaluated during the search and their corresponding `objective` value (i.e, validation accuracy), `duration` of computation and time of computation with `elapsed_sec`.

In [16]:
results

Unnamed: 0,regressor,C,alpha,kernel,max_depth,n_estimators,n_neighbors,gamma,id,objective,elapsed_sec,duration
0,SVR,0.000317,,sigmoid,,,,0.000371,1,-0.058856,26.522054,22.169491
1,AdaBoost,,,,,13.0,,,2,0.564506,28.500448,0.489093
2,AdaBoost,,,,,322.0,,,3,0.442501,31.241331,1.240041
3,XGBoost,,,,,,,,4,0.783908,33.657469,1.129844
4,MLP,,0.074061,,,,,,5,0.767663,46.345952,11.166475
5,SVR,1.562227,,linear,,,,,6,0.575381,72.394516,24.590795
6,SVR,0.008012,,rbf,,,,0.012148,7,0.474847,91.398495,17.416007
7,MLP,,0.000198,,,,,,8,0.776843,104.560521,11.689193
8,RandomForest,,,,29.0,988.0,,,9,0.807347,182.046024,75.990584
9,RandomForest,,,,7.0,70.0,,,10,0.719367,186.149946,2.577926


The `deephyper-analytics` command line is a way of analyzing this type of file. For example, we want to output the best configuration we can use the `topk` functionnality.

In [17]:
!deephyper-analytics topk results.csv -k 3

'0': {C: null, alpha: null, duration: 75.9905841351, elapsed_sec: 182.0460243225,
  gamma: null, id: 9, kernel: null, max_depth: 29.0, n_estimators: 988.0, n_neighbors: null,
  objective: 0.8073473164, regressor: RandomForest}
'1': {C: null, alpha: null, duration: 1.1298441887, elapsed_sec: 33.6574685574, gamma: null,
  id: 4, kernel: null, max_depth: null, n_estimators: null, n_neighbors: null, objective: 0.7839078942,
  regressor: XGBoost}
'2': {C: null, alpha: 0.0001976475, duration: 11.6891927719, elapsed_sec: 104.560520649,
  gamma: null, id: 8, kernel: null, max_depth: null, n_estimators: null, n_neighbors: null,
  objective: 0.7768434694, regressor: MLP}

