# Automated Machine Learning with Scikit-Learn

In this tutorial, we will show how to automatically search among different machine learning algorithms from [Scikit-Learn](https://scikit-learn.org/stable/). Automated machine learning only requires the user to link the data with a predifined problem and run function that we provide.

<div class="alert alert-warning">

<b>Warning</b>
    
By design asyncio does not allow nested event loops. Jupyter is using Tornado which already starts an event loop. Therefore the following patch is required to run this tutorial.
    
</div>



In [23]:
!pip install nest_asyncio

import nest_asyncio
nest_asyncio.apply()



## Classification

On this part of the tutorial we focus on the classification case.

Create `run` function to train and evaluate the model corresponding to the configuration generated by the search. This function has to return a scalar value (typically, validation accuracy), which will be maximized by the search algorithm. In the case of *automated machine learning* we use the `run` function provided at `deephyper.sklearn.classifier.run_autosklearn1` and wrap it with our data such as:

In [24]:
from deephyper.sklearn.classifier import run_autosklearn1


def load_data():
    from sklearn.datasets import load_breast_cancer

    X, y = load_breast_cancer(return_X_y=True)

    return X, y


def run(config):
    return run_autosklearn1(config, load_data)

We are ready to go! But, let us look at the problem provided by DeepHyper in `deephyper.sklearn.classifier.problem_autosklearn1` to understand better what is happening under the hood.

In [25]:
from deephyper.sklearn.classifier import problem_autosklearn1

problem_autosklearn1

Configuration space object:
  Hyperparameters:
    C, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    alpha, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    classifier, Type: Categorical, Choices: {RandomForest, Logistic, AdaBoost, KNeighbors, MLP, SVC, XGBoost}, Default: RandomForest
    gamma, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    kernel, Type: Categorical, Choices: {linear, poly, rbf, sigmoid}, Default: linear
    max_depth, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    n_estimators, Type: UniformInteger, Range: [1, 2000], Default: 45, on log-scale
    n_neighbors, Type: UniformInteger, Range: [1, 100], Default: 50
  Conditions:
    (C | classifier == 'Logistic' || C | classifier == 'SVC')
    (gamma | kernel == 'rbf' || gamma | kernel == 'poly' || gamma | kernel == 'sigmoid')
    (n_estimators | classifier == 'RandomForest' || n_estimators | classifier == 'AdaBoost')
    a

Create an `Evaluator` object using the `ray` backend to distribute the evaluation of the run-function defined previously.

In [26]:
from deephyper.evaluator import Evaluator
from deephyper.evaluator.callback import LoggerCallback

evaluator = Evaluator.create(run, 
                 method="ray", 
                 method_kwargs={
                     "address": None, 
                     "num_cpus": 2,
                     "num_cpus_per_task": 1,
                     "callbacks": [LoggerCallback()]
                 })

print("Number of workers: ", evaluator.num_workers)

Number of workers:  2


<div class="alert alert-info">
    
<b>Tip</b> 
    
You can open the ray-dashboard at an address like <a>http://127.0.0.1:port</a> in a browser to monitor the CPU usage of the execution.
    
</div>

Finally, you can define a Bayesian optimization search called `AMBS` (for Asynchronous Model-Based Search) and link to it the defined `problem_autosklearn1` and `evaluator`.

In [27]:
from deephyper.search.hps import AMBS

search = AMBS(problem_autosklearn1, evaluator)

In [28]:
results = search.search(10)

[00001] -- best objective: 0.64362 -- received objective: 0.64362
[00002] -- best objective: 0.64362 -- received objective: 0.64362
[00003] -- best objective: 0.64362 -- received objective: 0.64362
[00004] -- best objective: 0.93617 -- received objective: 0.93617
[00005] -- best objective: 0.93617 -- received objective: 0.87234
[00006] -- best objective: 0.93617 -- received objective: 0.79255
[00007] -- best objective: 0.97872 -- received objective: 0.97872
[00008] -- best objective: 0.98404 -- received objective: 0.98404
[00009] -- best objective: 0.98404 -- received objective: 0.94149
[00010] -- best objective: 0.98404 -- received objective: 0.97340


Once the search is over, a file named `results.csv` is saved in the current directory. The same dataframe is returned by the `search.search(...)` call. It contains the hyperparameters configurations evaluated during the search and their corresponding `objective` value (i.e, validation accuracy), `duration` of computation and time of computation with `elapsed_sec`.

In [29]:
results

Unnamed: 0,classifier,C,alpha,kernel,max_depth,n_estimators,n_neighbors,gamma,id,objective,elapsed_sec,duration
0,SVC,3.8e-05,,linear,,,,,1,0.643617,0.205998,0.020271
1,SVC,0.001884,,rbf,,,,0.052997,2,0.643617,0.304545,0.118803
2,SVC,0.000103,,sigmoid,,,,1.083386,3,0.643617,0.485166,0.181115
3,RandomForest,,,,49.0,12.0,,,4,0.93617,0.660375,0.175918
4,RandomForest,,,,81.0,1.0,,,6,0.87234,0.976429,0.025854
5,Logistic,0.000283,,,,,,,5,0.792553,1.667765,1.008016
6,AdaBoost,,,,,416.0,,,7,0.978723,2.136055,0.870724
7,AdaBoost,,,,,401.0,,,8,0.984043,2.629015,0.78198
8,RandomForest,,,,2.0,300.0,,,10,0.941489,3.190827,0.392217
9,AdaBoost,,,,,580.0,,,9,0.973404,3.451035,1.148255


The `deephyper-analytics` command line is a way of analyzing this type of file. For example, we want to output the best configuration we can use the `topk` functionnality.

In [30]:
!deephyper-analytics topk results.csv -k 3

'0':
  C: null
  alpha: null
  classifier: AdaBoost
  duration: 0.7819800377
  elapsed_sec: 2.6290147305
  gamma: null
  id: 8
  kernel: null
  max_depth: null
  n_estimators: 401.0
  n_neighbors: null
  objective: 0.9840425532
'1':
  C: null
  alpha: null
  classifier: AdaBoost
  duration: 0.8707237244
  elapsed_sec: 2.1360547543
  gamma: null
  id: 7
  kernel: null
  max_depth: null
  n_estimators: 416.0
  n_neighbors: null
  objective: 0.9787234043
'2':
  C: null
  alpha: null
  classifier: AdaBoost
  duration: 1.1482548714
  elapsed_sec: 3.4510347843
  gamma: null
  id: 9
  kernel: null
  max_depth: null
  n_estimators: 580.0
  n_neighbors: null
  objective: 0.9734042553



## Regression

On this part of the tutorial we focus on the regression case.

Create `run` function to train and evaluate the model corresponding to the configuration generated by the search. This function has to return a scalar value (typically, validation $R^2$), which will be maximized by the search algorithm. In the case of *automated machine learning* we use the `run`-function provided at `deephyper.sklearn.regressor.run_autosklearn1` and wrap it with our data such as:

In [31]:
from deephyper.sklearn.regressor import run_autosklearn1


def load_data():
    from sklearn.datasets import fetch_california_housing

    X, y = fetch_california_housing(return_X_y=True)
    return X, y


def run(config):
    return run_autosklearn1(config, load_data)

We are ready to go! But, let us look at the problem provided by DeepHyper to understand better what is happening under the hood. 

In [32]:
from deephyper.sklearn.regressor import problem_autosklearn1

problem_autosklearn1

Configuration space object:
  Hyperparameters:
    C, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    alpha, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    gamma, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    kernel, Type: Categorical, Choices: {linear, poly, rbf, sigmoid}, Default: linear
    max_depth, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    n_estimators, Type: UniformInteger, Range: [1, 2000], Default: 45, on log-scale
    n_neighbors, Type: UniformInteger, Range: [1, 100], Default: 50
    regressor, Type: Categorical, Choices: {RandomForest, Linear, AdaBoost, KNeighbors, MLP, SVR, XGBoost}, Default: RandomForest
  Conditions:
    (gamma | kernel == 'rbf' || gamma | kernel == 'poly' || gamma | kernel == 'sigmoid')
    (n_estimators | regressor == 'RandomForest' || n_estimators | regressor == 'AdaBoost')
    C | regressor == 'SVR'
    alpha | regressor == 'MLP'
    kernel | r

Create an `Evaluator` object using the `ray` backend to distribute the evaluation of the run-function defined previously.

In [33]:
from deephyper.evaluator import Evaluator
from deephyper.evaluator.callback import LoggerCallback

evaluator = Evaluator.create(run, 
                 method="ray", 
                 method_kwargs={
                     "address": None, 
                     "num_cpus": 2,
                     "num_cpus_per_task": 1,
                     "callbacks": [LoggerCallback()]
                 })

print("Number of workers: ", evaluator.num_workers)

Number of workers:  2


<div class="alert alert-info">
    
<b>Tip</b> 
    
You can open the ray-dashboard at an address like <a>http://127.0.0.1:port</a> in a browser to monitor the CPU usage of the execution.
    
</div>

Finally, you can define a Bayesian optimization search called `AMBS` (for Asynchronous Model-Based Search) and link to it the defined `Problem` and `evaluator`.

In [34]:
from deephyper.search.hps import AMBS

search = AMBS(problem_autosklearn1, evaluator)

In [35]:
results = search.search(10)



[00001] -- best objective: -0.05987 -- received objective: -0.05987
[00002] -- best objective: -0.05987 -- received objective: -1.00000
[00003] -- best objective: -0.05958 -- received objective: -0.05958
[00004] -- best objective: 0.64601 -- received objective: 0.64601
[00005] -- best objective: 0.79948 -- received objective: 0.79948
[00006] -- best objective: 0.80628 -- received objective: 0.80628
[00007] -- best objective: 0.80628 -- received objective: 0.77807
[00008] -- best objective: 0.80628 -- received objective: 0.77237
[00009] -- best objective: 0.80628 -- received objective: 0.57510
[00010] -- best objective: 0.80628 -- received objective: 0.44250


Once the search is over, a file named `results.csv` is saved in the current directory. The same dataframe is returned by the `search.search(...)` call. It contains the hyperparameters configurations evaluated during the search and their corresponding `objective` value (i.e, validation accuracy), `duration` of computation and time of computation with `elapsed_sec`.

In [36]:
results

Unnamed: 0,regressor,C,alpha,kernel,max_depth,n_estimators,n_neighbors,gamma,id,objective,elapsed_sec,duration
0,SVR,0.000641,,poly,,,,9.7e-05,2,-0.059871,8.143802,7.984654
1,AdaBoost,,,,,66.0,,,3,-1.0,8.256956,0.014779
2,SVR,2.4e-05,,sigmoid,,,,0.00115,1,-0.059577,9.388179,9.229047
3,KNeighbors,,,,,,100.0,,5,0.646005,10.084049,0.52983
4,RandomForest,,,,21.0,22.0,,,6,0.799479,10.593793,0.332195
5,RandomForest,,,,17.0,1985.0,,,7,0.806276,39.086356,28.174333
6,RandomForest,,,,10.0,1458.0,,,8,0.778068,52.83801,13.529363
7,MLP,,0.026939,,,,,,9,0.772373,58.923216,5.867041
8,SVR,0.205418,,linear,,,,,10,0.575102,67.039218,7.946631
9,AdaBoost,,,,,114.0,,,11,0.442501,68.080724,0.85179


The `deephyper-analytics` command line is a way of analyzing this type of file. For example, we want to output the best configuration we can use the `topk` functionnality.

In [37]:
!deephyper-analytics topk results.csv -k 3

'0':
  C: null
  alpha: null
  duration: 28.1743330956
  elapsed_sec: 39.0863559246
  gamma: null
  id: 7
  kernel: null
  max_depth: 17.0
  n_estimators: 1985.0
  n_neighbors: null
  objective: 0.8062755105
  regressor: RandomForest
'1':
  C: null
  alpha: null
  duration: 0.3321950436
  elapsed_sec: 10.5937926769
  gamma: null
  id: 6
  kernel: null
  max_depth: 21.0
  n_estimators: 22.0
  n_neighbors: null
  objective: 0.799478785
  regressor: RandomForest
'2':
  C: null
  alpha: null
  duration: 13.5293631554
  elapsed_sec: 52.8380098343
  gamma: null
  id: 8
  kernel: null
  max_depth: 10.0
  n_estimators: 1458.0
  n_neighbors: null
  objective: 0.778067878
  regressor: RandomForest

