# Automated Machine Learning with Scikit-Learn

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deephyper/tutorials/blob/main/tutorials/colab/AutoML_with_Sklearn.ipynb)

In this tutorial, we will show how to automatically search among different machine learning algorithms from [Scikit-Learn](https://scikit-learn.org/stable/). Automated machine learning only requires the user to link the data with a predifined problem and run function that we provide.

Let us start by installing DeepHyper.

In [1]:
!pip install deephyper["popt"]
!pip install ray



## Classification

On this part of the tutorial we focus on the classification case.

Create `run` function to train and evaluate the model corresponding to the configuration generated by the search. This function has to return a scalar value (typically, validation accuracy), which will be maximized by the search algorithm. In the case of *automated machine learning* we use the `run` function provided at `deephyper.sklearn.classifier.run_autosklearn1` and wrap it with our data such as:

In [2]:
from deephyper.sklearn.classifier import run_autosklearn1


def load_data():
    from sklearn.datasets import load_breast_cancer

    X, y = load_breast_cancer(return_X_y=True)

    return X, y


def run(config):
    return run_autosklearn1(config, load_data)

  from pandas import MultiIndex, Int64Index


We are ready to go! But, let us look at the problem provided by DeepHyper in `deephyper.sklearn.classifier.problem_autosklearn1` to understand better what is happening under the hood.

In [3]:
from deephyper.sklearn.classifier import problem_autosklearn1

problem_autosklearn1

Configuration space object:
  Hyperparameters:
    C, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    alpha, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    classifier, Type: Categorical, Choices: {RandomForest, Logistic, AdaBoost, KNeighbors, MLP, SVC, XGBoost}, Default: RandomForest
    gamma, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    kernel, Type: Categorical, Choices: {linear, poly, rbf, sigmoid}, Default: linear
    max_depth, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    n_estimators, Type: UniformInteger, Range: [1, 2000], Default: 45, on log-scale
    n_neighbors, Type: UniformInteger, Range: [1, 100], Default: 50
  Conditions:
    (C | classifier == 'Logistic' || C | classifier == 'SVC')
    (gamma | kernel == 'rbf' || gamma | kernel == 'poly' || gamma | kernel == 'sigmoid')
    (n_estimators | classifier == 'RandomForest' || n_estimators | classifier == 'AdaBoost')
    a

Create an `Evaluator` object using the `ray` backend to distribute the evaluation of the run-function defined previously.

In [5]:
from deephyper.evaluator import Evaluator
from deephyper.evaluator.callback import TqdmCallback

evaluator = Evaluator.create(run, 
                 method="ray", 
                 method_kwargs={
                     "address": None, 
                     "num_cpus": 1,
                     "num_cpus_per_task": 1,
                     "callbacks": [TqdmCallback()]
                 })

print("Number of workers: ", evaluator.num_workers)

Number of workers:  1




Finally, you can define a Bayesian optimization search called `CBO` (for Centralized Bayesian Optimization) and link to it the defined `problem_autosklearn1` and `evaluator`.

In [6]:
from deephyper.search.hps import CBO

search = CBO(problem_autosklearn1, evaluator)

In [7]:
results = search.search(10)

[2m[36m(pid=3969)[0m   from pandas import MultiIndex, Int64Index
100%|██████████| 10/10 [00:01<00:00,  4.10it/s, objective=0.984] 

Once the search is over, a file named `results.csv` is saved in the current directory. The same dataframe is returned by the `search.search(...)` call. It contains the hyperparameters configurations evaluated during the search and their corresponding `objective` value (i.e, validation accuracy), `timestamp_submit` the time when the evaluator submitted the configuration to be evaluated and `timestamp_gather` the time when the evaluator received the configuration once evaluated (both are relative times with respect to the creation of the `Evaluator` instance).

In [8]:
results

Unnamed: 0,classifier,C,alpha,kernel,max_depth,n_estimators,n_neighbors,gamma,job_id,objective,timestamp_submit,timestamp_gather
0,Logistic,0.000986,,,,,,,1,0.893617,47.562966,49.772636
1,KNeighbors,,,,,,41.0,,2,0.946809,49.936445,49.963114
2,RandomForest,,,,48.0,51.0,,,3,0.957447,50.183954,50.229447
3,Logistic,0.000341,,,,,,,4,0.819149,50.377067,50.383312
4,SVC,6.3e-05,,linear,,,,,5,0.643617,50.594783,50.607125
5,SVC,1.6e-05,,sigmoid,,,,0.00418,6,0.643617,50.754105,50.765272
6,SVC,0.422234,,sigmoid,,,,2.779419,7,0.893617,50.913402,50.921376
7,RandomForest,,,,91.0,15.0,,,8,0.952128,51.130214,51.146671
8,MLP,,1.350762,,,,,,9,0.984043,51.292809,51.40871
9,MLP,,0.033863,,,,,,10,0.978723,51.618489,51.735319


The `deephyper-analytics` command line is a way of analyzing this type of file. For example, we want to output the best configuration we can use the `topk` functionnality.

In [9]:
!deephyper-analytics topk results.csv -k 3

'0':
  C: null
  alpha: 1.3507621846
  classifier: MLP
  gamma: null
  job_id: 9
  kernel: null
  max_depth: null
  n_estimators: null
  n_neighbors: null
  objective: 0.9840425532
  timestamp_gather: 51.4087100029
  timestamp_submit: 51.2928090096
'1':
  C: null
  alpha: 0.0338633848
  classifier: MLP
  gamma: null
  job_id: 10
  kernel: null
  max_depth: null
  n_estimators: null
  n_neighbors: null
  objective: 0.9787234043
  timestamp_gather: 51.7353191376
  timestamp_submit: 51.618489027
'2':
  C: null
  alpha: null
  classifier: RandomForest
  gamma: null
  job_id: 3
  kernel: null
  max_depth: 48.0
  n_estimators: 51.0
  n_neighbors: null
  objective: 0.9574468085
  timestamp_gather: 50.229446888
  timestamp_submit: 50.1839540005



## Regression

On this part of the tutorial we focus on the regression case.

Create `run` function to train and evaluate the model corresponding to the configuration generated by the search. This function has to return a scalar value (typically, validation $R^2$), which will be maximized by the search algorithm. In the case of *automated machine learning* we use the `run`-function provided at `deephyper.sklearn.regressor.run_autosklearn1` and wrap it with our data such as:

In [10]:
from deephyper.sklearn.regressor import run_autosklearn1


def load_data():
    from sklearn.datasets import fetch_california_housing

    X, y = fetch_california_housing(return_X_y=True)
    return X, y


def run(config):
    return run_autosklearn1(config, load_data)

We are ready to go! But, let us look at the problem provided by DeepHyper to understand better what is happening under the hood. 

In [11]:
from deephyper.sklearn.regressor import problem_autosklearn1

problem_autosklearn1

Configuration space object:
  Hyperparameters:
    C, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    alpha, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    gamma, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    kernel, Type: Categorical, Choices: {linear, poly, rbf, sigmoid}, Default: linear
    max_depth, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    n_estimators, Type: UniformInteger, Range: [1, 2000], Default: 45, on log-scale
    n_neighbors, Type: UniformInteger, Range: [1, 100], Default: 50
    regressor, Type: Categorical, Choices: {RandomForest, Linear, AdaBoost, KNeighbors, MLP, SVR, XGBoost}, Default: RandomForest
  Conditions:
    (gamma | kernel == 'rbf' || gamma | kernel == 'poly' || gamma | kernel == 'sigmoid')
    (n_estimators | regressor == 'RandomForest' || n_estimators | regressor == 'AdaBoost')
    C | regressor == 'SVR'
    alpha | regressor == 'MLP'
    kernel | r

Create an `Evaluator` object using the `ray` backend to distribute the evaluation of the run-function defined previously.

In [13]:
from deephyper.evaluator import Evaluator
from deephyper.evaluator.callback import TqdmCallback

evaluator = Evaluator.create(run, 
                 method="ray", 
                 method_kwargs={
                     "address": None, 
                     "num_cpus": 1,
                     "num_cpus_per_task": 1,
                     "callbacks": [TqdmCallback()]
                 })

print("Number of workers: ", evaluator.num_workers)

Number of workers:  1




Finally, you can define a Bayesian optimization search called `CBO` (for Centralized Bayesian Optimization) and link to it the defined `Problem` and `evaluator`.

In [14]:
from deephyper.search.hps import CBO

search = CBO(problem_autosklearn1, evaluator)

In [15]:
results = search.search(10)

100%|██████████| 10/10 [01:37<00:00,  9.72s/it, objective=0.984]
100%|██████████| 10/10 [00:35<00:00,  3.56s/it, objective=0.803]  

Once the search is over, a file named `results.csv` is saved in the current directory. The same dataframe is returned by the `search.search(...)` call. It contains the hyperparameters configurations evaluated during the search and their corresponding `objective` value (i.e, validation $R^2$), `timestamp_submit` the time when the evaluator submitted the configuration to be evaluated and `timestamp_gather` the time when the evaluator received the configuration once evaluated (both are relative times with respect to the creation of the `Evaluator` instance).

In [16]:
results

Unnamed: 0,regressor,C,alpha,kernel,max_depth,n_estimators,n_neighbors,gamma,job_id,objective,timestamp_submit,timestamp_gather
0,Linear,,,,,,,,1,0.597049,43.84133,44.715718
1,KNeighbors,,,,,,41.0,,2,0.666496,44.862824,45.171929
2,RandomForest,,,,48.0,51.0,,,3,0.80251,45.384461,48.056313
3,RandomForest,,,,7.0,245.0,,,4,0.719056,48.200928,54.745264
4,SVR,6.3e-05,,linear,,,,,5,0.322115,54.949263,59.10784
5,SVR,1.6e-05,,sigmoid,,,,0.00418,6,-0.059354,59.249552,64.55257
6,SVR,0.422234,,sigmoid,,,,2.779419,7,-321050.500503,64.697452,73.575086
7,RandomForest,,,,91.0,15.0,,,8,0.796552,73.777804,74.575513
8,MLP,,1.350762,,,,,,9,0.708333,74.717424,76.848931
9,MLP,,0.033863,,,,,,10,0.771833,77.049758,80.177991


The `deephyper-analytics` command line is a way of analyzing this type of file. For example, we want to output the best configuration we can use the `topk` functionnality.

In [17]:
!deephyper-analytics topk results.csv -k 3

'0':
  C: null
  alpha: null
  gamma: null
  job_id: 3
  kernel: null
  max_depth: 48.0
  n_estimators: 51.0
  n_neighbors: null
  objective: 0.8025103513
  regressor: RandomForest
  timestamp_gather: 48.0563130379
  timestamp_submit: 45.3844611645
'1':
  C: null
  alpha: null
  gamma: null
  job_id: 8
  kernel: null
  max_depth: 91.0
  n_estimators: 15.0
  n_neighbors: null
  objective: 0.7965520433
  regressor: RandomForest
  timestamp_gather: 74.5755131245
  timestamp_submit: 73.7778041363
'2':
  C: null
  alpha: 0.0338633848
  gamma: null
  job_id: 10
  kernel: null
  max_depth: null
  n_estimators: null
  n_neighbors: null
  objective: 0.7718326426
  regressor: MLP
  timestamp_gather: 80.1779909134
  timestamp_submit: 77.0497579575

