In [1]:
!pip install auto-sklearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting auto-sklearn
  Downloading auto-sklearn-0.15.0.tar.gz (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 10.6 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting ConfigSpace<0.5,>=0.4.21
  Downloading ConfigSpace-0.4.21-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB)
[K     |████████████████████████████████| 4.3 MB 49.1 MB/s 
Collecting scikit-learn<0.25.0,>=0.24.0
  Downloading scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
[K     |████████████████████████████████| 22.3 MB 1.5 MB/s 
Collecting distro
  Downloading distro-1.8.0-py3-none-any.whl (20 kB)
Collecting smac<1.3,>=1.2
  Downloading smac-1.2.tar.gz (260 kB)
[K     |████████████████████████████████| 260 kB 67.7 MB/s 
Collecting pynisher<0.7,

In [1]:
%matplotlib inline


# Parallel Usage: Spawning workers from the command line

*Auto-sklearn* uses
[dask.distributed](https://distributed.dask.org/en/latest/index.html)
for parallel optimization.

This example shows how to start the dask scheduler and spawn
workers for *Auto-sklearn* manually from the command line. Use this example
as a starting point to parallelize *Auto-sklearn* across multiple
machines.

To run *Auto-sklearn* in parallel on a single machine check out the example
`sphx_glr_examples_60_search_example_parallel_n_jobs.py`.

If you want to start everything manually from within Python
please see ``:ref:sphx_glr_examples_60_search_example_parallel_manual_spawning_python.py``.

**NOTE:** Above example is disabled due to issue https://github.com/dask/distributed/issues/5627


You can learn more about the dask command line interface from
https://docs.dask.org/en/latest/setup/cli.html.

When manually passing a dask client to Auto-sklearn, all logic
must be guarded by ``if __name__ == "__main__":`` statements! We use
multiple such statements to properly render this example as a notebook
and also allow execution via the command line.

## Background

To run Auto-sklearn distributed on multiple machines we need to set
up three components:

1. **Auto-sklearn and a dask client**. This will manage all workload, find new
   configurations to evaluate and submit jobs via a dask client. As this
   runs Bayesian optimization it should be executed on its own CPU.
2. **The dask workers**. They will do the actual work of running machine
   learning algorithms and require their own CPU each.
3. **The scheduler**. It manages the communication between the dask client
   and the different dask workers. As the client and all workers connect
   to the scheduler it must be started first. This is a light-weight job
   and does not require its own CPU.

We will now start these three components in reverse order: scheduler,
workers and client. Also, in a real setup, the scheduler and the workers should
be started from the command line and not from within a Python file via
the ``subprocess`` module as done here (for the sake of having a self-contained
example).


## Import statements



In [7]:
import multiprocessing
import subprocess
import time

import dask.distributed
import sklearn.datasets
import sklearn.metrics

from autosklearn.classification import AutoSklearnClassifier
from autosklearn.constants import MULTICLASS_CLASSIFICATION

tmp_folder = "/tmp/autosklearn_parallel_3_example_tmp"

worker_processes = []

## 0. Setup client-scheduler communication

In this examples the dask scheduler is started without an explicit
address and port. Instead, the scheduler takes a free port and stores
relevant information in a file for which we provided the name and
location. This filename is also given to the worker so they can find all
relevant information to connect to the scheduler.



In [8]:
scheduler_file_name = "scheduler-file.json"

## 1. Start scheduler

Starting the scheduler is done with the following bash command:

.. code:: bash

    dask-scheduler --scheduler-file scheduler-file.json --idle-timeout 10

We will now execute this bash command from within Python to have a
self-contained example:



In [9]:
def cli_start_scheduler(scheduler_file_name):
    command = f"dask-scheduler --scheduler-file {scheduler_file_name} --idle-timeout 10"
    proc = subprocess.run(
        command,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        shell=True,
        check=True,
    )
    while proc.returncode is None:
        time.sleep(1)


if __name__ == "__main__":
    process_python_worker = multiprocessing.Process(
        target=cli_start_scheduler,
        args=(scheduler_file_name,),
    )
    process_python_worker.start()
    worker_processes.append(process_python_worker)

    # Wait a second for the scheduler to become available
    time.sleep(1)

## 2. Start two workers

Starting the scheduler is done with the following bash command:

.. code:: bash

    DASK_DISTRIBUTED__WORKER__DAEMON=False \
        dask-worker --nthreads 1 --lifetime 35 --memory-limit 0 \
        --scheduler-file scheduler-file.json

We will now execute this bash command from within Python to have a
self-contained example. Please note, that
``DASK_DISTRIBUTED__WORKER__DAEMON=False`` is required in this
case as dask-worker creates a new process, which by default is not
compatible with Auto-sklearn creating new processes in the workers itself.
We disable dask's memory management by passing ``--memory-limit`` as
Auto-sklearn does the memory management itself.



In [10]:
def cli_start_worker(scheduler_file_name):
    command = (
        "DASK_DISTRIBUTED__WORKER__DAEMON=False "
        "dask-worker --nthreads 1 --lifetime 35 --memory-limit 0 "
        f"--scheduler-file {scheduler_file_name}"
    )
    proc = subprocess.run(
        command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, shell=True
    )
    while proc.returncode is None:
        time.sleep(1)


if __name__ == "__main__":
    for _ in range(2):
        process_cli_worker = multiprocessing.Process(
            target=cli_start_worker,
            args=(scheduler_file_name,),
        )
        process_cli_worker.start()
        worker_processes.append(process_cli_worker)

    # Wait a second for workers to become available
    time.sleep(1)

## 3. Creating a client in Python

Finally we create a dask cluster which also connects to the scheduler via
the information in the file created by the scheduler.



In [11]:
client = dask.distributed.Client(scheduler_file=scheduler_file_name)

### Start Auto-sklearn



In [12]:
if __name__ == "__main__":
    X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
        X, y, random_state=1
    )

    automl = AutoSklearnClassifier(
        delete_tmp_folder_after_terminate=False,
        time_left_for_this_task=30,
        per_run_time_limit=10,
        memory_limit=2048,
        tmp_folder=tmp_folder,
        seed=777,
        # n_jobs is ignored internally as we pass a dask client.
        n_jobs=1,
        # Pass a dask client which connects to the previously constructed cluster.
        dask_client=client,
    )
    automl.fit(X_train, y_train)

    automl.fit_ensemble(
        y_train,
        task=MULTICLASS_CLASSIFICATION,
        dataset_name="digits",
        ensemble_kwargs={"ensemble_size": 20},
        ensemble_nbest=50,
    )

    predictions = automl.predict(X_test)
    print(automl.sprint_statistics())
    print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))

auto-sklearn results:
  Dataset name: 3ea56c88-58b8-11ed-80ab-0242ac1c0002
  Metric: accuracy
  Best validation score: 0.992908
  Number of target algorithm runs: 4
  Number of successful target algorithm runs: 4
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0

Accuracy score 0.965034965034965


## Wait until all workers are closed

This is only necessary if the workers are started from within this python
script. In a real application one would start them directly from the command
line.



In [13]:
if __name__ == "__main__":
    process_python_worker.join()
    for process in worker_processes:
        process.join()

In [13]:
%matplotlib inline


# Parallel Usage  on a single machine

*Auto-sklearn* uses
`dask.distributed <https://distributed.dask.org/en/latest/index.html`>_
for parallel optimization.

This example shows how to start *Auto-sklearn* to use multiple cores on a
single machine. Using this mode, *Auto-sklearn* starts a dask cluster,
manages the workers and takes care of shutting down the cluster once the
computation is done.
To run *Auto-sklearn* on multiple machines check the example
`sphx_glr_examples_60_search_example_parallel_manual_spawning_cli.py`.


In [15]:
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

import autosklearn.classification

## Data Loading



In [16]:
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1
)

## Build and fit a classifier

To use ``n_jobs_`` we must guard the code



In [18]:
if __name__ == "__main__":

    automl = autosklearn.classification.AutoSklearnClassifier(
        time_left_for_this_task=120,
        per_run_time_limit=30,
        tmp_folder="/tmp/autosklearn_parallel_1_example_tmp",
        n_jobs=4,
        # Each one of the 4 jobs is allocated 3GB
        memory_limit=3072,
        seed=5,
    )
    automl.fit(X_train, y_train, dataset_name="breast_cancer")

    # Print statistics about the auto-sklearn run such as number of
    # iterations, number of models failed with a time out.
    print(automl.sprint_statistics())

tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f114c852b90>, <Future finished exception=StreamClosedError('Stream is closed')>)
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/local/lib/python3.7/dist-packages/tornado/tcpclient.py", line 232, in connect
    af, addr, stream = yield connector.start(connect_timeout=timeout)
  File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/usr/local/lib/python3.7/dist-packages/tornado/tcpclient.py", line 112, in on_connect_done
    stream = future.result()
tornado.iostream.StreamClosedError: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tornado/ioloop.py", line 758, in _run_callbac

auto-sklearn results:
  Dataset name: breast_cancer
  Metric: accuracy
  Best validation score: 0.985816
  Number of target algorithm runs: 35
  Number of successful target algorithm runs: 35
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0



In [19]:
%matplotlib inline


# Random Search

A crucial feature of *auto-sklearn* is automatically optimizing the hyperparameters through SMAC,
introduced [here](https://ml.informatik.uni-freiburg.de/papers/11-LION5-SMAC.pdf).
Additionally, it is possible to use
[random search](https://www.jmlr.org/papers/v13/bergstra12a.html) instead of
SMAC, as demonstrated in the example below. Furthermore, the example also demonstrates how to use
[Random Online Aggressive Racing (ROAR)](https://ml.informatik.uni-freiburg.de/papers/11-LION5-SMAC.pdf)
as yet another alternative optimizatino strategy.
Both examples are intended to show how the optimization strategy in *auto-sklearn* can be adapted.


In [20]:
from pprint import pprint

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

from smac.facade.roar_facade import ROAR
from smac.scenario.scenario import Scenario

import autosklearn.classification

## Data Loading



In [21]:
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1
)

## Fit a classifier using ROAR



In [22]:
def get_roar_object_callback(
    scenario_dict,
    seed,
    ta,
    ta_kwargs,
    metalearning_configurations,
    n_jobs,
    dask_client,
    multi_objective_algorithm,  # This argument will be ignored as ROAR does not yet support multi-objective optimization
    multi_objective_kwargs,
):
    """Random online adaptive racing."""

    if n_jobs > 1 or (dask_client and len(dask_client.nthreads()) > 1):
        raise ValueError(
            "Please make sure to guard the code invoking Auto-sklearn by "
            "`if __name__ == '__main__'` and remove this exception."
        )

    scenario = Scenario(scenario_dict)
    return ROAR(
        scenario=scenario,
        rng=seed,
        tae_runner=ta,
        tae_runner_kwargs=ta_kwargs,
        run_id=seed,
        dask_client=dask_client,
        n_jobs=n_jobs,
    )


automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=60,
    per_run_time_limit=15,
    tmp_folder="/tmp/autosklearn_random_search_example_tmp",
    initial_configurations_via_metalearning=0,
    # The callback to get the SMAC object
    get_smac_object_callback=get_roar_object_callback,
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")

print("#" * 80)
print("Results for ROAR.")
# Print the final ensemble constructed by auto-sklearn via ROAR.
pprint(automl.show_models(), indent=4)
predictions = automl.predict(X_test)
# Print statistics about the auto-sklearn run such as number of
# iterations, number of models failed with a time out.
print(automl.sprint_statistics())
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))

################################################################################
Results for ROAR.
{   2: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f114e313150>,
           'cost': 0.028368794326241176,
           'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f11445b8750>,
           'ensemble_weight': 0.06,
           'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f11456bcf90>,
           'model_id': 2,
           'rank': 1,
           'sklearn_classifier': RandomForestClassifier(max_features=5, n_estimators=512, n_jobs=1,
                       random_state=1, warm_start=True)},
    3: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f114099e990>,
     

## Fit a classifier using Random Search



In [23]:
def get_random_search_object_callback(
    scenario_dict,
    seed,
    ta,
    ta_kwargs,
    metalearning_configurations,
    n_jobs,
    dask_client,
    multi_objective_algorithm,  # This argument will be ignored as ROAR does not yet support multi-objective optimization
    multi_objective_kwargs,
):
    """Random search"""

    if n_jobs > 1 or (dask_client and len(dask_client.nthreads()) > 1):
        raise ValueError(
            "Please make sure to guard the code invoking Auto-sklearn by "
            "`if __name__ == '__main__'` and remove this exception."
        )

    scenario_dict["minR"] = len(scenario_dict["instances"])
    scenario_dict["initial_incumbent"] = "RANDOM"
    scenario = Scenario(scenario_dict)
    return ROAR(
        scenario=scenario,
        rng=seed,
        tae_runner=ta,
        tae_runner_kwargs=ta_kwargs,
        run_id=seed,
        dask_client=dask_client,
        n_jobs=n_jobs,
    )


automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=60,
    per_run_time_limit=15,
    tmp_folder="/tmp/autosklearn_random_search_example_tmp",
    initial_configurations_via_metalearning=0,
    # Passing the callback to get the SMAC object
    get_smac_object_callback=get_random_search_object_callback,
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")

print("#" * 80)
print("Results for random search.")

# Print the final ensemble constructed by auto-sklearn via random search.
pprint(automl.show_models(), indent=4)

# Print statistics about the auto-sklearn run such as number of
# iterations, number of models failed with a time out.
print(automl.sprint_statistics())

predictions = automl.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))

################################################################################
Results for random search.
{   2: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f1141101850>,
           'cost': 0.07092198581560283,
           'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f11419f7250>,
           'ensemble_weight': 0.06,
           'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f1140ae9850>,
           'model_id': 2,
           'rank': 1,
           'sklearn_classifier': PassiveAggressiveClassifier(C=0.10318256510142626, average=True, max_iter=32,
                            random_state=1, tol=0.0013607858153657413,
                            warm_start=True)},
    3: {   'balancing': Balancing(random_state=1, strategy='weighting'),
           'classifie

In [24]:
%matplotlib inline


# Sequential Usage

By default, *auto-sklearn* fits the machine learning models and build their
ensembles in parallel. However, it is also possible to run the two processes
sequentially. The example below shows how to first fit the models and build the
ensembles afterwards.


In [25]:
from pprint import pprint

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

import autosklearn.classification

## Data Loading



In [26]:
from autosklearn.ensembles.ensemble_selection import EnsembleSelection

X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1
)

## Build and fit the classifier



In [27]:
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=60,
    tmp_folder="/tmp/autosklearn_sequential_example_tmp",
    # Do not construct ensembles in parallel to avoid using more than one
    # core at a time. The ensemble will be constructed after auto-sklearn
    # finished fitting all machine learning models.
    ensemble_class=None,
    delete_tmp_folder_after_terminate=False,
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")

# This call to fit_ensemble uses all models trained in the previous call
# to fit to build an ensemble which can be used with automl.predict()
automl.fit_ensemble(y_train, ensemble_class=EnsembleSelection)

RunKey(config_id=1, instance_id='{"task_id": "breast_cancer"}', seed=0, budget=0.0) RunValue(cost=0.028368794326241176, time=2.104825258255005, status=<StatusType.SUCCESS: 1>, starttime=1667178738.5084352, endtime=1667178740.6353476, additional_info={'duration': 2.008713722229004, 'num_run': 2, 'train_loss': 0.0, 'configuration_origin': 'Initial design'})
RunKey(config_id=2, instance_id='{"task_id": "breast_cancer"}', seed=0, budget=0.0) RunValue(cost=0.028368794326241176, time=1.2050554752349854, status=<StatusType.SUCCESS: 1>, starttime=1667178740.6405835, endtime=1667178741.8644972, additional_info={'duration': 1.1308479309082031, 'num_run': 3, 'train_loss': 0.01754385964912286, 'configuration_origin': 'Initial design'})
RunKey(config_id=3, instance_id='{"task_id": "breast_cancer"}', seed=0, budget=0.0) RunValue(cost=0.05673758865248224, time=1.904345989227295, status=<StatusType.SUCCESS: 1>, starttime=1667178741.8683646, endtime=1667178743.790697, additional_info={'duration': 1.832

AutoSklearnClassifier(delete_tmp_folder_after_terminate=False,
                      ensemble_class=<class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>,
                      per_run_time_limit=6, time_left_for_this_task=60,
                      tmp_folder='/tmp/autosklearn_sequential_example_tmp')

## Print the final ensemble constructed by auto-sklearn



In [28]:
pprint(automl.show_models(), indent=4)

{   2: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f11442e8910>,
           'cost': 0.028368794326241176,
           'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f114c2e4050>,
           'ensemble_weight': 0.1,
           'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f1143b988d0>,
           'model_id': 2,
           'rank': 1,
           'sklearn_classifier': RandomForestClassifier(max_features=5, n_estimators=512, n_jobs=1,
                       random_state=1, warm_start=True)},
    3: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f11486689d0>,
           'cost': 0.028368794326241176,
           'data_preprocessor': <autosklearn.pipeline.components

## Get the Score of the final ensemble



In [29]:
predictions = automl.predict(X_test)
print(automl.sprint_statistics())
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))

auto-sklearn results:
  Dataset name: breast_cancer
  Metric: accuracy
  Best validation score: 0.985816
  Number of target algorithm runs: 22
  Number of successful target algorithm runs: 21
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 1
  Number of target algorithms that exceeded the memory limit: 0

Accuracy score 0.9440559440559441


In [30]:
%matplotlib inline


# Successive Halving

This advanced  example illustrates how to interact with
the SMAC callback and get relevant information from the run, like
the number of iterations. Particularly, it exemplifies how to select
the intensification strategy to use in smac, in this case:
[SuccessiveHalving](http://proceedings.mlr.press/v80/falkner18a/falkner18a-supp.pdf).

This results in an adaptation of the [BOHB algorithm](http://proceedings.mlr.press/v80/falkner18a/falkner18a.pdf).
It uses Successive Halving instead of [Hyperband](https://jmlr.org/papers/volume18/16-558/16-558.pdf), and could be abbreviated as BOSH.
To get the BOHB algorithm, simply import Hyperband and use it as the intensification strategy.


In [31]:
from pprint import pprint

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

import autosklearn.classification

## Define a callback that instantiates SuccessiveHalving



In [32]:
def get_smac_object_callback(budget_type):
    def get_smac_object(
        scenario_dict,
        seed,
        ta,
        ta_kwargs,
        metalearning_configurations,
        n_jobs,
        dask_client,
        multi_objective_algorithm,  # This argument will be ignored as SH does not yet support multi-objective optimization
        multi_objective_kwargs,
    ):
        from smac.facade.smac_ac_facade import SMAC4AC
        from smac.intensification.successive_halving import SuccessiveHalving
        from smac.runhistory.runhistory2epm import RunHistory2EPM4LogCost
        from smac.scenario.scenario import Scenario

        if n_jobs > 1 or (dask_client and len(dask_client.nthreads()) > 1):
            raise ValueError(
                "Please make sure to guard the code invoking Auto-sklearn by "
                "`if __name__ == '__main__'` and remove this exception."
            )

        scenario = Scenario(scenario_dict)
        if len(metalearning_configurations) > 0:
            default_config = scenario.cs.get_default_configuration()
            initial_configurations = [default_config] + metalearning_configurations
        else:
            initial_configurations = None
        rh2EPM = RunHistory2EPM4LogCost

        ta_kwargs["budget_type"] = budget_type

        return SMAC4AC(
            scenario=scenario,
            rng=seed,
            runhistory2epm=rh2EPM,
            tae_runner=ta,
            tae_runner_kwargs=ta_kwargs,
            initial_configurations=initial_configurations,
            run_id=seed,
            intensifier=SuccessiveHalving,
            intensifier_kwargs={
                "initial_budget": 10.0,
                "max_budget": 100,
                "eta": 2,
                "min_chall": 1,
            },
            n_jobs=n_jobs,
            dask_client=dask_client,
        )

    return get_smac_object

## Data Loading



In [33]:
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1, shuffle=True
)

## Build and fit a classifier



In [34]:
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=40,
    per_run_time_limit=10,
    tmp_folder="/tmp/autosklearn_sh_example_tmp",
    disable_evaluator_output=False,
    # 'holdout' with 'train_size'=0.67 is the default argument setting
    # for AutoSklearnClassifier. It is explicitly specified in this example
    # for demonstrational purpose.
    resampling_strategy="holdout",
    resampling_strategy_arguments={"train_size": 0.67},
    include={
        "classifier": [
            "extra_trees",
            "gradient_boosting",
            "random_forest",
            "sgd",
            "passive_aggressive",
        ],
        "feature_preprocessor": ["no_preprocessing"],
    },
    get_smac_object_callback=get_smac_object_callback("iterations"),
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")

pprint(automl.show_models(), indent=4)
predictions = automl.predict(X_test)
# Print statistics about the auto-sklearn run such as number of
# iterations, number of models failed with a time out.
print(automl.sprint_statistics())
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))



  f"{self.__class__.__name__} is executed with {num_workers} workers only. "


{   2: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f11419474d0>,
           'cost': 0.021276595744680882,
           'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f114192b550>,
           'ensemble_weight': 0.04,
           'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f1141947710>,
           'model_id': 2,
           'rank': 1,
           'sklearn_classifier': RandomForestClassifier(max_features=5, n_estimators=64, n_jobs=1,
                       random_state=1, warm_start=True)},
    3: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f114192d210>,
           'cost': 0.028368794326241176,
           'data_preprocessor': <autosklearn.pipeline.components

## We can also use cross-validation with successive halving



In [35]:
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1, shuffle=True
)

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=40,
    per_run_time_limit=10,
    tmp_folder="/tmp/autosklearn_sh_example_tmp_01",
    disable_evaluator_output=False,
    resampling_strategy="cv",
    include={
        "classifier": [
            "extra_trees",
            "gradient_boosting",
            "random_forest",
            "sgd",
            "passive_aggressive",
        ],
        "feature_preprocessor": ["no_preprocessing"],
    },
    get_smac_object_callback=get_smac_object_callback("iterations"),
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")

# Print the final ensemble constructed by auto-sklearn.
pprint(automl.show_models(), indent=4)
automl.refit(X_train, y_train)
predictions = automl.predict(X_test)
# Print statistics about the auto-sklearn run such as number of
# iterations, number of models failed with a time out.
print(automl.sprint_statistics())
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))



  f"{self.__class__.__name__} is executed with {num_workers} workers only. "


{   2: {   'cost': 0.046948356807511755,
           'ensemble_weight': 0.04,
           'estimators': [   {   'balancing': Balancing(random_state=1),
                                 'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f113f8ac550>,
                                 'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f1140bba210>,
                                 'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f113f8ac410>,
                                 'sklearn_classifier': RandomForestClassifier(max_features=5, n_estimators=64, n_jobs=1,
                       random_state=1, warm_start=True)},
                             {   'balancing': Balancing(random_state=1),
                                 'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f113fa17810>,

## Use an iterative fit cross-validation with successive halving



In [36]:
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1, shuffle=True
)

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=40,
    per_run_time_limit=10,
    tmp_folder="/tmp/autosklearn_sh_example_tmp_cv_02",
    disable_evaluator_output=False,
    resampling_strategy="cv-iterative-fit",
    include={
        "classifier": [
            "extra_trees",
            "gradient_boosting",
            "random_forest",
            "sgd",
            "passive_aggressive",
        ],
        "feature_preprocessor": ["no_preprocessing"],
    },
    get_smac_object_callback=get_smac_object_callback("iterations"),
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")

# Print the final ensemble constructed by auto-sklearn.
pprint(automl.show_models(), indent=4)
automl.refit(X_train, y_train)
predictions = automl.predict(X_test)
# Print statistics about the auto-sklearn run such as number of
# iterations, number of models failed with a time out.
print(automl.sprint_statistics())
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))



  f"{self.__class__.__name__} is executed with {num_workers} workers only. "


{   2: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f1140bada10>,
           'cost': 0.046948356807511755,
           'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f11406ffa10>,
           'ensemble_weight': 0.32,
           'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f1140bad890>,
           'model_id': 2,
           'rank': 1,
           'sklearn_classifier': None},
    3: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f1148080990>,
           'cost': 0.05164319248826292,
           'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f114809ced0>,
           'ensemble_weight': 0.1,
           'f

## Next, we see the use of subsampling as a budget in Auto-sklearn



In [37]:
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1, shuffle=True
)

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=40,
    per_run_time_limit=10,
    tmp_folder="/tmp/autosklearn_sh_example_tmp_03",
    disable_evaluator_output=False,
    # 'holdout' with 'train_size'=0.67 is the default argument setting
    # for AutoSklearnClassifier. It is explicitly specified in this example
    # for demonstrational purpose.
    resampling_strategy="holdout",
    resampling_strategy_arguments={"train_size": 0.67},
    get_smac_object_callback=get_smac_object_callback("subsample"),
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")

# Print the final ensemble constructed by auto-sklearn.
pprint(automl.show_models(), indent=4)
predictions = automl.predict(X_test)
# Print statistics about the auto-sklearn run such as number of
# iterations, number of models failed with a time out.
print(automl.sprint_statistics())
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))

  f"{self.__class__.__name__} is executed with {num_workers} workers only. "


{   2: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f113f582310>,
           'cost': 0.028368794326241176,
           'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f113f574dd0>,
           'ensemble_weight': 0.1,
           'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f114c30b150>,
           'model_id': 2,
           'rank': 1,
           'sklearn_classifier': RandomForestClassifier(max_features=5, n_estimators=512, n_jobs=1,
                       random_state=1, warm_start=True)},
    3: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f113ff3ab50>,
           'cost': 0.021276595744680882,
           'data_preprocessor': <autosklearn.pipeline.components

## Mixed budget approach
Finally, there's a mixed budget type which uses iterations where possible and
subsamples otherwise



In [38]:
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1, shuffle=True
)

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=40,
    per_run_time_limit=10,
    tmp_folder="/tmp/autosklearn_sh_example_tmp_04",
    disable_evaluator_output=False,
    # 'holdout' with 'train_size'=0.67 is the default argument setting
    # for AutoSklearnClassifier. It is explicitly specified in this example
    # for demonstrational purpose.
    resampling_strategy="holdout",
    resampling_strategy_arguments={"train_size": 0.67},
    include={
        "classifier": ["extra_trees", "gradient_boosting", "random_forest", "sgd"]
    },
    get_smac_object_callback=get_smac_object_callback("mixed"),
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")

# Print the final ensemble constructed by auto-sklearn.
pprint(automl.show_models(), indent=4)
predictions = automl.predict(X_test)
# Print statistics about the auto-sklearn run such as number of
# iterations, number of models failed with a time out.
print(automl.sprint_statistics())
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))



  f"{self.__class__.__name__} is executed with {num_workers} workers only. "


{   2: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f114c2bbed0>,
           'cost': 0.021276595744680882,
           'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f113dd85850>,
           'ensemble_weight': 0.02,
           'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f114e320610>,
           'model_id': 2,
           'rank': 1,
           'sklearn_classifier': RandomForestClassifier(max_features=5, n_estimators=64, n_jobs=1,
                       random_state=1, warm_start=True)},
    4: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f113e38c650>,
           'cost': 0.014184397163120588,
           'data_preprocessor': <autosklearn.pipeline.components