# Ray Tune - Search Algorithms and Schedulers

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

This notebook introduces the concepts of search algorithms and schedulers which help optimize HPO. We'll see an example that combines the use of one search algorithm and one schedulers.

The full set of search algorithms provided by Tune is documented [here](https://docs.ray.io/en/latest/tune/api_docs/suggestion.html), along with information about implementing your own. The full set of schedulers provided is documented [here](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html).

We need to install a few libraries. We'll explain what they are below.

In [None]:
!pip install hpbandster ConfigSpace

In [1]:
!python --version

Python 3.7.6


> **NOTE:** If you are see **Python 3.6** in the output from the previous cell, run remove the `#` in the following cell and run it. This will fix a dependency bug needed for this notebook.
> 
> Afterwards, **restart the kernel for this notebook**, using the circular error in the tool bar. After that, proceed with the rest of the notebook. 
> 
> If you have **Python 3.7** or later, skip these steps.

In [2]:
#!pip install statsmodels -U --pre

## About Search Algorithms

Tune integrates many [open source optimization libraries](https://docs.ray.io/en/latest/tune/api_docs/suggestion.html), each of which defines the parameter search space in its own way. Hence, you should read the corresponding documentation for an algorithm to understand the particular details of using it.

Some of the search algorithms supported include the following:

* [Bayesian Optimization](https://github.com/fmfn/BayesianOptimization): This constrained global optimization process builds upon bayesian inference and gaussian processes. It attempts to find the maximum value of an unknown function in as few iterations as possible. This is a good technique for optimization of high cost functions.
* [BOHB (Bayesian Optimization HyperBand](https://github.com/automl/HpBandSter): An algorithm that both terminates bad trials and also uses Bayesian Optimization to improve the hyperparameter search. It is backed by the [HpBandSter](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#tune-scheduler-bohb) library. BOHB is intended to be paired with a specific scheduler class: [HyperBandForBOHB](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#tune-scheduler-bohb).
* [HyperOpt](http://hyperopt.github.io/hyperopt): A Python library for serial and parallel optimization over awkward search spaces, which may include real-valued, discrete, and conditional dimensions.
* [Nevergrad](https://github.com/facebookresearch/nevergrad): HPO without computing gradients.

These and other algorithms are described in the [documentation](https://docs.ray.io/en/latest/tune/api_docs/suggestion.html).

A limitation of search algorithms used by themselves is they can't affect or stop training processes, for example early stopping of trail that are performing poorly. The schedulers can do this, so it's common to use a compatible search algorithm with a scheduler, as we'll show in the first example.

## About Schedulers

Tune includes distributed implementations of several early-stopping algorithms, including the following:

* [Median Stopping Rule](https://research.google.com/pubs/pub46180.html): It applies the simple rule that a trial is aborted if the results are trending below the median of the previous trials.
* [HyperBand](https://arxiv.org/abs/1603.06560): It structures search as an _infinite-armed, stochastic, exploration-only, multi-armed bandit_. See the [Multi-Armed Bandits lessons](../ray-rllib/multi-armed-bandits/00-Multi-Armed-Bandits-Overview.ipynb) for information on these concepts. The infinite arms correspond to the tunable parameters. Trying values stochastically ensures quick exploration of the parameter space. Exploration-only is desirable because for HPO, we aren't interested in _exploiting_ parameter combinations we've already tried (the usual case when using MABs where rewards are the goal). Intead, we need to explore as many new parameter combinations as possible.
* [ASHA](https://openreview.net/forum?id=S1Y7OOlRZ). This is an aynchronous version of HyperBand that improves on the latter. Hence it is recommended over the original HyperBand implementation. 

Tune also includes a distributed implementation of [Population Based Training (PBT)](https://deepmind.com/blog/population-based-training-neural-networks). When the PBT scheduler is enabled, each trial variant is treated as a member of the _population_. Periodically, top-performing trials are checkpointed, which means your [`tune.Trainable`](https://docs.ray.io/en/latest/tune/api_docs/trainable.html#tune-trainable) object (e.g., the `TrainMNist` class we used in the previous exercise) has to support save and restore. 

Low-performing trials clone the checkpoints of top performers and perturb the configurations in the hope of discovering an even better variation. PBT trains a group of models (or RLlib agents) in parallel. So, unlike other hyperparameter search algorithms, PBT mutates hyperparameters during training time. This enables very fast hyperparameter discovery and also automatically discovers good [annealing](https://en.wikipedia.org/wiki/Simulated_annealing) schedules.

See the [Tune schedulers](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html) for a complete list and descriptions.

## Examples

Let's initialize Ray as before:

In [3]:
!../tools/start-ray.sh --check --verbose

../tools/start-ray.sh: line 118: 19661 Abort trap: 6           $NOOP ray stat > /dev/null 2>&1

INFO: Ray is not running. Run ../tools/start-ray.sh with no options in a terminal window to start Ray.
INFO: (You can start a terminal in Jupyter. Click the + under the Edit menu.)



In [1]:
import ray
from ray import tune

In [4]:
ray.init(address='auto', ignore_reinit_error=True)

{'node_ip_address': '192.168.1.149',
 'raylet_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:6379',
 'object_store_address': '/tmp/ray/session_2020-07-22_11-06-55_105752_20044/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-07-22_11-06-55_105752_20044/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-07-22_11-06-55_105752_20044'}

### BOHB

BOHB (Bayesian Optimization HyperBand) is an algorithm that both terminates bad trials and also uses Bayesian Optimization to improve the hyperparameter search. The [Tune implementation](https://docs.ray.io/en/latest/tune/api_docs/suggestion.html#bohb-tune-suggest-bohb-tunebohb) is backed by the [HpBandSter library](https://github.com/automl/HpBandSter), which we must install, along with [ConfigSpace](https://automl.github.io/HpBandSter/build/html/quickstart.html#searchspace), which is used to define the search space specification:



We use BOHB with the scheduler [HyperBandForBOHB](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#bohb-tune-schedulers-hyperbandforbohb).

Let's try it. We'll use the same MNIST example from the previous lesson, but this time, we'll import the code from a file in this directory, `mnist.py`. Note that the implementation of `TrainMNIST` in the file has enhancements not present in the previous lesson, such as methods to support saving and restoring checkpoints, which are required to be used here. See the code comments for details.

In [6]:
from mnist import ConvNet, TrainMNIST, EPOCH_SIZE, TEST_SIZE, DATA_ROOT

Import and configure the `ConfigSpace` object we need for the search algorithm.

In [7]:
import ConfigSpace as CS
from ray.tune.schedulers.hb_bohb import HyperBandForBOHB
from ray.tune.suggest.bohb import TuneBOHB

In [8]:
config_space = CS.ConfigurationSpace()

# There are also UniformIntegerHyperparameter and UniformFloatHyperparameter
# objects for defining integer and float ranges, respectively. For example:
# config_space.add_hyperparameter(
#     CS.UniformIntegerHyperparameter('foo', lower=0, upper=100))

config_space.add_hyperparameter(
    CS.CategoricalHyperparameter('lr', choices=[0.001, 0.01, 0.1]))
config_space.add_hyperparameter(
    CS.CategoricalHyperparameter('momentum', choices=[0.001, 0.01, 0.1, 0.9]))

config_space

Configuration space object:
  Hyperparameters:
    lr, Type: Categorical, Choices: {0.001, 0.01, 0.1}, Default: 0.001
    momentum, Type: Categorical, Choices: {0.001, 0.01, 0.1, 0.9}, Default: 0.001

In [9]:
experiment_metrics = dict(metric="mean_accuracy", mode="max")

search_algorithm = TuneBOHB(config_space, max_concurrent=4, **experiment_metrics)

scheduler = HyperBandForBOHB(
    time_attr='training_iteration',
    reduction_factor=4,
    max_t=200,
    **experiment_metrics)

Through experimentation, we determined that `max_t=200` is necessary to get good results. For the smallest learning rate and momentum values, it takes longer for training to converge.

In [10]:
analysis = tune.run(TrainMNIST, 
    scheduler=scheduler, 
    search_alg=search_algorithm, 
    num_samples=12,                           # Force it try all 12 combinations
    verbose=1
)

Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_2d839100,TERMINATED,,0.1,0.001,0.940625,48,22.5779
TrainMNIST_2d8413a0,TERMINATED,,0.1,0.01,0.928125,48,23.0732
TrainMNIST_2d84ec3a,TERMINATED,,0.001,0.1,0.221875,12,9.01179
TrainMNIST_2d85964e,TERMINATED,,0.1,0.01,0.809375,12,8.28744
TrainMNIST_312a3282,TERMINATED,,0.01,0.1,0.24375,12,7.43025
TrainMNIST_31330bfa,TERMINATED,,0.1,0.1,0.8125,12,7.67244
TrainMNIST_314c11b8,TERMINATED,,0.1,0.9,0.790625,12,7.51811
TrainMNIST_31726f84,TERMINATED,,0.1,0.001,0.90625,48,22.5777
TrainMNIST_34e85020,TERMINATED,,0.001,0.01,0.1375,12,4.94504
TrainMNIST_34f645c2,TERMINATED,,0.1,0.01,0.846875,12,4.9538


In [11]:
stats = analysis.stats()
secs = stats["timestamp"] - stats["start_time"]
print(f'{secs:7.2f} seconds, {secs/60.0:7.2f} minutes')

  85.34 seconds,    1.42 minutes


In [12]:
print("Best config: ", analysis.get_best_config(metric="mean_accuracy"))

Best config:  {'lr': 0.1, 'momentum': 0.01}


In [13]:
analysis.dataframe().sort_values('mean_accuracy', ascending=False).head()

Unnamed: 0,mean_accuracy,done,timesteps_total,episodes_total,training_iteration,experiment_id,date,timestamp,time_this_iter_s,time_total_s,...,hostname,node_ip,time_since_restore,timesteps_since_restore,iterations_since_restore,trial_id,experiment_tag,config/lr,config/momentum,logdir
10,0.953125,False,,,152,59879bcc54d649f19dc37e3c6d2c5013,2020-07-22_11-08-42,1595441322,0.28231,47.707444,...,DWAnyscaleMBP.local,192.168.1.149,28.228995,0,104,350d0a32,"11_lr=0.1,momentum=0.01",0.1,0.01,/Users/deanwampler/ray_results/TrainMNIST/Trai...
0,0.940625,False,,,48,1fe2cc811b634aa1bb41443e797e667b,2020-07-22_11-08-11,1595441291,0.39068,22.577943,...,DWAnyscaleMBP.local,192.168.1.149,14.28007,0,36,2d839100,"1_lr=0.1,momentum=0.001",0.1,0.001,/Users/deanwampler/ray_results/TrainMNIST/Trai...
11,0.934375,True,,,152,c3c6037f2dda486baffea1bfd824ccab,2020-07-22_11-08-42,1595441322,0.287843,48.056502,...,DWAnyscaleMBP.local,192.168.1.149,28.54946,0,104,352d04ea,"12_lr=0.1,momentum=0.001",0.1,0.001,/Users/deanwampler/ray_results/TrainMNIST/Trai...
1,0.928125,True,,,48,36dd716e34a34fdca0a5d53eeba3b774,2020-07-22_11-08-11,1595441291,0.368011,23.073215,...,DWAnyscaleMBP.local,192.168.1.149,14.546131,0,36,2d8413a0,"2_lr=0.1,momentum=0.01",0.1,0.01,/Users/deanwampler/ray_results/TrainMNIST/Trai...
7,0.90625,False,,,48,925f377c7a7c44728c4057c05a7bd61b,2020-07-22_11-08-11,1595441291,0.377046,22.577711,...,DWAnyscaleMBP.local,192.168.1.149,14.617559,0,36,31726f84,"8_lr=0.1,momentum=0.001",0.1,0.001,/Users/deanwampler/ray_results/TrainMNIST/Trai...


In [14]:
analysis.dataframe()[['mean_accuracy', 'config/lr', 'config/momentum']].sort_values('mean_accuracy', ascending=False)

Unnamed: 0,mean_accuracy,config/lr,config/momentum
10,0.953125,0.1,0.01
0,0.940625,0.1,0.001
11,0.934375,0.1,0.001
1,0.928125,0.1,0.01
7,0.90625,0.1,0.001
9,0.846875,0.1,0.01
5,0.8125,0.1,0.1
3,0.809375,0.1,0.01
6,0.790625,0.1,0.9
4,0.24375,0.01,0.1


The runs in the previous lesson, for the class-based and the function-based Tune APIs, took between 12 and 20 seconds (on my machine), but we only trained for 20 iterations, where as here we went for 100 iterations. That also accounts for the different results, notably that a much smaller momentum value `0.01` and `0.1` perform best here, while for the the previous lesson `0.9` performed best. This is because a smaller momentum value will result in longer training times required, but more fine-tuned iterating to the optimal result, so more training iterations will favor a smaller momentum value. Still, the mean accuracies among the top three or four combinations are quite close.

## Exercise - Population Base Training

Read the [documentation]() on _population based training_ to understand what it is doing. The next cell configures a PBT scheduler and defines other things you'll need. 

See also the discussion for the results [here](solutions/03/Search-Algos-and-Schedulers-Solutions.ipynb).

> **NOTE:** For a more complete example using MNIST and PyTorch, see [this example code](https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mnist_pytorch_lightning.py).

In [15]:
from ray.tune.schedulers import PopulationBasedTraining

pbt_scheduler = PopulationBasedTraining(
        time_attr='training_iteration',
        perturbation_interval=10,  # Every N time_attr units, "perturb" the parameters.
        hyperparam_mutations={
            "lr": [0.001, 0.01, 0.1],
            "momentum": [0.001, 0.01, 0.1, 0.9]
        },
        **experiment_metrics)

# Note: This appears to be needed to avoid a "key error", but in fact these values won't change
# in the analysis.dataframe() object, even though they will be tuned by the PBT scheduler.
# So when you look at the analysis.dataframe(), look at the `experiment_tag` to see the actual values!
config = {
    "lr": 0.001,            # Use the lowest values from the previous definition
    "momentum": 0.001
}

Now run the the following cell, modified from above, which makes these changes:
1. Uses the new scheduler.
2. Removes the search_alg argument.
3. Adds the `config` argument.
4. Don't allow it to keep going past `0.97` accuracy for `600` iterations.
5. Use `1` for the `verbose` argument to reduce the "noise".

Then run it. 

> **WARNING:** This will run for a few minutes.

In [None]:
analysis = tune.run(TrainMNIST, 
    scheduler=pbt_scheduler, 
    config=config,
    stop={"mean_accuracy": 0.97, "training_iteration": 600},
    num_samples=8,
    verbose=1
)

stats = analysis.stats()
secs = stats["timestamp"] - stats["start_time"]
print(f'{secs:7.2f} seconds, {secs/60.0:7.2f} minutes')

Trial name,status,loc,acc,iter,total time (s)
TrainMNIST_9188f_00000,RUNNING,192.168.1.149:20283,0.784375,120,60.8609
TrainMNIST_9188f_00001,RUNNING,192.168.1.149:20346,0.909375,91,46.5615
TrainMNIST_9188f_00002,RUNNING,192.168.1.149:20287,0.909375,109,55.1125
TrainMNIST_9188f_00003,RUNNING,192.168.1.149:20353,0.784375,76,40.3442
TrainMNIST_9188f_00004,RUNNING,192.168.1.149:20288,0.909375,115,56.3124
TrainMNIST_9188f_00005,RUNNING,192.168.1.149:20284,0.94375,100,51.3084
TrainMNIST_9188f_00006,RUNNING,192.168.1.149:20343,0.921875,87,45.4162
TrainMNIST_9188f_00007,RUNNING,192.168.1.149:20347,0.896875,84,43.9667


Look at the `analysis` data of interest, as done previously. (You might want to focus on other columns in the dataframe.) How well does PBT work?

The final lesson in this tutorial discusses the new Ray SGD library.