# Notebook 03. Introduction to Ray Tune and hyperparameter optimization (HPO)

© 2019-2022, Anyscale. All Rights Reserved <br>
📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb) <br>

➡️ [Next notebook](./ex_04_offline_rl_with_rllib.ipynb) <br>
⬅️ [Previous notebook](./ex_02_create_multiagent_rllib_env.ipynb) <br>

### Learning objectives
In this this notebook, you will learn:
 * [How to configure Ray Tune to find solid hyperparameters more easily](#configure_ray_tune)
 * [The details behind Ray RLlib resource allocation](#resource_allocation)
 

In [1]:
# Import required packages.

import gym
import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray import tune
from ray.tune import CLIReporter

# Importing the very same environment class that we have coded together in
# the previous notebook.
from multi_agent_arena.multi_agent_arena import MultiAgentArena, play_one_episode


print(f"gym: {gym.__version__}")
print(f"ray: {ray.__version__}")

# !ale-import-roms --import-from-pkg atari_py.atari_roms

gym: 0.21.0
ray: 3.0.0.dev0


### How to configure Ray Tune to find solid hyperparameters more easily <a class="anchor" id="configure_ray_tune"></a>

In the previous experiments, we used a single algorithm's (PPO) configuration to create
exactly one Algorithm object and call its `train()` method manually a couple of times.

A common thing to try when doing ML or RL is to look for better choices of hyperparameters, neural network architectures, or algorithm settings. This hyperparameter optimization
problem can be tackled in a scalable fashion using Ray Tune (in combination with RLlib!).

<img src="images/rllib_and_tune.png" width=800>


The following cell demonstrates, how you can setup a simple grid-search for one very important hyperparameter (the learning rate), using our already existing PPO config object and Ray Tune:

In [2]:
# Create a PPOConfig object (same as we did in the previous notebook):
config = PPOConfig()

# Setup our config object the exact same way as before:
# Point to our MultiAgentArena env:
config.environment(env=MultiAgentArena)

# Setup multi-agent mapping:

# Environment provides M agent IDs.
# RLlib has N policies (neural networks).
# The `policy_mapping_fn` maps M agent IDs to N policies (M <= N).

# If you don't provide a policy_mapping_fn, all agent IDs will map to "default_policy".
config.multi_agent(
    # Tell RLlib to create 2 policies with these IDs here:
    policies=["policy1", "policy2"],
    # Tell RLlib to map agent1 to policy1 and agent2 to policy2.
    policy_mapping_fn=lambda agent_id, episode, worker, **kwargs: "policy1" if agent_id == "agent1" else "policy2",
)

# Reduce the number of workers from 2 (default) to 1 to save some resources on the expensive hyperparameter sweep.
# IMPORTANT: More information on resource requirements for tune hyperparameter sweeps and different RLlib algorithm setups
# below.
config.rollouts(num_rollout_workers=1)

<ray.rllib.algorithms.ppo.ppo.PPOConfig at 0x2b9d6e06970>

Now, let's explore how a very simple hyperparameter search should be configured with RLlib and Tune:

In [3]:
# Before setting up the learning rate hyperparam sweep,
# let's see what the default learning rate and train batch size is for PPO:
print(f"Default learning rate for PPO is: {config.lr}")
print(f"Default train batch size for PPO is: {config.train_batch_size}")

Default learning rate for PPO is: 5e-05
Default train batch size for PPO is: 4000


In [4]:
# Now let's change our existing config object and add a simple
# grid-search over two different learning rates to it:
config.training(
    lr=tune.grid_search([5e-5, 1e-4]),
    train_batch_size=tune.grid_search([3000, 4000]),
)


<ray.rllib.algorithms.ppo.ppo.PPOConfig at 0x2b9d6e06970>

💡 <b>Right-click on the cell below and choose "Enable Scrolling for Outputs"!</b>  This will make it easier to view, since model training output can be very long!

In [5]:
# Example using Ray tune API (`tune.run()`) until some stopping condition is met.
# This will create one (or more) Algorithms under the hood automatically w/o us having to
# build these algos from the config.

# Use a custom "reporter" that adds the individual policies' rewards to the output.
reporter = CLIReporter()
reporter.add_metric_column("sampler_results/policy_reward_mean/policy1", "agent1 return")
reporter.add_metric_column("sampler_results/policy_reward_mean/policy2", "agent2 return")


experiment_results = tune.run(
    "PPO",

    # training config params (translated into a python dict!)
    config=config.to_dict(),

    # Stopping criteria whichever occurs first: average reward over training episodes, or ...
    stop={
        "timesteps_total": 40000,  # stop if reached n sampling timesteps
        #"training_iteration": 5,     # stop after n training iterations (calls to `Algorithm.train()`)
        #"episode_reward_mean": 400, # stop if average (sum of) rewards in an episode is n or more
    },  
    progress_reporter=reporter,

    # redirect logs instead of default ~/ray_results/
    local_dir="results",
         
    # Every how many train() calls do we create a checkpoint?
    checkpoint_freq=1,
    # Always save last checkpoint (no matter the frequency).
    checkpoint_at_end=True,

    ###############
    # Note about Ray Tune verbosity.
    # Screen verbosity in Ray Tune is defined as verbose = 0, 1, 2, or 3, where:
    # 0 = silent
    # 1 = only status updates, no logging messages
    # 2 = status and brief trial results, includes logging messages
    # 3 = status and detailed trial results, includes logging messages
    # Defaults to 3.
    ###############
    verbose=3,
                   
    # Define what we are comparing for, when we search for the
    # "best" checkpoint at the end.
    metric="episode_reward_mean",
    mode="max",
)

print("Training completed.")


Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.


2022-08-15 13:12:22,430	INFO worker.py:1481 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265[39m[22m.
[2m[36m(PPO pid=16724)[0m 2022-08-15 13:12:32,012	INFO algorithm.py:1871 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
[2m[36m(PPO pid=16724)[0m 2022-08-15 13:12:32,013	INFO algorithm.py:351 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


== Status ==
Current time: 2022-08-15 13:12:25 (running for 00:00:00.29)
Memory usage on this node: 9.9/31.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/20 CPUs, 0/1 GPUs, 0.0/14.5 GiB heap, 0.0/7.25 GiB objects
Result logdir: C:\Dropbox\Projects\ray-summit-2022-training\ray-rllib\results\PPO
Number of trials: 4/4 (3 PENDING, 1 RUNNING)
+---------------------------------+----------+-----------------+--------+--------------------+
| Trial name                      | status   | loc             |     lr |   train_batch_size |
|---------------------------------+----------+-----------------+--------+--------------------|
| PPO_MultiAgentArena_24664_00000 | RUNNING  | 127.0.0.1:16724 | 5e-05  |               3000 |
| PPO_MultiAgentArena_24664_00001 | PENDING  |                 | 0.0001 |               3000 |
| PPO_MultiAgentArena_24664_00002 | PENDING  |                 | 5e-05  |               4000 |
| PPO_MultiAgentArena_24664_00003 | PENDING  |                 | 0.0001 |

[2m[36m(PPO pid=16724)[0m 2022-08-15 13:12:42,693	INFO trainable.py:160 -- Trainable.setup took 10.681 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(PPO pid=5700)[0m 2022-08-15 13:12:48,704	INFO algorithm.py:1871 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
[2m[36m(PPO pid=5700)[0m 2022-08-15 13:12:48,705	INFO algorithm.py:351 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


== Status ==
Current time: 2022-08-15 13:12:42 (running for 00:00:17.16)
Memory usage on this node: 12.7/31.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 4.0/20 CPUs, 0/1 GPUs, 0.0/14.5 GiB heap, 0.0/7.25 GiB objects
Result logdir: C:\Dropbox\Projects\ray-summit-2022-training\ray-rllib\results\PPO
Number of trials: 4/4 (2 PENDING, 2 RUNNING)
+---------------------------------+----------+-----------------+--------+--------------------+
| Trial name                      | status   | loc             |     lr |   train_batch_size |
|---------------------------------+----------+-----------------+--------+--------------------|
| PPO_MultiAgentArena_24664_00000 | RUNNING  | 127.0.0.1:16724 | 5e-05  |               3000 |
| PPO_MultiAgentArena_24664_00001 | RUNNING  | 127.0.0.1:5700  | 0.0001 |               3000 |
| PPO_MultiAgentArena_24664_00002 | PENDING  |                 | 5e-05  |               4000 |
| PPO_MultiAgentArena_24664_00003 | PENDING  |                 | 0.0001 

[2m[36m(PPO pid=5700)[0m 2022-08-15 13:12:58,901	INFO trainable.py:160 -- Trainable.setup took 10.198 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(PPO pid=16440)[0m 2022-08-15 13:13:05,024	INFO algorithm.py:1871 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
[2m[36m(PPO pid=16440)[0m 2022-08-15 13:13:05,025	INFO algorithm.py:351 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


== Status ==
Current time: 2022-08-15 13:12:58 (running for 00:00:33.37)
Memory usage on this node: 15.2/31.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 6.0/20 CPUs, 0/1 GPUs, 0.0/14.5 GiB heap, 0.0/7.25 GiB objects
Result logdir: C:\Dropbox\Projects\ray-summit-2022-training\ray-rllib\results\PPO
Number of trials: 4/4 (1 PENDING, 3 RUNNING)
+---------------------------------+----------+-----------------+--------+--------------------+
| Trial name                      | status   | loc             |     lr |   train_batch_size |
|---------------------------------+----------+-----------------+--------+--------------------|
| PPO_MultiAgentArena_24664_00000 | RUNNING  | 127.0.0.1:16724 | 5e-05  |               3000 |
| PPO_MultiAgentArena_24664_00001 | RUNNING  | 127.0.0.1:5700  | 0.0001 |               3000 |
| PPO_MultiAgentArena_24664_00002 | RUNNING  | 127.0.0.1:16440 | 5e-05  |               4000 |
| PPO_MultiAgentArena_24664_00003 | PENDING  |                 | 0.0001 

[2m[36m(PPO pid=7072)[0m 2022-08-15 13:13:21,064	INFO algorithm.py:1871 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
[2m[36m(PPO pid=7072)[0m 2022-08-15 13:13:21,065	INFO algorithm.py:351 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


== Status ==
Current time: 2022-08-15 13:13:15 (running for 00:00:49.47)
Memory usage on this node: 17.8/31.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 8.0/20 CPUs, 0/1 GPUs, 0.0/14.5 GiB heap, 0.0/7.25 GiB objects
Result logdir: C:\Dropbox\Projects\ray-summit-2022-training\ray-rllib\results\PPO
Number of trials: 4/4 (4 RUNNING)
+---------------------------------+----------+-----------------+--------+--------------------+
| Trial name                      | status   | loc             |     lr |   train_batch_size |
|---------------------------------+----------+-----------------+--------+--------------------|
| PPO_MultiAgentArena_24664_00000 | RUNNING  | 127.0.0.1:16724 | 5e-05  |               3000 |
| PPO_MultiAgentArena_24664_00001 | RUNNING  | 127.0.0.1:5700  | 0.0001 |               3000 |
| PPO_MultiAgentArena_24664_00002 | RUNNING  | 127.0.0.1:16440 | 5e-05  |               4000 |
| PPO_MultiAgentArena_24664_00003 | RUNNING  | 127.0.0.1:7072  | 0.0001 |          



== Status ==
Current time: 2022-08-15 13:14:01 (running for 00:01:35.86)
Memory usage on this node: 20.4/31.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 8.0/20 CPUs, 0/1 GPUs, 0.0/14.5 GiB heap, 0.0/7.25 GiB objects
Current best trial: 24664_00000 with episode_reward_mean=-6.100000000000002 and parameters={'extra_python_environs_for_driver': {}, 'extra_python_environs_for_worker': {}, 'num_gpus': 0, 'num_cpus_per_worker': 1, 'num_gpus_per_worker': 0, '_fake_gpus': False, 'custom_resources_per_worker': {}, 'placement_strategy': 'PACK', 'eager_tracing': False, 'eager_max_retraces': 20, 'tf_session_args': {'intra_op_parallelism_threads': 2, 'inter_op_parallelism_threads': 2, 'gpu_options': {'allow_growth': True}, 'log_device_placement': False, 'device_count': {'CPU': 1}, 'allow_soft_placement': True}, 'local_tf_session_args': {'intra_op_parallelism_threads': 8, 'inter_op_parallelism_threads': 8}, 'env': <class 'multi_agent_arena.multi_agent_arena.MultiAgentArena'>, 'env_con

2022-08-15 13:24:21,591	INFO tune.py:758 -- Total run time: 716.33 seconds (715.80 seconds for the tuning loop).


Training completed.


In [6]:
# Using the returned `experiment_results` object,
# we can extract from it the best checkpoint according to some criterium, e.g. `episode_reward_mean`.

# We had 4 single trials (4 Algorithm instance); return the one that performed best here.
best_trial = experiment_results.get_best_trial()
print("Best trial: ", best_trial)

# From that trial, extract the best checkpoint (max `evaluation/episode_reward_mean` value).
best_checkpoint = experiment_results.get_best_checkpoint(trial=best_trial, metric="episode_reward_mean", mode="max")

# We would expect this to be either the very last checkpoint or one close to it:
print(f"Best checkpoint from training: {best_checkpoint}")

Best trial:  PPO_MultiAgentArena_24664_00001
Best checkpoint from training: Checkpoint(local_path=C:\Dropbox\Projects\ray-summit-2022-training\ray-rllib\results\PPO\PPO_MultiAgentArena_24664_00001_1_lr=0.0001,train_batch_size=3000_2022-08-15_13-12-42\checkpoint_000014)


### The details behind Ray RLlib resource allocation <a class="anchor" id="resource_allocation"></a>

#### Why did we use 8 CPUs in the tune run above (2 CPUs per trial)?

```
== Status ==
Current time: 2022-07-24 18:18:28 (running for 00:02:09.35)
Memory usage on this node: 9.9/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 8/16 CPUs, 0/0 GPUs, 0.5/3.97 GiB heap, 0.5/1.98 GiB objects
```

<img src="images/closer_look_at_rllib.png" width=700 />

By default, the PPO Algorithm uses 2 so called `RolloutWorkers` (you can change this via `config.rollouts(num_rollout_workers=2)`) for collecting samples from
environments in parallel.
We changed this setting to only 1 worker via the `config.rollouts(num_rollout_workers=1)` call in the cell above.

`RolloutWorkers` are Ray Actors that have their own copies of the environment and step through episodes in parallel. Each Actor in Ray normally uses a single CPU, but besides `RolloutWorker`s, an Algorithm in RLlib also always has one local process (aka. the "driver" process or the "local worker"), which - in case of PPO -
handles the model/policy learning updates.

For our experiment above, this gives us 2 CPUs (1 rollout worker + 1 local learner) per Algorithm instance.

Since our config specifies two `grid_search` with 2 different learning rates AND 2 different batch sizes, we were running 4 Algorithms in parallel above (2 learning rates x 2 batch sizes = 4 trials), hence 8 CPUs were required (4 algos x 2 CPUs each = 8).


### Summary

In this notebook, we have learnt, how to:

* Use Ray Tune in combination with RLlib for hyperparameter tuning
* How an RLlib Algorithm configuration and the Tune hyperparameter search setup determine the required computational resources for a given `tune.run()` experiment

### Exercise 03<a ></a>

#### Using the `config` that we have built so far, let's run another `tune.run()`.

But this time, apply the following changes to our setup:

- Setup only 1 learning rate using the `config.training(lr=...)` method call. Chose the (seemingly) best value from the run in the previous cell (the one that yielded the highest avg. reward).
- Setup only 1 train batch size using the `config.training(train_batch_size=...)` method call. Chose the (seemingly) best value from the run in the previous cell (the one that yielded the highest avg. reward).
- Set the number of RolloutWorkers to 5 using the `config.rollouts(num_rollout_workers=5)` method call, which will allow us to collect more environment samples in parallel.
- Set the `num_envs_per_worker` config parameter to 5 using the `config.rollouts(num_envs_per_worker=...)` method call. This will batch our environment on each rollout worker, and thus parallelize action computing forward passes through our neural networks.
- Set the stop criterium to "training_iteration=180".

Other than that, use the exact same args as in our `tune.run()` call in the previous cell.

**Good luck! :)**

### Solution

In [None]:
# Undo our tune hyperparameter search:
config.training(
    # ...
)

# Change the config as stated in the exercise task:
config.rollouts(
    # ...
)

# Run the experiment for 180 iterations:
experiment_results = tune.run(
    "PPO",
    config=config.to_dict(),
    stop={
        # ...
    },
    # redirect logs instead of default ~/ray_results/
    local_dir="results",
    checkpoint_freq=10,
    checkpoint_at_end=True,
    verbose=1,
    metric="episode_reward_mean",
    mode="max",
)

In [None]:
# We only had a single trial (one Algorithm instance), so this should be returned here.
best_trial = experiment_results.get_best_trial()
print("Best trial: ", best_trial)

# From that trial, extract the best checkpoint (max `evaluation/episode_reward_mean` value).
best_checkpoint = experiment_results.get_best_checkpoint(trial=best_trial, metric="episode_reward_mean", mode="max")

# We would expect this to be either the very last checkpoint or one close to it:
print(f"Best checkpoint from training: {best_checkpoint}")

# Create a fresh algorithm and retstore its state, using our best checkploint from the experiment above.
new_ppo = config.build()
new_ppo.restore(best_checkpoint)

# Let's see how we are doing now.
play_one_episode(env=None, algo=new_ppo)


In [None]:
# Clean up (release resources for other notebooks to come).
new_ppo.stop()

 ## References
 * [Tune, Scalable Hyperparameter Tuning](https://docs.ray.io/en/latest/tune/index.html)

⬅️ [Previous notebook](./ex_02_create_multiagent_rllib_env.ipynb) <br>
➡️ [Next notebook](./ex_04_offline_rl_with_rllib.ipynb) <br>

📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb)<br>