# Notebook 03. Introduction to Ray Tune and hyperparameter optimization (HPO)

© 2019-2022, Anyscale. All Rights Reserved <br>
📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb) <br>

➡️ [Next notebook](./ex_04_offline_rl_with_rllib.ipynb) <br>
⬅️ [Previous notebook](./ex_02_create_multiagent_rllib_env.ipynb) <br>

### Learning objectives
In this this notebook, you will learn:
 * [How to configure Ray Tune to find solid hyperparameters more easily](#configure_ray_tune)
 * [The details behind Ray RLlib resource allocation](#resource_allocation)
 

In [7]:
# Import required packages.

import gym
import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray import tune

# Importing the very same environment class that we have coded together in
# the previous notebook.
from multi_agent_arena.multi_agent_arena import MultiAgentArena


print(f"gym: {gym.__version__}")
print(f"ray: {ray.__version__}")

# !ale-import-roms --import-from-pkg atari_py.atari_roms

gym: 0.21.0
ray: 3.0.0.dev0


### How to configure Ray Tune to find solid hyperparameters more easily <a class="anchor" id="configure_ray_tune"></a>

In the previous experiments, we used a single algorithm's (PPO) configuration to create
exactly one Algorithm object and call its `train()` method manually a couple of times.

A common thing to try when doing ML or RL is to look for better choices of hyperparameters, neural network architectures, or algorithm settings. This hyperparameter optimization
problem can be tackled in a scalable fashion using Ray Tune (in combination with RLlib!).

<img src="images/rllib_and_tune.png" width=800>


The following cell demonstrates, how you can setup a simple grid-search for one very important hyperparameter (the learning rate), using our already existing PPO config object and Ray Tune:

In [12]:
# Create a PPOConfig object (same as we did in the previous notebook):
config = PPOConfig()
print(f"PPO default learning rate is {config.lr} and the train batch size is {config.train_batch_size}")

# Setup our config object the exact same way as before:
# Point to our MultiAgentArena env:
config.environment(env=MultiAgentArena)

# Setup multi-agent mapping:

# Environment provides M agent IDs.
# RLlib has N policies (neural networks).
# The `policy_mapping_fn` maps M agent IDs to N policies (M <= N).

# If you don't provide a policy_mapping_fn, all agent IDs will map to "default_policy".
config.multi_agent(
    # Tell RLlib to create 2 policies with these IDs here:
    policies=["policy1", "policy2"],
    # Tell RLlib to map agent1 to policy1 and agent2 to policy2.
    policy_mapping_fn=lambda agent_id, episode, worker, **kwargs: "policy1" if agent_id == "agent1" else "policy2",
)

# Reduce the number of workers from 2 (default) to 1 to save some resources on the expensive hyperparameter sweep.
# IMPORTANT: More information on resource requirements for tune hyperparameter sweeps and different RLlib algorithm setups
# below.
config.rollouts(num_rollout_workers=1)

PPO default learning rate is 5e-05 and the train batch size is 4000


<ray.rllib.algorithms.ppo.ppo.PPOConfig at 0x1390f880a60>

Now, let's explore how a very simple hyperparameter search should be configured with RLlib and Tune:

In [3]:
# Before setting up the learning rate hyperparam sweep,
# let's see what the default learning rate and train batch size is for PPO:
print(f"Default learning rate for PPO is: {config.lr}")
print(f"Default train batch size for PPO is: {config.train_batch_size}")

Default learning rate for PPO is: 5e-05
Default train batch size for PPO is: 4000


In [4]:
# Now let's change our existing config object and add a simple
# grid-search over two different learning rates to it:
config.training(
    lr=tune.grid_search([5e-5, 1e-4]),
    train_batch_size=tune.grid_search([3000, 4000]),
)

<ray.rllib.algorithms.ppo.ppo.PPOConfig at 0x13947653b80>

💡 <b>Right-click on the cell below and choose "Enable Scrolling for Outputs"!</b>  This will make it easier to view, since model training output can be very long!

In [5]:
# Example using Ray tune API (`tune.run()`) until some stopping condition is met.
# This will create one (or more) Algorithms under the hood automatically w/o us having to
# build these algos from the config.

experiment_results = tune.run(
    "PPO",

    # training config params (translated into a python dict!)
    config=config.to_dict(),

    # Stopping criteria whichever occurs first: average reward over training episodes, or ...
    stop={
        "training_iteration": 6,     # stop after n training iterations (calls to `Algorithm.train()`)
        #"episode_reward_mean": 400, # stop if average (sum of) rewards in an episode is 400 or more
        #"timesteps_total": 100000,  # stop if reached 100,000 sampling timesteps
    },  

    # redirect logs instead of default ~/ray_results/
    local_dir="results",
         
    # Every how many train() calls do we create a checkpoint?
    checkpoint_freq=1,
    # Always save last checkpoint (no matter the frequency).
    checkpoint_at_end=True,

    ###############
    # Note about Ray Tune verbosity.
    # Screen verbosity in Ray Tune is defined as verbose = 0, 1, 2, or 3, where:
    # 0 = silent
    # 1 = only status updates, no logging messages
    # 2 = status and brief trial results, includes logging messages
    # 3 = status and detailed trial results, includes logging messages
    # Defaults to 3.
    ###############
    verbose=3,
                   
    # Define what we are comparing for, when we search for the
    # "best" checkpoint at the end.
    metric="episode_reward_mean",
    mode="max",
)

print("Training completed.")


Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.


2022-08-09 11:15:54,749	INFO worker.py:1481 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265[39m[22m.


Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,num_recreated_wor...,episode_reward_max,episode_reward_min
PPO_MultiAgentArena_e0758_00000,TERMINATED,127.0.0.1:10368,5e-05,3000,6,268.406,18000,1.665,0,25.2,-13.5
PPO_MultiAgentArena_e0758_00001,TERMINATED,127.0.0.1:11124,0.0001,3000,6,268.993,18000,1.83,0,30.3,-21.3
PPO_MultiAgentArena_e0758_00002,TERMINATED,127.0.0.1:17352,5e-05,4000,6,320.524,24000,2.292,0,19.5,-24.3
PPO_MultiAgentArena_e0758_00003,TERMINATED,127.0.0.1:11556,0.0001,4000,6,316.389,24000,3.153,0,22.5,-18.9


[2m[36m(PPO pid=10368)[0m 2022-08-09 11:16:02,693	INFO algorithm.py:1871 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
[2m[36m(PPO pid=10368)[0m 2022-08-09 11:16:02,694	INFO algorithm.py:351 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(PPO pid=10368)[0m 2022-08-09 11:16:12,913	INFO trainable.py:160 -- Trainable.setup took 10.221 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(PPO pid=11124)[0m 2022-08-09 11:16:18,359	INFO algorithm.py:1871 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=T

Result for PPO_MultiAgentArena_e0758_00000:
  agent_timesteps_total: 6000
  counters:
    num_agent_steps_sampled: 6000
    num_agent_steps_trained: 6000
    num_env_steps_sampled: 3000
    num_env_steps_trained: 3000
  custom_metrics: {}
  date: 2022-08-09_11-16-32
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 28.499999999999915
  episode_reward_mean: -7.299999999999998
  episode_reward_min: -36.30000000000004
  episodes_this_iter: 30
  episodes_total: 30
  experiment_id: d3dc894fb4884b018f96995278e9ac97
  hostname: DESKTOP-0LQ89AE
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.3774759769439697
          entropy_coeff: 0.0
          kl: 0.008940611034631729
          model: {}
          policy_loss: -0.025092337280511856
          total_loss: 7.39013671875
          vf_explained_var: -0.00045563263120129704
          vf_loss: 7.413441



Result for PPO_MultiAgentArena_e0758_00001:
  agent_timesteps_total: 6000
  counters:
    num_agent_steps_sampled: 6000
    num_agent_steps_trained: 6000
    num_env_steps_sampled: 3000
    num_env_steps_trained: 3000
  custom_metrics: {}
  date: 2022-08-09_11-16-49
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 15.00000000000001
  episode_reward_mean: -8.799999999999994
  episode_reward_min: -39.90000000000008
  episodes_this_iter: 30
  episodes_total: 30
  experiment_id: 3b43653afbe04acfbe6caf1cbf1e78d6
  hostname: DESKTOP-0LQ89AE
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 1.3681732416152954
          entropy_coeff: 0.0
          kl: 0.018601803109049797
          model: {}
          policy_loss: -0.04214094206690788
          total_loss: 6.6384358406066895
          vf_explained_var: 0.008029840886592865
          vf_loss: 6.676856



Result for PPO_MultiAgentArena_e0758_00002:
  agent_timesteps_total: 8000
  counters:
    num_agent_steps_sampled: 8000
    num_agent_steps_trained: 8000
    num_env_steps_sampled: 4000
    num_env_steps_trained: 4000
  custom_metrics: {}
  date: 2022-08-09_11-17-32
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 22.499999999999915
  episode_reward_mean: -6.9
  episode_reward_min: -36.00000000000007
  episodes_this_iter: 40
  episodes_total: 40
  experiment_id: 3bb20841742c49d7bcfe50360fad8c16
  hostname: DESKTOP-0LQ89AE
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.3725521564483643
          entropy_coeff: 0.0
          kl: 0.01393173635005951
          model: {}
          policy_loss: -0.027482902631163597
          total_loss: 6.912868022918701
          vf_explained_var: 0.004622733220458031
          vf_loss: 6.937565326690674
    

[2m[36m(RolloutWorker pid=3252)[0m E0809 11:21:05.641000000 17564 src/core/ext/transport/chttp2/transport/chttp2_transport.cc:1103] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"


Result for PPO_MultiAgentArena_e0758_00000:
  agent_timesteps_total: 36000
  counters:
    num_agent_steps_sampled: 36000
    num_agent_steps_trained: 36000
    num_env_steps_sampled: 18000
    num_env_steps_trained: 18000
  custom_metrics: {}
  date: 2022-08-09_11-21-07
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 25.199999999999985
  episode_reward_mean: 1.6650000000000107
  episode_reward_min: -13.499999999999995
  episodes_this_iter: 30
  episodes_total: 180
  experiment_id: d3dc894fb4884b018f96995278e9ac97
  hostname: DESKTOP-0LQ89AE
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.2045027017593384
          entropy_coeff: 0.0
          kl: 0.011734535917639732
          model: {}
          policy_loss: -0.03639598190784454
          total_loss: 6.61864709854126
          vf_explained_var: 0.1052233874797821
          vf_loss: 6.652

[2m[36m(RolloutWorker pid=15000)[0m E0809 11:21:07.628000000 10244 src/core/ext/transport/chttp2/transport/chttp2_transport.cc:1103] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug Windows fatal exception: access violation
[2m[36m(RolloutWorker pid=15000)[0m 


Result for PPO_MultiAgentArena_e0758_00003:
  agent_timesteps_total: 32000
  counters:
    num_agent_steps_sampled: 32000
    num_agent_steps_trained: 32000
    num_env_steps_sampled: 16000
    num_env_steps_trained: 16000
  custom_metrics: {}
  date: 2022-08-09_11-21-12
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 23.99999999999997
  episode_reward_mean: 2.4570000000000065
  episode_reward_min: -21.000000000000007
  episodes_this_iter: 40
  episodes_total: 160
  experiment_id: 1cb98bfd22c24e0b8d41d2e27afd2ddf
  hostname: DESKTOP-0LQ89AE
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 9.999999747378752e-05
          entropy: 1.2412878274917603
          entropy_coeff: 0.0
          kl: 0.018541984260082245
          model: {}
          policy_loss: -0.05362188071012497
          total_loss: 6.571496486663818
          vf_explained_var: 0.14178690314292908
          vf_loss: 6.6

[2m[36m(RolloutWorker pid=6648)[0m E0809 11:22:04.424000000  7652 src/core/ext/transport/chttp2/transport/chttp2_transport.cc:1103] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug 


Result for PPO_MultiAgentArena_e0758_00003:
  agent_timesteps_total: 48000
  counters:
    num_agent_steps_sampled: 48000
    num_agent_steps_trained: 48000
    num_env_steps_sampled: 24000
    num_env_steps_trained: 24000
  custom_metrics: {}
  date: 2022-08-09_11-22-14
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 22.499999999999932
  episode_reward_mean: 3.1530000000000094
  episode_reward_min: -18.89999999999999
  episodes_this_iter: 40
  episodes_total: 240
  experiment_id: 1cb98bfd22c24e0b8d41d2e27afd2ddf
  hostname: DESKTOP-0LQ89AE
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 9.999999747378752e-05
          entropy: 1.180142879486084
          entropy_coeff: 0.0
          kl: 0.016136420890688896
          model: {}
          policy_loss: -0.048661764711141586
          total_loss: 6.61419677734375
          vf_explained_var: 0.09420862793922424
          vf_loss: 6.655

[2m[36m(RolloutWorker pid=17196)[0m E0809 11:22:14.960000000 17820 src/core/ext/transport/chttp2/transport/chttp2_transport.cc:1103] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
2022-08-09 11:22:15,071	INFO tune.py:758 -- Total run time: 378.22 seconds (377.59 seconds for the tuning loop).


Training completed.


In [None]:
# Using the returned `experiment_results` object,
# we can extract from it the best checkpoint according to some criterium, e.g. `episode_reward_mean`.

print("Best checkpoint: ", experiment_results.best_checkpoint)

print("To directory": ", experiment_results.best_checkpoint.to_directory())

'/var/folders/j4/brrn254576lgnbqqtp5p1z280000gn/T/checkpoint_tmp_6vs48i96'

### The details behind Ray RLlib resource allocation <a class="anchor" id="resource_allocation"></a>

#### Why did we use 8 CPUs in the tune run above (2 CPUs per trial)?

```
== Status ==
Current time: 2022-07-24 18:18:28 (running for 00:02:09.35)
Memory usage on this node: 9.9/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 8/16 CPUs, 0/0 GPUs, 0.5/3.97 GiB heap, 0.5/1.98 GiB objects
```

By default, the PPO Algorithm uses 2 so called `RolloutWorkers` (you can change this via `config.rollouts(num_rollout_workers=2)`) for collecting samples from
environments in parallel.
We changed this setting to only 1 worker via the `config.rollouts(num_rollout_workers=1)` call in the cell above.

`RolloutWorkers` are Ray Actors that have their own copies of the environment and step through episodes in parallel. Each Actor in Ray normally uses a single CPU, but besides `RolloutWorker`s, an Algorithm in RLlib also always has one local process (aka. the "driver" process or the "local worker"), which - in case of PPO -
handles the model/policy learning updates.

For our experiment above, this gives us 2 CPUs (1 rollout worker + 1 local learner) per Algorithm instance.

Since our config specifies two `grid_search` with 2 different learning rates AND 2 different batch sizes, we were running 4 Algorithms in parallel above (2 learning rates x 2 batch sizes = 4 trials), hence 8 CPUs were required (4 algos x 2 CPUs each = 8).


### Summary

In this notebook, we have learnt, how to:

* Use Ray Tune in combination with RLlib for hyperparameter tuning
* How RLlib and Tune determine the required computational resources for some `tune.run()` experiment

### Exercise 03<a ></a>

#### Using the `config` that we have built so far, let's run another `tune.run()`.

But this time, apply the following changes to our setup:

- Setup only 1 learning rate using the `config.training(lr=...)` method call. Chose the (seemingly) best value from the run in the previous cell (the one that yielded the highest avg. reward).
- Setup only 1 train batch size using the `config.training(train_batch_size=...)` method call. Chose the (seemingly) best value from the run in the previous cell (the one that yielded the highest avg. reward).
- Set the number of RolloutWorkers to 5 using the `config.rollouts(num_rollout_workers=5)` method call, which will allow us to collect more environment samples in parallel.
- Set the `num_envs_per_worker` config parameter to 5 using the `config.rollouts(num_envs_per_worker=...)` method call. This will batch our environment on each rollout worker, and thus parallelize action computing forward passes through our neural networks.

Other than that, use the exact same args as in our `tune.run()` call in the previous cell.

**Good luck! :)**

 ## References
 * [Tune, Scalable Hyperparameter Tuning](https://docs.ray.io/en/latest/tune/index.html)

⬅️ [Previous notebook](./ex_02_create_multiagent_rllib_env.ipynb) <br>
➡️ [Next notebook](./ex_04_offline_rl_with_rllib.ipynb) <br>

📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb)<br>