# Notebook 03. Introduction to Ray Tune and hyperparameter optimization (HPO)

© 2019-2022, Anyscale. All Rights Reserved <br>
📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb) <br>
➡️ [Next notebook](./ex_04_offline_rl_with_rllib.ipynb) <br>
⬅️ [Previous notebook](./ex_02_create_multiagent_rllib_env.ipynb) <br>

### Learning objectives
In this this notebook, you will learn:
 * [How to configure Ray Tune to find solid hyperparameters more easily](#configure_ray_tune)
 * [The details behind Ray RLlib resource allocation](#resource_allocation)
 

In [6]:
# Import required packages.

import gym
import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray import tune

# Importing the very same environment class that we have coded together in
# the previous notebook.
from multi_agent_arena.multi_agent_arena import MultiAgentArena


print(f"gym: {gym.__version__}")
print(f"ray: {ray.__version__}")

# !ale-import-roms --import-from-pkg atari_py.atari_roms

gym: 0.21.0
ray: 3.0.0.dev0


### How to configure Ray Tune to find solid hyperparameters more easily <a class="anchor" id="configure_ray_tune"></a>

In the previous experiments, we used a single algorithm's (PPO) configuration to create
exactly one Algorithm object and call its `train()` method manually a couple of times.

A common thing to try when doing ML or RL is to look for better choices of hyperparameters, neural network architectures, or algorithm settings. This hyperparameter optimization
problem can be tackled in a scalable fashion using Ray Tune (in combination with RLlib!).

<img src="images/rllib_and_tune.png" width="70%">


The following cell demonstrates, how you can setup a simple grid-search for one very important hyperparameter (the learning rate), using our already existing PPO config object and Ray Tune:

In [10]:
# Create a PPOConfig object (same as we did in the previous notebook):
config = PPOConfig()

# Setup our config object the exact same way as before:
# Point to our MultiAgentArena env:
config.environment(env=MultiAgentArena)

# Setup multi-agent mapping:

# Environment provides M agent IDs.
# RLlib has N policies (neural networks).
# The `policy_mapping_fn` maps M agent IDs to N policies (M <= N).

# If you don't provide a policy_mapping_fn, all agent IDs will map to "default_policy".
config.multi_agent(
    # Tell RLlib to create 2 policies with these IDs here:
    policies=["policy1", "policy2"],
    # Tell RLlib to map agent1 to policy1 and agent2 to policy2.
    policy_mapping_fn=lambda agent_id, episode, worker, **kwargs: "policy1" if agent_id == "agent1" else "policy2",
)

# Reduce the number of workers from 2 (default) to 1 to save some resources on the expensive hyperparameter sweep.
# IMPORTANT: More information on resource requirements for tune hyperparameter sweeps and different RLlib algorithm setups
# below.
config.rollouts(num_rollout_workers=1)

<ray.rllib.algorithms.ppo.ppo.PPOConfig at 0x7fc075b352e0>

Now, let's explore how a very simple hyperparameter search should be configured with RLlib and Tune:

In [11]:
# Before setting up the learning rate hyperparam sweep,
# let's see what the default learning rate and train batch size is for PPO:
print(f"Default learning rate for PPO is: {config.lr}")
print(f"Default train batch size for PPO is: {config.train_batch_size}")

Default learning rate for PPO is: 5e-05
Default train batch size for PPO is: 4000


In [12]:
# Now let's change our existing config object and add a simple
# grid-search over two different learning rates to it:
config.training(
    lr=tune.grid_search([5e-5, 1e-4]),
    train_batch_size=tune.grid_search([3000, 4000]),
)

<ray.rllib.algorithms.ppo.ppo.PPOConfig at 0x7fc075b352e0>

💡 <b>Right-click on the cell below and choose "Enable Scrolling for Outputs"!</b>  This will make it easier to view, since model training output can be very long!

In [13]:
# Example using Ray tune API (`tune.run()`) until some stopping condition is met.
# This will create one (or more) Algorithms under the hood automatically w/o us having to
# build these algos from the config.

experiment_results = tune.run(
    "PPO",

    # training config params (translated into a python dict!)
    config=config.to_dict(),

    # Stopping criteria whichever occurs first: average reward over training episodes, or ...
    stop={
        "training_iteration": 6,     # stop after n training iterations (calls to `Algorithm.train()`)
        #"episode_reward_mean": 400, # stop if average (sum of) rewards in an episode is 400 or more
        #"timesteps_total": 100000,  # stop if reached 100,000 sampling timesteps
    },  

    # redirect logs instead of default ~/ray_results/
    local_dir="results",
         
    # Every how many train() calls do we create a checkpoint?
    checkpoint_freq=1,
    # Always save last checkpoint (no matter the frequency).
    checkpoint_at_end=True,

    ###############
    # Note about Ray Tune verbosity.
    # Screen verbosity in Ray Tune is defined as verbose = 0, 1, 2, or 3, where:
    # 0 = silent
    # 1 = only status updates, no logging messages
    # 2 = status and brief trial results, includes logging messages
    # 3 = status and detailed trial results, includes logging messages
    # Defaults to 3.
    ###############
    verbose=3,
                   
    # Define what we are comparing for, when we search for the
    # "best" checkpoint at the end.
    metric="episode_reward_mean",
    mode="max",
)

print("Training completed.")
print("Best checkpoint: ", experiment_results.best_checkpoint)


2022-07-27 12:15:59,207	INFO services.py:1477 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8267[39m[22m


Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_1dedd_00000,TERMINATED,127.0.0.1:48430,5e-05,3000,20,299.73,60000,20.142,37.8,1.2,100
PPO_MultiAgentArena_1dedd_00001,TERMINATED,127.0.0.1:48438,0.0001,3000,20,301.605,60000,12.525,39.3,-13.2,100
PPO_MultiAgentArena_1dedd_00002,TERMINATED,127.0.0.1:48452,5e-05,4000,20,380.07,80000,19.011,36.6,-10.8,100
PPO_MultiAgentArena_1dedd_00003,TERMINATED,127.0.0.1:48457,0.0001,4000,20,381.545,80000,18.681,39.9,-20.1,100


2022-07-27 12:16:03,082	INFO plugin_schema_manager.py:51 -- Loading the default runtime env schemas: ['/Users/sven/opt/anaconda3/envs/rllib_tutorial/lib/python3.9/site-packages/ray/_private/runtime_env/../../runtime_env/schemas/working_dir_schema.json', '/Users/sven/opt/anaconda3/envs/rllib_tutorial/lib/python3.9/site-packages/ray/_private/runtime_env/../../runtime_env/schemas/pip_schema.json'].
[2m[36m(PPO pid=48430)[0m 2022-07-27 12:16:12,997	INFO algorithm.py:1774 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
[2m[36m(PPO pid=48430)[0m 2022-07-27 12:16:12,997	INFO algorithm.py:332 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(PPO pid=48430)[0m 2022-07-27 12:16:25,626	INFO trainable.py:160 -- Train

Result for PPO_MultiAgentArena_1dedd_00001:
  agent_timesteps_total: 6000
  counters:
    num_agent_steps_sampled: 6000
    num_agent_steps_trained: 6000
    num_env_steps_sampled: 3000
    num_env_steps_trained: 3000
  custom_metrics: {}
  date: 2022-07-27_12-16-56
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 4.500000000000011
  episode_reward_mean: -11.609999999999996
  episode_reward_min: -39.00000000000004
  episodes_this_iter: 30
  episodes_total: 30
  experiment_id: 5ce86766e6df4bbc883c7489a2650206
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 1.3687132596969604
          entropy_coeff: 0.0
          kl: 0.01779901050031185
          model: {}
          policy_loss: -0.046466317027807236
          total_loss: 6.718165397644043
          vf_explained_var: 0.0067339022643864155
          vf_loss:

[2m[36m(PPO pid=48457)[0m 2022-07-27 12:17:35,931	INFO trainable.py:160 -- Trainable.setup took 12.696 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


Result for PPO_MultiAgentArena_1dedd_00000:
  agent_timesteps_total: 6000
  counters:
    num_agent_steps_sampled: 6000
    num_agent_steps_trained: 6000
    num_env_steps_sampled: 3000
    num_env_steps_trained: 3000
  custom_metrics: {}
  date: 2022-07-27_12-16-33
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 12.900000000000032
  episode_reward_mean: -10.750000000000004
  episode_reward_min: -36.00000000000005
  episodes_this_iter: 30
  episodes_total: 30
  experiment_id: dddc0988b52440af8e96f1d839fcaa67
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.3750505447387695
          entropy_coeff: 0.0
          kl: 0.011448928155004978
          model: {}
          policy_loss: -0.029521968215703964
          total_loss: 7.369980335235596
          vf_explained_var: -0.0010922867804765701
          vf_lo



Result for PPO_MultiAgentArena_1dedd_00000:
  agent_timesteps_total: 12000
  counters:
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_env_steps_sampled: 6000
    num_env_steps_trained: 6000
  custom_metrics: {}
  date: 2022-07-27_12-17-46
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 18.000000000000007
  episode_reward_mean: -6.989999999999994
  episode_reward_min: -36.00000000000005
  episodes_this_iter: 30
  episodes_total: 60
  experiment_id: dddc0988b52440af8e96f1d839fcaa67
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.353236198425293
          entropy_coeff: 0.0
          kl: 0.009822160936892033
          model: {}
          policy_loss: -0.033334940671920776
          total_loss: 6.9307169914245605
          vf_explained_var: -0.0036104300525039434
          vf_

[2m[36m(RolloutWorker pid=48455)[0m E0727 12:23:52.105539000 123145558564864 chttp2_transport.cc:1103]     Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
[2m[36m(RolloutWorker pid=48455)[0m 2022-07-27 12:23:52,124	ERROR worker.py:754 -- Worker exits with an exit code 1.
[2m[36m(RolloutWorker pid=48455)[0m Traceback (most recent call last):
[2m[36m(RolloutWorker pid=48455)[0m   File "python/ray/_raylet.pyx", line 812, in ray._raylet.task_execution_handler
[2m[36m(RolloutWorker pid=48455)[0m   File "python/ray/_raylet.pyx", line 623, in ray._raylet.execute_task
[2m[36m(RolloutWorker pid=48455)[0m   File "python/ray/_raylet.pyx", line 663, in ray._raylet.execute_task
[2m[36m(RolloutWorker pid=48455)[0m   File "python/ray/_raylet.pyx", line 670, in ray._raylet.execute_task
[2m[36m(RolloutWorker pid=48455)[0m   File "python/ray/_raylet.pyx", line 674, in ray._raylet.execute_task
[2m[36m(RolloutWorker pid=48455)[0m   Fi

Result for PPO_MultiAgentArena_1dedd_00003:
  agent_timesteps_total: 160000
  counters:
    num_agent_steps_sampled: 160000
    num_agent_steps_trained: 160000
    num_env_steps_sampled: 80000
    num_env_steps_trained: 80000
  custom_metrics: {}
  date: 2022-07-27_12-24-03
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 39.8999999999999
  episode_reward_mean: 18.680999999999948
  episode_reward_min: -20.099999999999994
  episodes_this_iter: 40
  episodes_total: 800
  experiment_id: fbe69d4d3a824fb887bd9c2a22ba3d7e
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 9.999999747378752e-05
          entropy: 0.7442521452903748
          entropy_coeff: 0.0
          kl: 0.016393013298511505
          model: {}
          policy_loss: -0.045085858553647995
          total_loss: 7.268309593200684
          vf_explained_var: 0.10392039269208908
          vf

[2m[36m(RolloutWorker pid=48459)[0m E0727 12:24:03.854680000 123145604460544 chttp2_transport.cc:1103]     Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
[2m[36m(RolloutWorker pid=48459)[0m 2022-07-27 12:24:03,863	ERROR worker.py:754 -- Worker exits with an exit code 1.
[2m[36m(RolloutWorker pid=48459)[0m Traceback (most recent call last):
[2m[36m(RolloutWorker pid=48459)[0m   File "python/ray/_raylet.pyx", line 812, in ray._raylet.task_execution_handler
[2m[36m(RolloutWorker pid=48459)[0m   File "python/ray/_raylet.pyx", line 623, in ray._raylet.execute_task
[2m[36m(RolloutWorker pid=48459)[0m   File "python/ray/_raylet.pyx", line 663, in ray._raylet.execute_task
[2m[36m(RolloutWorker pid=48459)[0m   File "python/ray/_raylet.pyx", line 670, in ray._raylet.execute_task
[2m[36m(RolloutWorker pid=48459)[0m   File "python/ray/_raylet.pyx", line 674, in ray._raylet.execute_task
[2m[36m(RolloutWorker pid=48459)[0m   Fi

Training completed.
Best checkpoint:  <ray.air.checkpoint.Checkpoint object at 0x7fc076210100>


In [None]:
experiment_results.best_checkpoint.to_directory()

'/var/folders/j4/brrn254576lgnbqqtp5p1z280000gn/T/checkpoint_tmp_6vs48i96'

### The details behind Ray RLlib resource allocation <a class="anchor" id="resource_allocation"></a>

#### Why did we use 8 CPUs in the tune run above (2 CPUs per trial)?

```
== Status ==
Current time: 2022-07-24 18:18:28 (running for 00:02:09.35)
Memory usage on this node: 9.9/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 8/16 CPUs, 0/0 GPUs, 0.5/3.97 GiB heap, 0.5/1.98 GiB objects
```

By default, the PPO Algorithm uses 2 so called `RolloutWorkers` (you can change this via `config.rollouts(num_rollout_workers=2)`) for collecting samples from
environments in parallel.
We changed this setting to only 1 worker via the `config.rollouts(num_rollout_workers=1)` call in the cell above.

`RolloutWorkers` are Ray Actors that have their own copies of the environment and step through episodes in parallel. Each Actor in Ray normally uses a single CPU, but besides `RolloutWorker`s, an Algorithm in RLlib also always has one local process (aka. the "driver" process or the "local worker"), which - in case of PPO -
handles the model/policy learning updates.

For our experiment above, this gives us 2 CPUs (1 rollout worker + 1 local learner) per Algorithm instance.

Since our config specifies two `grid_search` with 2 different learning rates AND 2 different batch sizes, we were running 4 Algorithms in parallel above (2 learning rates x 2 batch sizes = 4 trials), hence 8 CPUs were required (4 algos x 2 CPUs each = 8).


### Summary

In this notebook, we have learnt, how to:

* Use Ray Tune in combination with RLlib for hyperparameter tuning
* How RLlib and Tune determine the required computational resources for some `tune.run()` experiment

### Exercises <a ></a>

#### 1. Using the `config` that we have built so far, let's run another `tune.run()`.

But this time, apply the following changes to our setup:

- Setup only 1 learning rate using the `config.training(lr=...)` method call. Chose the (seemingly) best value from the run in the previous cell (the one that yielded the highest avg. reward).
- Setup only 1 train batch size using the `config.training(train_batch_size=...)` method call. Chose the (seemingly) best value from the run in the previous cell (the one that yielded the highest avg. reward).
- Set the number of RolloutWorkers to 5 using the `config.rollouts(num_rollout_workers=5)` method call, which will allow us to collect more environment samples in parallel.
- Set the `num_envs_per_worker` config parameter to 5 using the `config.rollouts(num_envs_per_worker=...)` method call. This will batch our environment on each rollout worker, and thus parallelize action computing forward passes through our neural networks.

Other than that, use the exact same args as in our `tune.run()` call in the previous cell.

**Good luck! :)**

 ## References
 * [Tune, Scalable Hyperparameter Tuning](https://docs.ray.io/en/latest/tune/index.html)

⬅️ [Previous notebook](./ex_02_create_multiagent_rllib_env.ipynb) <br>
➡️ [Next notebook](./ex_04_offline_rl_with_rllib.ipynb) <br>

📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb)<br>