# Notebook 03. Introduction to Ray Tune and hyperparameter optimization (HPO)

© 2019-2022, Anyscale. All Rights Reserved <br>
📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb) <br>

➡️ [Next notebook](./ex_04_offline_rl_with_rllib.ipynb) <br>
⬅️ [Previous notebook](./ex_02_create_multiagent_rllib_env.ipynb) <br>

### Learning objectives
In this this notebook, you will learn:
 * [How to configure Ray Tune to find solid hyperparameters more easily](#configure_ray_tune)
 * [The details behind Ray RLlib resource allocation](#resource_allocation)
 

In [None]:
# Import required packages.

import gym
import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray import tune

# Importing the very same environment class that we have coded together in
# the previous notebook.
from multi_agent_arena.multi_agent_arena import MultiAgentArena


print(f"gym: {gym.__version__}")
print(f"ray: {ray.__version__}")

# !ale-import-roms --import-from-pkg atari_py.atari_roms

### How to configure Ray Tune to find solid hyperparameters more easily <a class="anchor" id="configure_ray_tune"></a>

In the previous experiments, we used a single algorithm's (PPO) configuration to create
exactly one Algorithm object and call its `train()` method manually a couple of times.

A common thing to try when doing ML or RL is to look for better choices of hyperparameters, neural network architectures, or algorithm settings. This hyperparameter optimization
problem can be tackled in a scalable fashion using Ray Tune (in combination with RLlib!).

<img src="images/rllib_and_tune.png" width=800>


The following cell demonstrates, how you can setup a simple grid-search for one very important hyperparameter (the learning rate), using our already existing PPO config object and Ray Tune:

In [None]:
# Create a PPOConfig object (same as we did in the previous notebook):
config = PPOConfig()

# Setup our config object the exact same way as before:
# Point to our MultiAgentArena env:
config.environment(env=MultiAgentArena)

# Setup multi-agent mapping:

# Environment provides M agent IDs.
# RLlib has N policies (neural networks).
# The `policy_mapping_fn` maps M agent IDs to N policies (M <= N).

# If you don't provide a policy_mapping_fn, all agent IDs will map to "default_policy".
config.multi_agent(
    # Tell RLlib to create 2 policies with these IDs here:
    policies=["policy1", "policy2"],
    # Tell RLlib to map agent1 to policy1 and agent2 to policy2.
    policy_mapping_fn=lambda agent_id, episode, worker, **kwargs: "policy1" if agent_id == "agent1" else "policy2",
)

# Reduce the number of workers from 2 (default) to 1 to save some resources on the expensive hyperparameter sweep.
# IMPORTANT: More information on resource requirements for tune hyperparameter sweeps and different RLlib algorithm setups
# below.
config.rollouts(num_rollout_workers=1)

Now, let's explore how a very simple hyperparameter search should be configured with RLlib and Tune:

In [None]:
# Before setting up the learning rate hyperparam sweep,
# let's see what the default learning rate and train batch size is for PPO:
print(f"Default learning rate for PPO is: {config.lr}")
print(f"Default train batch size for PPO is: {config.train_batch_size}")

In [None]:
# Now let's change our existing config object and add a simple
# grid-search over two different learning rates to it:
config.training(
    lr=tune.grid_search([5e-5, 1e-4]),
    train_batch_size=tune.grid_search([3000, 4000]),
)


💡 <b>Right-click on the cell below and choose "Enable Scrolling for Outputs"!</b>  This will make it easier to view, since model training output can be very long!

In [None]:
# Example using Ray tune API (`tune.run()`) until some stopping condition is met.
# This will create one (or more) Algorithms under the hood automatically w/o us having to
# build these algos from the config.

experiment_results = tune.run(
    "PPO",

    # training config params (translated into a python dict!)
    config=config.to_dict(),

    # Stopping criteria whichever occurs first: average reward over training episodes, or ...
    stop={
        "training_iteration": 6,     # stop after n training iterations (calls to `Algorithm.train()`)
        #"episode_reward_mean": 400, # stop if average (sum of) rewards in an episode is 400 or more
        #"timesteps_total": 100000,  # stop if reached 100,000 sampling timesteps
    },  

    # redirect logs instead of default ~/ray_results/
    local_dir="results",
         
    # Every how many train() calls do we create a checkpoint?
    checkpoint_freq=1,
    # Always save last checkpoint (no matter the frequency).
    checkpoint_at_end=True,

    ###############
    # Note about Ray Tune verbosity.
    # Screen verbosity in Ray Tune is defined as verbose = 0, 1, 2, or 3, where:
    # 0 = silent
    # 1 = only status updates, no logging messages
    # 2 = status and brief trial results, includes logging messages
    # 3 = status and detailed trial results, includes logging messages
    # Defaults to 3.
    ###############
    verbose=3,
                   
    # Define what we are comparing for, when we search for the
    # "best" checkpoint at the end.
    metric="episode_reward_mean",
    mode="max",
)

print("Training completed.")


In [None]:
# Using the returned `experiment_results` object,
# we can extract from it the best checkpoint according to some criterium, e.g. `episode_reward_mean`.

# We only had a single trial (one Algorithm instance), so this should be returned here.
best_trial = experiment_results.get_best_trial()
print("Best trial: ", best_trial)


# From that trial, extract the best checkpoint (max `evaluation/episode_reward_mean` value).
best_checkpoint = experiment_results.get_best_checkpoint(trial=best_trial, metric="episode_reward_mean", mode="max")

# We would expect this to be either the very last checkpoint or one close to it:
print(f"Best checkpoint from training: {best_checkpoint}")

### The details behind Ray RLlib resource allocation <a class="anchor" id="resource_allocation"></a>

#### Why did we use 8 CPUs in the tune run above (2 CPUs per trial)?

```
== Status ==
Current time: 2022-07-24 18:18:28 (running for 00:02:09.35)
Memory usage on this node: 9.9/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 8/16 CPUs, 0/0 GPUs, 0.5/3.97 GiB heap, 0.5/1.98 GiB objects
```

<img src="images/closer_look_at_rllib.png" width=700 />

By default, the PPO Algorithm uses 2 so called `RolloutWorkers` (you can change this via `config.rollouts(num_rollout_workers=2)`) for collecting samples from
environments in parallel.
We changed this setting to only 1 worker via the `config.rollouts(num_rollout_workers=1)` call in the cell above.

`RolloutWorkers` are Ray Actors that have their own copies of the environment and step through episodes in parallel. Each Actor in Ray normally uses a single CPU, but besides `RolloutWorker`s, an Algorithm in RLlib also always has one local process (aka. the "driver" process or the "local worker"), which - in case of PPO -
handles the model/policy learning updates.

For our experiment above, this gives us 2 CPUs (1 rollout worker + 1 local learner) per Algorithm instance.

Since our config specifies two `grid_search` with 2 different learning rates AND 2 different batch sizes, we were running 4 Algorithms in parallel above (2 learning rates x 2 batch sizes = 4 trials), hence 8 CPUs were required (4 algos x 2 CPUs each = 8).


### Summary

In this notebook, we have learnt, how to:

* Use Ray Tune in combination with RLlib for hyperparameter tuning
* How an RLlib Algorithm configuration and the Tune hyperparameter search setup determine the required computational resources for a given `tune.run()` experiment

### Exercise 03<a ></a>

#### Using the `config` that we have built so far, let's run another `tune.run()`.

But this time, apply the following changes to our setup:

- Setup only 1 learning rate using the `config.training(lr=...)` method call. Chose the (seemingly) best value from the run in the previous cell (the one that yielded the highest avg. reward).
- Setup only 1 train batch size using the `config.training(train_batch_size=...)` method call. Chose the (seemingly) best value from the run in the previous cell (the one that yielded the highest avg. reward).
- Set the number of RolloutWorkers to 5 using the `config.rollouts(num_rollout_workers=5)` method call, which will allow us to collect more environment samples in parallel.
- Set the `num_envs_per_worker` config parameter to 5 using the `config.rollouts(num_envs_per_worker=...)` method call. This will batch our environment on each rollout worker, and thus parallelize action computing forward passes through our neural networks.

Other than that, use the exact same args as in our `tune.run()` call in the previous cell.

**Good luck! :)**

 ## References
 * [Tune, Scalable Hyperparameter Tuning](https://docs.ray.io/en/latest/tune/index.html)

⬅️ [Previous notebook](./ex_02_create_multiagent_rllib_env.ipynb) <br>
➡️ [Next notebook](./ex_04_offline_rl_with_rllib.ipynb) <br>

📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb)<br>