# Exercise 03. Tune the hyperparameters of a RLlib Multi-Agent Model using Ray Tune

© 2019-2022, Anyscale. All Rights Reserved

### Learning objectives
In this this tutorial, you will learn:
 * [How to configure Ray Tune to find solid hyperparameters more easily](#configure_ray_tune)
 * [The details behind Ray RLlib resource allocation](#resource_allocation)
 

### How to configure Ray Tune to find solid hyperparameters more easily <a class="anchor" id="multi_agent_env"></a>

In the previous experiments, we used a single algorithm's (PPO) configuration to create
exactly one Algorithm object and call its `train()` method manually a couple of times.

A common thing to try when doing ML or RL is to look for better choices of hyperparameters, neural network architectures, or algorithm settings. This hyperparameter optimization
problem can be tackled in a scalable fashion using Ray Tune (in combination with RLlib!).

<img src="images/rllib_and_tune.png" width="70%">


The following cell demonstrates, how you can setup a simple grid-search for one very important hyperparameter (the learning rate), using our already existing PPO config object and Ray Tune:

In [9]:
# Create a PPOConfig object (same as we did in the previous notebook):
from ray.rllib.algorithms.ppo import PPOConfig
from ray import tune

from multi_agent_arena.multi_agent_arena import MultiAgentArena

config = PPOConfig()

# Setup our config object the exact same way as before:
# Point to our MultiAgentArena env:
config.environment(env=MultiAgentArena)
# Multi-agent settings:
config.multi_agent(
    policies=["policy1", "policy2"],
    policy_mapping_fn=lambda agent_id, episode, worker, **kwargs: "policy1" if agent_id == "agent1" else "policy2",
)

# Before setting up the learning rate hyperparam sweep,
# let's see what the default learning rate for PPO actually is:
print(f"Default learning rate for PPO is: {config.lr}")

# Now let's change our existing config object and add a simple
# grid-search over two different learning rates to it:
config.training(
    lr=tune.grid_search([0.005, 0.0003]),
)


Default learning rate for PPO is: 5e-05


<ray.rllib.algorithms.ppo.ppo.PPOConfig at 0x7f9daa259190>

💡 <b>Right-click on the cell below and choose "Enable Scrolling for Outputs"!</b>  This will make it easier to view, since model training output can be very long!

In [None]:
###############
# EXAMPLE USING RAY TUNE API .run() UNTIL STOP CONDITION
#
# Note about Ray Tune verbosity.
# Screen verbosity in Ray Tune is defined as verbose = 0, 1, 2, or 3, where:
# 0 = silent
# 1 = only status updates, no logging messages
# 2 = status and brief trial results, includes logging messages
# 3 = status and detailed trial results, includes logging messages
# Defaults to 3.
###############

verbosity = 3 # Tune logging verbosity


# Define trainer runtime config values
checkpoint_frequency = 1          # every how many train() calls do we create a checkpoint?
checkpoint_at_end = True          # always save last checkpoint (no matter the frequency)
relative_checkpoint_dir = "multiagent_PPO_logs" # redirect logs instead of ~/ray_results/


experiment_results = tune.run("PPO", 
    # Stopping criteria whichever occurs first: average reward over training episodes, or ...
    stop={
        #"episode_reward_mean": 400, # stop if average (sum of) rewards in an episode is 400 or more
        "training_iteration": 5,  # stop after 5 training iterations (calls to `Algorithm.train()`)
        # "timesteps_total": 100000,  # stop if reached 100,000 sampling timesteps
    },  
          
    # training config params
    config=config.to_dict(),
                    
    # redirect logs instead of default ~/ray_results/
    local_dir=relative_checkpoint_dir,
         
    # set frequency saving checkpoints >= evaulation_interval
    checkpoint_freq=checkpoint_frequency,
    checkpoint_at_end=checkpoint_at_end,
         
    # Reduce logging messages
    verbose=verbosity,
                   
    # Define what we are comparing for, when we search for the
    # "best" checkpoint at the end.
    metric="episode_reward_mean",
    mode="max",
    )

print("Training completed.")
print("Best checkpoint: ", experiment_results.best_checkpoint)


[2m[36m(PPO pid=26939)[0m 2022-07-24 18:16:29,265	INFO algorithm.py:1774 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
[2m[36m(PPO pid=26939)[0m 2022-07-24 18:16:29,265	INFO algorithm.py:332 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


### The details behind Ray RLlib resource allocation <a class="anchor" id="multi_agent_env"></a>

When running

## HomeWork



### Exercises <a ></a>

1. 

 ## References
 * 