# Exercise 03. Tune the hyperparameters of a RLlib Multi-Agent Model using Ray Tune

© 2019-2022, Anyscale. All Rights Reserved

### Learning objectives
In this this tutorial, you will learn:
 * [How to configure Ray Tune to find solid hyperparameters more easily](#configure_ray_tune)
 * [The details behind Ray RLlib resource allocation](#resource_allocation)
 

### How to configure Ray Tune to find solid hyperparameters more easily <a class="anchor" id="multi_agent_env"></a>

In the previous experiments, we used a single algorithm's (PPO) configuration to create
exactly one Algorithm object and call its `train()` method manually a couple of times.

A common thing to try when doing ML or RL is to look for better choices of hyperparameters, neural network architectures, or algorithm settings. This hyperparameter optimization
problem can be tackled in a scalable fashion using Ray Tune (in combination with RLlib!).

<img src="images/rllib_and_tune.png" width="70%">


The following cell demonstrates, how you can setup a simple grid-search for one very important hyperparameter (the learning rate), using our already existing PPO config object and Ray Tune:

In [2]:
from ray.rllib.algorithms.ppo import PPOConfig
from ray import tune
from multi_agent_arena.multi_agent_arena import MultiAgentArena

# Create a PPOConfig object (same as we did in the previous notebook):
config = PPOConfig()

# Setup our config object the exact same way as before:
# Point to our MultiAgentArena env:
config.environment(
    env=MultiAgentArena,
    env_config={
    # If you'd like, feel free to set the size of our world differently.
    #    "width": 12,
    #    "height": 12,
    },
)
# Multi-agent settings (same as before):
config.multi_agent(
    policies=["policy1", "policy2"],
    policy_mapping_fn=lambda agent_id, episode, worker, **kwargs: "policy1" if agent_id == "agent1" else "policy2",
)

# Before setting up the learning rate hyperparam sweep,
# let's see what the default learning rate for PPO actually is:
print(f"Default learning rate for PPO is: {config.lr}")

# Now let's change our existing config object and add a simple
# grid-search over two different learning rates to it:
config.training(
    lr=tune.grid_search([0.00005, 0.0003]),
)


Default learning rate for PPO is: 5e-05


<ray.rllib.algorithms.ppo.ppo.PPOConfig at 0x7fd722553940>

💡 <b>Right-click on the cell below and choose "Enable Scrolling for Outputs"!</b>  This will make it easier to view, since model training output can be very long!

In [11]:
###############
# EXAMPLE USING RAY TUNE API .run() UNTIL STOP CONDITION
#
# Note about Ray Tune verbosity.
# Screen verbosity in Ray Tune is defined as verbose = 0, 1, 2, or 3, where:
# 0 = silent
# 1 = only status updates, no logging messages
# 2 = status and brief trial results, includes logging messages
# 3 = status and detailed trial results, includes logging messages
# Defaults to 3.
###############


experiment_results = tune.run(
    # Registered Algo appreviation.
    "PPO",
    # Stopping criteria whichever occurs first: average reward over training episodes, or ...
    stop={
        "training_iteration": 3,     # stop after 3 training iterations (calls to `Algorithm.train()`)
        #"episode_reward_mean": 400, # stop if average (sum of) rewards in an episode is 400 or more
        #"timesteps_total": 100000,  # stop if reached 100,000 sampling timesteps
    },
    # training config params (translated into a python dict!)
    config=config.to_dict(),              
    # redirect logs instead of default ~/ray_results/
    local_dir="results",
    # Set frequency saving checkpoints >= evaulation_interval
    checkpoint_freq=1,
    checkpoint_at_end=True,
    # Reduce logging messages.
    verbose=3,
    # Define what we are comparing for, when we search for the
    # "best" checkpoint at the end.
    metric="episode_reward_mean",
    mode="max",
)

print("Training completed.")
print("Best checkpoint: ", experiment_results.best_checkpoint)


Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_f370e_00000,TERMINATED,127.0.0.1:26939,0.005,5,66.1827,20000,-0.471,25.5,-25.5,100
PPO_MultiAgentArena_f370e_00001,TERMINATED,127.0.0.1:26954,0.0003,5,60.1578,20000,1.659,19.5,-23.7,100


[2m[36m(PPO pid=26939)[0m 2022-07-24 18:16:29,265	INFO algorithm.py:1774 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
[2m[36m(PPO pid=26939)[0m 2022-07-24 18:16:29,265	INFO algorithm.py:332 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(PPO pid=26939)[0m 2022-07-24 18:16:47,206	INFO trainable.py:160 -- Trainable.setup took 17.942 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(PPO pid=26954)[0m 2022-07-24 18:17:06,773	INFO algorithm.py:1774 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=T

Result for PPO_MultiAgentArena_f370e_00000:
  agent_timesteps_total: 8000
  counters:
    num_agent_steps_sampled: 8000
    num_agent_steps_trained: 8000
    num_env_steps_sampled: 4000
    num_env_steps_trained: 4000
  custom_metrics: {}
  date: 2022-07-24_18-17-03
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 15.300000000000008
  episode_reward_mean: -8.129999999999997
  episode_reward_min: -33.00000000000006
  episodes_this_iter: 40
  episodes_total: 40
  experiment_id: b06a13ec4c9d44f2866d39064923d025
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.004999999888241291
          entropy: 1.3389638662338257
          entropy_coeff: 0.0
          kl: 0.048597175627946854
          model: {}
          policy_loss: -0.08453787863254547
          total_loss: 6.081151485443115
          vf_explained_var: -0.03356152027845383
          vf_loss: 6

[2m[36m(PPO pid=26954)[0m 2022-07-24 18:17:27,349	INFO trainable.py:160 -- Trainable.setup took 20.580 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


Result for PPO_MultiAgentArena_f370e_00001:
  agent_timesteps_total: 8000
  counters:
    num_agent_steps_sampled: 8000
    num_agent_steps_trained: 8000
    num_env_steps_sampled: 4000
    num_env_steps_trained: 4000
  custom_metrics: {}
  date: 2022-07-24_18-17-39
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 8.100000000000014
  episode_reward_mean: -11.0925
  episode_reward_min: -45.000000000000064
  episodes_this_iter: 40
  episodes_total: 40
  experiment_id: 7930c8da23ce49829b3cecebb3c03c5f
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.0003000000142492354
          entropy: 1.3541070222854614
          entropy_coeff: 0.0
          kl: 0.03301246091723442
          model: {}
          policy_loss: -0.06401419639587402
          total_loss: 6.8029632568359375
          vf_explained_var: -0.013164803385734558
          vf_loss: 6.8603749

2022-07-24 18:18:28,810	INFO tune.py:737 -- Total run time: 129.61 seconds (129.23 seconds for the tuning loop).


Training completed.
Best checkpoint:  <ray.air.checkpoint.Checkpoint object at 0x7f9daa29adc0>


### The details behind Ray RLlib resource allocation <a class="anchor" id="multi_agent_env"></a>

When running

## HomeWork



### Exercises <a ></a>

1. 

 ## References
 * 