# Exercise 02. Create a Custom Multi-Agent RLlib Environment

© 2019-2022, Anyscale. All Rights Reserved

### Learning objectives

In this tutorial, you will learn how to:
 * [Code a custom RLlib Multi-agent environment](#multi_agent_env)
 * [Select an algorithm and instantiate a config object using that algorithm's config class](#rllib_algo)
 * [Train a RL model using a multi-agent algorithm from RLlib](#tune_run)



In [8]:
# install libraries

import time
import ray, gym
from gym.spaces import Discrete, MultiDiscrete
from ray.rllib.env.multi_agent_env import MultiAgentEnv
from ray import tune
from ray.tune.logger import pretty_print
import numpy as np
from ipywidgets import Output
from IPython import display

print(f"gym: {gym.__version__}")
print(f"ray: {ray.__version__}")

# !ale-import-roms --import-from-pkg atari_py.atari_roms

gym: 0.21.0
ray: 3.0.0.dev0


## Code a custom RLlib Multi-agent environment <a class="anchor" id="multi_agent_env"></a>

We will create the following (adversarial) multi-agent environment.  Wil will use this custom environment for the rest of this tutorial.

<img src="images/custom_environment.png" width="98%">

### Review OpenAI Gym Environments

We learned in the last lesson about OpenAI Gym Environments.  Specifically we covered Gym:
<ul>
    <li>Action Space</li>
    <li>Observation Space</li>
    <li>Rewards</li>
    <li><i>`done`</i> signal</li>
    <li>Gym Environment API methods:</li>
    <ul>
        <li>reset(self)</li>
        <li>step(self, action: dict)</li>
        <li>render(self, mode=None)</li>
    </ul>
    </ul>

### RLlib MultiAgentEnv API

RLlib supports environments created using the OpenAI Gym API. This means, to create a RLlib environment from scratch, you need to minimally implement the same methods [required by Gym](https://www.gymlibrary.ml/content/api/#standard-methods).  

We want a multi-agent environment, so we will implement the [RLlib MultiAgentEnv base class](https://github.com/ray-project/ray/blob/master/rllib/env/multi_agent_env.py) which requires a few extra methods in addition to the minimal Gym methods.
    
<ul>
    <li>__init__(self)</li>
    <li> _get_obs(self)</li>
    <li>_move(self, coords, action, which_agent)</li>
    </ul>

In [3]:
# Let's code our multi-agent environment

class MultiAgentArena(MultiAgentEnv):
    
    def __init__(self, config=None):
        config = config or {}
        # Dimensions of the grid.
        self.width = config.get("width", 10)
        self.height = config.get("height", 10)

        # End an episode after this many timesteps.
        self.timestep_limit = config.get("ts", 100)

        self.observation_space = MultiDiscrete([self.width * self.height,
                                                self.width * self.height])
        # 0=up, 1=right, 2=down, 3=left.
        self.action_space = Discrete(4)

        # Reset env.
        self.reset()

        # For rendering.
        self.out = None
        if config.get("render"):
            self.out = Output()
            display.display(self.out)

    def reset(self):
        """Returns initial observation of next(!) episode."""
        # Row-major coords.
        self.agent1_pos = [0, 0]  # upper left corner
        self.agent2_pos = [self.height - 1, self.width - 1]  # lower bottom corner

        # Accumulated rewards in this episode.
        self.agent1_R = 0.0
        self.agent2_R = 0.0

        # Reset agent1's visited fields.
        self.agent1_visited_fields = set([tuple(self.agent1_pos)])

        # How many timesteps have we done in this episode.
        self.timesteps = 0

        # Did we have a collision in recent step?
        self.collision = False
        # How many collisions in total have we had in this episode?
        self.num_collisions = 0

        # Return the initial observation in the new episode.
        return self._get_obs()

    def step(self, action: dict):
        """
        Returns (next observation, rewards, dones, infos) after having taken the given actions.
        
        e.g.
        `action={"agent1": action_for_agent1, "agent2": action_for_agent2}`
        """
        
        # increase our time steps counter by 1.
        self.timesteps += 1
        # An episode is "done" when we reach the time step limit.
        is_done = self.timesteps >= self.timestep_limit

        # Agent2 always moves first.
        # events = [collision|agent1_new_field]
        events = self._move(self.agent2_pos, action["agent2"], is_agent1=False)
        events |= self._move(self.agent1_pos, action["agent1"], is_agent1=True)

        # Useful for rendering.
        self.collision = "collision" in events
        if self.collision is True:
            self.num_collisions += 1
            
        # Get observations (based on new agent positions).
        obs = self._get_obs()

        # Determine rewards based on the collected events:
        r1 = -1.0 if "collision" in events else 1.0 if "agent1_new_field" in events else -0.5
        r2 = 1.0 if "collision" in events else -0.1

        self.agent1_R += r1
        self.agent2_R += r2
        
        rewards = {
            "agent1": r1,
            "agent2": r2,
        }

        # Generate a `done` dict (per-agent and total).
        dones = {
            "agent1": is_done,
            "agent2": is_done,
            # special `__all__` key indicates that the episode is done for all agents.
            "__all__": is_done,
        }

        return obs, rewards, dones, {}  # <- info dict (not needed here).

    def _get_obs(self):
        """
        Returns obs dict (agent name to discrete-pos tuple) using each
        agent's current x/y-positions.
        """
        ag1_discrete_pos = self.agent1_pos[0] * self.width + \
            (self.agent1_pos[1] % self.width)
        ag2_discrete_pos = self.agent2_pos[0] * self.width + \
            (self.agent2_pos[1] % self.width)
        return {
            "agent1": np.array([ag1_discrete_pos, ag2_discrete_pos]),
            "agent2": np.array([ag2_discrete_pos, ag1_discrete_pos]),
        }

    def _move(self, coords, action, is_agent1):
        """
        Moves an agent (agent1 iff is_agent1=True, else agent2) from `coords` (x/y) using the
        given action (0=up, 1=right, etc..) and returns a resulting events dict:
        Agent1: "new" when entering a new field. "bumped" when having been bumped into by agent2.
        Agent2: "bumped" when bumping into agent1 (agent1 then gets -1.0).
        """
        orig_coords = coords[:]
        # Change the row: 0=up (-1), 2=down (+1)
        coords[0] += -1 if action == 0 else 1 if action == 2 else 0
        # Change the column: 1=right (+1), 3=left (-1)
        coords[1] += 1 if action == 1 else -1 if action == 3 else 0

        # Solve collisions.
        # Make sure, we don't end up on the other agent's position.
        # If yes, don't move (we are blocked).
        if (is_agent1 and coords == self.agent2_pos) or (not is_agent1 and coords == self.agent1_pos):
            coords[0], coords[1] = orig_coords
            # Agent2 blocked agent1 (agent1 tried to run into agent2)
            # OR Agent2 bumped into agent1 (agent2 tried to run into agent1)
            return {"collision"}

        # No agent blocking -> check walls.
        if coords[0] < 0:
            coords[0] = 0
        elif coords[0] >= self.height:
            coords[0] = self.height - 1
        if coords[1] < 0:
            coords[1] = 0
        elif coords[1] >= self.width:
            coords[1] = self.width - 1

        # If agent1 -> "new" if new tile covered.
        if is_agent1 and not tuple(coords) in self.agent1_visited_fields:
            self.agent1_visited_fields.add(tuple(coords))
            return {"agent1_new_field"}
        # No new tile for agent1.
        return set()

    def render(self, mode=None):

        if self.out is not None:
            self.out.clear_output(wait=True)

        print("_" * (self.width + 2))
        for r in range(self.height):
            print("|", end="")
            for c in range(self.width):
                field = r * self.width + c % self.width
                if self.agent1_pos == [r, c]:
                    print("1", end="")
                elif self.agent2_pos == [r, c]:
                    print("2", end="")
                elif (r, c) in self.agent1_visited_fields:
                    print(".", end="")
                else:
                    print(" ", end="")
            print("|")
        print("‾" * (self.width + 2))
        print(f"{'!!Collision!!' if self.collision else ''}")
        print("R1={: .1f}".format(self.agent1_R))
        print("R2={: .1f} ({} collisions)".format(self.agent2_R, self.num_collisions))
        print()
        time.sleep(0.25)


<br>
In the cell below:
<ul>
    <li>Initialize the environment</li>
    <li>Make both agents take a few steps</li>
    <li>Render the environment after each agent takes a step.</li>
    </ul>


In [4]:
env = MultiAgentArena(config={"render": True})
obs = env.reset()

with env.out:
    # Agent1 moves down, Agent2 moves up.
    obs, rewards, dones, infos = env.step(action={"agent1": 2, "agent2": 0})
    env.render()

    # Agent1 moves right, Agent2 moves left.
    obs, rewards, dones, infos = env.step(action={"agent1": 1, "agent2": 3})
    env.render()

    # Agent1 moves right, Agent2 moves left.
    obs, rewards, dones, infos = env.step(action={"agent1": 1, "agent2": 3})
    env.render()

    # Agent1 moves down, Agent2 moves up.
    obs, rewards, dones, infos = env.step(action={"agent1": 2, "agent2": 0})
    env.render()


print("Agent1's x/y position={}".format(env.agent1_pos))
print("Agent2's x/y position={}".format(env.agent2_pos))
print("Env timesteps={}".format(env.timesteps))

Output()

Agent1's x/y position=[2, 2]
Agent2's x/y position=[7, 7]
Env timesteps=4


## Select an algorithm and instantiate a config object using that algorithm's config class <a class="anchor" id="rllib_algo"></a>

<ol>
    <li><a href="https://docs.ray.io/en/master/rllib/rllib-algorithms.html">Open RLlib docs</a></li>
    <li>Scroll down and click url of algo you're searching for, e.g. <i><b>PPO</b></i></li>
    <li>On the <a href=""https://docs.ray.io/en/master/rllib/rllib-algorithms.html#ppo>algo docs page </a>, click on the link <i><b>Implementation</b></i>.  This will open the <a href="https://github.com/ray-project/ray/blob/master/rllib/algorithms/ppo/ppo.py">algo code file on github</a>.</li>
    <li>Search the github code file for the word <i><b>config</b></i></li>
    <li>Typically the docstring example will show: </li>
    <ol>
        <li>Example code implementing RLlib API, then </li>
        <li>Example code implementing Ray Tune API.</li>
    </ol>
    <li>Scroll down to the config <b>__init()__</b> method</li>
    <ol>
            <li>Algorithm default hyperparameter values are here.</li>
    </ol>
    </ol>

In [None]:
# Sven's trainer config

# TRAINER_CFG = {
#     # Using our environment class defined above.
#     "env": MultiAgentArena,
#     # Use `framework=torch` here for PyTorch.
#     "framework": "tf",

#     # Run on 1 GPU on the "learner".
#     "num_gpus": 1,
#     # Use 15 ray-parallelized environment workers,
#     # which collect samples to learn from. Each worker gets assigned
#     # 1 CPU.
#     "num_workers": 15,
#     # Each of the 15 workers has 10 environment copies ("vectorization")
#     # for faster (batched) forward passes.
#     "num_envs_per_worker": 10,

#     # Multi-agent setup: 2 policies.
#     "multiagent": {
#         "policies": {"policy1", "policy2"},
#         "policy_mapping_fn": lambda agent_id: "policy1" if agent_id == "agent1" else "policy2"
#     },
# }

In [10]:
# config is an object instead of a dictionary since Ray version >= 1.13
from ray.rllib.algorithms.ppo import PPOConfig

# uncomment below to see the long list of specifically PPO default config values
# print(pretty_print(PPOConfig().to_dict()))

# Define algorithm config values
env_name = MultiAgentArena
evaluation_interval = 2   #100, num training episodes to run between eval steps
evaluation_duration = 20  #100, num eval episodes to run for the eval step
num_workers = 4          # +1 for head node, num parallel workers or actors for rollouts
num_gpus = 0             # num gpus to use in the cluster
num_envs_per_worker = 1  #1, no vectorization of environments to run at same time

# Define trainer runtime config values
checkpoint_freq = evaluation_interval # freq save checkpoints >= evaulation_interval
checkpoint_at_end = True                # always save last checkpoint
relative_checkpoint_dir = "multiagent_PPO_logs" # redirect logs instead of ~/ray_results/
random_seed = 415
# Set the log level to DEBUG, INFO, WARN, or ERROR 
log_level = "ERROR"

# Create a new training config
# override certain default algorithm config values
config_train = (
    PPOConfig()
    .framework(framework='torch')
    .environment(env=env_name, disable_env_checking=False)
    .rollouts(num_rollout_workers=num_workers, num_envs_per_worker=num_envs_per_worker)
    .resources(num_gpus=num_gpus, )
#     .training(gamma=0.9, lr=0.01, kl_coeff=0.3)
    .evaluation(evaluation_interval=evaluation_interval, 
                evaluation_duration=evaluation_duration)
    .debugging(seed=random_seed, log_level=log_level)
    .multi_agent(policies=["policy1", "policy2"], 
                 policy_mapping_fn=lambda agent_id: "policy1" if agent_id == "agent1" else "policy2")
)

print(type(config_train))


<class 'ray.rllib.algorithms.ppo.ppo.PPOConfig'>


## Train a RL model using a multi-agent algorithm from RLlib <a class="anchor" id="tune_run"></a>

💡 <b>Right-click on the cell below and choose "Enable Scrolling for Outputs"!</b>  This will make it easier to view, since model training output can be very long!

In [None]:
# Sven's original tuning run

# results = tune.run(
#     # RLlib Trainer class (we use the "PPO" algorithm today).
#     PPOTrainer,
#     # Give our experiment a name (we will find results/checkpoints
#     # under this name on the server's `~ray_results/` dir).
#     name=f"CUJ-RL",
#     # The RLlib config (defined in a cell above).
#     config=TRAINER_CFG,
#     # Take a snapshot every 2 iterations.
#     checkpoint_freq=2,
#     # Plus one at the very end of training.
#     checkpoint_at_end=True,
#     # Run for exactly 30 training iterations.
#     stop={"training_iteration": 20},
#     # Define what we are comparing for, when we search for the
#     # "best" checkpoint at the end.
#     metric="episode_reward_mean",
#     mode="max")

# print("Best checkpoint: ", results.best_checkpoint)


In [11]:
###############
# EXAMPLE USING RAY TUNE API .run() IN A LOOP UNTIL STOP CONDITION
# Note about Ray Tune verbosity.
# Screen verbosity in Ray Tune is defined as verbose = 0, 1, 2, or 3, where:
# 0 = silent
# 1 = only status updates, no logging messages
# 2 = status and brief trial results, includes logging messages
# 3 = status and detailed trial results, includes logging messages
# Defaults to 3.
###############

# To start fresh, restart Ray in case it is already running
if ray.is_initialized():
    ray.shutdown()

evaluation_interval = 100   #100, num training episodes to run between eval steps
verbosity = 2 # Tune screen verbosity

trainer = tune.run("PPO", 
    # Stopping criteria whichever occurs first: average reward over training episodes, or ...
    stop={#"episode_reward_mean": 400, # stop if achieve 400 out of max 500
          "training_iteration": 20,  # stop if achieved 200 episodes
          # "timesteps_total": 100000,  # stop if achieved 100,000 timesteps
         },  
              
    # training config params
    config = config_train.to_dict(),
                    
    #redirect logs instead of default ~/ray_results/
    local_dir = relative_checkpoint_dir, #relative path
         
    # set frequency saving checkpoints >= evaulation_interval
    checkpoint_freq = checkpoint_freq,
    checkpoint_at_end=True,
         
    # Reduce logging messages
    verbose = verbosity,
                   
    # Define what we are comparing for, when we search for the
    # "best" checkpoint at the end.
    metric="episode_reward_mean",
    mode="max",
    )

print("Training completed.")
print("Best checkpoint: ", trainer.best_checkpoint)


2022-07-10 21:59:51,747	INFO tune.py:862 -- Initializing Ray automatically.For cluster usage or custom Ray initialization, call `ray.init(...)` before `tune.run`.
2022-07-10 21:59:53,801	ERROR services.py:1494 -- Failed to start the dashboard: Failed to start the dashboard, return code 0
 The last 10 lines of /tmp/ray/session_2022-07-10_21-59-51_749470_76756/logs/dashboard.log:
  File "/Users/christy/Documents/ray/python/ray/dashboard/head.py", line 105, in _configure_http_server
    http_server = HttpServerDashboardHead(
  File "/Users/christy/Documents/ray/python/ray/dashboard/http_server_head.py", line 69, in __init__
    raise ex
  File "/Users/christy/Documents/ray/python/ray/dashboard/http_server_head.py", line 60, in __init__
    build_dir = setup_static_dir()
  File "/Users/christy/Documents/ray/python/ray/dashboard/http_server_head.py", line 31, in setup_static_dir
    raise dashboard_utils.FrontendNotFoundError(
ray.dashboard.utils.FrontendNotFoundError: [Errno 2] Dashboard b

Trial name,status,loc
PPO_MultiAgentArena_4de8f_00000,ERROR,

Trial name,# failures,error file
PPO_MultiAgentArena_4de8f_00000,1,/Users/christy/Documents/github_ray_summit_2022/ray-summit-2022-training/ray-rllib/multiagent_PPO_logs/PPO/PPO_MultiAgentArena_4de8f_00000_0_2022-07-10_21-59-54/error.txt


[2m[36m(PPO pid=84310)[0m 2022-07-10 21:59:58,298	INFO algorithm.py:332 -- Current log_level is ERROR. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2022-07-10 22:00:01,980	ERROR trial_runner.py:920 -- Trial PPO_MultiAgentArena_4de8f_00000: Error processing event.
NoneType: None


The trial PPO_MultiAgentArena_4de8f_00000 errored with parameters={'extra_python_environs_for_driver': {}, 'extra_python_environs_for_worker': {}, 'num_gpus': 0, 'num_cpus_per_worker': 1, 'num_gpus_per_worker': 0, '_fake_gpus': False, 'custom_resources_per_worker': {}, 'placement_strategy': 'PACK', 'eager_tracing': False, 'eager_max_retraces': 20, 'tf_session_args': {'intra_op_parallelism_threads': 2, 'inter_op_parallelism_threads': 2, 'gpu_options': {'allow_growth': True}, 'log_device_placement': False, 'device_count': {'CPU': 1}, 'allow_soft_placement': True}, 'local_tf_session_args': {'intra_op_parallelism_threads': 8, 'inter_op_parallelism_threads': 8}, 'env': <class '__main__.MultiAgentArena'>, 'env_config': {}, 'observation_space': None, 'action_space': None, 'env_task_fn': None, 'render_env': False, 'clip_rewards': None, 'normalize_actions': True, 'clip_actions': False, 'disable_env_checking': False, 'num_workers': 4, 'num_envs_per_worker': 1, 'sample_collector': <class 'ray.rll

[2m[36m(PPO pid=84310)[0m 2022-07-10 22:00:01,971	ERROR worker.py:749 -- Exception raised in creation task: The actor died because of an error raised in its creation task, [36mray::PPO.__init__()[39m (pid=84310, ip=127.0.0.1, repr=PPO)
[2m[36m(PPO pid=84310)[0m   File "/Users/christy/Documents/ray/python/ray/rllib/evaluation/worker_set.py", line 127, in __init__
[2m[36m(PPO pid=84310)[0m     self.add_workers(
[2m[36m(PPO pid=84310)[0m   File "/Users/christy/Documents/ray/python/ray/rllib/evaluation/worker_set.py", line 270, in add_workers
[2m[36m(PPO pid=84310)[0m     self.foreach_worker_with_index(
[2m[36m(PPO pid=84310)[0m   File "/Users/christy/Documents/ray/python/ray/rllib/evaluation/worker_set.py", line 405, in foreach_worker_with_index
[2m[36m(PPO pid=84310)[0m     remote_results = ray.get(
[2m[36m(PPO pid=84310)[0m ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, [36mray::RolloutWorker.__init__()[39m (pid=

TuneError: ('Trials did not complete', [PPO_MultiAgentArena_4de8f_00000])

### Exercises

1. 
2. 

### Homework

1. 

## References

 * 
 