# Exercise 02. Create a Custom Multi-Agent RLlib Environment

© 2019-2022, Anyscale. All Rights Reserved

### Learning objectives

In this tutorial, you will learn how to:
 * [Code a custom RLlib Multi-agent environment](#multi_agent_env)
 * [Select an algorithm and instantiate a config object using that algorithm's config class](#rllib_algo)
 * [Train a RL model using a multi-agent algorithm from RLlib](#rllib_run)



In [1]:
# import required packages

import time
import gym
from gym.spaces import Discrete, MultiDiscrete
import numpy as np
from ipywidgets import Output
from IPython import display

import ray
from ray.rllib.env.multi_agent_env import MultiAgentEnv
from ray import tune
from ray.tune.logger import pretty_print

print(f"gym: {gym.__version__}")
print(f"ray: {ray.__version__}")

# !ale-import-roms --import-from-pkg atari_py.atari_roms

gym: 0.21.0
ray: 3.0.0.dev0


## Code a custom RLlib Multi-agent environment <a class="anchor" id="multi_agent_env"></a>

We will create the following (adversarial) multi-agent environment.  Wil will use this custom environment for the rest of this tutorial.

<img src="images/custom_environment.png" width="65%">

### Review OpenAI Gym Environments

We learned in the last lesson about OpenAI Gym Environments.  Specifically we covered Gym:
<ul>
    <li>Action Space</li>
    <li>Observation Space</li>
    <li>Rewards</li>
    <li><i>`done`</i> signal</li>
    <li>Gym Environment API methods:</li>
    <ul>
        <li>reset(self)</li>
        <li>step(self, action: dict)</li>
        <li>render(self, mode=None)</li>
    </ul>
    </ul>

### RLlib MultiAgentEnv API

RLlib supports environments created using the OpenAI Gym API. This means, to create a RLlib environment from scratch, you need to minimally implement the same methods [required by Gym](https://www.gymlibrary.ml/content/api/#standard-methods).  

We want a multi-agent environment, so we will implement the [RLlib MultiAgentEnv base class](https://github.com/ray-project/ray/blob/master/rllib/env/multi_agent_env.py) which requires a few extra methods in addition to the minimal Gym methods.
    
<ul>
    <li>__init__(self)</li>
    <li> _get_obs(self)</li>
    <li>_move(self, coords, action, which_agent)</li>
    </ul>

In [2]:
# Let's code our multi-agent environment

class MultiAgentArena(MultiAgentEnv):
    
    def __init__(self, config=None):
        config = config or {}
        # Dimensions of the grid.
        self.width = config.get("width", 10)
        self.height = config.get("height", 10)

        # End an episode after this many timesteps.
        self.timestep_limit = config.get("ts", 100)

        self.observation_space = MultiDiscrete([self.width * self.height,
                                                self.width * self.height])
        # 0=up, 1=right, 2=down, 3=left.
        self.action_space = Discrete(4)

        # Reset env.
        self.reset()
        
        # For rendering.
        self.out = None
        if config.get("render"):
            self.out = Output()
            display.display(self.out)

        self._spaces_in_preferred_format = False

    def reset(self):
        """Returns initial observation of next(!) episode."""
        # Row-major coords.
        self.agent1_pos = [0, 0]  # upper left corner
        self.agent2_pos = [self.height - 1, self.width - 1]  # lower bottom corner

        # Accumulated rewards in this episode.
        self.agent1_R = 0.0
        self.agent2_R = 0.0

        # Reset agent1's visited fields.
        self.agent1_visited_fields = set([tuple(self.agent1_pos)])

        # How many timesteps have we done in this episode.
        self.timesteps = 0

        # Did we have a collision in recent step?
        self.collision = False
        # How many collisions in total have we had in this episode?
        self.num_collisions = 0

        # Return the initial observation in the new episode.
        return self._get_obs()

    def step(self, action: dict):
        """
        Returns (next observation, rewards, dones, infos) after having taken the given actions.
        
        e.g.
        `action={"agent1": action_for_agent1, "agent2": action_for_agent2}`
        """
        
        # increase our time steps counter by 1.
        self.timesteps += 1
        # An episode is "done" when we reach the time step limit.
        is_done = self.timesteps >= self.timestep_limit

        # Agent2 always moves first.
        # events = [collision|agent1_new_field]
        events = self._move(self.agent2_pos, action["agent2"], is_agent1=False)
        events |= self._move(self.agent1_pos, action["agent1"], is_agent1=True)

        # Useful for rendering.
        self.collision = "collision" in events
        if self.collision is True:
            self.num_collisions += 1
            
        # Get observations (based on new agent positions).
        obs = self._get_obs()

        # Determine rewards based on the collected events:
        r1 = -1.0 if "collision" in events else 1.0 if "agent1_new_field" in events else -0.5
        r2 = 1.0 if "collision" in events else -0.1

        self.agent1_R += r1
        self.agent2_R += r2
        
        rewards = {
            "agent1": r1,
            "agent2": r2,
        }

        # Generate a `done` dict (per-agent and total).
        dones = {
            "agent1": is_done,
            "agent2": is_done,
            # special `__all__` key indicates that the episode is done for all agents.
            "__all__": is_done,
        }

        return obs, rewards, dones, {}  # <- info dict (not needed here).

    def _get_obs(self):
        """
        Returns obs dict (agent name to discrete-pos tuple) using each
        agent's current x/y-positions.
        """
        ag1_discrete_pos = self.agent1_pos[0] * self.width + \
            (self.agent1_pos[1] % self.width)
        ag2_discrete_pos = self.agent2_pos[0] * self.width + \
            (self.agent2_pos[1] % self.width)
        return {
            "agent1": np.array([ag1_discrete_pos, ag2_discrete_pos]),
            "agent2": np.array([ag2_discrete_pos, ag1_discrete_pos]),
        }

    def _move(self, coords, action, is_agent1):
        """
        Moves an agent (agent1 iff is_agent1=True, else agent2) from `coords` (x/y) using the
        given action (0=up, 1=right, etc..) and returns a resulting events dict:
        Agent1: "new" when entering a new field. "bumped" when having been bumped into by agent2.
        Agent2: "bumped" when bumping into agent1 (agent1 then gets -1.0).
        """
        orig_coords = coords[:]
        # Change the row: 0=up (-1), 2=down (+1)
        coords[0] += -1 if action == 0 else 1 if action == 2 else 0
        # Change the column: 1=right (+1), 3=left (-1)
        coords[1] += 1 if action == 1 else -1 if action == 3 else 0

        # Solve collisions.
        # Make sure, we don't end up on the other agent's position.
        # If yes, don't move (we are blocked).
        if (is_agent1 and coords == self.agent2_pos) or (not is_agent1 and coords == self.agent1_pos):
            coords[0], coords[1] = orig_coords
            # Agent2 blocked agent1 (agent1 tried to run into agent2)
            # OR Agent2 bumped into agent1 (agent2 tried to run into agent1)
            return {"collision"}

        # No agent blocking -> check walls.
        if coords[0] < 0:
            coords[0] = 0
        elif coords[0] >= self.height:
            coords[0] = self.height - 1
        if coords[1] < 0:
            coords[1] = 0
        elif coords[1] >= self.width:
            coords[1] = self.width - 1

        # If agent1 -> "new" if new tile covered.
        if is_agent1 and not tuple(coords) in self.agent1_visited_fields:
            self.agent1_visited_fields.add(tuple(coords))
            return {"agent1_new_field"}
        # No new tile for agent1.
        return set()

    def render(self, mode=None):

        if self.out is not None:
            self.out.clear_output(wait=True)

        print("_" * (self.width + 2))
        for r in range(self.height):
            print("|", end="")
            for c in range(self.width):
                field = r * self.width + c % self.width
                if self.agent1_pos == [r, c]:
                    print("1", end="")
                elif self.agent2_pos == [r, c]:
                    print("2", end="")
                elif (r, c) in self.agent1_visited_fields:
                    print(".", end="")
                else:
                    print(" ", end="")
            print("|")
        print("‾" * (self.width + 2))
        print(f"{'!!Collision!!' if self.collision else ''}")
        print("R1={: .1f}".format(self.agent1_R))
        print("R2={: .1f} ({} collisions)".format(self.agent2_R, self.num_collisions))
        print()
        time.sleep(0.25)


<br>
In the cell below:
<ul>
    <li>Initialize the environment</li>
    <li>Make both agents take a few steps</li>
    <li>Render the environment after each agent takes a step.</li>
    </ul>


In [5]:
env = MultiAgentArena(config={"render": True})
obs = env.reset()

with env.out:
    # Agent1 moves down, Agent2 moves up.
    obs, rewards, dones, infos = env.step(action={"agent1": 2, "agent2": 0})
    env.render()

    # Agent1 moves right, Agent2 moves left.
    obs, rewards, dones, infos = env.step(action={"agent1": 1, "agent2": 3})
    env.render()

    # Agent1 moves right, Agent2 moves left.
    obs, rewards, dones, infos = env.step(action={"agent1": 1, "agent2": 3})
    env.render()

    # Agent1 moves down, Agent2 moves up.
    obs, rewards, dones, infos = env.step(action={"agent1": 2, "agent2": 0})
    env.render()


print("Agent1's x/y position={}".format(env.agent1_pos))
print("Agent2's x/y position={}".format(env.agent2_pos))
print("Env timesteps={}".format(env.timesteps))

Output()

Agent1's x/y position=[2, 2]
Agent2's x/y position=[7, 7]
Env timesteps=4


## Select an algorithm and instantiate a config object using that algorithm's config class <a class="anchor" id="rllib_algo"></a>

<ol>
    <li><a href="https://docs.ray.io/en/master/rllib/rllib-algorithms.html">Open RLlib docs</a></li>
    <li>Scroll down and click url of algo you're searching for, e.g. <i><b>PPO</b></i></li>
    <li>On the <a href=""https://docs.ray.io/en/master/rllib/rllib-algorithms.html#ppo>algo docs page </a>, click on the link <i><b>Implementation</b></i>.  This will open the <a href="https://github.com/ray-project/ray/blob/master/rllib/algorithms/ppo/ppo.py">algo code file on github</a>.</li>
    <li>Search the github code file for the word <i><b>config</b></i></li>
    <li>Typically the docstring example will show: </li>
    <ol>
        <li>Example code implementing RLlib API, then </li>
        <li>Example code implementing Ray Tune API.</li>
    </ol>
    <li>Scroll down to the config <b>__init()__</b> method</li>
    <ol>
            <li>Algorithm default hyperparameter values are here.</li>
    </ol>
    </ol>

In [6]:
# Config is an object instead of a dictionary since Ray version >= 1.13
from ray.rllib.algorithms.ppo import PPOConfig

# uncomment below to see the long list of specifically PPO default config values
# print(pretty_print(PPOConfig().to_dict()))

config = PPOConfig()
config.environment(env=MultiAgentArena)
config.rollouts(num_rollout_workers=4, num_envs_per_worker=1)
config.training(lr=0.00005, train_batch_size=4000)  # default values for this algorithm
config.multi_agent(
    policies=["policy1", "policy2"],
    policy_mapping_fn=lambda agent_id, episode, worker, **kwargs: "policy1" if agent_id == "agent1" else "policy2",
)
# Set the log level to DEBUG, INFO, WARN, or ERROR 
config.debugging(log_level="ERROR")


<ray.rllib.algorithms.ppo.ppo.PPOConfig at 0x7fee73910460>

## Train a RL model using a multi-agent algorithm from RLlib <a class="anchor" id="rllib_run"></a>

In [7]:
# To start fresh, restart Ray in case it is already running
if ray.is_initialized():
    ray.shutdown()

ppo = config.build()

for _ in range(10):
    result = ppo.train()
    print(result["episode_reward_mean"])

print("Training completed.")


2022-07-15 11:27:31,373	INFO services.py:1477 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m
2022-07-15 11:27:44,938	INFO trainable.py:160 -- Trainable.setup took 15.805 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


-9.187500000000005
-4.755000000000001
-2.234999999999991
-1.0949999999999867
0.13500000000001325
0.5760000000000118
0.936000000000009
1.7310000000000076
3.1410000000000036
5.7899999999999965
Training completed.


In [8]:
# To stop the Algorithm and release its blocked resources, use:
ppo.stop()


### Homework

 <img src="images/exercise_env_loop.png" width=500>
 
In the cell above, we performed a `reset()` and a single `step()` call. To walk through an entire episode, one would normally call `step()` repeatedly (with different actions) until the returned `done` dict has the "agent1" or "agent2" (or "__all__") key set to True.
Your task is to write an "environment loop" that runs for exactly one episode using our `MultiAgentArena` class.

Follow these instructions here to get this done:

1. `reset` the already created (variable `env`) environment to get the first (initial) observation.
1. Enter an infinite while loop.
1. Compute the actions for \"agent1\" and \"agent2\" calling `DummyTrainer.compute_action([obs])` twice (once for each agent).
1. Put the results of the action computations into an action dict (`{\"agent1\": ..., \"agent2\": ...}`).
1. Pass this action dict into the env's `step()` method, just like it's done in the above cell (where we do a single `step()`).
1. Check the returned `dones` dict for True (yes, episode is terminated) and if True, break out of the loop.

**Good luck! :)**


Write your solution code into this following python cell here:

In [None]:
import time
from ipywidgets import Output
from IPython import display
import time


# Leave the following as-is. It'll help us with rendering the env in this very cell's output.,
out = Output()
display.display(out)

with out:

    # Start coding here inside this `with`-block:
    # 1) Reset the env ...

    # 2) Enter an infinite while loop (to step through the episode) ...

        # 3) Calculate both agents' actions individually, using dummy_trainer.compute_action([individual agent's obs])...

        # 4) Compile the actions dict from both individual agents' actions ...

        # 5) Send the actions dict to the env's `step()` method to receive: obs, rewards, dones, info dicts ...

        # 6) We'll do this together: Render the env.
        # Don't write any code here (skip directly to 7).
        out.clear_output(wait=True),
        time.sleep(0.08),
        env.render(),

        # 7) Check, whether the episde is done (take a look at the
        # `dones` dict returned from `step()`)
        # If yes, break out of the while loop we entered in step 2).


# 8) Run it! :)


## References

 * 
 