# Notebook 02. Create a custom multi-agent RLlib environment

© 2019-2022, Anyscale. All Rights Reserved <br>
📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb) <br>
➡️ [Next notebook](./ex_03_train_tune_rllib_model.ipynb) <br>
⬅️ [Previous notebook](./ex_01_intro_gym_and_rllib.ipynb) <br>

### Learning objectives

In this notebook, you will learn how to:
 * [Code a custom RLlib multi-agent environment](#multi_agent_env)
 * [Select an algorithm and instantiate a config object using that algorithm's config class](#rllib_algo)
 * [Train a RL model using a multi-agent capable algorithm from RLlib](#rllib_run)



In [None]:
# import required packages

import time
import gym
from gym.spaces import Discrete, MultiDiscrete
import numpy as np
from ipywidgets import Output
from IPython import display

import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.env.multi_agent_env import MultiAgentEnv
from ray import tune
from ray.tune.logger import pretty_print

print(f"gym: {gym.__version__}")
print(f"ray: {ray.__version__}")

# !ale-import-roms --import-from-pkg atari_py.atari_roms

gym: 0.21.0
ray: 3.0.0.dev0


## Code a custom RLlib multi-agent environment <a class="anchor" id="multi_agent_env"></a>

### Review OpenAI Gym Environments

We learned in the last lesson about OpenAI Gym Environments.  Specifically we covered:
<ul>
    <li>Action Spaces</li>
    <li>Observation Spaces</li>
    <li>Rewards</li>
    <li><i>`done`</i> signal</li>
    <li>Important gym.Env API methods:</li>
    <ul>
        <li>reset(self)</li>
        <li>step(self, action: dict)</li>
    </ul>
    </ul>

### Our example multi-agent Env

We will create the following (adversarial) multi-agent environment and will run RLlib experiments on this environment in the notebooks to come.

<img src="images/multi_agent_arena_0.png" width=800 />
<hr />
<img src="images/multi_agent_arena_1.png" width=800 />
<hr />
<img src="images/multi_agent_arena_2.png" width=800 />
<hr />
<img src="images/multi_agent_arena_3.png" width=800 />



### RLlib MultiAgentEnv API

For the single-agent case, RLlib supports the gym.Env API. Thus, if you would like to provide a cutom environment, you only have to hand RLlib your own gym.Env sub-class. Override [these important methods here required by Gym](https://www.gymlibrary.ml/content/api/#standard-methods)  and all is well.

However, gym.Env does not support multi-agent scenarios. So if your problem requires several agents interacting with each other in the environment, you will have to use RLlib gym.Env sub-class: `MultiAgentEnv`, which allows you to
code custom multi-agent problems.

We will now implmenent a child class of `MultiAgentEnv` and code our game logic into this new class.

In the following code, we will override the methods:
<ul>
<li>__init__(self)</li>
<li>reset(self)</li>
<li>step(self, action)</li>
</ul>


Our `MultiAgentArena` class will also come with some more utility methods, which we will not discuss here (out of scope).

In [None]:
# Let's code our multi-agent environment:

# We will write the code together for all segments tagged with [!LIVE CODING!].
# These will be - in particular - the __init__(), reset(), and step() methods.


class MultiAgentArena(MultiAgentEnv):  # MultiAgentEnv is a gym.Env sub-class
    
    def __init__(self, config=None):

        # [!LIVE CODING!]
        
        # Create a default env config.        
        
        # Store the dimensions of the grid.
        
        # End an episode after this many timesteps (return done=True).
        
        # Define our observation space (per-agent!).
        
        # Define our action space (per-agent!).
        # 0=up, 1=right, 2=down, 3=left.

        # Reset env.
        
        # END: [!LIVE CODING!]


        # For rendering.
        self.out = None
        if config.get("render"):
            self.out = Output()
            display.display(self.out)
        self._spaces_in_preferred_format = False

    def reset(self):
        """Reset all state and returns initial observation of new episode."""


        # [!LIVE CODING!]

        # Store each agents' current position as row-major coords.
        
        # Each agents' accumulated rewards in the this episode.
        
        # Reset agent1's visited fields.
        
        # How many timesteps have we done in this episode.
                
        # END: [!LIVE CODING!]


        # Did we have a collision in recent step?
        self.collision = False

        # How many collisions in total have we had in this episode?
        self.num_collisions = 0

        # Return the initial observation in the new episode.
        return self._get_obs()


    def step(self, action: dict):
        """
        Returns (next observation, rewards, dones, infos) after having taken the given actions.
        
        e.g.
        `action={"agent1": action_for_agent1, "agent2": action_for_agent2}`
        """

        # [!LIVE CODING!]

        # Increase our time steps counter by 1.
        
        # An episode is "done" when we reach the time step limit.
        
        # Agent2 always moves first.
        # events = [collision|agent1_new_field]
        events = self._move(self.agent2_pos, action["agent2"], is_agent1=False)
        events |= self._move(self.agent1_pos, action["agent1"], is_agent1=True)

        # Determine rewards based on the collected events:

        # Generate a `done` dict (per-agent and total).

        # END: [!LIVE CODING!]


        # Useful for rendering.
        self.collision = "collision" in events
        if self.collision is True:
            self.num_collisions += 1    

        return self._get_obs(), rewards, dones, {}  # <- info dict (not needed here).

    def _get_obs(self):
        """
        Returns obs dict (agent name to discrete-pos tuple) using each
        agent's current x/y-positions.
        """
        ag1_discrete_pos = self.agent1_pos[0] * self.width + \
            (self.agent1_pos[1] % self.width)
        ag2_discrete_pos = self.agent2_pos[0] * self.width + \
            (self.agent2_pos[1] % self.width)
        return {
            "agent1": np.array([ag1_discrete_pos, ag2_discrete_pos]),
            "agent2": np.array([ag2_discrete_pos, ag1_discrete_pos]),
        }

    def _move(self, coords, action, is_agent1):
        """
        Moves an agent (agent1 iff is_agent1=True, else agent2) from `coords` (x/y) using the
        given action (0=up, 1=right, etc..) and returns a resulting events dict:
        Agent1: "new" when entering a new field. "bumped" when having been bumped into by agent2.
        Agent2: "bumped" when bumping into agent1 (agent1 then gets -1.0).
        """
        orig_coords = coords[:]
        # Change the row: 0=up (-1), 2=down (+1)
        coords[0] += -1 if action == 0 else 1 if action == 2 else 0
        # Change the column: 1=right (+1), 3=left (-1)
        coords[1] += 1 if action == 1 else -1 if action == 3 else 0

        # Solve collisions.
        # Make sure, we don't end up on the other agent's position.
        # If yes, don't move (we are blocked).
        if (is_agent1 and coords == self.agent2_pos) or (not is_agent1 and coords == self.agent1_pos):
            coords[0], coords[1] = orig_coords
            # Agent2 blocked agent1 (agent1 tried to run into agent2)
            # OR Agent2 bumped into agent1 (agent2 tried to run into agent1)
            return {"collision"}

        # No agent blocking -> check walls.
        if coords[0] < 0:
            coords[0] = 0
        elif coords[0] >= self.height:
            coords[0] = self.height - 1
        if coords[1] < 0:
            coords[1] = 0
        elif coords[1] >= self.width:
            coords[1] = self.width - 1

        # If agent1 -> "agent1_new_field" if new tile covered.
        if is_agent1 and not tuple(coords) in self.agent1_visited_fields:
            self.agent1_visited_fields.add(tuple(coords))
            return {"agent1_new_field"}

        # No new tile for agent1.
        return set()

    def render(self, mode=None):

        if self.out is not None:
            self.out.clear_output(wait=True)

        print("_" * (self.width + 2))
        for r in range(self.height):
            print("|", end="")
            for c in range(self.width):
                field = r * self.width + c % self.width
                if self.agent1_pos == [r, c]:
                    print("1", end="")
                elif self.agent2_pos == [r, c]:
                    print("2", end="")
                elif (r, c) in self.agent1_visited_fields:
                    print(".", end="")
                else:
                    print(" ", end="")
            print("|")
        print("‾" * (self.width + 2))
        print(f"{'!!Collision!!' if self.collision else ''}")
        print("R1={: .1f}".format(self.agent1_R))
        print("R2={: .1f} ({} collisions)".format(self.agent2_R, self.num_collisions))
        print()
        time.sleep(0.25)


<br>
In the cell below:
<ul>
    <li>Initialize the environment</li>
    <li>Make both agents take a few steps</li>
    <li>Render the environment after each agent takes a step.</li>
    </ul>


In [7]:
env = MultiAgentArena(config={"render": True})
obs = env.reset()

with env.out:
    # Agent1 moves down, Agent2 moves up.
    obs, rewards, dones, infos = env.step(action={"agent1": 2, "agent2": 0})
    env.render()

    # Agent1 moves right, Agent2 moves left.
    obs, rewards, dones, infos = env.step(action={"agent1": 1, "agent2": 3})
    env.render()

    # Agent1 moves right, Agent2 moves left.
    obs, rewards, dones, infos = env.step(action={"agent1": 1, "agent2": 3})
    env.render()

    # Agent1 moves down, Agent2 moves up.
    obs, rewards, dones, infos = env.step(action={"agent1": 2, "agent2": 0})
    env.render()

    # Agent1 moves left, Agent2 moves up.
    obs, rewards, dones, infos = env.step(action={"agent1": 3, "agent2": 0})
    env.render()

    # Agent1 moves down, Agent2 moves up.
    obs, rewards, dones, infos = env.step(action={"agent1": 2, "agent2": 0})
    env.render()


print("Agent1's x/y position={}".format(env.agent1_pos))
print("Agent2's x/y position={}".format(env.agent2_pos))
print("Env timesteps={}".format(env.timesteps))

Output()

Agent1's x/y position=[3, 1]
Agent2's x/y position=[5, 7]
Env timesteps=6


## Select an algorithm and instantiate a config object using that algorithm's config class <a class="anchor" id="rllib_algo"></a>

<ol>
    <li><a href="https://docs.ray.io/en/master/rllib/rllib-algorithms.html">Open RLlib docs</a></li>
    <li>Scroll down and click url of algo you're searching for, e.g. <i><b>PPO</b></i></li>
    <li>On the <a href=""https://docs.ray.io/en/master/rllib/rllib-algorithms.html#ppo>algo docs page </a>, click on the link <i><b>Implementation</b></i>.  This will open the <a href="https://github.com/ray-project/ray/blob/master/rllib/algorithms/ppo/ppo.py">algo code file on github</a>.</li>
    <li>Search the github code file for the word <i><b>config</b></i></li>
    <li>Typically the docstring example will show: </li>
    <ol>
        <li>Example code implementing RLlib API, then </li>
        <li>Example code implementing Ray Tune API.</li>
    </ol>
    <li>Scroll down to the config <b>__init()__</b> method</li>
    <ol>
            <li>Algorithm default hyperparameter values are here.</li>
    </ol>
    </ol>

In [8]:
# Config is an object instead of a dictionary since Ray version >= 1.13.
config = PPOConfig()

# uncomment below to see the long list of specifically PPO default config values
# print(pretty_print(PPOConfig().to_dict()))

# Point the PPO to our new environment class.
config.environment(env=MultiAgentArena)
# Specify sampling behavior (use 4 workers to collect data in parallel, each one running through 1 environment copy).
config.rollouts(num_rollout_workers=4, num_envs_per_worker=1)
# Specify some training-related parameters.
config.training(lr=0.00005, train_batch_size=4000)  # Default values for this algorithm.


<ray.rllib.algorithms.ppo.ppo.PPOConfig at 0x7fc0d6fed910>

#### How do we tell RLlib that we would like to train different behaviors for our 2 agents?
<img src="images/multi_agent_setup.png" width="70%"></img>

In [9]:
# Setup multi-agent mapping:

# Environment provides M agent IDs.
# RLlib has N policies (neural networks).
# The `policy_mapping_fn` maps M agent IDs to N policies (M <= N).

# If you don't provide a policy_mapping_fn, all agent IDs will map to "default_policy".
config.multi_agent(
    # Tell RLlib to create 2 policies with these IDs here:
    policies=["policy1", "policy2"],
    # Tell RLlib to map agent1 to policy1 and agent2 to policy2.
    policy_mapping_fn=lambda agent_id, episode, worker, **kwargs: "policy1" if agent_id == "agent1" else "policy2",
)

<ray.rllib.algorithms.ppo.ppo.PPOConfig at 0x7fc0d6fed910>

## Train a RL model using a multi-agent algorithm from RLlib <a class="anchor" id="rllib_run"></a>

In [10]:
# Use the config object's `build()` method for generating
# an RLlib Algorithm instance that we can then train.
ppo = config.build()

# Train the PPO Algorithm instance.
for i in range(10):
    # Call its `train()` method.
    result = ppo.train()
    
    print(f"Sum of rewards for both agents R={result['episode_reward_mean']}")
    print(f"Agent1 R={result['policy_reward_mean']['policy1']}")
    print(f"Agent2 R={result['policy_reward_mean']['policy2']}")
    print()

print(f"Training over {i} iterations completed.")


2022-07-28 16:18:32,349	INFO services.py:1477 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8269[39m[22m
2022-07-28 16:18:46,411	INFO trainable.py:160 -- Trainable.setup took 16.499 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


Sum of rewards for both agents R=-13.65750000000001
Agent1 R=-4.125
Agent2 R=-9.532499999999981

Sum of rewards for both agents R=-7.7475
Agent1 R=1.125
Agent2 R=-8.872499999999983

Sum of rewards for both agents R=-4.001999999999992
Agent1 R=4.535
Agent2 R=-8.536999999999985

Sum of rewards for both agents R=-0.2579999999999895
Agent1 R=7.74
Agent2 R=-7.997999999999985

Sum of rewards for both agents R=0.5130000000000075
Agent1 R=8.445
Agent2 R=-7.931999999999984

Sum of rewards for both agents R=1.1460000000000072
Agent1 R=8.825
Agent2 R=-7.678999999999985

Sum of rewards for both agents R=0.4140000000000116
Agent1 R=7.895
Agent2 R=-7.480999999999986

Sum of rewards for both agents R=1.5750000000000077
Agent1 R=8.88
Agent2 R=-7.304999999999987

Sum of rewards for both agents R=1.8690000000000087
Agent1 R=9.185
Agent2 R=-7.3159999999999865

Sum of rewards for both agents R=4.185000000000005
Agent1 R=11.49
Agent2 R=-7.304999999999986

Training over 9 iterations completed.


<div class="alert alert-block alert-success">
<b>Note</b> that in an adversarial multi-agent setup, an agent benefits from the other agent's failures and vice-versa: agents get harmed more (receive negative rewards) the better the other agent is doing.
    <br/>
This highlights some important aspects of multi-agent training:
    <br/>
<ul>
<li>From each agent's perspective, the environment is not as static as in respective single-agent scenarios (the other agent's behavior is probably harder to predict than the environment's own inherent dynamics/physics).</li>
<li>As one agent learns how to behave more intelligently, the other agent has to counter this new behavior of its opponent and become smarter as well, asoasf.</li>
    </ul>
    </div>


In [8]:
# To stop the Algorithm and release its blocked resources, use:
ppo.stop()


### Summary

In this notebook, we have learnt, how to:

* Write our own custom multi-agent environment using RLlib's MultiAgentEnv superclass
* Quickly test the environment using observation, reward, and action-dictionaries
* Plug in the new environment into an RLlib Algorithm config and setup proper agentID to policyID mapping
* Run a quick (manual) training loop w/o using Ray Tune

### Homework

#### Run one episode of our multi-agent arena and render the episode while it plays out

 <img src="images/exercise_env_loop.png" width=500>
 
We already learned how to use an environment's `reset()` and `step()` calls to walk through a single agent environment: Call `reset()` once, continue using the returned observations to compute actions, pass these actions into consecutive calls to `step()` and stop when `step()` returns the `done=True` flag.

Let's do the same thing now in the multi-agent setting, using our `MultiAgentArena` class.
Remember that everything, from observations, over rewards, dones, and actions now become dictionaries mapping agent IDs
(in our case "agent1" and "agent2") to the individual agent's observation, reward, action, and done information.

Follow these instructions here to get this done:

1. `reset` the already created (variable `env`) environment to get the first (initial) observations for "agent1" and "agent2".
1. Enter an infinite while loop, in which you ..
1. .. compute the actions for "agent1" and "agent2" (using random sampling).
1. .. put the results of the action computations into an action dict (`{"agent1": [action1], "agent2": [action2]}`).
1. .. pass this action dict into the env's `step()` method.
1. .. check the returned `dones` dict for True (yes, episode is terminated) and if True, break out of the loop. Note here that you may also check the `dones` dict for the special "__all__" key and if `dones['__all__'] is True` -> the episode has ended for all agents.

**Good luck! :)**


Write your solution code into this following python cell here:

In [None]:
import time
from ipywidgets import Output
from IPython import display
import time


# Leave the following as-is. It'll help us with rendering the env in this very cell's output.,
out = Output()
display.display(out)

with out:

    # Start coding here inside this `with`-block:
    # 1) Reset the env ...

    # 2) Enter an infinite while loop (in order to step through one episode) ...

        # 3) Calculate both agents' actions individually, using random sampling with the 
        # action space: e.g. `a1 = env.action_space.sample()`.

        # 4) Compile the actions dict from both individual agents' actions ...

        # 5) Send the actions dict to the env's `step()` method to receive: obs, rewards, dones, info dicts ...

        # 6) We'll do this together: Render the env.
        # Don't write any code here (skip directly to 7).
        out.clear_output(wait=True),
        time.sleep(0.08),
        env.render(),

        # 7) Check, whether the episde is done (take a look at the
        # `dones` dict returned from `step()`)
        # If dones["__all__"], break out of the while loop we entered in step 2).


# 8) Run it! :)


## References

 * [Hands-on RL with Ray’s RLlib](https://github.com/sven1977/rllib_tutorials/tree/main/ray_summit_2021)
 * [Multi-agent environments in RLlib](https://docs.ray.io/en/latest/rllib/rllib-env.html#multi-agent-and-hierarchical)

⬅️ [Previous notebook](./ex_01_intro_gym_and_rllib.ipynb) <br>
➡️ [Next notebook](./ex_03_train_tune_rllib_model.ipynb) <br>

📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb)<br>