## RLlib for multi-agent RL

In [9]:
# HIDDEN
import gym
import numpy as np
import matplotlib.pyplot as plt
# https://github.com/sven1977/rllib_tutorials/blob/main/ray_summit_2021/tutorial_notebook.ipynb
# https://github.com/anyscale/ray-summit-2022-training/blob/main/ray-rllib/ex_02_create_multiagent_rllib_env.ipynb

#### Test

Test

#### Multi agent arena

![](img/multi-agent-arena.png)

Notes:

We have two agents, agent 1 and agent 2. In this case they have the same actions spaces and observations spaces but (critically) different reward functions. Agent 1 gets positive rewards if it explores a new square field, and a negative reward if it collides with agent 2. Agent 2 gets positive reward if it collides with agent 2. So in a way this is a game of tag, with agent 2 trying to catch agent 1, but agent 1 also has the additional goal of trying to explore territory rather than purely just running away. Since the field is 8x8, there are 64 squares. Our observation space is MultiDiscrete(64,64) because it contains the location of agent 1 (discrete 64) and agent 2 (also discrete 64).

#### Some code

In [5]:
from envs import MultiAgentArena

In [6]:
env = MultiAgentArena(config={"render": True})
obs = env.reset()

with env.out:
    # Agent1 moves down, Agent2 moves up.
    obs, rewards, dones, infos = env.step(action={"agent1": 2, "agent2": 0})
    env.render()

    # Agent1 moves right, Agent2 moves left.
    obs, rewards, dones, infos = env.step(action={"agent1": 1, "agent2": 3})
    env.render()

    # Agent1 moves right, Agent2 moves left.
    obs, rewards, dones, infos = env.step(action={"agent1": 1, "agent2": 3})
    env.render()

    # Agent1 moves down, Agent2 moves up.
    obs, rewards, dones, infos = env.step(action={"agent1": 2, "agent2": 0})
    env.render()


print("Agent1's x/y position={}".format(env.agent1_pos))
print("Agent2's x/y position={}".format(env.agent2_pos))
print("Env timesteps={}".format(env.timesteps))

____________
|.         |
|...       |
|  1       |
|          |
|          |
|          |
|          |
|       2  |
|          |
|          |
‾‾‾‾‾‾‾‾‾‾‾‾

R1= 4.0
R2=-0.4 (0 collisions)

Agent1's x/y position=[2, 2]
Agent2's x/y position=[7, 7]
Env timesteps=4


In [8]:
from ray.rllib.agents.ppo import PPOTrainer

config = {
    "env_config": {
        "config": {
            "width": 10,
            "height": 10,
            "ts": 100,
        },
    },

    "framework": "torch",
    "create_env_on_driver": True,
    "seed"  : 0
}

trainer = PPOTrainer(config=config, env=MultiAgentArena)

AttributeError: 'MultiAgentArena' object has no attribute '_agent_ids'

[2m[36m(RolloutWorker pid=37933)[0m 2022-07-19 14:23:10,606	ERROR worker.py:430 -- Exception raised in creation task: The actor died because of an error raised in its creation task, [36mray::RolloutWorker.__init__()[39m (pid=37933, ip=127.0.0.1, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x160addc40>)
[2m[36m(RolloutWorker pid=37933)[0m   File "/Users/mike/miniconda3/envs/rl-course-dev-2/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 636, in __init__
[2m[36m(RolloutWorker pid=37933)[0m     self.async_env: BaseEnv = convert_to_base_env(
[2m[36m(RolloutWorker pid=37933)[0m   File "/Users/mike/miniconda3/envs/rl-course-dev-2/lib/python3.8/site-packages/ray/rllib/env/base_env.py", line 732, in convert_to_base_env
[2m[36m(RolloutWorker pid=37933)[0m     return env.to_base_env(
[2m[36m(RolloutWorker pid=37933)[0m   File "/Users/mike/miniconda3/envs/rl-course-dev-2/lib/python3.8/site-packages/ray/rllib/env/multi_agent_env.p

#### Multi-policy

![](img/from_single_agent_to_multi_agent.png)

#### Let's apply what we learned!

## RLlib trainer methods
<!-- multiple choice -->

_Which of the following most accurately describes the role of `trainer.train()` in RLlib?_

- [ ] It neither collects a data set of episodes nor learns a policy. | Are you sure?
- [ ] It learns a policy from a fixed data set of episodes. | Remember, calling train() causes the agent to play through episodes.
- [ ] It creates a data set of episodes but does not learn a policy. | Remember, calling train() learns a policy.
- [x] It simultaneously collects a data set of episodes and also learns a policy. | You got it!