# Hands-on RL with Ray’s RLlib (Simplified Tutorial)
<hr />

## Tutorial for working with multi-agent environments, models, and algorithms

<img src="https://drive.google.com/uc?export=view&id=1s1chO-ET7inBCKDdKgP4hI0UgTI4bLPs" width=250> <img src="https://drive.google.com/uc?export=view&id=1GGD7V_oO1osZqgKF8QzajM3_bs5o9fNw" width=169> <img src="https://drive.google.com/uc?export=view&id=1xJTlXqv182zVvDPeRc2lEg06zU0GbNrK" width=252> <img src="https://drive.google.com/uc?export=view&id=1X3eVsp3hhFzwFaeqOwwZ9DmJ0UiYfu4y" width=213>

### Overview
“Hands-on RL with Ray’s RLlib” is a beginners tutorial for working with reinforcement learning (RL) environments, models, and algorithms using Ray’s RLlib library. RLlib offers high scalability, a large list of algos to choose from (offline, model-based, model-free, etc..), support for TensorFlow and PyTorch, and a unified API for a variety of applications. This tutorial includes a brief introduction to provide an overview of concepts (e.g. why RL) before proceeding to RLlib (multi- and single-agent) environments, neural network models, student exercises, Q/A, and more. All code will be provided as .py files in a GitHub repo.

### Intended Audience
* Python programmers who want to get started with reinforcement learning and RLlib.

### Prerequisites
* Some Python programming experience.
* Some familiarity with machine learning.
* *Helpful, but not required:* Experience in reinforcement learning and Ray.
* *Helpful, but not required:* Experience with TensorFlow or PyTorch.

### Requirements/Dependencies


In [None]:
!pip install ray[rllib]
!pip install tensorflow -U  # <- either one works!
!pip install matplotlib

Collecting ray[rllib]
  Downloading ray-1.8.0-cp37-cp37m-manylinux2014_x86_64.whl (54.7 MB)
[K     |████████████████████████████████| 54.7 MB 32 kB/s 
Collecting redis>=3.5.0
  Downloading redis-3.5.3-py2.py3-none-any.whl (72 kB)
[K     |████████████████████████████████| 72 kB 735 kB/s 
Collecting tensorboardX>=1.9
  Downloading tensorboardX-2.4-py2.py3-none-any.whl (124 kB)
[K     |████████████████████████████████| 124 kB 73.1 MB/s 
[?25hCollecting lz4
  Downloading lz4-3.1.3-cp37-cp37m-manylinux2010_x86_64.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 65.0 MB/s 
Installing collected packages: redis, tensorboardX, ray, lz4
Successfully installed lz4-3.1.3 ray-1.8.0 redis-3.5.3 tensorboardX-2.4
Collecting tensorflow
  Downloading tensorflow-2.7.0-cp37-cp37m-manylinux2010_x86_64.whl (489.6 MB)
[K     |████████████████████████████████| 489.6 MB 14 kB/s 
Collecting libclang>=9.0.1
  Downloading libclang-12.0.0-py2.py3-none-manylinux1_x86_64.whl (13.4 MB)
[K     |███

### Key Takeaways
* What is reinforcement learning and why RLlib?
* Core concepts of RLlib: Environments, Trainers, Policies, and Models.

### Tutorial Outline (30-40 min)
1. RL and RLlib in a nutshell.
1. Defining an RL-solvable problem: Our first (multi-agent) environment.
1. **Exercise No.1**: Environment Loop.
1. Picking an algorithm and training our first RLlib Trainer.
1. **Exercise No.2** Fixing our experiment's config - Going multi-agent.

### Other Recommended Readings
* [Reinforcement Learning with RLlib in the Unity Game Engine](https://medium.com/distributed-computing-with-ray/reinforcement-learning-with-rllib-in-the-unity-game-engine-1a98080a7c0d)

<img src="https://drive.google.com/uc?export=view&id=1mgu5vPHwTB-3uch1d43BICQoK0h9XkbO" width=400>

* [Attention Nets and More with RLlib's Trajectory View API](https://medium.com/distributed-computing-with-ray/attention-nets-and-more-with-rllibs-trajectory-view-api-d326339a6e65)
* [Intro to RLlib: Example Environments](https://medium.com/distributed-computing-with-ray/intro-to-rllib-example-environments-3a113f532c70)


## Environment Setup

### Coding/defining our "problem" via an RL environment.

We will use the following (adversarial) multi-agent environment
throughout this tutorial to demonstrate RLlib's
APIs, features, and customization options.

<img src="https://drive.google.com/uc?export=view&id=1GL5LDrrnw0rx-cYK9ucQ4drpaykz1pBd" width=800>

### A word or two on Spaces:

Spaces are used in ML to describe what possible/valid values inputs and outputs of a neural network can have.

RL environments also use them to describe what their valid observations and actions are.

Spaces are usually defined by their shape (e.g. 84x84x3 RGB images) and datatype (e.g. uint8 for RGB values between 0 and 255).
However, spaces could also be composed of other spaces (see Tuple or Dict spaces) or could be simply discrete with n fixed possible values
(represented by integers). For example, in our game, where each agent can only go up/down/left/right, the action space would be `Discrete(4)`
(no datatype, no shape needs to be defined here). Our observation space will be `MultiDiscrete([n, m])`, where n is the position of the agent observing and m is the position of the opposing agent, so if agent1 starts in the upper left corner and agent2 starts in the bottom right corner, agent1's observation would be: `[0, 63]` (in an 8 x 8 grid) and agent2's observation would be `[63, 0]`.

<img src="https://drive.google.com/uc?export=view&id=1zTklLKfSzK4ia054NNFMq3KLWii2QYa3" width=800>

In [None]:
# Let's code our multi-agent environment.

import gym
from gym.spaces import Discrete, MultiDiscrete
import numpy as np
import random

from ray.rllib.env.multi_agent_env import MultiAgentEnv


class MultiAgentArena(MultiAgentEnv):
    def __init__(self, config=None):
        """ Config takes in width, height, and ts """
        config = config or {}
        # Dimensions of the grid.
        self.width = config.get("width", 10)
        self.height = config.get("height", 10)

        # End an episode after this many timesteps.
        self.timestep_limit = config.get("ts", 100)

        self.observation_space = MultiDiscrete([self.width * self.height,
                                                self.width * self.height])
        # 0=up, 1=right, 2=down, 3=left.
        self.action_space = Discrete(4)

        # Reset env.
        self.reset()

    def reset(self):
        """Returns initial observation of next(!) episode."""
        # Row-major coords.
        self.agent1_pos = [0, 0]  # upper left corner
        self.agent2_pos = [self.height - 1, self.width - 1]  # lower bottom corner

        # Accumulated rewards in this episode.
        self.agent1_R = 0.0
        self.agent2_R = 0.0

        # Reset agent1's visited fields.
        self.agent1_visited_fields = set([tuple(self.agent1_pos)])

        # How many timesteps have we done in this episode.
        self.timesteps = 0

        # Return the initial observation in the new episode.
        return self._get_obs()

    def step(self, action: dict):
        """
        Returns (next observation, rewards, dones, infos) after having taken the given actions.

        e.g.
        `action={"agent1": action_for_agent1, "agent2": action_for_agent2}`
        """

        # increase our time steps counter by 1.
        self.timesteps += 1
        # An episode is "done" when we reach the time step limit.
        is_done = self.timesteps >= self.timestep_limit

        # Agent2 always moves first.
        # events = [collision|agent1_new_field]
        events = self._move(self.agent2_pos, action["agent2"], is_agent1=False)
        events = self._move(self.agent1_pos, action["agent1"], is_agent1=True)

        # Useful for rendering.
        self.collision = "collision" in events

        # Get observations (based on new agent positions).
        obs = self._get_obs()

        # Determine rewards based on the collected events:
        r1 = -1.0 if "collision" in events else 1.0 if "agent1_new_field" in events else -0.5
        r2 = 1.0 if "collision" in events else -0.1

        self.agent1_R += r1
        self.agent2_R += r2

        rewards = {
            "agent1": r1,
            "agent2": r2,
        }

        # Generate a `done` dict (per-agent and total).
        dones = {
            "agent1": is_done,
            "agent2": is_done,
            # special `__all__` key indicates that the episode is done for all agents.
            "__all__": is_done,
        }

        return obs, rewards, dones, {}  # <- info dict (not needed here).

    def _get_obs(self):
        """
        Returns obs dict (agent name to discrete-pos tuple) using each
        agent's current x/y-positions.
        """
        ag1_discrete_pos = self.agent1_pos[0] * self.width + \
            (self.agent1_pos[1] % self.width)
        ag2_discrete_pos = self.agent2_pos[0] * self.width + \
            (self.agent2_pos[1] % self.width)
        return {
            "agent1": np.array([ag1_discrete_pos, ag2_discrete_pos]),
            "agent2": np.array([ag2_discrete_pos, ag1_discrete_pos]),
        }

    def _move(self, coords, action, is_agent1):
        """
        Moves an agent (agent1 iff is_agent1=True, else agent2) from `coords` (x/y) using the
        given action (0=up, 1=right, etc..) and returns a resulting events dict:
        Agent1: "new" when entering a new field. "bumped" when having been bumped into by agent2.
        Agent2: "bumped" when bumping into agent1 (agent1 then gets -1.0).
        """
        orig_coords = coords[:]
        # Change the row: 0=up (-1), 2=down (+1)
        coords[0] += -1 if action == 0 else 1 if action == 2 else 0
        # Change the column: 1=right (+1), 3=left (-1)
        coords[1] += 1 if action == 1 else -1 if action == 3 else 0

        # Solve collisions.
        # Make sure, we don't end up on the other agent's position.
        # If yes, don't move (we are blocked).
        if (is_agent1 and coords == self.agent2_pos) or (not is_agent1 and coords == self.agent1_pos):
            coords[0], coords[1] = orig_coords
            # Agent2 blocked agent1 (agent1 tried to run into agent2)
            # OR Agent2 bumped into agent1 (agent2 tried to run into agent1)
            return {"collision"}

        # No agent blocking -> check walls.
        if coords[0] < 0:
            coords[0] = 0
        elif coords[0] >= self.height:
            coords[0] = self.height - 1
        if coords[1] < 0:
            coords[1] = 0
        elif coords[1] >= self.width:
            coords[1] = self.width - 1

        # If agent1 -> "new" if new tile covered.
        if is_agent1 and not tuple(coords) in self.agent1_visited_fields:
            self.agent1_visited_fields.add(tuple(coords))
            return {"agent1_new_field"}
        # No new tile for agent1.
        return set()

    def render(self, mode=None):
        print("_" * (self.width + 2))
        for r in range(self.height):
            print("|", end="")
            for c in range(self.width):
                field = r * self.width + c % self.width
                if self.agent1_pos == [r, c]:
                    print("1", end="")
                elif self.agent2_pos == [r, c]:
                    print("2", end="")
                elif (r, c) in self.agent1_visited_fields:
                    print(".", end="")
                else:
                    print(" ", end="")
            print("|")
        print("‾" * (self.width + 2))
        print(f"{'!!Collision!!' if self.collision else ''}")
        print("R1={: .1f}".format(self.agent1_R))
        print("R2={: .1f}".format(self.agent2_R))
        print()


env = MultiAgentArena()

obs = env.reset()

# Agent1 will move down, Agent2 moves up.
obs, rewards, dones, infos = env.step(action={"agent1": 2, "agent2": 0})

env.render()

print("Agent1's x/y position={}".format(env.agent1_pos))
print("Agent2's x/y position={}".format(env.agent2_pos))
print("Env timesteps={}".format(env.timesteps))


## Exercise No 1: Environment Rollout

<hr />

<img src="https://drive.google.com/uc?export=view&id=1Ta1s0QOfSCtuK0ZbmviwkI_6GcBWmXzY" width=800>

In the cell above, we performed a `reset()` and a single `step()` call. To walk through an entire episode, one would normally call `step()` repeatedly (with different actions) until the returned `done` dict has the "agent1" or "agent2" (or "__all__") key set to True. Your task is to write an "environment loop" that runs for exactly one episode using our `MultiAgentArena` class.

Follow these instructions here to get this done.

1. `reset` the already created (variable `env`) environment to get the first (initial) observation.
1. Enter an infinite while loop.
1. Compute the actions for "agent1" and "agent2" calling `DummyTrainer.compute_action([obs])` twice (once for each agent).
1. Put the results of the action computations into an action dict (`{"agent1": ..., "agent2": ...}`).
1. Pass this action dict into the env's `step()` method, just like it's done in the above cell (where we do a single `step()`).
1. Check the returned `dones` dict for True (yes, episode is terminated) and if True, break out of the loop.

**Good luck! :)**


In [None]:
class DummyTrainer:
    """Dummy Trainer class used in Exercise #1.

    Use its `compute_action` method to get a new action for one of the agents,
    given the agent's observation (a single discrete value encoding the field
    the agent is currently in).
    """

    def compute_action(self, single_agent_obs=None):
        # Returns a random action for a single agent.
        return np.random.randint(4)  # Discrete(4) -> return rand int between 0 and 3 (incl. 3).

dummy_trainer = DummyTrainer()
# Check, whether it's working.
for _ in range(3):
    # Get action for agent1 (providing agent1's and agent2's positions).
    print("action_agent1={}".format(dummy_trainer.compute_action(np.array([0, 99]))))

    # Get action for agent2 (providing agent2's and agent1's positions).
    print("action_agent2={}".format(dummy_trainer.compute_action(np.array([99, 0]))))

    print()

In [None]:
# Leave the following as-is. It'll help us with rendering the env in this very cell's output.
import time
from ipywidgets import Output
from IPython import display
import time

out = Output()
display.display(out)

with out:

    # Exercise #1:

    # Start coding here inside this `with`-block:
    # 1) Reset the env.

    # 2) Enter an infinite while loop (to step through the episode).

        # 3) Calculate both agents' actions individually, using dummy_trainer.compute_action([individual agent's obs])

        # 4) Compile the actions dict from both individual agents' actions.

        # 5) Send the actions dict to the env's `step()` method to receive: obs, rewards, dones, info dicts

        # 6) We'll do this together: Render the env.
        # Don't write any code here (skip directly to 7).
        out.clear_output(wait=True)
        time.sleep(0.08)
        env.render()

        # 7) Check, whether the episde is done, if yes, break out of the while loop.

# 8) Run it! :)

## Training with RLlib's PPO

We will now train an RL agent with RLlib's PPO. PPO is well-known in the RL community to be one of the most reliable algorithms that works most classes of environments.

There are many different algos in RLlib (over 20!) and you can mix match whatever algorithm you like to train your RL agent. This is what makes RLlib a versatile library to use!

<img src="https://drive.google.com/uc?export=view&id=11pv431GA0frNFZIRfeSp0mMeJ2coTkPW" width=800>


### Initializing Ray

In [None]:
import numpy as np
import pprint
import ray

# Start a new instance of Ray (when running this tutorial locally) or
# connect to an already running one (when running this tutorial through Anyscale).
# ray.shutdown()
ray.init()  # Hear the engine humming? ;)

# In case you encounter the following error during our tutorial: `RuntimeError: Maybe you called ray.init twice by accident?`
# Try: `ray.shutdown() + ray.init()` or `ray.init(ignore_reinit_error=True)`

### Creating an RLlib Trainer (PPOTrainer)

In [None]:
# Import a Trainable (one of RLlib's built-in algorithms):
# We use the PPO algorithm here b/c its very flexible wrt its supported
# action spaces and model types and b/c it learns well almost any problem.
from ray.rllib.agents.ppo import PPOTrainer

# Specify a very simple config, defining our environment and some environment
# options (see environment.py).
config = {
    "env": MultiAgentArena,
    "env_config": {
        "config": {
            "width": 10,
            "height": 10,
            "ts": 100,
        },
    },

    # !PyTorch users!
    "framework": "tf",  # If users have chosen to install torch instead of tf.

    "create_env_on_driver": True,
}
# Instantiate the Trainer object using above config.
rllib_trainer = PPOTrainer(config=config)
rllib_trainer

### Ready to train with RLlib's PPO algorithm

That's it, we are ready to train.
Calling `Trainer.train()` will execute a single "training iteration".

One iteration for most algos involves:

1) sampling from the environment(s)

2) using the sampled data (observations, actions taken, rewards) to update the policy model (neural network), such that it would pick better actions in the future, leading to higher rewards.

Let's try it out!


In [None]:
# Runs 1 Iteration of Training
results = rllib_trainer.train()

# Delete the config from the results for clarity.
# Only the stats will remain, then.
del results["config"]
# Pretty print the stats.
pprint.pprint(results)
del rllib_trainer

## Exercise 2: Training with Multiple Policies

So far, our experiment has been ill-configured, because both
agents, which should behave differently due to their different
tasks and reward functions, learn the same policy: the "default_policy",
which RLlib always provides if you don't configure anything else.

Remember that RLlib does not know at Trainer setup time, how many and which agents the environment will "produce". Agent control (adding agents, removing them, terminating episodes for agents) is entirely in the Env's hands.
Let's fix our single policy problem and introduce the "multiagent" API.

<img src="https://drive.google.com/uc?export=view&id=1rsRMLN8KyEHKS4XCcjRmUW19kpRjqB8z" width=800>

In order to turn on RLlib's multi-agent functionality, follow these instructions:

1. A policies definition dict, mapping policy IDs (e.g. "policy1") to 4-tuples consisting of 1) policy class (None for using the default class), 2) observation space, 3) action space, and 4) config overrides (empty dict for no overrides and using the Trainer's main config dict).
1. A policy mapping function, mapping agent IDs (e.g. a string like "agent1", produced by the environment in the returned observation/rewards/dones-dicts) to a policy ID (another string, e.g. "policy1").
1. Pass in the policy mapping function and policy configs into the Trainer config.
1. Train!

If stucked, https://docs.ray.io/en/latest/rllib-env.html#multi-agent-and-hierarchical provides a great example.

**Good luck! :)**

In [None]:
# Run this if neccessary
ray.shutdown()
ray.init()

In [None]:
# Exercise 2
# 1) Define the policies definition dict:
# Each policy in there is defined by its ID (key) mapping to a 4-tuple (value):
# - Policy class (None for using the "default" class, e.g. PPOTFPolicy for PPO+tf or PPOTorchPolicy for PPO+torch).
# - obs-space (we get this directly from our already created env object).
# - act-space (we get this directly from our already created env object).
# - config-overrides dict (leave empty for using the Trainer's config as-is)
policies = {
    ### Modify Code here ####
    "policy1": None,
    "policy2": None,
}
# Note that now we won't have a "default_policy" anymore, just "policy1" and "policy2".

# 2) Defines an agent->policy mapping function.
# The mapping here is M (agents) -> N (policies), where M >= N.
def policy_mapping_fn(agent_id: str) -> str:
    # Make sure agent ID is valid.
    assert agent_id in ["agent1", "agent2"], f"ERROR: invalid agent ID {agent_id}!"
    ### Modify Code here ####
    return None

config = {
    "env": MultiAgentArena,  # "my_env" <- if we previously have registered the env with `tune.register_env("[name]", lambda config: [returns env object])`.
    "env_config": {
        "config": {
            "width": 10,
            "height": 10,
            "ts": 100,
        },
    },
    # !PyTorch users!
    "framework": "tf",  # If users have chosen to install torch instead of tf.
    "create_env_on_driver": True,
}

# 3) Adding the above to our config.
### Modify Code here ####
config.update({
    "multiagent": {
        "policies": None,
        "policy_mapping_fn": None,
    },
})

pprint.pprint(config)
print()
print(f"agent1 is now mapped to {policy_mapping_fn('agent1')}")
print(f"agent2 is now mapped to {policy_mapping_fn('agent2')}")

rllib_trainer = PPOTrainer(config=config)

In [None]:
# 4) Run `train()` n times. Repeatedly call `train()` now to see rewards increase.
# Move on once you see (agent1 + agent2) episode rewards of 10.0 or more.
for _ in range(10):
    ### Modify Code here ####
    results = None
    print(f"Iteration={rllib_trainer.iteration}: R(\"return\")={results['episode_reward_mean']}")

Now that we are setup correctly with two policies as per our "multiagent" config, let's call `train()` on the new Trainer several times (what about 10 times?).

In [None]:
# Do another loop, but this time, we will print out each policies' individual rewards.
for _ in range(10):
    results = rllib_trainer.train()
    r1 = results['policy_reward_mean']['policy1']
    r2 = results['policy_reward_mean']['policy2']
    r = r1 + r2
    print(f"Iteration={rllib_trainer.iteration}: R(\"return\")={r} R1={r1} R2={r2}")

## Evaluating Multiagent PPO Trainer

Now that we are done training with PPO, let's evaluate how the agents behave, using our code in Exercise 1.

In [None]:
out = Output()
display.display(out)

with out:
    env = MultiAgentArena()
    obs = env.reset()
    while True:
        a1 = rllib_trainer.compute_action(obs["agent1"], policy_id="policy1")
        a2 = rllib_trainer.compute_action(obs["agent2"], policy_id="policy2")
        obs, rewards, dones, _ = env.step({"agent1": a1, "agent2": a2})
        out.clear_output(wait=True)
        time.sleep(0.08)
        env.render()
        if dones["agent1"]:
          break

## Time for Q&A