<a href="https://colab.research.google.com/github/araffin/tools-for-robotic-rl-icra2022/blob/main/notebooks/icra_hands_on_sb3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gym/Stable Baselines3 Getting Started - ICRA 2022

Github repo: https://github.com/araffin/tools-for-robotic-rl-icra2022

Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3

Documentation: https://stable-baselines3.readthedocs.io/en/master/

SB3 Contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib

RL Baselines3 zoo: https://github.com/DLR-RM/rl-baselines3-zoo

[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a collection of pre-trained Reinforcement Learning agents using Stable-Baselines3.

It also provides basic scripts for training, evaluating agents, tuning hyperparameters and recording videos.


## Introduction

In this notebook, you will learn the basics for using stable baselines3 library: how to create a RL model, train it and evaluate it. Because all algorithms share the same interface, we will see how simple it is to switch from one algorithm to another.
You will also learn how to define a gym wrapper to customise the training.


## Install Dependencies and Stable Baselines3 Using Pip

List of full dependencies can be found in the [README](https://github.com/DLR-RM/stable-baselines3).


```
pip install stable-baselines3[extra]
```

In [None]:
!apt-get install ffmpeg freeglut3-dev xvfb  # For visualization

In [None]:
!pip install stable-baselines3[extra]

In [None]:
# Optional: install SB3 contrib to have access to additional algorithms
!pip install sb3-contrib

# Part I: Getting Started

## First steps with the Gym interface

An environment that follows the [gym interface](https://stable-baselines3.readthedocs.io/en/master/guide/custom_env.html) is quite simple to use.
It provides to this user mainly three methods:
- `reset()` called at the beginning of an episode, it returns an observation
- `step(action)` called to take an action with the environment, it returns the next observation, the immediate reward, whether the episode is over and additional information
- (Optional) `render(method='human')` which allow to visualize the agent in action. Note that graphical interface does not work on google colab, so we cannot use it directly (we have to rely on `method='rbg_array'` to retrieve an image of the scene

Under the hood, it also contains two useful properties:
- `observation_space` which one of the gym spaces (`Discrete`, `Box`, ...) and describe the type and shape of the observation
- `action_space` which is also a gym space object that describes the action space, so the type of action that can be taken

The best way to learn about gym spaces is to look at the [source code](https://github.com/openai/gym/tree/master/gym/spaces), but you need to know at least the main ones:
- `gym.spaces.Box`: A (possibly unbounded) box in $R^n$. Specifically, a Box represents the Cartesian product of n closed intervals. Each interval has the form of one of [a, b], (-oo, b], [a, oo), or (-oo, oo). Example: A 1D-Vector or an image observation can be described with the Box space.
```python
# Example for using image as input:
observation_space = spaces.Box(low=0, high=255, shape=(HEIGHT, WIDTH, N_CHANNELS), dtype=np.uint8)
```                                       

- `gym.spaces.Discrete`: A discrete space in $\{ 0, 1, \dots, n-1 \}$
  Example: if you have two actions ("left" and "right") you can represent your action space using `Discrete(2)`, the first action will be 0 and the second 1.



[Documentation on custom env](https://stable-baselines3.readthedocs.io/en/master/guide/custom_env.html)

Below you can find an example of a custom environment:

In [5]:
from typing import Any, Callable, Dict, List, NamedTuple, Tuple, Union

import gym
import numpy as np

from stable_baselines3.common.env_checker import check_env

GymObs = Union[Tuple, Dict, np.ndarray, int]

class CustomEnv(gym.Env):
  """
  Minimal custom environment to demonstrate the Gym interface.
  """
  def __init__(self):
    super().__init__()
    self.observation_space = gym.spaces.Box(low=-np.inf, high=np.inf, shape=(14,))
    self.action_space = gym.spaces.Box(low=-1, high=1, shape=(6,))

  def reset(self) -> GymObs:
    """
    Called at the beginning of an episode.
    :return: the first observation of the episode
    """
    return self.observation_space.sample()

  def step(self, action: Union[int, np.ndarray]) -> Tuple[GymObs, float, bool, Dict]:
    """
    Step into the environment.
    :return: A tuple containing the new observation, the reward signal, 
      whether the episode is over and additional informations.
    """
    obs = self.observation_space.sample()
    reward = 1.0
    done = False
    # Whether the termination was due to timeout or not
    info = {"TimeLimit.truncated": False}
    return obs, reward, done, info

env = CustomEnv()
# Check your custom environment
# this will print warnings and throw errors if needed
check_env(env)

## Imports

Stable-Baselines3 works on environments that follow the [gym interface](https://stable-baselines3.readthedocs.io/en/master/guide/custom_env.html).
You can find a list of available environment [here](https://www.gymlibrary.ml/).

It is also recommended to check the [source code](https://github.com/openai/gym) to learn more about the observation and action space of each env, as gym document is still a work in progress.
Not all algorithms can work with all action spaces, you can find more in this [recap table](https://stable-baselines3.readthedocs.io/en/master/guide/algos.html)

In [6]:
import gym
import numpy as np

The first thing you need to import is the RL model, check the documentation to know what you can use on which problem

In [7]:
from stable_baselines3 import PPO, A2C, SAC, TD3, DQN

In [8]:
# Algorithms from the contrib repo
# https://github.com/Stable-Baselines-Team/stable-baselines3-contrib
from sb3_contrib import QRDQN, TQC

The next thing you need to import is the policy class that will be used to create the networks (for the policy/value functions).
This step is optional as you can directly use strings in the constructor: 

```PPO("MlpPolicy", env)``` instead of ```PPO(MlpPolicy, env)```

Note that some algorithms like `SAC` have their own `MlpPolicy`, that's why using string for the policy is the recommended option.

In [9]:
from stable_baselines3.ppo.policies import MlpPolicy

## Create the Gym env and instantiate the agent

For this example, we will use Pendulum environment, a classic control problem.

"The inverted pendulum swingup problem is based on the classic problem in control theory. The system consists of a pendulum attached at one end to a fixed point, and the other end being free. The pendulum starts in a random position and the goal is to apply torque on the free end to swing it into an upright position, with its center of gravity right above the fixed point."

Pendulum environment: [https://www.gymlibrary.ml/environments/classic_control/pendulum/](https://www.gymlibrary.ml/environments/classic_control/pendulum/)


![Pendulum-v1](https://huggingface.co/sb3/ppo-Pendulum-v1/resolve/main/pendulum.gif)

We chose the MlpPolicy because the observation of the Pendulum task is a feature vector, not images.

The type of action to use (discrete/continuous) will be automatically deduced from the environment action space

Here we are using the [Proximal Policy Optimization](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html) algorithm, which is an Actor-Critic method: it uses a value function to improve the policy gradient descent (by reducing the variance).

It combines ideas from [A2C](https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html) (having multiple workers and using an entropy bonus for exploration) and [TRPO](https://sb3-contrib.readthedocs.io/en/master/modules/trpo.html) (it uses a trust region to improve stability and avoid catastrophic drops in performance).

PPO is an on-policy algorithm, which means that the trajectories used to update the networks must be collected using the latest policy.
It is usually less sample efficient than off-policy alorithms like [DQN](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html), [SAC](https://stable-baselines3.readthedocs.io/en/master/modules/sac.html) or [TD3](https://stable-baselines3.readthedocs.io/en/master/modules/td3.html), but is much faster regarding wall-clock time.


In [None]:
# Create the gym Env
env_id = "Pendulum-v1"
env = gym.make(env_id)

# Create the RL agent
# Here we are using tuned hyperparameters
model = PPO(
    "MlpPolicy",
    env,
    gamma=0.98,
    use_sde=True,
    sde_sample_freq=4,
    learning_rate=1e-3,
    verbose=1,
)

### Using the model to predict actions

In [None]:
print(env.observation_space)
print(env.action_space)

In [13]:
# Retrieve first observation
obs = env.reset()

In [14]:
# Predict the action to take given the observation
action, _ = model.predict(obs, deterministic=True)

In [None]:
# We are using continuous actions, therefore `action` is a numpy array
assert env.action_space.contains(action)

print(action)

Step in the environment

In [None]:
obs, reward, done, info = env.step(action)

In [None]:
print(f"obs_shape={obs.shape}, reward={reward}, done? {done}")

In [None]:
# Reset the env at the end of an episode
if done:
    obs = env.reset()

### Exercise (10 minutes): write the function to evaluate the agent

This function will be used to evaluate the performance of an RL agent.
Thanks to Stable Baselines3 interface, it will work with any SB3 algorithms and any Gym environment.

See docstring of the function for what is expected as input/output.

In [None]:
from stable_baselines3.common.base_class import BaseAlgorithm


def evaluate(
    model: BaseAlgorithm,
    env: gym.Env,
    n_eval_episodes: int = 100,
    deterministic: bool = False,
) -> float:
    """
    Evaluate an RL agent for `n_eval_episodes`.

    :param model: the RL Agent
    :param env: the gym Environment
    :param n_eval_episodes: number of episodes to evaluate it
    :param deterministic: Whether to use deterministic or stochastic actions
    :return: Mean episodic reward for the last `n_eval_episodes`
     (Mean over episodes of the cumulative episodic reward)
    """
    ### YOUR CODE HERE
    # TODO: run `n_eval_episodes` episodes in the Gym env
    # using the RL agent and keep track of the total reward
    # collected for each episode (aka cumulative reward or episode return).
    # Finally, compute the mean and print it
    total_reward_per_episode = []
    for _ in range(n_eval_episodes):
        done = False
        cumulative_reward = 0.0
        obs = env.reset()
        # Loop until the episode terminates
        while not done:
            action, _ = model.predict(obs, deterministic=deterministic)
            obs, reward, done, info = env.step(action)
            # Update cumulative reward
            cumulative_reward += reward
            if done:
                total_reward_per_episode.append(cumulative_reward)
    
    # Print some infos
    mean_episode_reward = np.mean(total_reward_per_episode)
    std_reward = np.std(total_reward_per_episode)
    print(f"Mean episode reward = {mean_episode_reward:.2f} +/- {std_reward:.2f}")

    ### END OF YOUR CODE
    return mean_episode_reward

Let's evaluate the un-trained agent, this should be a random agent.

In [None]:
# Random Agent, before training
mean_reward_before_train = evaluate(model, env, n_eval_episodes=20, deterministic=False)

Mean reward: 21.45 Num episodes: 100


Stable-Baselines already provides you with that helper (the actual implementation is a little more advanced):

In [None]:
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

In [None]:
# The Monitor wrapper allows to keep track of the training reward and other infos (useful for plotting)
env = Monitor(env)

In [None]:
# Seed to compare to previous implementation
# env.reset(seed=42) with gym 0.23
env.seed(42)

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=20, deterministic=True)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

## Train the agent and evaluate it

In [None]:
# Train the agent for 50 000 steps
model.learn(total_timesteps=50_000)

In [None]:
# Evaluate the trained agent
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=20)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

Apparently the training went well, the mean reward increased a lot! 

### Prepare video recording

In [None]:
# Set up fake display; otherwise rendering will fail
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

In [None]:
import base64
from pathlib import Path

from IPython import display as ipythondisplay

def show_videos(video_path='', prefix=''):
  """
  Taken from https://github.com/eleurent/highway-env

  :param video_path: (str) Path to the folder containing videos
  :param prefix: (str) Filter the video, showing only the only starting with this prefix
  """
  html = []
  for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
      video_b64 = base64.b64encode(mp4.read_bytes())
      html.append('''<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

We will record a video using the [VecVideoRecorder](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecvideorecorder) wrapper, you can learn more about those wrappers in our Documentation.

In [None]:
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv

def record_video(env_id, model, video_length=500, prefix='', video_folder='videos/'):
  """
  :param env_id: (str)
  :param model: (RL model)
  :param video_length: (int)
  :param prefix: (str)
  :param video_folder: (str)
  """
  eval_env = DummyVecEnv([lambda: gym.make(env_id)])
  # Start the video at step=0 and record 500 steps
  eval_env = VecVideoRecorder(eval_env, video_folder=video_folder,
                              record_video_trigger=lambda step: step == 0, video_length=video_length,
                              name_prefix=prefix)

  obs = eval_env.reset()
  for _ in range(video_length):
    action, _ = model.predict(obs, deterministic=True)
    obs, _, _, _ = eval_env.step(action)

  # Close the video recorder
  eval_env.close()

### Visualize trained agent



In [None]:
record_video('Pendulum-v1', model, video_length=500, prefix='ppo-pendulum')

In [None]:
show_videos('videos', prefix='ppo')

### [Optional] Exercise (5 minutes): Save, Load The Model and that the loading was correct

Save the model and then load it.

Don't forget to check that loading went well: the model must predict the same actions given the same  observations.

In [None]:
# Sample observations using the environment observation space
observations = np.array([env.observation_space.sample() for _ in range(10)])

# Predict actions on those observations using trained model
action_before_saving, _ = model.predict(observations, deterministic=True)

In [None]:
# Save the model
model.save("ppo_pendulum")

In [None]:
# Delete the model (to demonstrate loading)
del model

In [None]:
!ls *.zip

ppo_cartpole.zip


In [None]:
# Load the model
model = PPO.load("ppo_pendulum")

In [None]:
# Predict actions on the observations with the loaded model
action_after_loading, _ = model.predict(observations, deterministic=True)

In [None]:
# Check that the predictions are the same
assert np.allclose(action_before_saving, action_after_loading), "Somethng went wrong in the loading"

## Bonus: Train a RL Model in One Line

The policy class to use will be inferred and the environment will be automatically created. This works because both are [registered](https://stable-baselines3.readthedocs.io/en/master/guide/quickstart.html).

In [None]:
model = PPO('MlpPolicy', "CartPole-v1", verbose=1).learn(1000)

# Part II: Gym Wrappers


In this part, you will learn how to use *Gym Wrappers* which allow to do monitoring, normalization, limit the number of steps, feature augmentation, ...


## Anatomy of a gym wrapper

A gym wrapper follows the [gym](https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html) interface: it has a `reset()` and `step()` method.

Because a wrapper is *around* an environment, we can access it with `self.env`, this allow to easily interact with it without modifying the original env.
There are many wrappers that have been predefined, for a complete list refer to [gym documentation](https://github.com/openai/gym/tree/master/gym/wrappers)

In [None]:
class CustomWrapper(gym.Wrapper):
  """
  :param env:  Gym environment that will be wrapped
  """
  def __init__(self, env: gym.Env):
    # Call the parent constructor, so we can access self.env later
    super().__init__(env)
  
  def reset(self):
    """
    Reset the environment 
    """
    obs = self.env.reset()
    return obs

  def step(self, action):
    """
    :param action: ([float] or int) Action taken by the agent
    :return: (np.ndarray, float, bool, dict) observation, reward, is the episode over?, additional informations
    """
    obs, reward, done, infos = self.env.step(action)
    return obs, reward, done, infos


### [Optional] Exercise (7 minutes): limit the episode length

In this exercise, the goal is to create a Gym wrapper that will limit the maximum number of steps per episode (timeout).


It will also pass a `timeout` signal in the info dict to tell the agent that the termination was due to reaching the limits.

Termination due to timeout must be handled separately (see [Time Limit in RL paper](https://arxiv.org/abs/1712.00378)), you can also take a look at [Issue #284](https://github.com/DLR-RM/stable-baselines3/issues/284) and [Issue #633](https://github.com/DLR-RM/stable-baselines3/issues/633).

In [None]:
class TimeLimitWrapper(gym.Wrapper):
  """
  Limit the maximum number of steps per episode.

  :param env: Gym environment that will be wrapped
  :param max_steps: Max number of steps per episode
  """
  def __init__(self, env: gym.Env, max_steps: int = 100):
    # Call the parent constructor, so we can access self.env later
    super().__init__(env)
    self.max_steps = max_steps
    # YOUR CODE HERE
    # Counter of steps per episode
    self.n_steps = 0

    # END OF YOUR CODE
  
  def reset(self) -> GymObs:
    # YOUR CODE HERE
    # TODO: reset the counter and reset the env
    self.n_steps = 0
    obs = self.env.reset()

    # END OF YOUR CODE
    return obs

  def step(self, action: Union[int, np.ndarray]) -> Tuple[GymObs, float, bool, Dict]:
    # YOUR CODE HERE
    # TODO: 
    # 1. Step into the env
    # 2. Increment the episode counter
    # 3. Overwrite the done signal when time limit is reached 
    # (optional) 4. update the info dict (add a "episode_timeout" key)
    # when the episode was stopped due to timelimit
    obs, reward, done, infos = self.env.step(action)
    self.n_steps += 1
    if self.n_steps >= self.max_steps:
      # Note: in gym, the key is called "TimeLimit.truncated"
      # Termination due to this wrapper only (timeout)
      infos["episode_timeout"] = not done
      done = True

    # END OF YOUR CODE
    return obs, reward, done, infos

#### Test the wrapper

In [None]:
from gym.envs.classic_control.pendulum import PendulumEnv

# Here we create the environment directly because gym.make() already wrap the environement in a TimeLimit wrapper otherwise
env = PendulumEnv()
# Wrap the environment
env = TimeLimitWrapper(env, max_steps=100)

In [None]:
obs = env.reset()
done = False
n_steps = 0
while not done:
  # Take random actions
  random_action = env.action_space.sample()
  obs, reward, done, infos = env.step(random_action)
  n_steps += 1

print(f"Episode length: {n_steps} steps, info dict: {infos}")

Episode length: 100 steps, info dict: {'episode_timeout': True}


In practice, `gym` already have a wrapper for that named `TimeLimit` (`gym.wrappers.TimeLimit`) that is used by most environments.

# Conclusion

What we have seen in this notebook:
- SB3 101
- Gym wrappers to modify the env
- more complete tutorial: https://github.com/araffin/rl-tutorial-jnrr19
- longer hands-on session: https://www.youtube.com/watch?v=Ikngt0_DXJg

