# Stable Baselines3 Hands-on Session - RLVS

Github repo: https://github.com/araffin/rl-tutorial-jnrr19/tree/sb3/

Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3

Documentation: https://stable-baselines3.readthedocs.io/en/master/

SB3 Contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib

RL Baselines3 zoo: https://github.com/DLR-RM/rl-baselines3-zoo

[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a collection of pre-trained Reinforcement Learning agents using Stable-Baselines3.

It also provides basic scripts for training, evaluating agents, tuning hyperparameters and recording videos.


## Introduction

In this notebook, you will learn the basics for using stable baselines3 library: how to create a RL model, train it and evaluate it. Because all algorithms share the same interface, we will see how simple it is to switch from one algorithm to another.
You will also learn how to define a gym wrapper and callback to customise the training.
We will finish this session by trying out multiprocessing.


## Install Dependencies and Stable Baselines3 Using Pip

List of full dependencies can be found in the [README](https://github.com/DLR-RM/stable-baselines3).


```
pip install stable-baselines3[extra]
```

In [None]:
!apt-get install ffmpeg freeglut3-dev xvfb  # For visualization
!pip install stable-baselines3[extra]

# Part I: Getting Started

## Imports

Stable-Baselines3 works on environments that follow the [gym interface](https://stable-baselines3.readthedocs.io/en/master/guide/custom_env.html).
You can find a list of available environment [here](https://gym.openai.com/envs/#classic_control).

It is also recommended to check the [source code](https://github.com/openai/gym) to learn more about the observation and action space of each env, as gym does not have a proper documentation.
Not all algorithms can work with all action spaces, you can find more in this [recap table](https://stable-baselines3.readthedocs.io/en/master/guide/algos.html)

In [3]:
import gym
import numpy as np

The first thing you need to import is the RL model, check the documentation to know what you can use on which problem

In [44]:
from stable_baselines3 import PPO, A2C, SAC, TD3, DQN

The next thing you need to import is the policy class that will be used to create the networks (for the policy/value functions).
This step is optional as you can directly use strings in the constructor: 

```PPO("MlpPolicy", env)``` instead of ```PPO(MlpPolicy, env)```

Note that some algorithms like `SAC` have their own `MlpPolicy`, that's why using string for the policy is the recommened option.

In [5]:
from stable_baselines3.ppo.policies import MlpPolicy

## Create the Gym env and instantiate the agent

For this example, we will use CartPole environment, a classic control problem.

"A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. "

Cartpole environment: [https://gym.openai.com/envs/CartPole-v1/](https://gym.openai.com/envs/CartPole-v1/)

![Cartpole](https://cdn-images-1.medium.com/max/1143/1*h4WTQNVIsvMXJTCpXm_TAw.gif)


We chose the MlpPolicy because the observation of the CartPole task is a feature vector, not images.

The type of action to use (discrete/continuous) will be automatically deduced from the environment action space

Here we are using the [Proximal Policy Optimization](https://stable-baselines3.readthedocs.io/en/master/modules/ppo2.html) algorithm, which is an Actor-Critic method: it uses a value function to improve the policy gradient descent (by reducing the variance).

It combines ideas from [A2C](https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html) (having multiple workers and using an entropy bonus for exploration) and [TRPO](https://stable-baselines.readthedocs.io/en/master/modules/trpo.html) (it uses a trust region to improve stability and avoid catastrophic drops in performance).

PPO is an on-policy algorithm, which means that the trajectories used to update the networks must be collected using the latest policy.
It is usually less sample efficient than off-policy alorithms like [DQN](https://stable-baselines.readthedocs.io/en/master/modules/dqn.html), [SAC](https://stable-baselines3.readthedocs.io/en/master/modules/sac.html) or [TD3](https://stable-baselines3.readthedocs.io/en/master/modules/td3.html), but is much faster regarding wall-clock time.


In [6]:
# Create the gym Env
env = gym.make('CartPole-v1')
# Create the RL agent
model = PPO(MlpPolicy, env, verbose=1)

### Using the model to predict actions

In [7]:
print(env.observation_space)
print(env.action_space)

Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)
Discrete(2)


In [8]:
# Retrieve first observation
obs = env.reset()

In [11]:
action, _states = model.predict(obs, deterministic=False)

In [12]:
print(action)

1


Step in the environment

In [14]:
obs, reward, done, infos = env.step(action)

In [15]:
print(f"obs_shape={obs.shape}, reward={reward}, done? {done}")

obs_shape=(4,), reward=1.0, done? False


### Exercise (10 minutes): write the function to evaluate the agent

In [17]:
from stable_baselines3.common.base_class import BaseAlgorithm


def evaluate(
    model: BaseAlgorithm,
    env: gym.Env,
    num_episodes: int = 100,
    deterministic: bool = False,
) -> float:
    """
    Evaluate an RL agent for `num_episodes`.

    :param model: the RL Agent
    :param env: the gym Environment
    :param num_episodes: number of episodes to evaluate it
    :param deterministic: Whether to use deterministic or stochastic actions
    :return: Mean reward for the last `num_episodes`
    """
    n_episodes = 0
    episode_reward = 0.0
    episode_rewards = []
    obs = env.reset()

    while n_episodes < num_episodes:
        
        action, _ = model.predict(obs, deterministic=deterministic)
        obs, reward, done, infos = env.step(action)
        episode_reward += reward

        # Not needed when using `VecEnv`
        if done:
            episode_rewards.append(episode_reward)
            episode_reward = 0.0
            n_episodes += 1
            obs = env.reset()

    mean_episode_reward = np.mean(episode_rewards)
    print(f"Mean reward: {mean_episode_reward} Num episodes: {num_episodes}")

    return mean_episode_reward

Let's evaluate the un-trained agent, this should be a random agent.

In [19]:
# Random Agent, before training
mean_reward_before_train = evaluate(model, env, num_episodes=100, deterministic=False)

Mean reward: 23.0 Num episodes: 10


23.0

Stable-Baselines already provides you with that helper:

In [20]:
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

In [21]:
# The Monitor wrapper allows to keep track of the training reward and other infos (useful for plotting)
env = Monitor(env)

In [22]:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100, deterministic=True)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:150.62 +/- 29.90


## Train the agent and evaluate it

In [None]:
# Train the agent for 10000 steps
model.learn(total_timesteps=10000)

In [None]:
# Evaluate the trained agent
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

Apparently the training went well, the mean reward increased a lot! 

### Prepare video recording

In [None]:
# Set up fake display; otherwise rendering will fail
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

In [None]:
import base64
from pathlib import Path

from IPython import display as ipythondisplay

def show_videos(video_path='', prefix=''):
  """
  Taken from https://github.com/eleurent/highway-env

  :param video_path: (str) Path to the folder containing videos
  :param prefix: (str) Filter the video, showing only the only starting with this prefix
  """
  html = []
  for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
      video_b64 = base64.b64encode(mp4.read_bytes())
      html.append('''<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

We will record a video using the [VecVideoRecorder](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecvideorecorder) wrapper, you will learn about those wrapper in the next notebook.

In [None]:
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv

def record_video(env_id, model, video_length=500, prefix='', video_folder='videos/'):
  """
  :param env_id: (str)
  :param model: (RL model)
  :param video_length: (int)
  :param prefix: (str)
  :param video_folder: (str)
  """
  eval_env = DummyVecEnv([lambda: gym.make(env_id)])
  # Start the video at step=0 and record 500 steps
  eval_env = VecVideoRecorder(eval_env, video_folder=video_folder,
                              record_video_trigger=lambda step: step == 0, video_length=video_length,
                              name_prefix=prefix)

  obs = eval_env.reset()
  for _ in range(video_length):
    action, _ = model.predict(obs)
    obs, _, _, _ = eval_env.step(action)

  # Close the video recorder
  eval_env.close()

### Visualize trained agent



In [None]:
record_video('CartPole-v1', model, video_length=500, prefix='ppo-cartpole')

In [None]:
show_videos('videos', prefix='ppo')

### Exercise (5 minutes): Save, Load The Model and that the loading was correct

In [34]:
# Sample obvservations
observations = np.array([env.observation_space.sample() for _ in range(10)])
# Predict action using trained model
action_before_saving, _ = model.predict(observations, deterministic=True)

In [35]:
# Save the model
model.save("ppo_cartpole")

In [40]:
!ls *.zip

ppo_cartpole.zip


In [37]:
# Load the model
model = PPO.load("ppo_cartpole")

In [38]:
# Predict with the loaded model
action_after_loading, _ = model.predict(observations, deterministic=True)

In [39]:
# Check that the predictions are the same
assert np.allclose(action_before_saving, action_after_loading), "Somethng went wrong in the loading"

## Bonus: Train a RL Model in One Line

The policy class to use will be inferred and the environment will be automatically created. This works because both are [registered](https://stable-baselines3.readthedocs.io/en/master/guide/quickstart.html).

In [None]:
model = PPO('MlpPolicy', "CartPole-v1", verbose=1).learn(1000)

# Part II: Gym Wrappers


In this part, you will learn how to use *Gym Wrappers* which allow to do monitoring, normalization, limit the number of steps, feature augmentation, ...


## Anatomy of a gym wrapper

A gym wrapper follows the [gym](https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html) interface: it has a `reset()` and `step()` method.

Because a wrapper is *around* an environment, we can access it with `self.env`, this allow to easily interact with it without modifying the original env.
There are many wrappers that have been predefined, for a complete list refer to [gym documentation](https://github.com/openai/gym/tree/master/gym/wrappers)

In [None]:
class CustomWrapper(gym.Wrapper):
  """
  :param env:  Gym environment that will be wrapped
  """
  def __init__(self, env: gym.Env):
    # Call the parent constructor, so we can access self.env later
    super().__init__(env)
  
  def reset(self):
    """
    Reset the environment 
    """
    obs = self.env.reset()
    return obs

  def step(self, action):
    """
    :param action: ([float] or int) Action taken by the agent
    :return: (np.ndarray, float, bool, dict) observation, reward, is the episode over?, additional informations
    """
    obs, reward, done, info = self.env.step(action)
    return obs, reward, done, info


### Exercise (7 minutes): limit the episode length

In [41]:
class TimeLimitWrapper(gym.Wrapper):
  """
  :param env: Gym environment that will be wrapped
  :param max_steps: Max number of steps per episode
  """
  def __init__(self, env: gym.Env, max_steps: int = 100):
    # Call the parent constructor, so we can access self.env later
    super().__init__(env)
    self.max_steps = max_steps
    # Counter of steps per episode
    self.episode_steps = 0
  
  def reset(self):
    # Reset the counter
    self.episode_steps = 0
    return self.env.reset()

  def step(self, action):
    obs, reward, done, infos = self.env.step(action)
    # Increment the counter
    self.episode_steps += 1
    # Overwrite the done signal when time limit is reached 
    if self.episode_steps >= self.max_steps:
      done = True
      # Update the info dict to signal that the time limit was exceeded
      infos["episode_timeout"] = True
    return obs, reward, done, infos

#### Test the wrapper

In [42]:
from gym.envs.classic_control.pendulum import PendulumEnv

# Here we create the environment directly because gym.make() already wrap the environement in a TimeLimit wrapper otherwise
env = PendulumEnv()
# Wrap the environment
env = TimeLimitWrapper(env, max_steps=100)

In [43]:
obs = env.reset()
done = False
n_steps = 0
while not done:
  # Take random actions
  random_action = env.action_space.sample()
  obs, reward, done, info = env.step(random_action)
  n_steps += 1

print(n_steps, info)

100 {'episode_timeout': True}


In practice, `gym` already have a wrapper for that named `TimeLimit` (`gym.wrappers.TimeLimit`) that is used by most environments.

# Part III: Callbacks

In this part, you will learn how to use [Callbacks](https://stable-baselines3.readthedocs.io/en/master/guide/callbacks.html) which allow to do monitoring, auto saving, model manipulation, progress bars, ...

Please read the [documentation](https://stable-baselines3.readthedocs.io/en/master/guide/callbacks.html). Although Stable-Baselines3 provides you with a callback collection (e.g. for creating checkpoints or for evaluation), we are going to re-implement some so you can get a good understanding of how they work.

To build a custom callback, you need to create a class that derives from `BaseCallback`. This will give you access to events (`_on_training_start`, `_on_step()`) and useful variables (like `self.model` for the RL model).

`_on_step` returns a boolean value for whether or not the training should continue.

Thanks to the access to the models variables, in particular `self.model`, we are able to even change the parameters of the model without halting the training, or changing the model's code.

In [47]:
from stable_baselines3.common.callbacks import BaseCallback

In [None]:
class CustomCallback(BaseCallback):
    """
    A custom callback that derives from ``BaseCallback``.

    :param verbose: (int) Verbosity level 0: not output 1: info 2: debug
    """
    def __init__(self, verbose=0):
        super(CustomCallback, self).__init__(verbose)
        # Those variables will be accessible in the callback
        # (they are defined in the base class)
        # The RL model
        # self.model = None  # type: BaseRLModel
        # An alias for self.model.get_env(), the environment used for training
        # self.training_env = None  # type: Union[gym.Env, VecEnv, None]
        # Number of time the callback was called
        # self.n_calls = 0  # type: int
        # self.num_timesteps = 0  # type: int
        # local and global variables
        # self.locals = None  # type: Dict[str, Any]
        # self.globals = None  # type: Dict[str, Any]
        # The logger object, used to report things in the terminal
        # self.logger = None  # type: logger.Logger
        # # Sometimes, for event callback, it is useful
        # # to have access to the parent object
        # self.parent = None  # type: Optional[BaseCallback]

    def _on_training_start(self) -> None:
        """
        This method is called before the first rollout starts.
        """
        pass

    def _on_rollout_start(self) -> None:
        """
        A rollout is the collection of environment interaction
        using the current policy.
        This event is triggered before collecting new samples.
        """
        pass

    def _on_step(self) -> bool:
        """
        This method will be called by the model after each call to `env.step()`.

        For child callback (of an `EventCallback`), this will be called
        when the event is triggered.

        :return: If the callback returns False, training is aborted early.
        """
        return True

    def _on_rollout_end(self) -> None:
        """
        This event is triggered before updating the policy.
        """
        pass

    def _on_training_end(self) -> None:
        """
        This event is triggered before exiting the `learn()` method.
        """
        pass

Here we have a simple callback that can only be called twice:

In [48]:
class SimpleCallback(BaseCallback):
    """
    a simple callback that can only be called twice

    :param verbose: (int) Verbosity level 0: not output 1: info 2: debug
    """
    def __init__(self, verbose=0):
        super(SimpleCallback, self).__init__(verbose)
        self._called = False
    
    def _on_step(self):
      if not self._called:
        print("callback - first call")
        self._called = True
        return True # returns True, training continues.
      print("callback - second call")
      return False # returns False, training stops.      

In [49]:
model = SAC('MlpPolicy', 'Pendulum-v0', verbose=1)
model.learn(8000, callback=SimpleCallback())

Using cuda device
Creating environment from the given name 'Pendulum-v0'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
callback - first call
callback - second call


<stable_baselines3.sac.sac.SAC at 0x7f70db60d1d0>

## Exercise (8 minutes): Checkpoint Callback

In RL, it is quite useful to save checkpoints during, as we can end up with burn-in of a bad policy. It also useful if you want to see the progression over time.

This is a typical use case for callback, as they can call the save function of the model, and observe the training over time.

In [None]:
import os

import numpy as np

In [52]:
class CheckpointCallback(BaseCallback):
    """
    Callback for saving a model every ``save_freq`` steps

    :param save_freq:
    :param save_path: Path to the folder where the model will be saved.
    :param name_prefix: Common prefix to the saved models
    :param verbose: Whether to print additional infos or not
    """

    def __init__(self, save_freq: int, save_path: str, name_prefix: str = "rl_model", verbose: int = 0):
        super().__init__(verbose)
        self.save_freq = save_freq
        self.save_path = save_path
        self.name_prefix = name_prefix

    def _init_callback(self) -> None:
        # Create folder if needed
        os.makedirs(self.save_path, exist_ok=True)

    def _on_step(self) -> bool:
        if self.n_calls % self.save_freq == 0:
          # Name of the checkpoint
          checkpoint_path = os.path.join(self.save_path, f"{self.name_prefix}_{self.num_timesteps}_steps")
          
          if self.verbose > 0:
            print(f"Saving checkpoint to {checkpoint_path}.zip")

          self.model.save(checkpoint_path)
        return True

In [None]:
log_dir = "/tmp/gym/"
# Create Callback
callback = CheckpointCallback(save_freq=1000, save_path="/tmp/gym/", verbose=1)

model = A2C("MlpPolicy", "CartPole-v1", verbose=0)
model.learn(total_timesteps=10000, callback=callback)

In [None]:
!ls "/tmp/gym/"

Note: The `CheckpointCallback` as well as other [common callbacks](https://stable-baselines3.readthedocs.io/en/master/guide/callbacks.html), like the `EvalCallback` are already included in Stable-Baselines3.

## Multiprocessing Demo


[Vectorized Environments](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html) are a method for stacking multiple independent environments into a single environment. Instead of training an RL agent on 1 environment per step, it allows us to train it on n environments per step. This provides two benefits:
* Agent experience can be collected more quickly
* The experience will contain a more diverse range of states, it usually improves exploration

Stable-Baselines provides two types of Vectorized Environment:
- SubprocVecEnv which run each environment in a separate process
- DummyVecEnv which run all environment on the same process

In practice, DummyVecEnv is usually faster than SubprocVecEnv because of communication delays that subprocesses have.

In [59]:
import time

from stable_baselines3.common.env_util import make_vec_env

In [75]:
env = gym.make("Pendulum-v0")
n_steps = 1024

In [80]:
start_time_one_env = time.time()
model = PPO("MlpPolicy", env, n_epochs=1, n_steps=n_steps, verbose=1).learn(int(2e4))
time_one_env = time.time() - start_time_one_env

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [81]:
print(f"Took {time_one_env:.2f}s")

Took 22.75s


In [82]:
start_time_vec_env = time.time()
# Create 16 environments
vec_env = make_vec_env("Pendulum-v0", n_envs=16)
# At each call to `env.step()`, 16 transitions will be collected, so we account for that for fair comparison
model = PPO("MlpPolicy", vec_env, n_epochs=1, n_steps=n_steps // 16, verbose=1).learn(int(2e4))

time_vec_env = time.time() - start_time_vec_env

Using cuda device


In [83]:
print(f"Took {time_vec_env:.2f}s")

Took 4.04s


# Part IV: The importance of hyperparameter tuning



When compared with Supervised Learning, Deep Reinforcement Learning is far more sensitive to the choice of hyper-parameters such as learning rate, number of neurons, number of layers, optimizer ... etc. 
Poor choice of hyper-parameters can lead to poor/unstable convergence. This challenge is compounded by the variability in performance across random seeds (used to initialize the network weights and the environment).


### Challenge (15 minutes): "Grad Student Descent" - can you beat automatic tuning?

The challenge is to find the best hyperparameters (max performance) for A2C on CartPole with a limited budget of 20 000 steps.

You will compete against automatic hyperparameter tuning, good luck ;)


Maximum reward: 500 on `CartPole-v1`

The hyperparameters should work for different random seeds.

In [84]:
budget = int(2e4)

#### The baseline: default hyperparameters

In [92]:
model = A2C("MlpPolicy", "CartPole-v1", seed=8, verbose=1).learn(budget)

Using cuda device
Creating environment from the given name 'CartPole-v1'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [93]:
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=50, deterministic=True)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:173.96 +/- 23.43


**Your goal is to beat that baseline and get closer to the optimal 500 episodic reward**

Time to tune!

In [91]:
import torch.nn as nn

In [111]:
policy_kwargs = dict(
    net_arch=[
      dict(vf=[64, 64], pi=[64, 64]), # network architectures for actor/critic
    ],
    ortho_init=True, # Orthogonal initialization,
    activation_fn=nn.Tanh,
)

hyperparams = dict(
    n_steps=5,
    learning_rate=7e-4,
    gamma=0.99, # discount factor
    gae_lambda=1.0, # Factor for trade-off of bias vs variance for Generalized Advantage Estimator
                    # Equivalent to classic advantage when set to 1.
    max_grad_norm=0.5, # The maximum value for the gradient clipping
    ent_coef=0.0, # Entropy coefficient for the loss calculation
)

model = A2C("MlpPolicy", "CartPole-v1", seed=8, verbose=1, **hyperparams).learn(budget)

Using cuda device
Creating environment from the given name 'CartPole-v1'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [112]:
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=50, deterministic=True)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:171.36 +/- 24.66


### Result from automatic hyperparameter tuning

In [109]:
policy_kwargs = dict(
    net_arch=[
      dict(vf=[64], pi=[64]),
    ],
    ortho_init=False,
    activation_fn=nn.Tanh,
)

hyperparams = dict(
    n_steps=int(2 ** 6),
    learning_rate=0.002,
    gamma=0.999,
    gae_lambda=0.88,
    max_grad_norm=1.1,
)

model = A2C("MlpPolicy", "CartPole-v1", seed=20, verbose=1, **hyperparams).learn(budget)

Using cuda device
Creating environment from the given name 'CartPole-v1'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [110]:
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=50, deterministic=True)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:422.24 +/- 83.70


Simple example of hyperparameter tuning: https://github.com/optuna/optuna/blob/master/examples/rl/sb3_simple.py

Complete example: https://github.com/DLR-RM/rl-baselines3-zoo

# Conclusion

- SB3 101
- Gym wrappers
- SB3 callbacks
- the importance of good hyperparameters
- multiprocessing
- more complete tutorial: https://github.com/araffin/rl-tutorial-jnrr19

