# Stable Baselines3 Tutorial - Getting Started

Github repo: https://github.com/araffin/rl-tutorial-jnrr19/tree/sb3/

Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3

Documentation: https://stable-baselines3.readthedocs.io/en/master/

RL Baselines3 zoo: https://github.com/DLR-RM/rl-baselines3-zoo

[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a collection of pre-trained Reinforcement Learning agents using Stable-Baselines3.

It also provides basic scripts for training, evaluating agents, tuning hyperparameters and recording videos.


## Introduction

In this notebook, you will learn the basics for using stable baselines library: how to create a RL model, train it and evaluate it. Because all algorithms share the same interface, we will see how simple it is to switch from one algorithm to another.


## Install Dependencies and Stable Baselines3 Using Pip

List of full dependencies can be found in the [README](https://github.com/DLR-RM/stable-baselines3).


```
pip install stable-baselines3[extra]
```

In [1]:
!apt-get install ffmpeg freeglut3-dev xvfb  # For visualization
!pip install stable-baselines3[extra]

Reading package lists... Done
Building dependency tree       
Reading state information... Done
freeglut3-dev is already the newest version (2.8.1-3).
freeglut3-dev set to manually installed.
ffmpeg is already the newest version (7:3.4.8-0ubuntu0.2).
The following NEW packages will be installed:
  xvfb
0 upgraded, 1 newly installed, 0 to remove and 13 not upgraded.
Need to get 784 kB of archives.
After this operation, 2,270 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 xvfb amd64 2:1.19.6-1ubuntu4.8 [784 kB]
Fetched 784 kB in 1s (712 kB/s)
Selecting previously unselected package xvfb.
(Reading database ... 146374 files and directories currently installed.)
Preparing to unpack .../xvfb_2%3a1.19.6-1ubuntu4.8_amd64.deb ...
Unpacking xvfb (2:1.19.6-1ubuntu4.8) ...
Setting up xvfb (2:1.19.6-1ubuntu4.8) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Collecting stable-baselines3[extra]
[?25l  Downloading https://file

In [2]:
!lsb_release -a

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.5 LTS
Release:	18.04
Codename:	bionic


In [5]:
!ls -ld /usr/local/cuda*

lrwxrwxrwx  1 root root    9 Jan 20 17:22 /usr/local/cuda -> cuda-10.1
drwxr-xr-x 16 root root 4096 Jan 20 17:19 /usr/local/cuda-10.0
drwxr-xr-x  1 root root 4096 Jan 20 17:21 /usr/local/cuda-10.1


In [20]:
!pip uninstall -y rlscope || true
!pip install rlscope==0.0.1+cu101 -f https://uoft-ecosystem.github.io/rlscope/whl

Uninstalling rlscope-0.0.1+cu101:
  Successfully uninstalled rlscope-0.0.1+cu101
Looking in links: https://uoft-ecosystem.github.io/rlscope/whl
Collecting rlscope==0.0.1+cu101
[?25l  Downloading https://github.com/UofT-EcoSystem/rlscope/releases/download/v0.0.1/rlscope-0.0.1%2Bcu101-py3-none-manylinux1_x86_64.whl (26.6MB)
[K     |████████████████████████████████| 26.6MB 123kB/s 
Installing collected packages: rlscope
Successfully installed rlscope-0.0.1+cu101


In [11]:
!apt install texlive-extra-utils

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  fonts-droid-fallback fonts-lato fonts-lmodern fonts-noto-mono ghostscript
  gsfonts javascript-common libcupsfilters1 libcupsimage2 libfile-homedir-perl
  libfile-which-perl libgs9 libgs9-common libijs-0.35 libjbig2dec0 libjs-jquery
  libkpathsea6 libmime-charset-perl libpotrace0 libptexenc1 libruby2.5
  libsombok3 libsynctex1 libtexlua52 libtexluajit2 libunicode-linebreak-perl
  libyaml-tiny-perl libzzip-0-13 lmodern poppler-data rake ruby
  ruby-did-you-mean ruby-minitest ruby-net-telnet ruby-power-assert
  ruby-test-unit ruby2.5 rubygems-integration t1utils tex-common texlive-base
  texlive-binaries texlive-latex-base texlive-latex-recommended
Suggested packages:
  fonts-noto ghostscript-x apache2 | lighttpd | httpd libencode-hanextra-perl
  libpod2-base-perl poppler-utils fonts-japanese-mincho | fonts-ipafont-mincho
  fonts-japanese-

## Imports

Stable-Baselines3 works on environments that follow the [gym interface](https://stable-baselines3.readthedocs.io/en/master/guide/custom_env.html).
You can find a list of available environment [here](https://gym.openai.com/envs/#classic_control).

It is also recommended to check the [source code](https://github.com/openai/gym) to learn more about the observation and action space of each env, as gym does not have a proper documentation.
Not all algorithms can work with all action spaces, you can find more in this [recap table](https://stable-baselines3.readthedocs.io/en/master/guide/algos.html)

In [1]:
import gym
import numpy as np

The first thing you need to import is the RL model, check the documentation to know what you can use on which problem

In [2]:
from stable_baselines3 import PPO

The next thing you need to import is the policy class that will be used to create the networks (for the policy/value functions).
This step is optional as you can directly use strings in the constructor: 

```PPO('MlpPolicy', env)``` instead of ```PPO(MlpPolicy, env)```

Note that some algorithms like `SAC` have their own `MlpPolicy`, that's why using string for the policy is the recommened option.

In [3]:
from stable_baselines3.ppo.policies import MlpPolicy

## Create the Gym env and instantiate the agent

For this example, we will use CartPole environment, a classic control problem.

"A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. "

Cartpole environment: [https://gym.openai.com/envs/CartPole-v1/](https://gym.openai.com/envs/CartPole-v1/)

![Cartpole](https://cdn-images-1.medium.com/max/1143/1*h4WTQNVIsvMXJTCpXm_TAw.gif)


We chose the MlpPolicy because the observation of the CartPole task is a feature vector, not images.

The type of action to use (discrete/continuous) will be automatically deduced from the environment action space

Here we are using the [Proximal Policy Optimization](https://stable-baselines3.readthedocs.io/en/master/modules/ppo2.html) algorithm, which is an Actor-Critic method: it uses a value function to improve the policy gradient descent (by reducing the variance).

It combines ideas from [A2C](https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html) (having multiple workers and using an entropy bonus for exploration) and [TRPO](https://stable-baselines.readthedocs.io/en/master/modules/trpo.html) (it uses a trust region to improve stability and avoid catastrophic drops in performance).

PPO is an on-policy algorithm, which means that the trajectories used to update the networks must be collected using the latest policy.
It is usually less sample efficient than off-policy alorithms like [DQN](https://stable-baselines.readthedocs.io/en/master/modules/dqn.html), [SAC](https://stable-baselines3.readthedocs.io/en/master/modules/sac.html) or [TD3](https://stable-baselines3.readthedocs.io/en/master/modules/td3.html), but is much faster regarding wall-clock time.


In [4]:
env = gym.make('CartPole-v1')

model = PPO(MlpPolicy, env, verbose=0)

We create a helper function to evaluate the agent:

In [5]:
def evaluate(model, num_episodes=100):
    """
    Evaluate a RL agent
    :param model: (BaseRLModel object) the RL Agent
    :param num_episodes: (int) number of episodes to evaluate it
    :return: (float) Mean reward for the last num_episodes
    """
    # This function will only work for a single Environment
    env = model.get_env()
    all_episode_rewards = []
    for i in range(num_episodes):
        episode_rewards = []
        done = False
        obs = env.reset()
        while not done:
            # _states are only useful when using LSTM policies
            action, _states = model.predict(obs)
            # here, action, rewards and dones are arrays
            # because we are using vectorized env
            obs, reward, done, info = env.step(action)
            episode_rewards.append(reward)

        all_episode_rewards.append(sum(episode_rewards))

    mean_episode_reward = np.mean(all_episode_rewards)
    print("Mean reward:", mean_episode_reward, "Num episodes:", num_episodes)

    return mean_episode_reward

Let's evaluate the un-trained agent, this should be a random agent.

In [6]:
# Random Agent, before training
mean_reward_before_train = evaluate(model, num_episodes=100)

Mean reward: 24.04 Num episodes: 100


Stable-Baselines already provides you with that helper:

In [7]:
from stable_baselines3.common.evaluation import evaluate_policy

In [8]:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:476.74 +/- 43.89


## Train the agent and evaluate it

In [9]:
# Train the agent for 10000 steps
model.learn(total_timesteps=10000)

<stable_baselines3.ppo.ppo.PPO at 0x7f4413813be0>

In [10]:
# Evaluate the trained agent
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:443.32 +/- 73.54


Apparently the training went well, the mean reward increased a lot ! 

#RL-Scope
Lets annotate the evaluation inference loop with RL-Scope annotations to understand where time is spent.

In [4]:
# !rls-prof --help
!pip freeze | grep torch

torch==1.7.0+cu101
torchsummary==1.5.1
torchtext==0.3.1
torchvision==0.8.1+cu101


In [22]:
%%writefile test_writefile.py

import rlscope.api as rlscope

rlscope.handle_rlscope_args(
    parser=None, args=None, 
    directory="./rlscope_traces",
    # paths['rlscope_directory']
)
rlscope.prof.set_metadata({
    'algo': 'PPO',
    'env': 'CartPole-v1',
})
process_name = 'PPO_CartPole'
phase_name = process_name

def evaluate_rlscope(model, num_episodes=100):
    """
    Evaluate a RL agent
    :param model: (BaseRLModel object) the RL Agent
    :param num_episodes: (int) number of episodes to evaluate it
    :return: (float) Mean reward for the last num_episodes
    """
    with rlscope.prof.profile(process_name=process_name, phase_name=phase_name):
      # This function will only work for a single Environment
      env = model.get_env()
      all_episode_rewards = []

      rlscope.prof

      with rlscope.prof.operation('training_loop'):
        for i in range(num_episodes):
            episode_rewards = []
            done = False
            obs = env.reset()
            while not done:
                with rlscope.prof.operation('inference'):
                  # _states are only useful when using LSTM policies
                  action, _states = model.predict(obs)
                with rlscope.prof.operation('step'):
                  # here, action, rewards and dones are arrays
                  # because we are using vectorized env
                  obs, reward, done, info = env.step(action)
                  episode_rewards.append(reward)

            all_episode_rewards.append(sum(episode_rewards))

      mean_episode_reward = np.mean(all_episode_rewards)
      print("Mean reward:", mean_episode_reward, "Num episodes:", num_episodes)

      return mean_episode_reward

Overwriting test_writefile.py


In [24]:
!pip freeze

absl-py==0.10.0
alabaster==0.7.12
albumentations==0.1.12
altair==4.1.0
appdirs==1.4.4
argon2-cffi==20.1.0
asgiref==3.3.1
astor==0.8.1
astropy==4.1
astunparse==1.6.3
async-generator==1.10
atari-py==0.2.6
atomicwrites==1.4.0
attrs==20.3.0
audioread==2.1.9
autograd==1.3
Babel==2.9.0
backcall==0.2.0
beautifulsoup4==4.6.3
bleach==3.2.2
blis==0.4.1
bokeh==2.1.1
Bottleneck==1.3.2
branca==0.4.2
bs4==0.0.1
CacheControl==0.12.6
cachetools==4.2.1
catalogue==1.0.0
Cerberus==1.3.2
certifi==2020.12.5
cffi==1.14.4
chainer==7.4.0
chardet==3.0.4
click==7.1.2
cloudpickle==1.3.0
cmake==3.12.0
cmdstanpy==0.9.5
colorlog==4.7.2
colorlover==0.3.0
community==1.0.0b1
contextlib2==0.5.5
convertdate==2.2.0
coverage==3.7.1
coveralls==0.5
crcmod==1.7
cufflinks==0.17.3
cupy-cuda101==7.4.0
cvxopt==1.2.5
cvxpy==1.0.31
cycler==0.10.0
cymem==2.0.5
Cython==0.29.21
daft==0.0.4
dask==2.12.0
dataclasses==0.8
datascience==0.10.6
debugpy==1.0.0
decorator==4.4.2
defusedxml==0.6.0
descartes==1.1.0
dill==0.3.3
distributed==1.25

In [23]:
!rls-prof python test_writefile.py

> CMD:
  $ /usr/local/bin/python test_writefile.py
  PWD=/content
  Environment:
    LD_LIBRARY_PATH=/usr/lib64-nvidia:/usr/local/lib/python3.6/dist-packages/rlscope/cpp/lib
    LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4:librlscope.so
    RLSCOPE_CONFIG=full
    RLSCOPE_CUDA_ACTIVITIES=yes
    RLSCOPE_CUDA_API_CALLS=yes
    RLSCOPE_CUDA_API_EVENTS=yes
    RLSCOPE_FUZZ_CUDA_API=no
    RLSCOPE_GPU_HW=no
    RLSCOPE_PC_SAMPLING=no
    RLSCOPE_STREAM_SAMPLING=no
    RLSCOPE_TRACE_AT_START=no
Stack trace (most recent call last):
#31   Object "/bin/bash", at 0x55cd24281bf1, in execute_command
#30   Object "/bin/bash", at 0x55cd2427ffd6, in execute_command_internal
#29   Object "/bin/bash", at 0x55cd24281bf1, in execute_command
#28   Object "/bin/bash", at 0x55cd2427ffd6, in execute_command_internal
#27   Object "/bin/bash", at 0x55cd24281bf1, in execute_command
#26   Object "/bin/bash", at 0x55cd2427ffd6, in execute_command_internal
#25   Object "/bin/bash", at 0x55cd24281bf1, in 

### Prepare video recording

In [11]:
# Set up fake display; otherwise rendering will fail
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

In [12]:
import base64
from pathlib import Path

from IPython import display as ipythondisplay

def show_videos(video_path='', prefix=''):
  """
  Taken from https://github.com/eleurent/highway-env

  :param video_path: (str) Path to the folder containing videos
  :param prefix: (str) Filter the video, showing only the only starting with this prefix
  """
  html = []
  for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
      video_b64 = base64.b64encode(mp4.read_bytes())
      html.append('''<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

We will record a video using the [VecVideoRecorder](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecvideorecorder) wrapper, you will learn about those wrapper in the next notebook.

In [13]:
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv

def record_video(env_id, model, video_length=500, prefix='', video_folder='videos/'):
  """
  :param env_id: (str)
  :param model: (RL model)
  :param video_length: (int)
  :param prefix: (str)
  :param video_folder: (str)
  """
  eval_env = DummyVecEnv([lambda: gym.make(env_id)])
  # Start the video at step=0 and record 500 steps
  eval_env = VecVideoRecorder(eval_env, video_folder=video_folder,
                              record_video_trigger=lambda step: step == 0, video_length=video_length,
                              name_prefix=prefix)

  obs = eval_env.reset()
  for _ in range(video_length):
    action, _ = model.predict(obs)
    obs, _, _, _ = eval_env.step(action)

  # Close the video recorder
  eval_env.close()

### Visualize trained agent



In [14]:
record_video('CartPole-v1', model, video_length=500, prefix='ppo2-cartpole')

In [15]:
show_videos('videos', prefix='ppo2')

## Bonus: Train a RL Model in One Line

The policy class to use will be inferred and the environment will be automatically created. This works because both are [registered](https://stable-baselines3.readthedocs.io/en/master/guide/quickstart.html).

In [None]:
model = PPO('MlpPolicy', "CartPole-v1", verbose=1).learn(1000)

## Conclusion

In this notebook we have seen:
- how to define and train a RL model using stable baselines3, it takes only one line of code ;)