<a href="https://colab.research.google.com/github/araffin/rl-tutorial-jnrr19/blob/sb3/1_getting_started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stable Baselines3 Tutorial - Getting Started

Github repo: https://github.com/araffin/rl-tutorial-jnrr19/tree/sb3/

Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3

Documentation: https://stable-baselines3.readthedocs.io/en/master/

SB3-Contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib

RL Baselines3 zoo: https://github.com/DLR-RM/rl-baselines3-zoo

[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a training framework for Reinforcement Learning (RL), using Stable Baselines3.

It provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.


## Introduction

In this notebook, you will learn the basics for using stable baselines library: how to create a RL model, train it and evaluate it. Because all algorithms share the same interface, we will see how simple it is to switch from one algorithm to another.


## Install Dependencies and Stable Baselines3 Using Pip 
**We have already installed the packages**

List of full dependencies can be found in the [README](https://github.com/DLR-RM/stable-baselines3).


```
pip install stable-baselines3[extra]
```

In [1]:
# for autoformatting
# %load_ext jupyter_black

In [2]:
# !apt-get install ffmpeg freeglut3-dev xvfb  # For visualization
# !pip install "stable-baselines3[extra]>=2.0.0a4"

In [3]:
# !pip install gymnasium[other]

## Imports

Stable-Baselines3 works on environments that follow the [gym interface](https://stable-baselines3.readthedocs.io/en/master/guide/custom_env.html).
You can find a list of available environment [here](https://gymnasium.farama.org/environments/classic_control/).

Not all algorithms can work with all action spaces, you can find more in this [recap table](https://stable-baselines3.readthedocs.io/en/master/guide/algos.html)

In [4]:
import gymnasium as gym
import numpy as np

The first thing you need to import is the RL model, check the documentation to know what you can use on which problem

In [5]:
from stable_baselines3 import PPO

The next thing you need to import is the policy class that will be used to create the networks (for the policy/value functions).
This step is optional as you can directly use strings in the constructor: 

```PPO('MlpPolicy', env)``` instead of ```PPO(MlpPolicy, env)```

Note that some algorithms like `SAC` have their own `MlpPolicy`, that's why using string for the policy is the recommended option.

In [6]:
from stable_baselines3.ppo.policies import MlpPolicy

## Create the Gym env and instantiate the agent

For this example, we will use CartPole environment, a classic control problem.

"A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. "

Cartpole environment: [https://gymnasium.farama.org/environments/classic_control/cart_pole/](https://gymnasium.farama.org/environments/classic_control/cart_pole/)

![Cartpole](https://cdn-images-1.medium.com/max/1143/1*h4WTQNVIsvMXJTCpXm_TAw.gif)


We chose the MlpPolicy because the observation of the CartPole task is a feature vector, not images.

The type of action to use (discrete/continuous) will be automatically deduced from the environment action space

Here we are using the [Proximal Policy Optimization](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html) algorithm, which is an Actor-Critic method: it uses a value function to improve the policy gradient descent (by reducing the variance).

It combines ideas from [A2C](https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html) (having multiple workers and using an entropy bonus for exploration) and [TRPO](https://stable-baselines.readthedocs.io/en/master/modules/trpo.html) (it uses a trust region to improve stability and avoid catastrophic drops in performance).

PPO is an on-policy algorithm, which means that the trajectories used to update the networks must be collected using the latest policy.
It is usually less sample efficient than off-policy alorithms like [DQN](https://stable-baselines.readthedocs.io/en/master/modules/dqn.html), [SAC](https://stable-baselines3.readthedocs.io/en/master/modules/sac.html) or [TD3](https://stable-baselines3.readthedocs.io/en/master/modules/td3.html), but is much faster regarding wall-clock time.


In [None]:

import torch as th 

env = gym.make("CartPole-v1")

# Custom neural network of two layers of size 32 each with Relu activation function
# Note: an extra linear layer will be added on top of the pi and the vf nets, respectively
policy_kwargs = dict(activation_fn=th.nn.ReLU,
                     net_arch=dict(pi=[32, 32], vf=[32, 32]))
# Create the agent
model = PPO("MlpPolicy", "CartPole-v1", policy_kwargs=policy_kwargs, verbose=1, device ="cpu")

# model = PPO(MlpPolicy, env, verbose=0)

## We create a helper function to evaluate the agent:

In [8]:
from stable_baselines3.common.base_class import BaseAlgorithm


def evaluate(
    model: BaseAlgorithm,
    num_episodes: int = 100,
    deterministic: bool = True,
) -> float:
    """
    Evaluate an RL agent for `num_episodes`.

    :param model: the RL Agent
    :param env: the gym Environment
    :param num_episodes: number of episodes to evaluate it
    :param deterministic: Whether to use deterministic or stochastic actions
    :return: Mean reward for the last `num_episodes`
    """
    # This function will only work for a single environment
    vec_env = model.get_env()
    obs = vec_env.reset()
    all_episode_rewards = []
    for _ in range(num_episodes):
        episode_rewards = []
        done = False
        # Note: SB3 VecEnv resets automatically:
        # https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecenv-api-vs-gym-api
        # obs = vec_env.reset()
        while not done:
            # _states are only useful when using LSTM policies
            # `deterministic` is to use deterministic actions
            action, _states = model.predict(obs, deterministic=deterministic)
            # here, action, rewards and dones are arrays
            # because we are using vectorized env
            obs, reward, done, _info = vec_env.step(action)
            episode_rewards.append(reward)

        all_episode_rewards.append(sum(episode_rewards))

    mean_episode_reward = np.mean(all_episode_rewards)
    print(f"Mean reward: {mean_episode_reward:.2f} - Num episodes: {num_episodes}")

    return mean_episode_reward

Let's evaluate the un-trained agent, this should be a random agent.

In [None]:
# Random Agent, before training
mean_reward_before_train = evaluate(model, num_episodes=100, deterministic=True)

## Stable-Baselines already provides you with that helper:
```
evaluate_policy
```

In [10]:
from stable_baselines3.common.evaluation import evaluate_policy

In [None]:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100, warn=False)

print(f"mean_reward: {mean_reward:.2f} +/- {std_reward:.2f}")

## Train the agent and evaluate it

In [None]:
# Train the agent for 10000 steps
model.learn(total_timesteps=10_000)


### **Training Output Breakdown**

Here's a sample of training output 

```
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 51.8         |
|    ep_rew_mean          | 51.8         |
| time/                   |              |
|    fps                  | 1499         |
|    iterations           | 5            |
|    time_elapsed         | 6            |
|    total_timesteps      | 10240        |
| train/                  |              |
|    approx_kl            | 0.0072779306 |
|    clip_fraction        | 0.0377       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.639       |
|    explained_variance   | 0.00668      |
|    learning_rate        | 0.0003       |
|    loss                 | 39.5         |
|    n_updates            | 40           |
|    policy_gradient_loss | -0.0129      |
|    value_loss           | 70.8         |
------------------------------------------ 
```

This output is typically displayed by SB3 to provide real-time insights into the training progress.  

---

#### **1. Rollout Metrics (`rollout/`)**

**Purpose:** Monitor the agent's performance during the rollout phase, where it interacts with the environment to collect experiences.

- **`ep_len_mean` (Episode Length Mean):**  
  - **Value:** `51.8`  
  - **Explanation:** Represents the **average length** (in steps) of episodes over recent rollouts. A higher value indicates that the agent is sustaining interactions with the environment longer, which often correlates with better performance, especially in environments where the goal is to maximize episode duration (e.g., balancing tasks).

- **`ep_rew_mean` (Episode Reward Mean):**  
  - **Value:** `51.8`  
  - **Explanation:** Denotes the **average cumulative reward** obtained per episode during recent rollouts. This metric is a direct indicator of the agent's performance—the higher the average reward, the better the agent is performing in achieving its objectives.

---

#### **2. Time Metrics (`time/`)**

**Purpose:** Provide information about the training process's temporal aspects, including speed and progression.

- **`fps` (Frames Per Second):**  
  - **Value:** `1499`  
  - **Explanation:** Indicates the **number of frames (steps)** processed per second. A higher FPS means faster training, allowing the agent to learn more efficiently. This metric is influenced by factors like environment complexity, hardware capabilities, and whether multiple environments are running in parallel.

- **`iterations`:**  
  - **Value:** `5`  
  - **Explanation:** Represents the **number of training iterations** completed. Each iteration typically involves collecting a batch of experiences and performing updates to the agent's policy and value networks.

- **`time_elapsed`:**  
  - **Value:** `6`  
  - **Explanation:** Shows the **total time elapsed** since the start of training, usually in **seconds**. Given the high FPS and low iteration count, this suggests that the training is progressing rapidly.

- **`total_timesteps`:**  
  - **Value:** `10240`  
  - **Explanation:** Denotes the **cumulative number of environment steps** taken by the agent across all parallel environments. This metric is crucial for understanding how much experience the agent has accumulated, which directly impacts learning.

---

#### **3. Training Metrics (`train/`)**

**Purpose:** Provide detailed insights into the optimization process, including losses, learning rates, and policy updates.

- **`approx_kl` (Approximate Kullback-Leibler Divergence):**  
  - **Value:** `0.0072779306`  
  - **Explanation:** Measures the **difference between the new and old policy distributions**. PPO uses this metric to ensure that policy updates do not deviate excessively from the previous policy, maintaining stability. A smaller KL divergence indicates minor changes, whereas larger values suggest more significant updates.

- **`clip_fraction`:**  
  - **Value:** `0.0377`  
  - **Explanation:** Indicates the **fraction of policy updates** that were **clipped** during the optimization step. PPO uses a clipping mechanism to restrict how much the policy can change in a single update, enhancing training stability. A low clip fraction means most updates were within the clipping range, whereas a higher value suggests more updates were subject to clipping.

- **`clip_range`:**  
  - **Value:** `0.2`  
  - **Explanation:** The **clipping parameter** used in PPO. It defines the **maximum allowable change** in the policy ratio between the new and old policies. A standard value is `0.2`, meaning that the policy ratio is clipped to stay within `[1 - 0.2, 1 + 0.2]`.

- **`entropy_loss`:**  
  - **Value:** `-0.639`  
  - **Explanation:** Represents the **entropy loss** component of the total loss. Entropy encourages exploration by penalizing certainty in action selection. A negative value indicates that the entropy is contributing positively to the loss, promoting more exploratory behavior.

- **`explained_variance`:**  
  - **Value:** `0.00668`  
  - **Explanation:** Measures how well the value function predicts the returns. It ranges from `-∞` to `1`, where `1` means perfect prediction, `0` indicates no predictive power, and negative values suggest worse-than-mean predictions. A value close to `0` (as in this case) implies that the value function is not effectively capturing the variance in returns.

- **`learning_rate`:**  
  - **Value:** `0.0003`  
  - **Explanation:** The **current learning rate** used by the optimizer. This parameter dictates the **step size** during gradient descent. A lower learning rate can lead to more stable but slower convergence, while a higher rate may speed up training but risk overshooting minima.

- **`loss`:**  
  - **Value:** `39.5`  
  - **Explanation:** The **total loss** combining all components (policy loss, value loss, entropy loss). This scalar value guides the optimization process. Monitoring loss helps in diagnosing training progress and convergence behavior.

- **`n_updates`:**  
  - **Value:** `40`  
  - **Explanation:** The **number of optimization steps** performed during the current iteration. PPO typically performs multiple updates per batch of collected experiences to refine the policy and value networks.

- **`policy_gradient_loss`:**  
  - **Value:** `-0.0129`  
  - **Explanation:** The **policy loss** component derived from the policy gradient objective. A negative value indicates that the policy is improving (increasing the probability of better actions). This loss encourages the policy to favor actions that lead to higher rewards.

- **`value_loss`:**  
  - **Value:** `70.8`  
  - **Explanation:** The **value function loss**, which measures the discrepancy between the predicted value and the actual returns. Minimizing this loss improves the accuracy of the value function, which is crucial for estimating future rewards.

---

### **Visualizing and Interpreting the Metrics**

Understanding these metrics allows you to monitor the training process effectively:

- **Performance Metrics (`rollout/`):**  
  Track how well the agent is performing in the environment. An increase in `ep_rew_mean` and `ep_len_mean` typically indicates improving performance.

- **Training Progress (`time/`):**  
  Observe the speed (`fps`) and progression (`iterations`, `total_timesteps`) of training. High FPS is beneficial for faster experimentation.

- **Optimization Health (`train/`):**  
  - **Stability:** Low `approx_kl` and `clip_fraction` suggest stable policy updates.
  - **Exploration vs. Exploitation:** Negative `entropy_loss` promotes exploration. Balancing this is key to effective learning.
  - **Learning Efficiency:** Appropriate `learning_rate` and decreasing `value_loss` indicate efficient learning.
  - **Value Function Accuracy:** Higher `explained_variance` signifies better value predictions.

---

### **Practical Tips for Monitoring and Optimization**

1. **Monitor Trends Over Time:**  
   Rather than focusing on individual metric values, observe how they evolve over iterations. For instance, `ep_rew_mean` should generally increase, and `value_loss` should decrease as training progresses.

2. **Adjust Hyperparameters Based on Metrics:**  
   - If `clip_fraction` is too high, consider reducing `clip_range` to allow more significant policy updates.
   - If `approx_kl` is consistently low, you might increase the `clip_range` to permit more substantial policy changes.
   - A very low or negative `explained_variance` suggests that the value function needs improvement—consider modifying the network architecture or training parameters.

3. **Ensure Balance Between Exploration and Exploitation:**  
   Properly tuning `entropy_loss` is crucial. Too much exploration can prevent the agent from exploiting learned strategies, while too little can lead to premature convergence to suboptimal policies.

4. **Optimize Learning Rate:**  
   Monitor `learning_rate` and adjust if necessary. Learning rate schedulers can help in reducing the learning rate as training progresses, aiding in fine-tuning the policy.

5. **Use Additional Logging and Visualization Tools:**  
   Integrate tools like **Weights & Biases (wandb)** or **TensorBoard** to visualize these metrics over time, making it easier to diagnose and understand training dynamics.

---


In [None]:
# Evaluate the trained agent
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

Apparently the training went well, the mean reward increased a lot ! 

### Prepare video recording

In [14]:
# Set up fake display; otherwise rendering will fail
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

In [9]:
import base64
from pathlib import Path

from IPython import display as ipythondisplay


def show_videos(video_path="", prefix=""):
    """
    Taken from https://github.com/eleurent/highway-env

    :param video_path: (str) Path to the folder containing videos
    :param prefix: (str) Filter the video, showing only the only starting with this prefix
    """
    html = []
    for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
        video_b64 = base64.b64encode(mp4.read_bytes())
        html.append(
            """<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>""".format(
                mp4, video_b64.decode("ascii")
            )
        )
    ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

We will record a video using the [VecVideoRecorder](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecvideorecorder) wrapper, you will learn about those wrapper in the next notebook.

In [10]:
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv


def record_video(env_id, model, video_length=500, prefix="", video_folder="videos/"):
    """
    :param env_id: (str)
    :param model: (RL model)
    :param video_length: (int)
    :param prefix: (str)
    :param video_folder: (str)
    """
    eval_env = DummyVecEnv([lambda: gym.make(env_id, render_mode="rgb_array")])
    # Start the video at step=0 and record 500 steps
    eval_env = VecVideoRecorder(
        eval_env,
        video_folder=video_folder,
        record_video_trigger=lambda step: step == 0,
        video_length=video_length,
        name_prefix=prefix,
    )

    obs = eval_env.reset()
    for _ in range(video_length):
        action, _ = model.predict(obs)
        obs, _, _, _ = eval_env.step(action)

    # Close the video recorder
    eval_env.close()

### Visualize trained agent



In [None]:
record_video("CartPole-v1", model, video_length=500, prefix="ppo-cartpole")

In [None]:
show_videos("videos", prefix="ppo-cartpole")

## Bonus 1: Train a RL Model in One Line

The policy class to use will be inferred and the environment will be automatically created. This works because both are [registered](https://stable-baselines3.readthedocs.io/en/master/guide/quickstart.html).

In [None]:
model = PPO('MlpPolicy', "CartPole-v1", verbose=1).learn(1000)

## Whole pipeline of training an RL agent using SB3

In [None]:
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv, VecVideoRecorder
import torch as th
import numpy as np
import time 

# Environment ID
env_id = "Pendulum-v1"

# Create the Gymnasium environment
env = gym.make(env_id, render_mode="rgb_array")


# Define custom policy network architecture
policy_kwargs = dict(
    activation_fn=th.nn.ReLU,
    net_arch=dict(pi=[32, 32], vf=[32, 32])
)

# Initialize the PPO agent
model = PPO(
    "MlpPolicy",
    env,
    policy_kwargs=policy_kwargs,
    verbose=0,
    device="cpu"
)

# Evaluate the untrained agent
print("=== Evaluating the untrained agent ===")
avg_reward_before, _ = evaluate_policy(model, env, n_eval_episodes=5, render=False)
print(f"Average Reward before training over 5 episodes: {avg_reward_before:.2f}")

record_video("Pendulum-v1", model, video_length=100, prefix="ppo_untrained_pendulum")

start_time = time.time()
# Train the PPO agent
print("\n=== Training the PPO agent ===")
model.learn(
    total_timesteps=100_000,
    log_interval=20   
)
end_time = time.time()
training_time= end_time - start_time
print(f"\n=== Training finished in {training_time:.2f} seconds ===")

# Evaluate the trained agent
print("\n=== Evaluating the trained agent ===")
avg_reward_after, _ = evaluate_policy(model, env, n_eval_episodes=5, render=False)
print(f"Average Reward after training over 5 episodes: {avg_reward_after:.2f}")

record_video("Pendulum-v1", model, video_length=100, prefix="ppo_trained_pendulum")

# Structured Summary
print("\n=== Training Summary ===")
print(f"Environment: {env_id}")
print(f"Average Reward before Training: {avg_reward_before:.2f}")
print(f"Average Reward after Training: {avg_reward_after:.2f}")






In [None]:
show_videos("videos", prefix="ppo_untrained_pendulum")
show_videos("videos", prefix="ppo_trained_pendulum")

# Bonus 2: Multiple Environments 

In [None]:
import gymnasium as gym
import time
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy

# Parallel environments
start_time = time.time()
vec_env = make_vec_env("Pendulum-v1", n_envs=8)
print("\n=== Training the PPO agent ===")
model = PPO("MlpPolicy", vec_env, verbose=0)
model.learn(total_timesteps=100000, log_interval=10)
print("\n=== Training finished ===")
end_time = time.time()
training_time= end_time - start_time
print(f"\n=== Training finished in {training_time:.2f} seconds ===")


avg_reward_after, _ = evaluate_policy(model, vec_env, n_eval_episodes=5, render=False)
print(f"Average Reward after training over 5 episodes: {avg_reward_after:.2f}")

## Conclusion

In this notebook we have seen:
- how to define and train a RL model using stable baselines3, it takes only one line of code ;)