# Stable Baselines 3 - Tutorial

Github repo: https://github.com/araffin/rl-tutorial-jnrr19/tree/sb3/

Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3

Documentation: https://stable-baselines3.readthedocs.io/en/master/

### Introduction

Learning how to create a RL model, train it and evaluate it.

In [4]:
# for autoformatting
# %load_ext jupyter_black

In [9]:
# Installing dependencies

!apt-get install ffmpeg freeglut3-dev xvfb
%pip install "stable-baselines3[extra]>=2.0.0.0a4"

E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


# Imports

The Stable-Baselines3 environments follows [gym interfaces](https://stable-baselines3.readthedocs.io/en/master/guide/custom_env.html) .


[List of available environments](https://gymnasium.farama.org/environments/classic_control/)

Not all algorithms can work with all action spaces, you can find more in this [recap table](https://stable-baselines3.readthedocs.io/en/master/guide/algos.html)

In [12]:
import gymnasium as gym
import numpy as np

# First thing is to select and import the RL model.
# You can check the Stable Baselines 3 Docs to know what model you can use on which problem
from stable_baselines3 import PPO

# Import the policy class that will be used to create the networks 
from stable_baselines3.ppo.policies import MlpPolicy

# Create Gym env and instantiate the agent

We will use [CartPole environment](https://gymnasium.farama.org/environments/classic_control/cart_pole/), the pendulum classic control problem.

"A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. "


We chose the MlpPolicy because the observation of the CartPole task is a feature vector, not images.

The type of action to use (discrete/continuous) will be automatically deduced from the environment action space

Here we are using the [Proximal Policy Optimization](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html) algorithm, which is an Actor-Critic method: it uses a value function to improve the policy gradient descent (by reducing the variance).

It combines ideas from [A2C](https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html) (having multiple workers and using an entropy bonus for exploration) and [TRPO](https://stable-baselines.readthedocs.io/en/master/modules/trpo.html) (it uses a trust region to improve stability and avoid catastrophic drops in performance).

PPO is an on-policy algorithm, which means that the trajectories used to update the networks must be collected using the latest policy.
It is usually less sample efficient than off-policy alorithms like [DQN](https://stable-baselines.readthedocs.io/en/master/modules/dqn.html), [SAC](https://stable-baselines3.readthedocs.io/en/master/modules/sac.html) or [TD3](https://stable-baselines3.readthedocs.io/en/master/modules/td3.html), but is much faster regarding wall-clock time.


In [22]:
env = gym.make('CartPole-v1', render_mode="")

model = PPO(MlpPolicy, env, verbose=0)

In [25]:
# Create a helper function to evaluate the agent

from stable_baselines3.common.base_class import BaseAlgorithm
import numpy as np

def evaluate(
  model: BaseAlgorithm,
  num_episodes: int = 100,
  deterministic: bool = True,
) -> float:
  """
    Evaluate an RL agent for `num_episodes`.

    :param model: the RL Agent
    :param env: the gym Environment
    :param num_episodes: number of episodes to evaluate it
    :param deterministic: Whether to use deterministic or stochastic actions
    :return: Mean reward for the last `num_episodes`
  """
  
  # This function just work for single environment
  vec_env = model.get_env()

  # Get the observations
  obs = vec_env.reset()
  print("Observations", obs)
  
  all_episode_rewards = []
  
  for i in range(num_episodes):
    episode_rewards = []
    done = False
    
    while not done:
      action, _states = model.predict(obs, deterministic=deterministic)
      
      obs, reward, done, _info = vec_env.step(action)
      episode_rewards.append(reward)
    
    all_episode_rewards.append(sum(episode_rewards))
    
  mean_episode_reward = np.mean(all_episode_rewards)
  print(f"Mean episode reward: {mean_episode_reward:.2f} - {num_episodes} episodes")
  return mean_episode_reward

In [26]:
# Let's evaluate the un-trained agent, like a random agent

mean_reward_before_training = evaluate(model, num_episodes=100, deterministic=True)

Observations [[-0.00556318  0.00632022 -0.03088834 -0.03041439]]
Mean episode reward: 9.02 - 100 episodes


In [27]:
# Stable-Baselines provides a helper

from stable_baselines3.common.evaluation import evaluate_policy

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100, warn=False)

print(f"Mean reward:{mean_reward:.2f} +/- {std_reward:.2f}")

Mean reward:8.94 +/- 0.70


Train the agent and evaluate it

In [29]:
# Training the agente for 10000 steps
model.learn(total_timesteps=50_000)

<stable_baselines3.ppo.ppo.PPO at 0x7fbc5cc03f70>

In [30]:
# Evaluate the trained agent
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100, warn=False)

print(f"Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

Mean reward: 500.00 +/- 0.00


In [31]:
env.render()

array([[[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       ...,

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]]