# Stable Baselines3 - Easy Multiprocessing

Github Repo: [https://github.com/DLR-RM/stable-baselines3](https://github.com/DLR-RM/stable-baselines3)


[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a training framework for Reinforcement Learning (RL), using Stable Baselines3.

It provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.

Documentation is available online: [https://stable-baselines3.readthedocs.io/](https://stable-baselines3.readthedocs.io/)

## Install Dependencies and Stable Baselines Using Pip


```
pip install stable-baselines3[extra]
```

In [None]:
# for autoformatting
# %load_ext jupyter_black

In [None]:
!pip install "stable-baselines3[extra]>=2.0.0a4"

Collecting stable-baselines3>=2.0.0a4 (from stable-baselines3[extra]>=2.0.0a4)
  Downloading stable_baselines3-2.6.0a0-py3-none-any.whl.metadata (4.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3>=2.0.0a4->stable-baselines3[extra]>=2.0.0a4)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3>=2.0.0a4->stable-baselines3[extra]>=2.0.0a4)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3>=2.0.0a4->stable-baselines3[extra]>=2.0.0a4)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3.0,>=2.3->stable-baselines3>=2.0.0a4->stable-baselines3[extra]>=2.0.0a4)
  Downloading nvidia_cudnn_cu12-9

In [None]:
!pip install git+https://github.com/cubecloud/sb3-rllab.git

Collecting git+https://github.com/cubecloud/sb3-rllab.git
  Cloning https://github.com/cubecloud/sb3-rllab.git to /tmp/pip-req-build-a3wgdvi4
  Running command git clone --filter=blob:none --quiet https://github.com/cubecloud/sb3-rllab.git /tmp/pip-req-build-a3wgdvi4
  Resolved https://github.com/cubecloud/sb3-rllab.git to commit fc032635f5c21c6ffb4138429a4efd239f233b6a
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sb3_rllab
  Building wheel for sb3_rllab (setup.py) ... [?25l[?25hdone
  Created wheel for sb3_rllab: filename=sb3_rllab-0.24-py3-none-any.whl size=8650 sha256=f7c9ac90aa988f0b4635076996bf52e47751cd4cd0ba65326da97fe6ff37e03b
  Stored in directory: /tmp/pip-ephem-wheel-cache-an1pjx8r/wheels/27/a7/76/3797116c55f4e02e7fc97f41c96c4d3a1e5eae0533c5f9cfde
Successfully built sb3_rllab
Installing collected packages: sb3_rllab
Successfully installed sb3_rllab-0.24


In [None]:
from sb3_rllab import LabSubprocVecEnv

## Import policy, RL agent, ...

In [None]:
import time

import gymnasium as gym
import numpy as np

from stable_baselines3 import A2C
from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv
from stable_baselines3.common.utils import set_random_seed
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env


  and should_run_async(code)


## Multiprocessing RL Training

To multiprocess RL training, we will just have to wrap the Gym env into a `SubprocVecEnv` object, that will take care of synchronising the processes. The idea is that each process will run an indepedent instance of the Gym env.

For that, we need an additional utility function, `make_env`, that will instantiate the environments and make sure they are different (using different random seed).

In [None]:
from typing import Callable


def make_env(env_id: str, rank: int, seed: int = 0) -> Callable:
    """
    Utility function for multiprocessed env.

    :param env_id: (str) the environment ID
    :param num_env: (int) the number of environment you wish to have in subprocesses
    :param seed: (int) the inital seed for RNG
    :param rank: (int) index of the subprocess
    :return: (Callable)
    """

    def _init() -> gym.Env:
        env = gym.make(env_id)
        env.reset(seed=seed + rank)
        return env

    set_random_seed(seed)
    return _init

The number of parallel process used is defined by the `num_cpu` variable.

Because we use vectorized environment (SubprocVecEnv), the actions sent to the wrapped env must be an array (one action per process). Also, observations, rewards and dones are arrays.

In [None]:
env_id = "CartPole-v1"
num_cpu = 4  # Number of processes to use
# Create the vectorized environment
env = SubprocVecEnv([make_env(env_id, i) for i in range(num_cpu)])
# env = LabSubprocVecEnv([make_env(env_id, i) for i in range(num_cpu)])

model = A2C("MlpPolicy", env, verbose=0)

Stable-Baselines3 provides you with make_vec_env() helper which does exactly the previous steps for you:

In [None]:
# By default, we use a DummyVecEnv as it is usually faster (cf doc)
vec_env = make_vec_env(env_id, n_envs=num_cpu)

model = A2C("MlpPolicy", vec_env, verbose=0)

  and should_run_async(code)


Let's evaluate the un-trained agent, this should be a random agent.

In [None]:
# We create a separate environment for evaluation
eval_env = gym.make(env_id)

# Random Agent, before training
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10)
print(f"Mean reward: {mean_reward} +/- {std_reward:.2f}")



Mean reward: 9.6 +/- 0.92


In [None]:
env_id = "CartPole-v1"
seed = 42
num_envs = 12  # Number of environments
num_cpu = 4 # number of cpu or cpu cores/threads to use
"""
This means that we want to run 12 environments across 4 CPU cores,
with 3 environments per core.
"""

# Create the vectorized environment
lab_vec_env_kwargs = dict(env_id="CartPole-v1",
                          # env_kwargs=env_kwargs,
                          n_envs=num_envs,
                          seed=seed,
                          vec_env_cls=LabSubprocVecEnv,
                          vec_env_kwargs=dict(n_processes=num_cpu,
                                              use_threads=False,
                                              use_period='train')
                          )

lab_vec_env = make_vec_env(**lab_vec_env_kwargs)
lab_model = A2C("MlpPolicy", lab_vec_env, verbose=0)

Let's evaluate the un-trained agent, this should be a random agent.

In [None]:
# We create a separate environment for evaluation
eval_env = gym.make(env_id)

# Random Agent, before training
mean_reward, std_reward = evaluate_policy(lab_model, eval_env, n_eval_episodes=10)
print(f"Mean reward: {mean_reward} +/- {std_reward:.2f}")

Mean reward: 8.9 +/- 0.70




## Multiprocess VS Single Process Training

Here, we will compare time taken using one vs 4 processes, it should take ~30s in total.

In [None]:
n_timesteps = 25000

# Multiprocessed RL Training
start_time = time.time()
model.learn(n_timesteps)
total_time_multi = time.time() - start_time

print(
    f"Took {total_time_multi:.2f}s for multiprocessed version - {n_timesteps / total_time_multi:.2f} FPS"
)

# Multiprocessed RL Training
start_time = time.time()
lab_model.learn(n_timesteps)
total_time_lab_multi = time.time() - start_time

print(f"Took {total_time_lab_multi:.2f}s for Lab multiprocessed version - {n_timesteps / total_time_lab_multi:.2f} FPS")

# Single Process RL Training
single_process_model = A2C("MlpPolicy", env_id, verbose=0)

start_time = time.time()
single_process_model.learn(n_timesteps)
total_time_single = time.time() - start_time

print(f"Took {total_time_single:.2f}s for single process version - {n_timesteps / total_time_single:.2f} FPS")

print(f"Multiprocessed training is {total_time_single / total_time_multi:.2f}x faster!")

print(f"Lab Multiprocessed training is {total_time_single / total_time_lab_multi:.2f}x faster!")

Took 16.82s for multiprocessed version - 1486.32 FPS
Took 7.18s for Lab multiprocessed version - 3480.51 FPS
Took 62.95s for single process version - 397.17 FPS
Multiprocessed training is 3.74x faster!
Lab Multiprocessed training is 8.76x faster!


In [None]:
# Evaluate the trained single process agent
print(f"Evaluate the trained single agent")
mean_reward, std_reward = evaluate_policy(single_process_model, eval_env, n_eval_episodes=100)
print(f"Mean reward: {mean_reward} +/- {std_reward:.2f}")

Evaluate the trained single agent
Mean reward: 110.61 +/- 6.13


In [None]:
# Evaluate the trained multiprocessing agent
print(f"Evaluate the trained multiprocessing agent")
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=100)
print(f"Mean reward: {mean_reward} +/- {std_reward:.2f}")

Evaluate the trained multiprocessing agent
Mean reward: 163.83 +/- 6.55


In [None]:
# Evaluate the trained Lab multiprocessing agent
print(f"Evaluate the trained Lab multiprocessing agent")
mean_reward, std_reward = evaluate_policy(lab_model, eval_env, n_eval_episodes=100)
print(f"Mean reward: {mean_reward} +/- {std_reward:.2f}")

Evaluate the trained Lab multiprocessing agent
Mean reward: 428.18 +/- 102.24
