## Stable Baselines3 Tutorial

在本笔记本中，我们将学习使用稳定基线库的基础知识：如何创建 RL 模型、训练和评估它。由于所有算法共享相同的接口，因此我们将看到从一种算法切换到另一种算法是多么简单。

使用 Pip 安装依赖项和稳定基线3

pip install 'stable-baselines3[extra]'

### 导入模块

In [1]:
import gymnasium as gym
import numpy as np

# 注意：并非所有sb3算法都适用于所有动作空间。
# 在这个链接检查：https://stable-baselines3.readthedocs.io/en/master/guide/algos.html

In [2]:
from stable_baselines3 import PPO

In [3]:
from stable_baselines3.ppo.policies import MlpPolicy

### 创建gym环境和实例化智能体

在这个练习中，我们将使用CartPole环境，一个经典的控制问题

“一根杆子通过非驱动关节连接到小车上，小车沿着无摩擦的轨道移动。通过向小车施加 +1 或 -1 的力来控制系统。钟摆开始直立，目标是防止它翻倒。杆子保持直立的每一个时间步都会提供 +1 的奖励。”

我们选择 MlpPolicy 是因为 CartPole 任务的观察是特征向量，而不是图像。

要使用的动作类型（离散/连续）将从环境动作空间中自动推断出。

这里我们使用 Proximal Policy Optimization 算法，它是一种 Actor-Critic 方法：它使用值函数来改进策略梯度下降（通过减少方差）。

它结合了 A2C（拥有多个工作人员并使用熵奖励进行探索）和 TRPO（它使用信任区域来提高稳定性并避免性能灾难性下降）的思想。

PPO 是一种同策略算法，这意味着用于更新网络的轨迹必须使用最新的策略来收集。它的样本效率通常低于 DQN、SAC 或 TD3 等离策略算法，但就wall-clock time而言要快得多。

In [4]:
env = gym.make("CartPole-v1")

model = PPO(MlpPolicy, env, verbose=0)

In [5]:
# 我们创建一个辅助函数来评估代理

from stable_baselines3.common.base_class import BaseAlgorithm


def evaluate(
    model: BaseAlgorithm,
    num_episodes: int = 100,
    deterministic: bool = True,
) -> float:
    """
    Evaluate an RL agent for `num_episodes`.

    :param model: the RL Agent
    :param env: the gym Environment
    :param num_episodes: number of episodes to evaluate it
    :param deterministic: Whether to use deterministic or stochastic actions
    :return: Mean reward for the last `num_episodes`
    """
    # This function will only work for a single environment
    vec_env = model.get_env()
    obs = vec_env.reset()
    all_episode_rewards = []
    for _ in range(num_episodes):
        episode_rewards = []
        done = False
        # Note: SB3 VecEnv resets automatically:
        # https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecenv-api-vs-gym-api
        # obs = vec_env.reset()
        while not done:
            # _states are only useful when using LSTM policies
            # `deterministic` is to use deterministic actions
            action, _states = model.predict(obs, deterministic=deterministic)
            # here, action, rewards and dones are arrays
            # because we are using vectorized env
            obs, reward, done, _info = vec_env.step(action)
            episode_rewards.append(reward)

        all_episode_rewards.append(sum(episode_rewards))

    mean_episode_reward = np.mean(all_episode_rewards)
    print(f"Mean reward: {mean_episode_reward:.2f} - Num episodes: {num_episodes}")

    return mean_episode_reward

让我们评估未经训练的智能体，这应该是一个随机智能体。

In [6]:
# Random Agent, before training
mean_reward_before_train = evaluate(model, num_episodes=100, deterministic=True)

Mean reward: 61.86 - Num episodes: 100


sb中有Common评估方法

In [7]:
from stable_baselines3.common.evaluation import evaluate_policy

In [8]:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100, warn=False)

print(f"mean_reward: {mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward: 57.45 +/- 18.63


### 训练代理并对其进行评估

In [9]:
# Train the agent for 10000 steps
model.learn(total_timesteps=10_000)

<stable_baselines3.ppo.ppo.PPO at 0x16ba4b2b0>

In [10]:
# Evaluate the trained agent
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")



mean_reward:450.69 +/- 82.56


### 可视化

In [11]:
# Set up fake display; otherwise rendering will fail
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

_XSERVTransmkdir: ERROR: euid != 0,directory /tmp/.X11-unix will not be created.
_XSERVTransSocketUNIXCreateListener: mkdir(/tmp/.X11-unix) failed, errno = 2
_XSERVTransMakeAllCOTSServerListeners: failed to create listener for local
(EE) 
Fatal server error:
(EE) Cannot establish any listening sockets - Make sure an X server isn't already running(EE) 


In [12]:
import base64
from pathlib import Path

from IPython import display as ipythondisplay


def show_videos(video_path="", prefix=""):
    """
    Taken from https://github.com/eleurent/highway-env

    :param video_path: (str) Path to the folder containing videos
    :param prefix: (str) Filter the video, showing only the only starting with this prefix
    """
    html = []
    for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
        video_b64 = base64.b64encode(mp4.read_bytes())
        html.append(
            """<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>""".format(
                mp4, video_b64.decode("ascii")
            )
        )
    ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

In [13]:
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv


def record_video(env_id, model, video_length=500, prefix="", video_folder="videos/"):
    """
    :param env_id: (str)
    :param model: (RL model)
    :param video_length: (int)
    :param prefix: (str)
    :param video_folder: (str)
    """
    eval_env = DummyVecEnv([lambda: gym.make(env_id, render_mode="rgb_array")])
    # Start the video at step=0 and record 500 steps
    eval_env = VecVideoRecorder(
        eval_env,
        video_folder=video_folder,
        record_video_trigger=lambda step: step == 0,
        video_length=video_length,
        name_prefix=prefix,
    )

    obs = eval_env.reset()
    for _ in range(video_length):
        action, _ = model.predict(obs)
        obs, _, _, _ = eval_env.step(action)

    # Close the video recorder
    eval_env.close()

#### 可视化训练后的智能体

In [14]:
# Mac系统需要先安装ffmpeg： brew install ffmpeg

record_video("CartPole-v1", model, video_length=500, prefix="ppo-cartpole")

Saving video to /Users/zrl_mini/DMU/homepage/MIT/01_TechnicalBasis_CN/DeepRL_CN/sb3_tutorial/videos/ppo-cartpole-step-0-to-step-500.mp4
Moviepy - Building video /Users/zrl_mini/DMU/homepage/MIT/01_TechnicalBasis_CN/DeepRL_CN/sb3_tutorial/videos/ppo-cartpole-step-0-to-step-500.mp4.
Moviepy - Writing video /Users/zrl_mini/DMU/homepage/MIT/01_TechnicalBasis_CN/DeepRL_CN/sb3_tutorial/videos/ppo-cartpole-step-0-to-step-500.mp4



                                                                                                                                          

Moviepy - Done !
Moviepy - video ready /Users/zrl_mini/DMU/homepage/MIT/01_TechnicalBasis_CN/DeepRL_CN/sb3_tutorial/videos/ppo-cartpole-step-0-to-step-500.mp4


In [15]:
show_videos("videos", prefix="ppo")