学习如何使用回调来进行训练过程监控、自动保存、模型操作、进度条等。

In [1]:
import gymnasium as gym
from stable_baselines3 import A2C, SAC, PPO, TD3

## 超参数调整的重要性

与监督学习相比，深度强化学习对超参数的选择要敏感得多，例如学习率、神经元数量、层数、优化器……等。超参数选择不当会导致性能不佳/不稳定收敛。这一挑战因随机种子（用于初始化网络权重和环境）的性能变化而变得更加复杂。

在这里，我们通过一个玩具示例演示了在 Pendulum 环境中应用的 Soft Actor Critic （SAC）算法。请注意默认参数和“调整”参数之间的性能变化。

In [2]:
import numpy as np

from stable_baselines3.common.evaluation import evaluate_policy

In [3]:
eval_env = gym.make("Pendulum-v1")

In [4]:
default_model = SAC(
    "MlpPolicy",
    "Pendulum-v1",
    verbose=1,
    seed=0,
    batch_size=64,
    policy_kwargs=dict(net_arch=[64, 64]),
).learn(8000)

Using cpu device
Creating environment from the given name 'Pendulum-v1'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.38e+03 |
| time/              |           |
|    episodes        | 4         |
|    fps             | 454       |
|    time_elapsed    | 1         |
|    total_timesteps | 800       |
| train/             |           |
|    actor_loss      | 18.7      |
|    critic_loss     | 1.46      |
|    ent_coef        | 0.811     |
|    ent_coef_loss   | -0.346    |
|    learning_rate   | 0.0003    |
|    n_updates       | 699       |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.45e+03 |
| time/              |           |
|    episodes        | 8         |
|    fps             | 435       |
|    time_e

In [5]:
mean_reward, std_reward = evaluate_policy(default_model, eval_env, n_eval_episodes=100)
print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")



mean_reward:-310.80 +/- 112.18


In [6]:
tuned_model = SAC(
    "MlpPolicy",
    "Pendulum-v1",
    batch_size=256,
    verbose=1,
    policy_kwargs=dict(net_arch=[256, 256]),
    seed=0,
).learn(8000)

Using cpu device
Creating environment from the given name 'Pendulum-v1'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.38e+03 |
| time/              |           |
|    episodes        | 4         |
|    fps             | 179       |
|    time_elapsed    | 4         |
|    total_timesteps | 800       |
| train/             |           |
|    actor_loss      | 22.5      |
|    critic_loss     | 0.273     |
|    ent_coef        | 0.812     |
|    ent_coef_loss   | -0.341    |
|    learning_rate   | 0.0003    |
|    n_updates       | 699       |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.39e+03 |
| time/              |           |
|    episodes        | 8         |
|    fps             | 159       |
|    time_e

In [7]:
mean_reward, std_reward = evaluate_policy(tuned_model, eval_env, n_eval_episodes=100)
print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:-139.84 +/- 99.76


## 回调函数

尽管 Stable-Baselines3 为我们提供了回调集合（例如，用于创建检查点或用于评估），但我们将重新实现一些回调集合，以便可以很好地理解它们的工作原理。

要构建自定义回调，我们需要创建一个派生自 BaseCallback 的类。这将使我们能够访问事件（_on_training_start、_on_step()）和有用的变量（例如 RL 模型的 self.model）。

_on_step 返回一个布尔值，表示训练是否应该继续。

由于可以访问模型变量，特别是 self.model，我们甚至可以更改模型的参数，而无需停止训练或更改模型的代码。

In [8]:
from stable_baselines3.common.callbacks import BaseCallback

In [9]:
class SimpleCallback(BaseCallback):
    """
    a simple callback that can only be called twice

    :param verbose: (int) Verbosity level 0: not output 1: info 2: debug
    """

    def __init__(self, verbose=0):
        super(SimpleCallback, self).__init__(verbose)
        self._called = False

    def _on_step(self):
        if not self._called:
            print("callback - first call")
            self._called = True
            return True  # returns True, training continues.
        print("callback - second call")
        return False  # returns False, training stops.

In [10]:
model = SAC("MlpPolicy", "Pendulum-v1", verbose=1)
model.learn(8000, callback=SimpleCallback())

Using cpu device
Creating environment from the given name 'Pendulum-v1'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
callback - first call
callback - second call


<stable_baselines3.sac.sac.SAC at 0x169b5cb80>

## 第一个示例：自动保存最佳模型

在强化学习中，在训练时保留模型的干净版本非常有用，因为我们最终可能会老化错误的策略。这是回调的典型用例，因为他们可以调用模型的保存函数，并观察一段时间内的训练情况。

使用监控包装器，我们可以保存环境的统计数据，并使用它们来确定平均训练奖励。这使我们能够在训练时保存最佳模型。

请注意，这不是评估 RL 代理的正确方法，我们应该创建一个测试环境并在回调中评估代理性能（参见 EvalCallback）。为简单起见，我们将使用训练奖励作为代理。

In [11]:
import os

import numpy as np

from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.results_plotter import load_results, ts2xy

In [12]:
from stable_baselines3.common.callbacks import BaseCallback

In [13]:
class SaveOnBestTrainingRewardCallback(BaseCallback):
    """
    Callback for saving a model (the check is done every ``check_freq`` steps)
    based on the training reward (in practice, we recommend using ``EvalCallback``).

    :param check_freq: (int)
    :param log_dir: (str) Path to the folder where the model will be saved.
      It must contains the file created by the ``Monitor`` wrapper.
    :param verbose: (int)
    """

    def __init__(self, check_freq, log_dir, verbose=1):
        super().__init__(verbose)
        self.check_freq = check_freq
        self.log_dir = log_dir
        self.save_path = os.path.join(log_dir, "best_model")
        self.best_mean_reward = -np.inf

    def _init_callback(self) -> None:
        # Create folder if needed
        if self.save_path is not None:
            os.makedirs(self.save_path, exist_ok=True)

    def _on_step(self) -> bool:
        if self.n_calls % self.check_freq == 0:

            # Retrieve training reward
            x, y = ts2xy(load_results(self.log_dir), "timesteps")
            if len(x) > 0:
                # Mean training reward over the last 100 episodes
                mean_reward = np.mean(y[-100:])
                if self.verbose > 0:
                    print("Num timesteps: {}".format(self.num_timesteps))
                    print(
                        "Best mean reward: {:.2f} - Last mean reward per episode: {:.2f}".format(
                            self.best_mean_reward, mean_reward
                        )
                    )

                # New best model, you could save the agent here
                if mean_reward > self.best_mean_reward:
                    self.best_mean_reward = mean_reward
                    # Example for saving best model
                    if self.verbose > 0:
                        print("Saving new best model at {} timesteps".format(x[-1]))
                        print("Saving new best model to {}.zip".format(self.save_path))
                    self.model.save(self.save_path)

        return True

In [14]:
# Create log dir
log_dir = "/tmp/gym/"
os.makedirs(log_dir, exist_ok=True)

# Create and wrap the environment
env = make_vec_env("CartPole-v1", n_envs=1, monitor_dir=log_dir)
# it is equivalent to:
# env = gym.make('CartPole-v1')
# env = Monitor(env, log_dir)
# env = DummyVecEnv([lambda: env])

# Create Callback
callback = SaveOnBestTrainingRewardCallback(check_freq=20, log_dir=log_dir, verbose=1)

model = A2C("MlpPolicy", env, verbose=0)
model.learn(total_timesteps=5000, callback=callback)

Num timesteps: 40
Best mean reward: -inf - Last mean reward per episode: 20.00
Saving new best model at 40 timesteps
Saving new best model to /tmp/gym/best_model.zip
Num timesteps: 60
Best mean reward: 20.00 - Last mean reward per episode: 20.00
Num timesteps: 80
Best mean reward: 20.00 - Last mean reward per episode: 20.00
Num timesteps: 100
Best mean reward: 20.00 - Last mean reward per episode: 20.00
Num timesteps: 120
Best mean reward: 20.00 - Last mean reward per episode: 20.00
Num timesteps: 140
Best mean reward: 20.00 - Last mean reward per episode: 44.00
Saving new best model at 132 timesteps
Saving new best model to /tmp/gym/best_model.zip
Num timesteps: 160
Best mean reward: 44.00 - Last mean reward per episode: 37.75
Num timesteps: 180
Best mean reward: 44.00 - Last mean reward per episode: 34.40
Num timesteps: 200
Best mean reward: 44.00 - Last mean reward per episode: 34.40
Num timesteps: 220
Best mean reward: 44.00 - Last mean reward per episode: 35.33
Num timesteps: 240


Num timesteps: 2440
Best mean reward: 44.00 - Last mean reward per episode: 18.74
Num timesteps: 2460
Best mean reward: 44.00 - Last mean reward per episode: 18.28
Num timesteps: 2480
Best mean reward: 44.00 - Last mean reward per episode: 18.30
Num timesteps: 2500
Best mean reward: 44.00 - Last mean reward per episode: 18.17
Num timesteps: 2520
Best mean reward: 44.00 - Last mean reward per episode: 17.97
Num timesteps: 2540
Best mean reward: 44.00 - Last mean reward per episode: 18.04
Num timesteps: 2560
Best mean reward: 44.00 - Last mean reward per episode: 17.94
Num timesteps: 2580
Best mean reward: 44.00 - Last mean reward per episode: 17.80
Num timesteps: 2600
Best mean reward: 44.00 - Last mean reward per episode: 17.86
Num timesteps: 2620
Best mean reward: 44.00 - Last mean reward per episode: 17.81
Num timesteps: 2640
Best mean reward: 44.00 - Last mean reward per episode: 17.73
Num timesteps: 2660
Best mean reward: 44.00 - Last mean reward per episode: 17.63
Num timesteps: 2

Num timesteps: 4860
Best mean reward: 44.00 - Last mean reward per episode: 30.77
Num timesteps: 4880
Best mean reward: 44.00 - Last mean reward per episode: 30.77
Num timesteps: 4900
Best mean reward: 44.00 - Last mean reward per episode: 30.77
Num timesteps: 4920
Best mean reward: 44.00 - Last mean reward per episode: 30.77
Num timesteps: 4940
Best mean reward: 44.00 - Last mean reward per episode: 34.66
Num timesteps: 4960
Best mean reward: 44.00 - Last mean reward per episode: 34.66
Num timesteps: 4980
Best mean reward: 44.00 - Last mean reward per episode: 34.66
Num timesteps: 5000
Best mean reward: 44.00 - Last mean reward per episode: 34.66


<stable_baselines3.a2c.a2c.A2C at 0x169b5f250>

## 第二个示例：实时绘制性能图
在训练时，有时相对于episodic奖励，训练如何随着时间的推移而进展是有用的。为此，Stable-Baselines 有 Tensorboard 支持，但这可能非常麻烦，尤其是在磁盘空间使用方面。

注意：不幸的是，实时绘图在 google colab 上无法开箱即用

在这里，我们可以再次使用回调，使用监控包装器来实时绘制情景奖励：

In [16]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib notebook


class PlottingCallback(BaseCallback):
    """
    Callback for plotting the performance in realtime.

    :param verbose: (int)
    """
    def __init__(self, verbose=1):
        super().__init__(verbose)
        self._plot = None

    def _on_step(self) -> bool:
        # get the monitor's data
        x, y = ts2xy(load_results(log_dir), 'timesteps')
        if self._plot is None: # make the plot
            plt.ion()
            fig = plt.figure(figsize=(6,3))
            ax = fig.add_subplot(111)
            line, = ax.plot(x, y)
            self._plot = (line, ax, fig)
            plt.show()
        else: # update and rescale the plot
            self._plot[0].set_data(x, y)
            self._plot[-2].relim()
            self._plot[-2].set_xlim([self.locals["total_timesteps"] * -0.02, 
                                    self.locals["total_timesteps"] * 1.02])
            self._plot[-2].autoscale_view(True,True,True)
            self._plot[-1].canvas.draw()
        
# Create log dir
log_dir = "/tmp/gym/"
os.makedirs(log_dir, exist_ok=True)

# Create and wrap the environment
env = make_vec_env('MountainCarContinuous-v0', n_envs=1, monitor_dir=log_dir)

plotting_callback = PlottingCallback()
        
model = PPO('MlpPolicy', env, verbose=0)
model.learn(10000, callback=plotting_callback)

<IPython.core.display.Javascript object>

<stable_baselines3.ppo.ppo.PPO at 0x169b5e860>

## 第三个例子：进度条

开发和使用强化学习时，生活质量的改善总是受到欢迎。在这里，我们使用 tqdm 显示训练的进度条，以及每秒的时间步数和训练结束的估计剩余时间：

请注意，此回调已包含在 SB3 中，可以通过将 Progress_bar=True 传递给 learn() 方法来使用。

In [17]:
from tqdm.auto import tqdm


class ProgressBarCallback(BaseCallback):
    """
    :param pbar: (tqdm.pbar) Progress bar object
    """

    def __init__(self, pbar):
        super().__init__()
        self._pbar = pbar

    def _on_step(self):
        # Update the progress bar:
        self._pbar.n = self.num_timesteps
        self._pbar.update(0)


# this callback uses the 'with' block, allowing for correct initialisation and destruction
class ProgressBarManager(object):
    def __init__(self, total_timesteps):  # init object with total timesteps
        self.pbar = None
        self.total_timesteps = total_timesteps

    def __enter__(self):  # create the progress bar and callback, return the callback
        self.pbar = tqdm(total=self.total_timesteps)

        return ProgressBarCallback(self.pbar)

    def __exit__(self, exc_type, exc_val, exc_tb):  # close the callback
        self.pbar.n = self.total_timesteps
        self.pbar.update(0)
        self.pbar.close()


model = TD3("MlpPolicy", "Pendulum-v1", verbose=0)
# Using a context manager garanties that the tqdm progress bar closes correctly
with ProgressBarManager(2000) as callback:
    model.learn(2000, callback=callback)

  0%|          | 0/2000 [00:00<?, ?it/s]

## 第四个例子：组合

由于回调的功能性质，可以将回调组合成单个回调。这意味着我们可以自动保存最好的模型，显示训练的进度条和情景奖励。

当将列表传递给 learn() 方法时，回调会自动组成。在底层，创建了一个 CallbackList。

In [18]:
from stable_baselines3.common.callbacks import CallbackList

# Create log dir
log_dir = "/tmp/gym/"
os.makedirs(log_dir, exist_ok=True)

# Create and wrap the environment
env = make_vec_env('CartPole-v1', n_envs=1, monitor_dir=log_dir)

# Create callbacks
auto_save_callback = SaveOnBestTrainingRewardCallback(check_freq=1000, log_dir=log_dir)

model = PPO('MlpPolicy', env, verbose=0)
with ProgressBarManager(1000) as progress_callback:
  # This is equivalent to callback=CallbackList([progress_callback, auto_save_callback])
  model.learn(1000, callback=[progress_callback, auto_save_callback])

  0%|          | 0/1000 [00:00<?, ?it/s]

## 练习：编写自己的回调代码

前面的示例展示了什么是回调以及如何使用它的基础知识。

本练习的目标是创建一个回调，该回调将使用测试环境评估模型，如果这是最知名的模型，则保存它。

为了让事情变得更简单，我们将使用类而不是带有魔术方法 __call__ 的函数。

In [24]:
import gymnasium as gym
import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import BaseCallback

class EvalCallback(BaseCallback):
    def __init__(self, eval_env, n_eval_episodes=5, eval_freq=20, verbose=1):
        super().__init__(verbose)
        self.eval_env = eval_env
        self.n_eval_episodes = n_eval_episodes
        self.eval_freq = eval_freq
        self.best_mean_reward = -np.inf

    def _on_step(self):
        if self.n_calls % self.eval_freq == 0:
            total_rewards = []
            for _ in range(self.n_eval_episodes):
                obs = self.eval_env.reset()
                done = False
                total_reward = 0
                while not done:
                    action, _states = self.model.predict(obs, deterministic=True)
                    obs, reward, done, info = self.eval_env.step(action)
                    total_reward += reward
                total_rewards.append(total_reward)
            mean_reward = np.mean(total_rewards)
            if mean_reward > self.best_mean_reward:
                self.best_mean_reward = mean_reward
                self.model.save("best_model")
            if self.verbose > 0:
                print("Best mean reward: {:.2f} - Last mean reward per episode: {:.2f}".format(self.best_mean_reward, mean_reward))

        return True

# 创建训练和评估环境
from stable_baselines3.common.vec_env import DummyVecEnv
'''
使用DummyVecEnv将评估环境包装起来，这样就可以确保观测值的格式与模型期望的一致。这通常是处理单个环境时推荐的方法。
'''

# 包装评估环境
eval_env = gym.make("CartPole-v1")
eval_env = DummyVecEnv([lambda: eval_env])  # 注意这里的包装方式

# 创建回调对象
callback = EvalCallback(eval_env, n_eval_episodes=5, eval_freq=1000)

# 创建RL模型，这里使用PPO算法
model = PPO("MlpPolicy", env, verbose=1)

# 训练RL模型
model.learn(int(10000), callback=callback)


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Best mean reward: 8.60 - Last mean reward per episode: 8.60
Best mean reward: 8.60 - Last mean reward per episode: 8.60
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 23.4     |
|    ep_rew_mean     | 23.4     |
| time/              |          |
|    fps             | 7947     |
|    iterations      | 1        |
|    time_elapsed    | 0        |
|    total_timesteps | 2048     |
---------------------------------
Best mean reward: 104.80 - Last mean reward per episode: 104.80
Best mean reward: 104.80 - Last mean reward per episode: 74.80
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 26.3        |
|    ep_rew_mean          | 26.3        |
| time/                   |             |
|    fps                  | 4551        |
|    iterations           | 2           |
|    time_elapsed        

Best mean reward: 500.00 - Last mean reward per episode: 500.00
Best mean reward: 500.00 - Last mean reward per episode: 500.00
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 143         |
|    ep_rew_mean          | 143         |
| time/                   |             |
|    fps                  | 2677        |
|    iterations           | 10          |
|    time_elapsed         | 7           |
|    total_timesteps      | 20480       |
| train/                  |             |
|    approx_kl            | 0.012323601 |
|    clip_fraction        | 0.135       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.56       |
|    explained_variance   | 0.879       |
|    learning_rate        | 0.0003      |
|    loss                 | 6.51        |
|    n_updates            | 90          |
|    policy_gradient_loss | -0.0123     |
|    value_loss           | 24.2        |
--------------------------------

Best mean reward: 500.00 - Last mean reward per episode: 500.00
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 285          |
|    ep_rew_mean          | 285          |
| time/                   |              |
|    fps                  | 2418         |
|    iterations           | 18           |
|    time_elapsed         | 15           |
|    total_timesteps      | 36864        |
| train/                  |              |
|    approx_kl            | 0.0011644698 |
|    clip_fraction        | 0.00498      |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.484       |
|    explained_variance   | 0.258        |
|    learning_rate        | 0.0003       |
|    loss                 | 5.28         |
|    n_updates            | 170          |
|    policy_gradient_loss | 0.000535     |
|    value_loss           | 50.7         |
------------------------------------------
Best mean reward: 500.00 - Last m

Best mean reward: 500.00 - Last mean reward per episode: 500.00
Best mean reward: 500.00 - Last mean reward per episode: 500.00
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 396         |
|    ep_rew_mean          | 396         |
| time/                   |             |
|    fps                  | 2344        |
|    iterations           | 26          |
|    time_elapsed         | 22          |
|    total_timesteps      | 53248       |
| train/                  |             |
|    approx_kl            | 0.005444841 |
|    clip_fraction        | 0.0496      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.413      |
|    explained_variance   | 0.496       |
|    learning_rate        | 0.0003      |
|    loss                 | 0.00812     |
|    n_updates            | 250         |
|    policy_gradient_loss | -0.00498    |
|    value_loss           | 0.0236      |
--------------------------------

Best mean reward: 500.00 - Last mean reward per episode: 500.00
Best mean reward: 500.00 - Last mean reward per episode: 500.00
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 472          |
|    ep_rew_mean          | 472          |
| time/                   |              |
|    fps                  | 2297         |
|    iterations           | 34           |
|    time_elapsed         | 30           |
|    total_timesteps      | 69632        |
| train/                  |              |
|    approx_kl            | 0.0071001803 |
|    clip_fraction        | 0.0689       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.383       |
|    explained_variance   | 0.17         |
|    learning_rate        | 0.0003       |
|    loss                 | -0.0169      |
|    n_updates            | 330          |
|    policy_gradient_loss | -0.00322     |
|    value_loss           | 0.00107      |
------------

Best mean reward: 500.00 - Last mean reward per episode: 500.00
Best mean reward: 500.00 - Last mean reward per episode: 500.00
Best mean reward: 500.00 - Last mean reward per episode: 500.00
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 490          |
|    ep_rew_mean          | 490          |
| time/                   |              |
|    fps                  | 2250         |
|    iterations           | 42           |
|    time_elapsed         | 38           |
|    total_timesteps      | 86016        |
| train/                  |              |
|    approx_kl            | 0.0007618759 |
|    clip_fraction        | 0.0232       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.287       |
|    explained_variance   | 0.281        |
|    learning_rate        | 0.0003       |
|    loss                 | -0.00543     |
|    n_updates            | 410          |
|    policy_gradient_loss | -0.000

<stable_baselines3.ppo.ppo.PPO at 0x2a4ac5270>

## 结论

好的超参数是 RL 成功的关键，你不应该只使用默认的超参数来解决所有问题

什么是回调以及你可以用它做什么

如何创建自己的回调