# 第 8 单元：使用 PyTorch 实现近端策略优化 (PPO) 🤖

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/thumbnail.png" alt="Unit 8"/>


在本 notebook 中，你将学习**以 CleanRL 实现为模型，使用 PyTorch 从头开始编写 PPO 智能体**。

为了测试其鲁棒性，我们将在以下环境中训练它：

- [LunarLander-v2 🚀](https://www.gymlibrary.dev/environments/box2d/lunar_lander/)


⬇️ 这是你将达成的效果示例。 ⬇️

In [1]:
%%html
<video controls autoplay><source src="https://huggingface.co/sb3/ppo-LunarLander-v2/resolve/main/replay.mp4" type="video/mp4"></video>

我们一直在努力改进我们的教程，因此**如果你在本 notebook 中发现任何问题**，请在 [GitHub 仓库中提出 issue](https://github.com/huggingface/deep-rl-class/issues)。

## 本 notebook 的目标 🏆

在本 notebook 结束时，你将能够：

- **使用 PyTorch 从头开始编写 PPO 智能体**。
- **将你训练好的智能体和代码推送到 Hub**，并附上精彩的视频回放和评估分数 🔥。




## 本 notebook 来自深度强化学习课程
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg" alt="Deep RL Course illustration"/>

在这个免费课程中，你将：

- 📖 学习深度强化学习的**理论与实践**。
- 🧑‍💻 学习**使用著名的深度强化学习库**，如 Stable Baselines3、RL Baselines3 Zoo、CleanRL 和 Sample Factory 2.0。
- 🤖 在**独特的环境中训练智能体**

别忘了**<a href="http://eepurl.com/ic5ZUD">注册课程</a>** (我们收集你的电子邮件是为了在**每个单元发布时向你发送链接，并为你提供有关挑战和更新的信息)。**


保持联系的最佳方式是加入我们的 Discord 服务器，与社区和我们进行交流 👉🏻 https://discord.gg/ydHrjt3WP5

## 先决条件 🏗️
在深入学习本 notebook 之前，你需要：

🔲 📚 [阅读第 8 单元学习 PPO](https://huggingface.co/deep-rl-course/unit8/introduction) 🤗

为了通过本次实践的[认证流程](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process)，你需要推送一个模型。我们不要求最低分数，但**建议你尝试不同的超参数设置以获得更好的结果**。

如果你找不到你的模型，**请到页面底部点击刷新按钮**。

有关认证流程的更多信息，请查看此部分 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process

## 设置 GPU 💪
- 为了**加速智能体的训练，我们将使用 GPU**。为此，请转到 `代码执行程序 > 更改运行时类型`

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">

- `硬件加速器 > GPU`

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">

## 创建虚拟显示器 🔽

在本 notebook 中，我们需要生成一个回放视频。为此，在 Colab 中，**我们需要一个虚拟屏幕来渲染环境**（从而录制视频帧）。

因此，下面的单元格将安装所需的库，并创建和运行一个虚拟屏幕 🖥

In [1]:
!pip install setuptools==65.5.0



In [2]:
%%capture
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!apt install swig cmake
!pip install pyglet==1.5
!pip3 install pyvirtualdisplay

In [4]:
# 虚拟显示器
from pyvirtualdisplay import Display

# virtual_display = Display(visible=0, size=(1400, 900))
# virtual_display.start()

## 安装依赖项 🔽
在本练习中，我们使用 `gym==0.22`。

In [3]:
!pip install more_itertools

Collecting more_itertools
  Using cached more_itertools-10.7.0-py3-none-any.whl.metadata (37 kB)
Using cached more_itertools-10.7.0-py3-none-any.whl (65 kB)
Installing collected packages: more_itertools
Successfully installed more_itertools-10.7.0


In [4]:
!pip install pygame

Collecting pygame
  Using cached pygame-2.6.1-cp310-cp310-win_amd64.whl.metadata (13 kB)
Using cached pygame-2.6.1-cp310-cp310-win_amd64.whl (10.6 MB)
Installing collected packages: pygame
Successfully installed pygame-2.6.1


In [3]:
!pip install gym==0.22
!pip install imageio-ffmpeg
!pip install huggingface_hub
!pip install gym[box2d]==0.22



## 让我们跟随 Costa Huang 的教程从头开始编写 PPO
- PPO 的核心实现部分，我们将使用优秀的 [Costa Huang](https://costa.sh/) 教程。
- 除了教程之外，为了更深入地学习，你可以阅读 37 个核心实现细节：https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/

👉 视频教程：https://youtu.be/MEt6rrxH8W4

In [2]:
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/MEt6rrxH8W4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')



- 最好先在下面的单元格中编写代码，这样即使运行环境中断，**你也不会丢失已编写的实现**。

In [None]:
### 在这里编写你的代码：

## 添加 Hugging Face 集成 🤗
- 为了将我们的模型推送到 Hub，我们需要定义一个 `package_to_hub` 函数。

- 添加将模型推送到 Hub 所需的依赖项

In [4]:
from huggingface_hub import HfApi, upload_folder
from huggingface_hub.repocard import metadata_eval_result, metadata_save

from pathlib import Path
import datetime
import tempfile
import json
import shutil
import imageio

from wasabi import Printer
msg = Printer()

  from .autonotebook import tqdm as notebook_tqdm


- 在 `parse_args()` 函数中添加新参数，用于定义我们想要推送模型的仓库 ID。

In [5]:
# 添加 HuggingFace 参数
parser.add_argument("--repo-id", type=str, default="a1024053774/ppo-CartPole-v1", help="Hugging Face Hub 上的模型仓库 ID {username/repo_name}")

NameError: name 'parser' is not defined

- 接下来，我们添加将模型推送到 Hub 所需的方法

- 这些方法将：
  - `_evalutate_agent()`: 评估智能体。
  - `_generate_model_card()`: 生成智能体的模型卡片。
  - `_record_video()`: 录制智能体的视频。

In [None]:
def package_to_hub(repo_id,
                model,
                hyperparameters,
                eval_env,
                video_fps=30,
                commit_message="Push agent to the Hub",
                token= None,
                logs=None
                ):
  """
  评估、生成视频并将模型上传到 Hugging Face Hub。
  该方法完成了整个流程：
  - 评估模型
  - 生成模型卡片
  - 生成智能体的回放视频
  - 将所有内容推送到 Hub
  :param repo_id: Hugging Face Hub 上的模型仓库 ID
  :param model: 训练好的模型
  :param eval_env: 用于评估智能体的环境
  :param fps: 渲染视频的帧率
  :param commit_message: 提交信息
  :param logs: 你想上传的本地 TensorBoard 日志目录
  """
  msg.info(
        "该函数将保存、评估、生成你的智能体视频，"
        "创建模型卡片并将所有内容推送到 Hub。"
        "这可能需要长达 1 分钟。\n "
        "这项工作仍在进行中：如果遇到错误，请提出 issue。"
    )
  # 步骤 1：克隆或创建仓库
  repo_url = HfApi().create_repo(
        repo_id=repo_id,
        token=token,
        private=False,
        exist_ok=True,
    )

  with tempfile.TemporaryDirectory() as tmpdirname:
    tmpdirname = Path(tmpdirname)

    # 步骤 2：保存模型
    torch.save(model.state_dict(), tmpdirname / "model.pt")

    # 步骤 3：评估模型并构建 JSON 文件
    mean_reward, std_reward = _evaluate_agent(eval_env,
                                           10,
                                           model)

    # 首先获取日期时间
    eval_datetime = datetime.datetime.now()
    eval_form_datetime = eval_datetime.isoformat()

    evaluate_data = {
        "env_id": hyperparameters.env_id,
        "mean_reward": mean_reward,
        "std_reward": std_reward,
        "n_evaluation_episodes": 10,
        "eval_datetime": eval_form_datetime,
    }

    # 写入 JSON 文件
    with open(tmpdirname / "results.json", "w") as outfile:
      json.dump(evaluate_data, outfile)

    # 步骤 4：生成视频
    video_path =  tmpdirname / "replay.mp4"
    record_video(eval_env, model, video_path, video_fps)

    # 步骤 5：生成模型卡片
    generated_model_card, metadata = _generate_model_card("PPO", hyperparameters.env_id, mean_reward, std_reward, hyperparameters)
    _save_model_card(tmpdirname, generated_model_card, metadata)

    # 步骤 6：如果需要，添加日志
    if logs:
      _add_logdir(tmpdirname, Path(logs))

    msg.info(f"正在将仓库 {repo_id} 推送到 Hugging Face Hub")

    repo_url = upload_folder(
            repo_id=repo_id,
            folder_path=tmpdirname,
            path_in_repo="",
            commit_message=commit_message,
            token=token,
        )

    msg.info(f"你的模型已推送到 Hub。你可以在这里查看你的模型：{repo_url}")
  return repo_url


def _evaluate_agent(env, n_eval_episodes, policy):
  """
  评估智能体 n_eval_episodes 个回合，并返回平均奖励和奖励的标准差。
  :param env: 评估环境
  :param n_eval_episodes: 评估智能体的回合数
  :param policy: 智能体
  """
  episode_rewards = []
  for episode in range(n_eval_episodes):
    state = env.reset()
    step = 0
    done = False
    total_rewards_ep = 0

    while done is False:
      state = torch.Tensor(state).to(device)
      action, _, _, _ = policy.get_action_and_value(state)
      new_state, reward, done, info = env.step(action.cpu().numpy())
      total_rewards_ep += reward
      if done:
        break
      state = new_state
    episode_rewards.append(total_rewards_ep)
  mean_reward = np.mean(episode_rewards)
  std_reward = np.std(episode_rewards)

  return mean_reward, std_reward


def record_video(env, policy, out_directory, fps=30):
  images = []
  done = False
  state = env.reset()
  img = env.render(mode='rgb_array')
  images.append(img)
  while not done:
    state = torch.Tensor(state).to(device)
    # 在给定状态下，采取具有最大预期未来奖励的动作（索引）
    action, _, _, _  = policy.get_action_and_value(state)
    state, reward, done, info = env.step(action.cpu().numpy()) # 为了录制逻辑，我们直接将 next_state 设置为 state
    img = env.render(mode='rgb_array')
    images.append(img)
  imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)


def _generate_model_card(model_name, env_id, mean_reward, std_reward, hyperparameters):
  """
  为 Hub 生成模型卡片
  :param model_name: 模型名称
  :env_id: 环境名称
  :mean_reward: 智能体的平均奖励
  :std_reward: 智能体平均奖励的标准差
  :hyperparameters: 训练参数
  """
  # 步骤 1：选择标签
  metadata = generate_metadata(model_name, env_id, mean_reward, std_reward)

  # 将超参数命名空间转换为字符串
  converted_dict = vars(hyperparameters)
  converted_str = str(converted_dict)
  converted_str = converted_str.split(", ")
  converted_str = '\n'.join(converted_str)

  # 步骤 2：生成模型卡片
  model_card = f"""
  # PPO Agent Playing {env_id}

  This is a trained model of a PPO agent playing {env_id}.

  # Hyperparameters
  ```python
  {converted_str}
  ```
  """
  return model_card, metadata


def generate_metadata(model_name, env_id, mean_reward, std_reward):
  """
  为模型卡片定义标签
  :param model_name: 模型名称
  :param env_id: 环境名称
  :mean_reward: 智能体的平均奖励
  :std_reward: 智能体平均奖励的标准差
  """
  metadata = {}
  metadata["tags"] = [
        env_id,
        "ppo",
        "deep-reinforcement-learning",
        "reinforcement-learning",
        "custom-implementation",
        "deep-rl-course"
  ]

  # 添加指标
  eval = metadata_eval_result(
      model_pretty_name=model_name,
      task_pretty_name="reinforcement-learning",
      task_id="reinforcement-learning",
      metrics_pretty_name="mean_reward",
      metrics_id="mean_reward",
      metrics_value=f"{mean_reward:.2f} +/- {std_reward:.2f}",
      dataset_pretty_name=env_id,
      dataset_id=env_id,
  )

  # 合并两个字典
  metadata = {**metadata, **eval}

  return metadata


def _save_model_card(local_path, generated_model_card, metadata):
    """为仓库保存模型卡片。
    :param local_path: 仓库目录
    :param generated_model_card: 由 _generate_model_card() 生成的模型卡片
    :param metadata: 元数据
    """
    readme_path = local_path / "README.md"
    readme = ""
    if readme_path.exists():
        with readme_path.open("r", encoding="utf8") as f:
            readme = f.read()
    else:
        readme = generated_model_card

    with readme_path.open("w", encoding="utf-8") as f:
        f.write(readme)

    # 将我们的指标保存到 Readme 元数据中
    metadata_save(readme_path, metadata)


def _add_logdir(local_path: Path, logdir: Path):
  """向仓库中添加日志目录。
  :param local_path: 仓库目录
  :param logdir: 日志目录
  """
  if logdir.exists() and logdir.is_dir():
    # 在名为 logs 的新目录下将 logdir 添加到仓库中
    repo_logdir = local_path / "logs"

    # 如果当前日志存在，则删除
    if repo_logdir.exists():
      shutil.rmtree(repo_logdir)

    # 将 logdir 复制到仓库的日志目录中
    shutil.copytree(logdir, repo_logdir)

- 最后，我们在 PPO 训练结束时调用此函数。

In [None]:
# 创建评估环境
eval_env = gym.make(args.env_id)

package_to_hub(repo_id = args.repo_id,
                model = agent, # 我们想要保存的模型
                hyperparameters = args,
                eval_env = gym.make(args.env_id),
                logs= f"runs/{run_name}",
                )

- 这是最终的 ppo.py 文件的样子

In [None]:
# 文档和实验结果可以在 https://docs.cleanrl.dev/rl-algorithms/ppo/#ppopy 找到

import argparse
import os
import random
import time
from distutils.util import strtobool

import gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions.categorical import Categorical
from torch.utils.tensorboard import SummaryWriter

from huggingface_hub import HfApi, upload_folder
from huggingface_hub.repocard import metadata_eval_result, metadata_save

from pathlib import Path
import datetime
import tempfile
import json
import shutil
import imageio

from wasabi import Printer
msg = Printer()

def parse_args():
    # fmt: off
    parser = argparse.ArgumentParser()
    parser.add_argument("--exp-name", type=str, default=os.path.basename(__file__).rstrip(".py"),
        help="本次实验的名称")
    parser.add_argument("--seed", type=int, default=1,
        help="实验的种子")
    parser.add_argument("--torch-deterministic", type=lambda x: bool(strtobool(x)), default=True, nargs="?", const=True,
        help="如果开启, `torch.backends.cudnn.deterministic=False`")
    parser.add_argument("--cuda", type=lambda x: bool(strtobool(x)), default=True, nargs="?", const=True,
        help="如果开启, 默认将启用 cuda")
    parser.add_argument("--track", type=lambda x: bool(strtobool(x)), default=False, nargs="?", const=True,
        help="如果开启，本次实验将使用 Weights and Biases 进行跟踪")
    parser.add_argument("--wandb-project-name", type=str, default="cleanRL",
        help="wandb 的项目名称")
    parser.add_argument("--wandb-entity", type=str, default=None,
        help="wandb 项目的实体（团队）")
    parser.add_argument("--capture-video", type=lambda x: bool(strtobool(x)), default=False, nargs="?", const=True,
        help="是否捕获智能体表现的视频（请查看 `videos` 文件夹）")

    # 算法特定参数
    parser.add_argument("--env-id", type=str, default="CartPole-v1",
        help="环境的 ID")
    parser.add_argument("--total-timesteps", type=int, default=50000,
        help="实验的总时间步数")
    parser.add_argument("--learning-rate", type=float, default=2.5e-4,
        help="优化器的学习率")
    parser.add_argument("--num-envs", type=int, default=4,
        help="并行游戏环境的数量")
    parser.add_argument("--num-steps", type=int, default=128,
        help="每个策略 rollout 在每个环境中运行的步数")
    parser.add_argument("--anneal-lr", type=lambda x: bool(strtobool(x)), default=True, nargs="?", const=True,
        help="为策略和价值网络切换学习率退火")
    parser.add_argument("--gae", type=lambda x: bool(strtobool(x)), default=True, nargs="?", const=True,
        help="使用 GAE 进行优势计算")
    parser.add_argument("--gamma", type=float, default=0.99,
        help="折扣因子 gamma")
    parser.add_argument("--gae-lambda", type=float, default=0.95,
        help="广义优势估计的 lambda")
    parser.add_argument("--num-minibatches", type=int, default=4,
        help="mini-batch 的数量")
    parser.add_argument("--update-epochs", type=int, default=4,
        help="更新策略的 K 个 epoch")
    parser.add_argument("--norm-adv", type=lambda x: bool(strtobool(x)), default=True, nargs="?", const=True,
        help="切换优势标准化")
    parser.add_argument("--clip-coef", type=float, default=0.2,
        help="代理裁剪系数")
    parser.add_argument("--clip-vloss", type=lambda x: bool(strtobool(x)), default=True, nargs="?", const=True,
        help="根据论文，切换是否为价值函数使用裁剪损失。")
    parser.add_argument("--ent-coef", type=float, default=0.01,
        help="熵的系数")
    parser.add_argument("--vf-coef", type=float, default=0.5,
        help="价值函数的系数")
    parser.add_argument("--max-grad-norm", type=float, default=0.5,
        help="梯度裁剪的最大范数")
    parser.add_argument("--target-kl", type=float, default=None,
        help="目标 KL 散度阈值")

    # 添加 HuggingFace 参数
    parser.add_argument("--repo-id", type=str, default="ThomasSimonini/ppo-CartPole-v1", help="Hugging Face Hub 上的模型仓库 ID {username/repo_name}")

    args = parser.parse_args()
    args.batch_size = int(args.num_envs * args.num_steps)
    args.minibatch_size = int(args.batch_size // args.num_minibatches)
    # fmt: on
    return args

def package_to_hub(repo_id,
                model,
                hyperparameters,
                eval_env,
                video_fps=30,
                commit_message="Push agent to the Hub",
                token= None,
                logs=None
                ):
  """
  评估、生成视频并将模型上传到 Hugging Face Hub。
  该方法完成了整个流程：
  - 评估模型
  - 生成模型卡片
  - 生成智能体的回放视频
  - 将所有内容推送到 Hub
  :param repo_id: Hugging Face Hub 上的模型仓库 ID
  :param model: 训练好的模型
  :param eval_env: 用于评估智能体的环境
  :param fps: 渲染视频的帧率
  :param commit_message: 提交信息
  :param logs: 你想上传的本地 TensorBoard 日志目录
  """
  msg.info(
        "该函数将保存、评估、生成你的智能体视频，"
        "创建模型卡片并将所有内容推送到 Hub。"
        "这可能需要长达 1 分钟。\n "
        "这项工作仍在进行中：如果遇到错误，请提出 issue。"
    )
  # 步骤 1：克隆或创建仓库
  repo_url = HfApi().create_repo(
        repo_id=repo_id,
        token=token,
        private=False,
        exist_ok=True,
    )

  with tempfile.TemporaryDirectory() as tmpdirname:
    tmpdirname = Path(tmpdirname)

    # 步骤 2：保存模型
    torch.save(model.state_dict(), tmpdirname / "model.pt")

    # 步骤 3：评估模型并构建 JSON 文件
    mean_reward, std_reward = _evaluate_agent(eval_env,
                                           10,
                                           model)

    # 首先获取日期时间
    eval_datetime = datetime.datetime.now()
    eval_form_datetime = eval_datetime.isoformat()

    evaluate_data = {
        "env_id": hyperparameters.env_id,
        "mean_reward": mean_reward,
        "std_reward": std_reward,
        "n_evaluation_episodes": 10,
        "eval_datetime": eval_form_datetime,
    }

    # 写入 JSON 文件
    with open(tmpdirname / "results.json", "w") as outfile:
      json.dump(evaluate_data, outfile)

    # 步骤 4：生成视频
    video_path =  tmpdirname / "replay.mp4"
    record_video(eval_env, model, video_path, video_fps)

    # 步骤 5：生成模型卡片
    generated_model_card, metadata = _generate_model_card("PPO", hyperparameters.env_id, mean_reward, std_reward, hyperparameters)
    _save_model_card(tmpdirname, generated_model_card, metadata)

    # 步骤 6：如果需要，添加日志
    if logs:
      _add_logdir(tmpdirname, Path(logs))

    msg.info(f"正在将仓库 {repo_id} 推送到 Hugging Face Hub")

    repo_url = upload_folder(
            repo_id=repo_id,
            folder_path=tmpdirname,
            path_in_repo="",
            commit_message=commit_message,
            token=token,
        )

    msg.info(f"你的模型已推送到 Hub。你可以在这里查看你的模型：{repo_url}")
  return repo_url

def _evaluate_agent(env, n_eval_episodes, policy):
  """
  评估智能体 n_eval_episodes 个回合，并返回平均奖励和奖励的标准差。
  :param env: 评估环境
  :param n_eval_episodes: 评估智能体的回合数
  :param policy: 智能体
  """
  episode_rewards = []
  for episode in range(n_eval_episodes):
    state = env.reset()
    step = 0
    done = False
    total_rewards_ep = 0

    while done is False:
      state = torch.Tensor(state).to(device)
      action, _, _, _ = policy.get_action_and_value(state)
      new_state, reward, done, info = env.step(action.cpu().numpy())
      total_rewards_ep += reward
      if done:
        break
      state = new_state
    episode_rewards.append(total_rewards_ep)
  mean_reward = np.mean(episode_rewards)
  std_reward = np.std(episode_rewards)

  return mean_reward, std_reward


def record_video(env, policy, out_directory, fps=30):
  images = []
  done = False
  state = env.reset()
  img = env.render(mode='rgb_array')
  images.append(img)
  while not done:
    state = torch.Tensor(state).to(device)
    # 在给定状态下，采取具有最大预期未来奖励的动作（索引）
    action, _, _, _  = policy.get_action_and_value(state)
    state, reward, done, info = env.step(action.cpu().numpy()) # 为了录制逻辑，我们直接将 next_state 设置为 state
    img = env.render(mode='rgb_array')
    images.append(img)
  imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)


def _generate_model_card(model_name, env_id, mean_reward, std_reward, hyperparameters):
  """
  为 Hub 生成模型卡片
  :param model_name: 模型名称
  :env_id: 环境名称
  :mean_reward: 智能体的平均奖励
  :std_reward: 智能体平均奖励的标准差
  :hyperparameters: 训练参数
  """
  # 步骤 1：选择标签
  metadata = generate_metadata(model_name, env_id, mean_reward, std_reward)

  # 将超参数命名空间转换为字符串
  converted_dict = vars(hyperparameters)
  converted_str = str(converted_dict)
  converted_str = converted_str.split(", ")
  converted_str = '\n'.join(converted_str)

  # 步骤 2：生成模型卡片
  model_card = f"""
  # PPO Agent Playing {env_id}

  This is a trained model of a PPO agent playing {env_id}.

  # Hyperparameters
  ```python
  {converted_str}
  ```
  """
  return model_card, metadata

def generate_metadata(model_name, env_id, mean_reward, std_reward):
  """
  为模型卡片定义标签
  :param model_name: 模型名称
  :param env_id: 环境名称
  :mean_reward: 智能体的平均奖励
  :std_reward: 智能体平均奖励的标准差
  """
  metadata = {}
  metadata["tags"] = [
        env_id,
        "ppo",
        "deep-reinforcement-learning",
        "reinforcement-learning",
        "custom-implementation",
        "deep-rl-course"
  ]

  # 添加指标
  eval = metadata_eval_result(
      model_pretty_name=model_name,
      task_pretty_name="reinforcement-learning",
      task_id="reinforcement-learning",
      metrics_pretty_name="mean_reward",
      metrics_id="mean_reward",
      metrics_value=f"{mean_reward:.2f} +/- {std_reward:.2f}",
      dataset_pretty_name=env_id,
      dataset_id=env_id,
  )

  # 合并两个字典
  metadata = {**metadata, **eval}

  return metadata

def _save_model_card(local_path, generated_model_card, metadata):
    """为仓库保存模型卡片。
    :param local_path: 仓库目录
    :param generated_model_card: 由 _generate_model_card() 生成的模型卡片
    :param metadata: 元数据
    """
    readme_path = local_path / "README.md"
    readme = ""
    if readme_path.exists():
        with readme_path.open("r", encoding="utf8") as f:
            readme = f.read()
    else:
        readme = generated_model_card

    with readme_path.open("w", encoding="utf-8") as f:
        f.write(readme)

    # 将我们的指标保存到 Readme 元数据中
    metadata_save(readme_path, metadata)

def _add_logdir(local_path: Path, logdir: Path):
  """向仓库中添加日志目录。
  :param local_path: 仓库目录
  :param logdir: 日志目录
  """
  if logdir.exists() and logdir.is_dir():
    # 在名为 logs 的新目录下将 logdir 添加到仓库中
    repo_logdir = local_path / "logs"

    # 如果当前日志存在，则删除
    if repo_logdir.exists():
      shutil.rmtree(repo_logdir)

    # 将 logdir 复制到仓库的日志目录中
    shutil.copytree(logdir, repo_logdir)

def make_env(env_id, seed, idx, capture_video, run_name):
    def thunk():
        env = gym.make(env_id)
        env = gym.wrappers.RecordEpisodeStatistics(env)
        if capture_video:
            if idx == 0:
                env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")
        env.seed(seed)
        env.action_space.seed(seed)
        env.observation_space.seed(seed)
        return env

    return thunk


def layer_init(layer, std=np.sqrt(2), bias_const=0.0):
    torch.nn.init.orthogonal_(layer.weight, std)
    torch.nn.init.constant_(layer.bias, bias_const)
    return layer


class Agent(nn.Module):
    def __init__(self, envs):
        super().__init__()
        self.critic = nn.Sequential(
            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 1), std=1.0),
        )
        self.actor = nn.Sequential(
            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, envs.single_action_space.n), std=0.01),
        )

    def get_value(self, x):
        return self.critic(x)

    def get_action_and_value(self, x, action=None):
        logits = self.actor(x)
        probs = Categorical(logits=logits)
        if action is None:
            action = probs.sample()
        return action, probs.log_prob(action), probs.entropy(), self.critic(x)


if __name__ == "__main__":
    args = parse_args()
    run_name = f"{args.env_id}__{args.exp_name}__{args.seed}__{int(time.time())}"
    if args.track:
        import wandb

        wandb.init(
            project=args.wandb_project_name,
            entity=args.wandb_entity,
            sync_tensorboard=True,
            config=vars(args),
            name=run_name,
            monitor_gym=True,
            save_code=True,
        )
    writer = SummaryWriter(f"runs/{run_name}")
    writer.add_text(
        "hyperparameters",
        "|param|value|\n|-|-|\n%s" % ("\n".join([f"|{key}|{value}|" for key, value in vars(args).items()])),
    )

    # 尽量不要修改：播种
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    torch.backends.cudnn.deterministic = args.torch_deterministic

    device = torch.device("cuda" if torch.cuda.is_available() and args.cuda else "cpu")

    # 环境设置
    envs = gym.vector.SyncVectorEnv(
        [make_env(args.env_id, args.seed + i, i, args.capture_video, run_name) for i in range(args.num_envs)]
    )
    assert isinstance(envs.single_action_space, gym.spaces.Discrete), "仅支持离散动作空间"

    agent = Agent(envs).to(device)
    optimizer = optim.Adam(agent.parameters(), lr=args.learning_rate, eps=1e-5)

    # 算法逻辑：存储设置
    obs = torch.zeros((args.num_steps, args.num_envs) + envs.single_observation_space.shape).to(device)
    actions = torch.zeros((args.num_steps, args.num_envs) + envs.single_action_space.shape).to(device)
    logprobs = torch.zeros((args.num_steps, args.num_envs)).to(device)
    rewards = torch.zeros((args.num_steps, args.num_envs)).to(device)
    dones = torch.zeros((args.num_steps, args.num_envs)).to(device)
    values = torch.zeros((args.num_steps, args.num_envs)).to(device)

    # 尽量不要修改：开始游戏
    global_step = 0
    start_time = time.time()
    next_obs = torch.Tensor(envs.reset()).to(device)
    next_done = torch.zeros(args.num_envs).to(device)
    num_updates = args.total_timesteps // args.batch_size

    for update in range(1, num_updates + 1):
        # 如果指定，则进行学习率退火。
        if args.anneal_lr:
            frac = 1.0 - (update - 1.0) / num_updates
            lrnow = frac * args.learning_rate
            optimizer.param_groups[0]["lr"] = lrnow

        for step in range(0, args.num_steps):
            global_step += 1 * args.num_envs
            obs[step] = next_obs
            dones[step] = next_done

            # 算法逻辑：动作逻辑
            with torch.no_grad():
                action, logprob, _, value = agent.get_action_and_value(next_obs)
                values[step] = value.flatten()
            actions[step] = action
            logprobs[step] = logprob

            # 尽量不要修改：执行游戏并记录数据。
            next_obs, reward, done, info = envs.step(action.cpu().numpy())
            rewards[step] = torch.tensor(reward).to(device).view(-1)
            next_obs, next_done = torch.Tensor(next_obs).to(device), torch.Tensor(done).to(device)

            for item in info:
                if "episode" in item.keys():
                    print(f"global_step={global_step}, episodic_return={item['episode']['r']}")
                    writer.add_scalar("charts/episodic_return", item["episode"]["r"], global_step)
                    writer.add_scalar("charts/episodic_length", item["episode"]["l"], global_step)
                    break

        # 如果游戏未结束，则进行 bootstrap value 计算
        with torch.no_grad():
            next_value = agent.get_value(next_obs).reshape(1, -1)
            if args.gae:
                advantages = torch.zeros_like(rewards).to(device)
                lastgaelam = 0
                for t in reversed(range(args.num_steps)):
                    if t == args.num_steps - 1:
                        nextnonterminal = 1.0 - next_done
                        nextvalues = next_value
                    else:
                        nextnonterminal = 1.0 - dones[t + 1]
                        nextvalues = values[t + 1]
                    delta = rewards[t] + args.gamma * nextvalues * nextnonterminal - values[t]
                    advantages[t] = lastgaelam = delta + args.gamma * args.gae_lambda * nextnonterminal * lastgaelam
                returns = advantages + values
            else:
                returns = torch.zeros_like(rewards).to(device)
                for t in reversed(range(args.num_steps)):
                    if t == args.num_steps - 1:
                        nextnonterminal = 1.0 - next_done
                        next_return = next_value
                    else:
                        nextnonterminal = 1.0 - dones[t + 1]
                        next_return = returns[t + 1]
                    returns[t] = rewards[t] + args.gamma * nextnonterminal * next_return
                advantages = returns - values

        # 展平 batch
        b_obs = obs.reshape((-1,) + envs.single_observation_space.shape)
        b_logprobs = logprobs.reshape(-1)
        b_actions = actions.reshape((-1,) + envs.single_action_space.shape)
        b_advantages = advantages.reshape(-1)
        b_returns = returns.reshape(-1)
        b_values = values.reshape(-1)

        # 优化策略和价值网络
        b_inds = np.arange(args.batch_size)
        clipfracs = []
        for epoch in range(args.update_epochs):
            np.random.shuffle(b_inds)
            for start in range(0, args.batch_size, args.minibatch_size):
                end = start + args.minibatch_size
                mb_inds = b_inds[start:end]

                _, newlogprob, entropy, newvalue = agent.get_action_and_value(b_obs[mb_inds], b_actions.long()[mb_inds])
                logratio = newlogprob - b_logprobs[mb_inds]
                ratio = logratio.exp()

                with torch.no_grad():
                    # calculate approx_kl http://joschu.net/blog/kl-approx.html
                    old_approx_kl = (-logratio).mean()
                    approx_kl = ((ratio - 1) - logratio).mean()
                    clipfracs += [((ratio - 1.0).abs() > args.clip_coef).float().mean().item()]

                mb_advantages = b_advantages[mb_inds]
                if args.norm_adv:
                    mb_advantages = (mb_advantages - mb_advantages.mean()) / (mb_advantages.std() + 1e-8)

                # 策略损失
                pg_loss1 = -mb_advantages * ratio
                pg_loss2 = -mb_advantages * torch.clamp(ratio, 1 - args.clip_coef, 1 + args.clip_coef)
                pg_loss = torch.max(pg_loss1, pg_loss2).mean()

                # 价值损失
                newvalue = newvalue.view(-1)
                if args.clip_vloss:
                    v_loss_unclipped = (newvalue - b_returns[mb_inds]) ** 2
                    v_clipped = b_values[mb_inds] + torch.clamp(
                        newvalue - b_values[mb_inds],
                        -args.clip_coef,
                        args.clip_coef,
                    )
                    v_loss_clipped = (v_clipped - b_returns[mb_inds]) ** 2
                    v_loss_max = torch.max(v_loss_unclipped, v_loss_clipped)
                    v_loss = 0.5 * v_loss_max.mean()
                else:
                    v_loss = 0.5 * ((newvalue - b_returns[mb_inds]) ** 2).mean()

                entropy_loss = entropy.mean()
                loss = pg_loss - args.ent_coef * entropy_loss + v_loss * args.vf_coef

                optimizer.zero_grad()
                loss.backward()
                nn.utils.clip_grad_norm_(agent.parameters(), args.max_grad_norm)
                optimizer.step()

            if args.target_kl is not None:
                if approx_kl > args.target_kl:
                    break

        y_pred, y_true = b_values.cpu().numpy(), b_returns.cpu().numpy()
        var_y = np.var(y_true)
        explained_var = np.nan if var_y == 0 else 1 - np.var(y_true - y_pred) / var_y

        # 尽量不要修改：记录奖励以用于绘图
        writer.add_scalar("charts/learning_rate", optimizer.param_groups[0]["lr"], global_step)
        writer.add_scalar("losses/value_loss", v_loss.item(), global_step)
        writer.add_scalar("losses/policy_loss", pg_loss.item(), global_step)
        writer.add_scalar("losses/entropy", entropy_loss.item(), global_step)
        writer.add_scalar("losses/old_approx_kl", old_approx_kl.item(), global_step)
        writer.add_scalar("losses/approx_kl", approx_kl.item(), global_step)
        writer.add_scalar("losses/clipfrac", np.mean(clipfracs), global_step)
        writer.add_scalar("losses/explained_variance", explained_var, global_step)
        print("SPS:", int(global_step / (time.time() - start_time)))
        writer.add_scalar("charts/SPS", int(global_step / (time.time() - start_time)), global_step)

    envs.close()
    writer.close()

    # 创建评估环境
    eval_env = gym.make(args.env_id)

    package_to_hub(repo_id = args.repo_id,
                model = agent, # 我们想要保存的模型
                hyperparameters = args,
                eval_env = gym.make(args.env_id),
                logs= f"runs/{run_name}",
                )


为了能与社区分享你的模型，还需要遵循以下三个步骤：

1️⃣ (如果尚未完成) 创建一个 HF 账户 ➡ https://huggingface.co/join

2️⃣ 登录，然后你需要从 Hugging Face 网站存储你的身份验证令牌。
- 创建一个新令牌 (https://huggingface.co/settings/tokens) **并赋予写入权限**

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">

- 复制令牌
- 运行下面的单元格并粘贴令牌

In [7]:
from huggingface_hub import notebook_login
notebook_login()
!git config --global credential.helper store

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

如果你不想使用 Google Colab 或 Jupyter Notebook，你需要改用此命令：`huggingface-cli login`

## 让我们开始训练吧 🔥
- ⚠️ ⚠️ ⚠️ 不要使用**与第 1 单元相同的仓库 ID**
- 现在你已经从头编写了 PPO 并添加了 Hugging Face 集成，我们准备好开始训练了 🔥

- 首先，你需要将所有代码复制到一个你创建的名为 `ppo.py` 的文件中

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/step1.png" alt="PPO"/>

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/step2.png" alt="PPO"/>

- 现在我们只需要使用 `python <python-脚本-名称>.py` 并附加上我们用 `argparse` 定义的额外参数来运行这个 python 脚本

- 你应该修改更多的超参数，否则训练过程可能不会非常稳定。

In [12]:
# --load-model-from: 指定从哪个仓库加载模型以继续训练
# --ent-coef=0.005, --clip-coef=0.1: 使用新的、更优的超参数进行微调
# --total-timesteps=1000000: 微调不需要像从头开始那么长的时间，100万步应该足够让它稳定下来

!python ppo.py --env-id="LunarLander-v2" --repo-id="a1024053774/ppo-LunarLander-v2" --load-model-from="a1024053774/ppo-LunarLander-v2" --total-timesteps=5000000 --num-steps=1024 --learning-rate=2e-4 --ent-coef=0.005 --clip-coef=0.1

global_step=424, episodic_return=-170.8101806640625
global_step=472, episodic_return=-320.469482421875
global_step=476, episodic_return=-81.00131225585938
global_step=484, episodic_return=-20.284530639648438
global_step=700, episodic_return=-95.84405517578125
global_step=784, episodic_return=-81.61933135986328
global_step=844, episodic_return=-123.97444152832031
global_step=948, episodic_return=-169.76583862304688
global_step=1000, episodic_return=-130.58905029296875
global_step=1136, episodic_return=-113.75289154052734
global_step=1196, episodic_return=-540.8772583007812
global_step=1236, episodic_return=-104.046142578125
global_step=1316, episodic_return=-223.84754943847656
global_step=1404, episodic_return=-171.53033447265625
global_step=1560, episodic_return=-101.09183502197266
global_step=1580, episodic_return=-314.45697021484375
global_step=1804, episodic_return=-642.2508544921875
global_step=1816, episodic_return=-153.7117919921875
global_step=1872, episodic_return=-199.86035156

Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
Traceback (most recent call last):
  File "D:\_DevelopmentCode\Python\RL_Boot\Hugging_Face\notebooks\unit8-PPO\ppo.py", line 578, in <module>
    writer.add_scalar("losses/explained_variance", explained_var, global_step)
  File "D:\_DevelopmentCode\Python\RL_Boot\Hugging_Face\notebooks\unit8-PPO\ppo.py", line 125, in package_to_hub
    # 步骤 1：克隆或创建仓库
  File "D:\DevelopmentSoftware\Anaconda\envs\mlagents_py3_10_12\lib\site-packages\huggingface_hub\utils\_validators.py", line 106, in _inner_fn
    validate_repo_id(arg_value)
  File "D:\DevelopmentSoftware\Anaconda\envs\mlagents_py3_10_12\lib\site-packages\huggingface_hub\utils\_val

更改超参数之前
``` plain text
SPS: 977
global_step=48672, episodic_return=-118.76814270019531
global_step=48748, episodic_return=-54.866146087646484
global_step=48896, episodic_return=-135.96490478515625
global_step=49052, episodic_return=-309.20989990234375
global_step=49064, episodic_return=-207.33941650390625
global_step=49120, episodic_return=-118.00939178466797
SPS: 978
global_step=49296, episodic_return=-52.607940673828125
global_step=49364, episodic_return=-84.02847290039062
global_step=49408, episodic_return=-345.39013671875
global_step=49504, episodic_return=-342.02001953125
SPS: 979
```
- 经过 50 万个时间步后，智能体仍然无法解决任务,飞到了屏幕之外

**更改超参数**
- total-timesteps从500000到5000000
- ent-coef从0.01改为0.02,
- num-steps改为1024
- learning-rate从2.5e-4改为2e-4
``` plain text
global_step=4997268, episodic_return=31.80633544921875
global_step=4997348, episodic_return=235.93096923828125
global_step=4997476, episodic_return=10.417015075683594
SPS: 644
global_step=4997932, episodic_return=-43.06093215942383
global_step=4998080, episodic_return=276.66845703125
SPS: 644
global_step=4998200, episodic_return=206.16268920898438
SPS: 644
global_step=4998692, episodic_return=233.63084411621094
global_step=4998944, episodic_return=236.6766815185547
global_step=4999028, episodic_return=-32.05982208251953
global_step=4999072, episodic_return=275.3636169433594
SPS: 644
global_step=4999456, episodic_return=-24.924705505371094
global_step=4999648, episodic_return=-20.378501892089844
global_step=4999672, episodic_return=23.991310119628906
```

在第 8 单元第 2 部分中再见，届时我们将训练智能体玩 Doom 🔥
## 持续学习，保持卓越 🤗