Reinforce 是一种基于策略的方法：一种深度强化学习算法，尝试直接优化策略而不使用动作值函数，基于Monte Carlo Policy Gradient。更准确地说，Reinforce旨在通过使用梯度上升估计最优策略的权重来直接优化策略。

本笔记本的目标：

能够使用 PyTorch 从头开始编写 Reinforce 算法。

能够使用简单的环境测试代理的稳健性。

能够通过精彩的视频回放和评估分数将您训练有素的代理推送给别人。

### 导入模块

In [1]:
import numpy as np

from collections import deque

import matplotlib.pyplot as plt
%matplotlib inline

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical

# Gym
import gym
#import gym_pygame

In [6]:
# 检查我们是否有 GPU

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cpu


### 创建 CartPole 环境并了解其工作原理

如果满足以下条件，则该episode结束：

极角大于±12°

购物车位置大于±2.4

剧集长度大于 500

In [2]:
env_id = "CartPole-v1"
# Create the env
env = gym.make(env_id)

# Create the evaluation env
eval_env = gym.make(env_id)

# Get the state space and action space
s_size = env.observation_space.shape[0]
print(s_size)
a_size = env.action_space.n
print(a_size)

4
2


In [3]:
print("_____OBSERVATION SPACE_____ \n")
print("The State Space is: ", s_size)
print("Sample observation", env.observation_space.sample()) # Get a random observation

print("\n _____ACTION SPACE_____ \n")
print("The Action Space is: ", a_size)
print("Action Space Sample", env.action_space.sample()) # Take a random action

_____OBSERVATION SPACE_____ 

The State Space is:  4
Sample observation [-1.8286734e+00 -2.7030667e+38 -3.4430575e-01 -8.4396607e+37]

 _____ACTION SPACE_____ 

The Action Space is:  2
Action Space Sample 1


### 构建Reinforce架构

REINFORCE算法是一种基于策略的深度强化学习算法，它直接对策略进行优化以最大化预期回报。这种算法属于蒙特卡罗方法，因为它依赖于完整的序列样本来更新策略。下面是REINFORCE算法的主要组成部分和工作流程：

1. **策略表示**：REINFORCE算法通过参数化的策略来操作，通常表示为πθ(a|s)，其中s表示状态，a表示动作，θ表示策略参数。这个策略可以用任何具有可微分参数的模型来表示，如神经网络。

2. **目标函数**：它的目标是最大化预期回报的期望值，即策略的好坏由其获得的回报来衡量。回报是从某状态开始，按照策略πθ采取行动直到终止状态所获得的累积奖励。

3. **梯度上升**：为了最大化目标函数，REINFORCE算法使用梯度上升方法来更新策略参数θ。策略梯度定理提供了一种计算目标函数梯度的方法，即通过采样得到的回报来估计。

4. **策略梯度定理**：根据策略梯度定理，策略的梯度可以表示为期望回报的梯度与对数策略的乘积。这意味着可以通过增加高回报动作的概率并减少低回报动作的概率来优化策略。

5. **算法步骤**：
   - 初始化策略参数θ。
   - 通过当前策略πθ在环境中执行多个完整的序列（或称为episode），收集状态、动作和奖励的序列。
   - 对于每一个episode，计算每一步的累积回报（从该步骤到序列结束的奖励总和）。
   - 对每一步，计算策略梯度，并根据这些梯度进行参数θ的更新。更新是通过加上步长（学习率）乘以策略梯度来实现的，以期最大化累积回报。
   - 重复上述过程，直到策略收敛（即参数θ的变化非常小或达到预设的迭代次数）。

REINFORCE算法的一个关键特性是它直接从经验中学习，无需建模环境的动态（即转移概率和奖励函数）。然而，这种方法可能会导致高方差，因此在实践中通常需要使用基线或方差缩减技术来改进学习效率和稳定性。

In [None]:
# 常见错误

class Policy(nn.Module):
    def __init__(self, s_size, a_size, h_size):
        super(Policy, self).__init__()
        self.fc1 = nn.Linear(s_size, h_size)
        self.fc2 = nn.Linear(h_size, a_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.softmax(x, dim=1)
    
    def act(self, state):
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        probs = self.forward(state).cpu()
        m = Categorical(probs)
        action = np.argmax(m)
        return action.item(), m.log_prob(action)

这里有一个错误，可以通过调用查找

debug_policy = Policy(s_size, a_size, 64).to(device)

debug_policy.act(env.reset())

In [None]:
debug_policy = Policy(s_size, a_size, 64).to(device)
debug_policy.act(env.reset())

In [9]:
# 正确的策略类表示方法

class Policy(nn.Module):
    def __init__(self, s_size, a_size, h_size):
        super(Policy, self).__init__()
        self.fc1 = nn.Linear(s_size, h_size)
        self.fc2 = nn.Linear(h_size, a_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.softmax(x, dim=1)
    
    def act(self, state):
        # 如果env.reset()返回的是元组，则只取第一个元素（假设它是观察数组）
        if isinstance(state, tuple):
            state = state[0]
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        probs = self.forward(state).cpu()
        m = Categorical(probs)
        action = m.sample()
        return action.item(), m.log_prob(action)

In [10]:
debug_policy = Policy(s_size, a_size, 64).to(device)
debug_policy.act(env.reset())

(1, tensor([-0.6924], grad_fn=<SqueezeBackward1>))

### 训练算法

In [15]:
def reinforce(policy, optimizer, n_training_episodes, max_t, gamma, print_every):
    # Help us to calculate the score during the training
    scores_deque = deque(maxlen=100)
    scores = []
    # Line 3 of pseudocode
    for i_episode in range(1, n_training_episodes+1):
        saved_log_probs = []
        rewards = []
        state = env.reset()
        # Line 4 of pseudocode
        for t in range(max_t):
            action, log_prob = policy.act(state)
            saved_log_probs.append(log_prob)
            state, reward, terminated, truncated, info = env.step(action)
            rewards.append(reward)
            if terminated:
                break 
        scores_deque.append(sum(rewards))
        scores.append(sum(rewards))
        
        # Line 6 of pseudocode: calculate the return
        returns = deque(maxlen=max_t) 
        n_steps = len(rewards) 
        # Compute the discounted returns at each timestep,
        # as 
        #      the sum of the gamma-discounted return at time t (G_t) + the reward at time t
        #
        # In O(N) time, where N is the number of time steps
        # (this definition of the discounted return G_t follows the definition of this quantity 
        # shown at page 44 of Sutton&Barto 2017 2nd draft)
        # G_t = r_(t+1) + r_(t+2) + ...
        
        # Given this formulation, the returns at each timestep t can be computed 
        # by re-using the computed future returns G_(t+1) to compute the current return G_t
        # G_t = r_(t+1) + gamma*G_(t+1)
        # G_(t-1) = r_t + gamma* G_t
        # (this follows a dynamic programming approach, with which we memorize solutions in order 
        # to avoid computing them multiple times)
        
        # This is correct since the above is equivalent to (see also page 46 of Sutton&Barto 2017 2nd draft)
        # G_(t-1) = r_t + gamma*r_(t+1) + gamma*gamma*r_(t+2) + ...
        
        
        ## Given the above, we calculate the returns at timestep t as: 
        #               gamma[t] * return[t] + reward[t]
        #
        ## We compute this starting from the last timestep to the first, in order
        ## to employ the formula presented above and avoid redundant computations that would be needed 
        ## if we were to do it from first to last.
        
        ## Hence, the queue "returns" will hold the returns in chronological order, from t=0 to t=n_steps
        ## thanks to the appendleft() function which allows to append to the position 0 in constant time O(1)
        ## a normal python list would instead require O(N) to do this.
        for t in range(n_steps)[::-1]:
            disc_return_t = (returns[0] if len(returns)>0 else 0)
            returns.appendleft( gamma*disc_return_t + rewards[t]   )    
            
        ## standardization of the returns is employed to make training more stable
        eps = np.finfo(np.float32).eps.item()
        ## eps is the smallest representable float, which is 
        # added to the standard deviation of the returns to avoid numerical instabilities        
        returns = torch.tensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + eps)
        
        # Line 7:
        policy_loss = []
        for log_prob, disc_return in zip(saved_log_probs, returns):
            policy_loss.append(-log_prob * disc_return)
        policy_loss = torch.cat(policy_loss).sum()
        
        # Line 8: PyTorch prefers gradient descent 
        optimizer.zero_grad()
        policy_loss.backward()
        optimizer.step()
        
        if i_episode % print_every == 0:
            print('Episode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))
        
    return scores

In [16]:
cartpole_hyperparameters = {
    "h_size": 16,
    "n_training_episodes": 1000,
    "n_evaluation_episodes": 10,
    "max_t": 1000,
    "gamma": 1.0,
    "lr": 1e-2,
    "env_id": env_id,
    "state_space": s_size,
    "action_space": a_size,
}

In [17]:
# Create policy and place it to the device
cartpole_policy = Policy(cartpole_hyperparameters["state_space"], cartpole_hyperparameters["action_space"], cartpole_hyperparameters["h_size"]).to(device)
cartpole_optimizer = optim.Adam(cartpole_policy.parameters(), lr=cartpole_hyperparameters["lr"])

In [18]:
scores = reinforce(cartpole_policy,
                   cartpole_optimizer,
                   cartpole_hyperparameters["n_training_episodes"], 
                   cartpole_hyperparameters["max_t"],
                   cartpole_hyperparameters["gamma"], 
                   100)

Episode 100	Average Score: 42.95
Episode 200	Average Score: 411.89
Episode 300	Average Score: 638.59
Episode 400	Average Score: 967.33
Episode 500	Average Score: 995.89
Episode 600	Average Score: 793.57
Episode 700	Average Score: 482.94
Episode 800	Average Score: 649.18
Episode 900	Average Score: 1000.00
Episode 1000	Average Score: 1000.00


### 定义评估方法

In [23]:
def evaluate_agent(env, max_steps, n_eval_episodes, policy):
  """
  Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.
  :param env: The evaluation environment
  :param n_eval_episodes: Number of episode to evaluate the agent
  :param policy: The Reinforce agent
  """
  episode_rewards = []
  for episode in range(n_eval_episodes):
    state = env.reset()
    step = 0
    done = False
    total_rewards_ep = 0
    
    for step in range(max_steps):
      action, _ = policy.act(state)
      new_state, reward, terminated, truncated, info = env.step(action)
      total_rewards_ep += reward
        
      if terminated:
        break
      state = new_state
    episode_rewards.append(total_rewards_ep)
  mean_reward = np.mean(episode_rewards)
  std_reward = np.std(episode_rewards)

  return mean_reward, std_reward

In [24]:
evaluate_agent(eval_env, 
               cartpole_hyperparameters["max_t"], 
               cartpole_hyperparameters["n_evaluation_episodes"],
               cartpole_policy)

(1000.0, 0.0)