> PPO: default RL algo in OpenAI

In [2]:
from IPython.display import Image

## math tricks

$$
\nabla f(x)=f(x)\nabla \log f(x)
$$

## DQN -> TRPO -> PPO

- DQN (2014)
    - unstable & offline method
- TRPO（2015）: Trust Region Policy Optimization

- PPO

## policy gradient

$$
\begin{split}
\nabla_\theta J(\pi_\theta)=E_{\tau\sim \pi}\left[\sum_{t=0}^T\nabla_\theta\log \pi_\theta(a_t|s_t)G_t\right]\\
G_t=R_t+\gamma R_{t+1}+\gamma^2 R_{t+2} + \cdots = \sum_{k=t}^T\gamma^{k-t}R_k
\end{split}
$$

- $G_t$:  reward-to-go (RTG).
    - $R_k$：表示时刻 $k$ 的即时回报；
    - 从当前时刻 （$t$）起，未来某个时间（$T$）点之前的所有回报（reward）的累计和。
    - 常用于策略梯度（policy gradient）中；
    - 计算每个行动的优势函数时，RTG减去基线可以更有效地估算行动的价值

```
def compute_rtgs(self, batch_rews):
    # The rewards-to-go (rtg) per episode per batch to return.
    # The shape will be (num timesteps per episode)
    batch_rtgs = []
    # Iterate through each episode backwards to maintain same order
    # in batch_rtgs
    for ep_rews in reversed(batch_rews):
        discounted_reward = 0 # The discounted reward so far
        for rew in reversed(ep_rews):
            discounted_reward = rew + discounted_reward * self.gamma
            batch_rtgs.insert(0, discounted_reward)
    # Convert the rewards-to-go into a tensor
    batch_rtgs = torch.tensor(batch_rtgs, dtype=torch.float)
    return batch_rtgs
```

- Advantage function.

$$
A^\pi(s,a)=Q^\pi(s,a)-V_{\phi_k}(s)
$$


```
def evaluate(self, batch_obs):
    # Query critic network for a value V for each obs in batch_obs.
    V = self.critic(batch_obs).squeeze()
    return V
  
# Calculate V_{phi, k}
V = self.evaluate(batch_obs)
# ALG STEP 5
# Calculate advantage
A_k = batch_rtgs - V.detach()

# Normalize advantages
A_k = (A_k - A_k.mean()) / (A_k.std() + 1e-10)
```

## on policy vs. off policy

- Policy gradient的方法，一般是on policy的，ppo通过importance sampling的方式，将其变为off policy

In [3]:
Image(url='https://miro.medium.com/v2/resize:fit:4800/format:webp/1*ONEmrwcr-jOwyUlFoO1g-Q.png', width=400)

In [4]:
Image(url='https://huggingface.co/blog/assets/73_deep_rl_q_part2/off-on-4.jpg', width=500)

## PPO clip