## Policy Gradient
To maximize the objective function $J(\theta)$
    $$
    \underset{\theta}{\operatorname{arg max}}J(\theta) = \underset{\theta}{\operatorname{arg max}} \mathbb{E}_{s \sim d_\pi}\left [ V_\pi(s) \right ] 
    $$
we can use the policy gradient theorem:
$$
\nabla_\theta J(\theta)  = \mathbb{E}_{s \sim d_\pi}\left [\mathbb{E}_{a \sim \pi_{\theta}(\cdot \vert s)} [Q_\pi(s, a) \nabla_\theta \ln \pi_\theta(a \vert s)] \right ] 
$$
Sample from $a \sim \pi_{\theta}(\cdot \vert s)$ and a monte carlo estimation of $\nabla_\theta J(\theta)$ is:
$$
\boldsymbol{g}_\theta(s, a) \doteq Q_{\pi_\theta}(s, a) \nabla_\theta \ln \pi_\theta(a \vert s)
$$

## Generalization Advantage Estimation(GAE)
![](./doc/gae_0.png)
![](./doc/gae_1.png)
![](./doc/gae_2.png)

$$
\hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l} = \delta_{t} + (\gamma \lambda) \hat{A}_{t+1}^{GAE(\gamma, \lambda)}
$$

In [20]:
from torch import nn
import torch    

class Critic(nn.Module):
    def __init__(self, state_dim, hidden_dim = 64, dim_pred = 1):
        super(Critic, self).__init__()
        
        self.proj_in = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.Tanh(), 
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh()
        )

        self.value_head = nn.Linear(hidden_dim, dim_pred)

    def forward(self, state):
        x = self.proj_in(state)
        x = self.value_head(x)
        return x

num_samples = 9
state_dim = 10
critic = Critic(state_dim)
state = torch.randn(num_samples, state_dim)
rewards = torch.randn((num_samples, 1))
dones = torch.zeros((num_samples, 1))
dones[-1] = 1.0
next_states = torch.randn(num_samples, state_dim)
gamma = 0.99
lambda_ = 0.95

td_target = rewards + gamma * critic(next_states) * (1.0 - dones)
td_delta = td_target - critic(state)
print((gamma * critic(next_states)).shape)

advantages = torch.zeros((num_samples, 1))
advantages[-1] = td_delta[-1]
for t in range(num_samples - 2, -1, -1):
    advantages[t] = td_delta[t] + gamma * lambda_ * advantages[t + 1]

print(advantages.shape)


torch.Size([9, 1])
torch.Size([9, 1])



## The TRPO formula
Importance sampling ratio:
$$
\underset{\theta}{\operatorname{arg max}} \mathbb{E}_{s}\left [\mathbb{E}_{a \sim \pi_{\theta_\text{old}}(\cdot \vert s)} [  \frac{\pi_\theta(a \vert s)}{\pi_{\theta_\text{old}}(a \vert s)} \hat{A}_{\theta_\text{old}}(s, a) ] \right]
$$

## The PPO Algorithm
$$
\underset{\theta}{\operatorname{arg max}} \mathbb{E}_{s}\left [\mathbb{E}_{a \sim \pi_{\theta_\text{old}}(\cdot \vert s)} [\min(\frac{\pi_\theta(a \vert s)}{\pi_{\theta_\text{old}}(a \vert s)} \hat{A}_{\theta_\text{old}}(s, a), \text{clip}(\frac{\pi_\theta(a \vert s)}{\pi_{\theta_\text{old}}(a \vert s)}, 1 - \epsilon, 1 + \epsilon) \hat{A}_{\theta_\text{old}}(s, a))]\right]
$$