# TRUST REGION POLICY OPTIMIZATION (TRPO) and PROXIMAL POLICY OPTIMIZATION (PPO)

## I. PROBLEM of VANILA POLICY OPTIMIZATION 

### 1. Summary of **Vanila Policy Optimization**: 
From the previous [Vanila Policy Optimization](docs/Vanila_Policy_Optimization.ipynb), our objective is to find the policy $\pi_\theta(a|s)$ that maximizes the expected returns: 
    $$\pi^*= \argmax_\pi J(\pi) \quad \text{where} \quad J(\pi)= \mathbb{E}_{\tau \sim \pi} [R(\tau)] = \int_\tau P(\tau|\pi) R(\tau) \quad\quad (1)$$
which leads to the solution:
    $$ \nabla_\theta J(\pi_\theta) =  \int_\tau P(\tau|\pi_\theta) \sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau) = E_{\tau \sim \pi_{\theta}}[\sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau) ]\quad\quad (2)$$
where $R(\tau)$ can be generalized into $\Phi_t$, such as the Generalized Advantage Estimation (GAE):
    $$\Phi_t^{GAE}=\sum_{n=0}^N (\lambda \gamma)^n \delta_{t+n} \quad \text{where} \quad \delta_t = r_t + \gamma V^\phi(s_{t+1}) - V^\phi(s_t) $$
where $V^\phi(s_t)$ is the approximation of the Value function $V^\pi(s_t)$. 
+ $\pi_\theta(a_t|s_t)$ is the the actor network parameterized by  $\{\theta\}$ and updated by SGD:
    $$\theta \leftarrow \theta + \nabla_\theta J(\pi_\theta) $$
+ $V^\phi(s_t)$ is the critic network parameterized by $\{\phi\}$ and updated by SGD: 
    $$ \phi \leftarrow \phi + \nabla_\phi (V^\phi(s,a)-R(\tau))^2 $$

### 2. Optimization Problem
On-Policy methods use the most recent data generated by following the policy to update the policy-critic  networks:
+ However, because we use SGD to update the policy and critic networks, a small change of parameters $\theta_k$ can lead to large divergence between the old policy ($\pi_{\theta_k}$) and the new one ($\pi_{\theta_{k+1}}$).  
+ Consequently, if the policy changes too much, the Value function $V_\phi(s_t)$ estimated on the data generated by the old policy $\pi_{\theta_k}$ can no longer well approximate the Value function $V^\pi(s_t)$.  Following the gradient descend, this in turn push the policy in wrong direction.
+ And, if the policy goes too wrong, we will not get any meaningful rewards, which then destroy the whole learning process. This is different with common supervised learning, because the data in supervised learning is static, so if we go wrong in one batch, the next batch can correct the mistake. However, this won't happen in Reinforcement Learning, if we go wrong in 1 step, we may never be able to correct it again.
+ To reduce the divergence, one solution is to reduce the learning rate, which leads to slow update and slow convergence. In addition, a small change in the network parameters can not gaurantee the small change in the distribution of policy $\pi_\theta(a,s)$.

TRPO propose to stablize the training by explicitly adding the constraint that *the distribution of the new policy should be close to that of the old policy*. Doing so can ensure that the advantage $V_\phi(s_t)$ built on the data collected using old policy is still valid to update the new policy. Mathmatically, we can write the constraint as the Kullback-Leibler divergence between the new policy and the old policy:
    $$D_{KL}(\pi, \pi_{old}) \leq \epsilon $$
And, by following this constraint, we can rewrite the loss function as:
    $$J(\pi)= \mathbb{E}_{\tau \sim \pi} [R(\tau)] = \int_\tau P(\tau|\pi) R(\tau) = \int_\tau P(\tau|\pi_{old}) \frac{P(\tau|\pi)}{P(\tau|\pi_{old})}R(\tau) = \mathbb{E}_{\tau \sim \pi_{old}} [\frac{P(\tau|\pi)}{P(\tau|\pi_{old})}R(\tau)] \quad \quad (3)$$
Recall from the [Problem Formulation](docs/ProblemFormulation_Notation) that, $P(\tau|\pi)$ is the probability of a T-step trajectory $\tau=(s_0,a_0,s_1,a_1,...)$ and is derived as:
    $$P(\tau|\pi) = \rho_0(s_0) \prod_{t=0}^{T-1}P(s_{t+1}|s_t,a_t)\pi_\theta(a_t|s_t) \quad\quad (4)$$
where $\rho_0(s_0)$ and $P(s_{t+1}|s_t,a_t)$ are unknown transition probability of the environment. From (3) and (4) we can rewrite the loss function as: 
    $$J(\pi) = \mathbb{E}_{\tau \sim \pi_{old}} [\frac{P(\tau|\pi)}{P(\tau|\pi_{old})}R(\tau)] =  \mathbb{E}_{\tau \sim \pi_{old}} \left[\prod_{t=0}^{T-1}\frac{\pi(a_t|s_t)}{\pi_{old}(a_t|s_t)}R(\tau)\right] \quad \quad (5)$$
For short notation, denote $\pi^\tau(a|s) = \prod_{t=0}^{T-1}\pi(a_t|s_t)$, which is the probability of the set of actions $\{a_t\}_{t=0}^{T-1}$ at the corresponding states $\{s_t\}_{t=0}^{T-1}$ by following the policy $\pi$ in $T$ steps before updating, (5) can be writen in the compact form:
    $$J(\pi) = \mathbb{E}_{\tau \sim \pi_{old}} [\frac{\pi^\tau(a|s)}{\pi_{old}^\tau(a|s)}R(\tau)] \quad \quad (6) $$

In summary, the problem is redefined as:
    $$\boxed{\pi^*= \argmax_\pi J(\pi) \quad \text{where} \quad J(\pi) = \mathbb{E}_{\tau \sim \pi_{old}} [\frac{\pi^\tau(a|s)}{\pi_{old}^\tau(a|s)}R(\tau)] \quad \text{s.t.} \quad \mathbb{E}_{\tau \sim \pi_{old}} [D_{KL}(\pi(a|s), \pi_{old}(a|s))] \leq \epsilon \quad \quad (7)}$$

Intuitively, Equ. (7) says that we can use $R(\tau)$, estimated by the data generated using the old policy $\tau \sim \pi_{old}$, to guide the SGD to update the new policy $\pi$, with the constraint that the distribution of old and new policy is not too different. 

Note that, **$\pi$ is the new policy and not available at the current step, e.g. it is the result after updating.** All the collected data and computation are performed using the old policy $\pi_{old}$. Therefore, the gradient of the loss function is still evaluated at $\theta_{old}$. Concretely, using the Log-derivative trick as done in [Vanila Policy Optimization Solution](docs/Vanila_Policy_Optimization.ipynb),
    $$ \nabla_\theta J(\pi_\theta) \bigg\rvert_{\theta=\theta_{old}}= \mathbb{E}_{\tau \sim \pi_{old}} [\frac{\nabla_\theta \pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)}R(\tau)] \bigg\rvert_{\theta=\theta_{old}}
                                   = \mathbb{E}_{\tau \sim \pi_{old}} [\frac{\pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)}R(\tau)] \bigg\rvert_{\theta=\theta_{old}} =  \mathbb{E}_{\tau \sim \pi_{old}} [\nabla_\theta \log \pi_\theta(a|s) R(\tau)]\bigg\rvert_{\theta=\theta_{old}} $$
which is exactly the same gradient of Vanila PO in Equ (2). So, event we change the objective function $J(\theta)$, its gradient does not change. The change of total gradient is only due to the contrainst added. 

## II. Solution 
To solve constraint problem, we can use Largrange Multiplier, as done in TRPO. However, the solution is complicated and hard to implement. PPO proposes two simpler solutions:
+ PPO-Penalty: We can move the constraint into the loss function as the form of regularization, or penalty:
    $$ J(\pi) = \mathbb{E}_{\tau \sim \pi_{old}} \left[\frac{\pi(a|s)}{\pi_{old}(a|s)}R(\tau) - \alpha D_{KL}(\pi(a|s), \pi_{old}(a|s)) \right] \quad \quad (7)$$
+ PPO-Clip: We clip the ratio $r_\pi=\frac{\pi(a|s)}{\pi_{old}(a|s)} \in (1-\epsilon,1+\epsilon)$ to ensure the distribution of old and new policy is not too different. Specifically, 
    $$ J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_{old}}[L(s,a,\theta_{old},\theta)] \quad \text{where} \quad  L(s,a,\theta_{old},\theta) = 
        \begin{cases}
            \max(r_\pi,1 + \epsilon)R(\tau) \quad \text{if} \quad R(\tau) > 0\\
            \min(r_\pi,1 - \epsilon)R(\tau) \quad \text{if} \quad R(\tau) < 0 \\
        \end{cases}$$
  Although PPO-Clip is much simpler to implement, it is still possible to end up with a new policy which is too far from the old policy. There are several tips to handle this case. The simple one is to use Early stoping. If the mean KL-divergence of the new policy from the old grows beyond a threshold, we stop taking gradient steps. 