# Proximal Policy Optimization

<!-- 
- Introduction 
    - Why PPO? What does it build on? How is it better? What's the point?
- Note that this post assumes the reader knows about policy gradient as explained in the previous post
- Maybe quickly re-introduce the idea of policy gradient methods
- Trust Region Methods such as TRPO
- Clipped Surrogate Objective
- Adaptive KL Penalty Coefficient
- The Algorithm itself
- Implementations
    - Note implementations are using openAI's stable-baselines3
    - CartPole-v1 and CartPole-v1k
    - CarRacing-v1
- Conclusion
-->

Schulman, _et al._ suggest a new policy gradient based reinforcement learning approach that keeps some of the benefits of trust region proximation optimization (TRPO) while also being much simpler to implement. The general concept comprises an alternation between data sampling through environment interaction and the optimization of a so called surrogate objective function using stochastic gradient ascent. 

## Policy optimization

### Policy gradient methods

First of all since the policy gradient was touched upon in a previous post it is assumed that the reader is somewhat familiar with the topic. 

Generally policy gradient methods perform stochastic gradient ascent on an estimator of the policy gradient. This is the most common estimator
$$
\hat{g} = \hat{\mathbb{E}}_t\left[\nabla_\theta\log\pi_\theta(a_t|s_t)\hat{A}_t \right]
$$

- $\pi_\theta$ is a stochastic policy
- $\hat{A}_t$ is an estimator of the advantage function at timestep $t$
- $\hat{\mathbb{E}}_t\left[...\right]$ is the empirical average over a finite batch of samples, in an algorithm that alternates between sampling and optimization

For implementations of policy gradient methods an objective function is constructed in such way that the gradient is the policy gradient estimator. Therefore said estimator can be calculated by differentiating the objective
$$
L^{PG}(\theta)=\hat{\mathbb{E}}_t\left[\log\pi_\theta(a_t|s_t)\hat{A}_t\right]
$$

This objective function is also called the policy gradient loss. If the advantage estimate is positive (agent's actions in the sample trajectory resulted in better than average return) the probability of selecting those actions again is increased. If on the contrary the advantage estimate is negative the likelihood of selecting those actions again is decreased.

<!-- Note on repeadedly running gradient-descent on the same batch

Continiously running gradient-descent on the same batch of collected experience will cause the neural network's parameter to be updated far outside the range of where the data was originally collected. Since the advantage function is basically a noisy estimate of the real advantage, it will in turn be corrupted to the point where it is completely wrong.

Therefore the policy will be destroyed if gradient descent is continiously run on the same batch of collected experience. 
-->

### Trust Region Methods

The before mentioned trust region policy optimization algorithm (TRPO) aims to maximize an objective function while putting it under a constraint regarding the size of the policy update (to avoid wandering too far off the old policy within a single update) as that would destroy the policy

$$
\begin{align*}
\max_\theta\,&\hat{\mathbb{E}}_t\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}\hat{A}_t\right] \\
\text{subject to }&\hat{\mathbb{E}}_t\left[KL\left[\pi_{\theta_\text{old}}(\cdot|s_t), \pi_\theta\right]\right]\leq \delta
\end{align*}
$$

- $\theta_{\text{old}}$ is a vector containing the policy parameter before the update

TRPO actually suggests using a penalty instead of the contraint because the latter adds additional overhead to the optimization process and can sometimes lead to very undesirable training behavior.
That way the former constraint is directly included in the optimization objective:
$$
\max_{\theta}\hat{\mathbb{E}}_t\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}\hat{A}_t-\beta\,KL\left[\pi_{\theta_\text{old}}(\cdot|s_t), \pi_\theta(\cdot|s_t)\right]\right]
$$

<!-- Something about pessimistic lower bound? -->

That being said TRPO itself though uses a hard constraint instead of a penalty. This is because the introduced coefficient $\beta$ turns out to be very tricky to set to a single value without affecting performance across different problems.

Therefore PPO suggests additional modifications because to optimize the penalty objective using stochastic gradient descent just choosing a fixed penalty coefficient $\beta$ will not be enough.

## Clipped Surrogate Objective

Say $r_t(\theta)$ is the probability ratio between the new updated policy and the old policy:
$$
r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}$$

$\Rightarrow$ therefore $r(\theta_\text{old})=1$

So given a sequence of sampled action-state pairs this $r_t(\theta)$ value will be larger than 1 if the action is more likely now than it was in $\pi_{\theta_\text{old}}$. If on the other hand the action is less probable now than before the last gradient step $r_t(\theta)$ will be somewhere between 0 and 1.

Multiplying this ratio $r_t(\theta)$ with the estimated advantage function results in a more readable version of the normal TRPO objective function - a so called "surrogate" objective:
$$
L^{CPI}(\theta) = \hat{\mathbb{E}}_t\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}\hat{A}_t\right]=\hat{\mathbb{E}}_t\left[r_t(\theta)\hat{A}_t\right]
$$

Maximizing this $L^{CPI}$ without any further constraints would result in a very large policy update which - as already explained - might end up destroying the policy. Therefore Schulman, _et al._ suggest to penalize those changes to the policy that would move $r_t(\theta)$ too far away from 1.

Resulting in this final proposition for the main objective:
$$
L^{CLIP}(\theta) = \hat{\mathbb{E}}_t\left[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t\right]
$$

- $\epsilon$ is a hyperparameter (e.g. $\epsilon = 0.2$)

Basically this objective is a pessimistic bound on the unclipped objective. That is because the objective chooses the minimum between the normal unclipped policy gradient objective $L^{CPI}$ and a new clipped version of that objective. The latter discourages moving $r_t$ outside of the interval $[1 - \epsilon, 1 + \epsilon]$.

The paper provides two figures to ease insight into this concept.

<center>
<img src="..\resources\img\single-term-L-CLIP-graphic.PNG">
</center>

The first graph shows an example for a single timestep _t_ in $L^{CPI}$ with a positive advantage while the second shows one with a negative advantage. This means in the first example the selected action had an estimated better outcome than expected while the selected action in the latter example had a negative effect on the outcome. 
So for example the objective function flattens out for values of _r_ that are too high while the advantage was positive. This means if an action was good and is now a lot more likely than it was in the old policy update $\pi_{\theta_\text{old}}$ the clipping prevents too large of an update based on just a single estimate. Again this might destroy the policy because of the advantage function's noisy characteristic. 
Of course on the other hand for terms with a negative advantage the clipping also avoids over-adjusting for these values as that might reduce their likelyhood to zero while having the same effect of damaging the policy based on a single estimate but this time in the opposite direction.

If however the advantage function is negative and _r_ is large, meaning the chosen action was bad **and** is alot more probable now than it was in the old policy $\pi_{\theta_\text{old}}$, then it would be beneficial to reverse the update. And as it so happens this is the only case in which the unclipped version has a lower value than the clipped version and is favored by the _min_ operator. This really showcases the finesse of PPO's objective function.

## Adaptive KL penalty coefficient

Alternatively or additionally to the surrogate objective Schulman, _et al._ provide another concept - the so called adaptive KL penalty coefficient. The general idea is to penalize the KL divergence and then adapt this penalty coefficient based on the last policy updates. Therefore the procedure can be divided into two steps:

- First the policy is updated over several epochs by optimizing the KL-penalized objective:
$$
L^{KLPEN}(\theta) = \hat{\mathbb{E}}_t\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}\hat{A}_t - \beta KL[\pi_{\theta_\text{old}}(\cdot|s_t), \pi_\theta(\cdot|s_t)]\right]
$$

- Then $d$ is computed as $d = \hat{\mathbb{E}}_t[KL[\pi_{\theta_\text{old}}(\cdot|s_t), \pi_\theta(\cdot|s_t)]]$ to finally update the penalty coefficient $\beta$ based on some target value of the KL divergence $d_\text{targ}$:
    - if $d<\frac{d_\text{targ}}{1.5}: \beta \leftarrow \beta/2$
    - if $d>d_\text{targ} \times 1.5: \beta \leftarrow \beta\times2$

This method seems to generally perform worse than the clipped surrogate objective however and is included simply because it still makes for an important baseline.

Also note that while the parameters 1.5 and 2 are chosen heuristically the algorithm is not particularly sensitive to them and the initial value of $\beta$ is not relevant because the algorithm quickly adjusts it anyways.

## Trying out PPO on OpenAI' Gym

To test PPO, the algorithm was applied to several gym environments. Please note that the used implementation of the PPO algorithm originate from OpenAI's stable-baselines3 github repository.

### CartPole

This environment might be familiar from previous posts in this series. To quickly summarize a pole is attached by an un-actuated joint to a cart. The agent operates with an action space of 2: apply force to the left (-1) or the right (+1). The model receives a reward of +1 each timestep that neither of the following conditions are met: the cart's x position exceeds a threshold of 2.4 units in either direction _or_ the pole is more than 15 degrees from vertical. Otherwise the episode ends.

<center>
<figure>
<img src="..\workspace\ppo\out\videos\cartpole-v1-ppo-15000-step-0-to-step-500.gif" height="185">
<figcaption>This model was trained over 15000 timesteps</figcaption>
</figure>
</center>

We trained a PPO agent for 15000 timesteps on the environment. The agent reached the perfect mean reward of 500 for CartPole-v1 after only approximately 7500 timesteps of training. 

Since gym allows for modification of its environments we created a custom version of CartPole named CartPole-v1k that extends the maximum number of timesteps in an episode from 500 (CartPole-v1) to 1000. Of course this means that the maximum reward for an episode also increased to 1000. As shown in the following figure the PPO agent trained on this new environment reached the maximum reward a little later after about 8500 timesteps. 

The standard deviation of the reward during the 100 episodes each evaluation step also declines to 0 over even less than the first 8500 timesteps. This means the agent reaches the maximum reward in every single episode after that without fail.

Interestingly the old model that was trained on the original environment also reaches a reward of 1000 on the modified version over all  100 episodes it was evaluated (mean reward of 1000 with a standard deviation of 0). It seems safe to say the agent not only reaches it's maximum reward after about 7500 timesteps but also perfects the act of balancing the pole over the next few thousand iterations.

<center>
<img src="..\workspace\ppo\out\charts\mean_rew_comparison.png" height="185">
<img src="..\workspace\ppo\out\charts\std_rew_comparison.png" height="185">
</center>

### CarRacing

**TODO**

This environment might be familiar from previous posts in this series. To quickly summarize a pole is attached by an un-actuated joint to a cart. The agent operates with an action space of 2: apply force to the left (-1) or the right (+1). The model receives a reward of +1 each timestep that neither of the following conditions are met: the cart's x position exceeds a threshold of 2.4 units in either direction _or_ the pole is more than 15 degrees from vertical. Otherwise the episode ends.

This environment consists of a racing car trying to maneuver a car racing track. Observations contain a 96x96 pixel image and the action space comprises 3 actions: steering, gas and breaking. Each frame is penalized with a negative reward of _-0.1_ while the competion of a track section is rewarded with _+1000/N_ where _N_ is the number of track sections. Therefore the final reward calculates as _1000 - the number of frames it took the agent to complete the track_. An episode ends when the car finishes the entire track or moves outside of the 96x96 pixel plane.

## Conclusion

**TODO**

# References

- [Schulman, *et al.* (2017)](https://arxiv.org/abs/1707.06347)
- [Arxiv Insights' Video on the paper](https://www.youtube.com/watch?v=5P7I-xPq8u8)
- [Stable Baselines3](https://github.com/ischubert/stable-baselines3)
- [CartPole](https://gym.openai.com/envs/CartPole-v1/)
- [CarRacing](https://gym.openai.com/envs/CarRacing-v0/)