#### Applying the PPO RL Algorithm to Natural Language Sequences

The reinforcement learning (RL) update step uses the log probabilities, value function estimations, stepwise rewards, the prompt tokens and generated tokens, to update the langauge model

Lets first just quickly go thru all the steps from part 1 to get to the same log probabilities, value function estimations, stepwise rewards, the prompt tokens and generated tokens in order to continue from where we left off

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import torch

from minichatgpt.experiments.imdb import config, sent_kwargs
from minichatgpt import Lab
from minichatgpt.processdata.collators import imdb_dataloader_collator

# for the loss calculation
from minichatgpt.core import whiten, logprobs_from_logits, clip_by_value

In [3]:
# The notebook should work even without GPUs, but if you have them, confirm you do
print ('number of GPUs:', torch.cuda.device_count())

# For the sake of the speed of this demonstration, the batch_size is temporarily decreased from 256 to 4
batch_size = 4
config.batch_size = batch_size
config.forward_batch_size = batch_size//2

config.seed

number of GPUs: 0


0

In [4]:
lab = Lab(config)

dataset = lab.build_dataset(
    dataset_name="imdb",
    input_min_text_length=2,
    input_max_text_length=8,
)

Found cached dataset imdb (/Users/carson/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)
Loading cached processed dataset at /Users/carson/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-57d76e4722ace3a5.arrow
Loading cached processed dataset at /Users/carson/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-26d56c39f81904d0.arrow


In [6]:
new_policy, old_policy, tokenizer = lab.init_policies_tokenizer()
lab.set_generation_config(do_sample=True,output_min_length=4,output_max_length=16,pad_token_id=tokenizer.eos_token_id)
ppo_trainer = lab.init_ppo_trainer(
    config, 
    new_policy,old_policy, 
    tokenizer, 
    dataset, dataloader_collator=imdb_dataloader_collator,
)
reward_model = lab.init_reward_model()

In [7]:
for batch_step, batch in enumerate(ppo_trainer.dataloader):
    
    queries = batch['input_ids']
    
    #### Get response from gpt2
    responses = []
    for query in queries:
        gen_len = lab.output_length_sampler()
        lab.generation_kwargs["max_new_tokens"] = gen_len
        response = ppo_trainer.generate(query, **lab.generation_kwargs)
        responses.append(response.squeeze()[-gen_len:])

    batch['response'] = [tokenizer.decode(r.squeeze()) for r in responses]

    #### Compute sentiment score
    texts = [q + r for q,r in zip(batch['query'], batch['response'])]
    pipe_outputs = lab.reward_model(texts, **sent_kwargs)
    rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]
    break
    
queries, responses, scores = ppo_trainer._step_safety_checker(batch_size, queries, responses, rewards)

scores



[tensor(1.3289), tensor(-1.1267), tensor(2.6723), tensor(-0.2299)]

Next we will study whats going on in `ppo_trainer.compute_logits_vpred` and `ppo_trainer.loss`. As you can see below, when you run these together, even with one iteration, the policy will be updated such that the rewards will have changed.

In [8]:
old_logprobs, ref_logprobs, values = ppo_trainer.batched_forward_pass(queries, responses)

rewards, non_score_reward = ppo_trainer.compute_rewards(scores, old_logprobs, ref_logprobs)

print(rewards)

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[tensor([-0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, 1.3289]), tensor([-0.0000, -0.0000, -0.0000, -1.1267]), tensor([-0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, 2.6723]), tensor([-0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,
        -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.2299])]


In [9]:
idx = list(range(config.batch_size))

# train_minibatch() # line 419 ppo_trainier.py

for idx in range(config.batch_size):
    
    new_logprobs, vpred, logits = ppo_trainer.compute_logits_vpred(
        model_input = torch.cat([queries[idx],responses[idx]]).unsqueeze(0), 
        query = queries[idx].unsqueeze(0), 
        response = responses[idx].unsqueeze(0), 
        rewards = rewards[idx].unsqueeze(0),
    )
    
    loss_p, loss_v, train_stats = ppo_trainer.loss(
        old_logprobs[idx].unsqueeze(0),
        values[idx].unsqueeze(0),
        rewards[idx].unsqueeze(0),
        logits,
        vpred,
        new_logprobs,
    )
    
    loss = loss_p + loss_v
    
    ppo_trainer.optimizer.zero_grad()
    ppo_trainer.accelerator.backward(loss)
    ppo_trainer.optimizer.step()
    
    break
    
train_stats

{'loss/policy': tensor(1.3659e-08, grad_fn=<MeanBackward0>),
 'loss/value': tensor(3.0568, grad_fn=<MulBackward0>),
 'loss/total': tensor(0.3057, grad_fn=<AddBackward0>),
 'policy/entropy': tensor(4.1144, grad_fn=<MeanBackward0>),
 'policy/approxkl': tensor(0., grad_fn=<MulBackward0>),
 'policy/policykl': tensor(0., grad_fn=<MeanBackward0>),
 'policy/clipfrac': tensor(0., dtype=torch.float64),
 'policy/advantages': tensor([[ 0.1825, -0.2097, -1.0063,  0.3816,  0.2100, -0.6775, -1.1533,  0.0527,
           2.2200]]),
 'policy/advantages_mean': tensor(-1.3659e-08),
 'policy/ratio': tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1.]], grad_fn=<ExpBackward0>),
 'returns/mean': tensor(1.1464),
 'returns/var': tensor(0.0075),
 'val/vpred': tensor(-0.3766, grad_fn=<MeanBackward0>),
 'val/error': tensor(6.0088, grad_fn=<MeanBackward0>),
 'val/clipfrac': tensor(0.3333, dtype=torch.float64),
 'val/mean': tensor(0.4157),
 'val/var': tensor(0.6546)}

In [13]:
old_logprobs, ref_logprobs, values = ppo_trainer.batched_forward_pass(queries, responses)

rewards, non_score_reward = ppo_trainer.compute_rewards(scores, old_logprobs, ref_logprobs)

print(rewards)


[tensor([-0.0084,  0.0823,  0.1289, -0.1553, -0.1648,  0.0288,  0.0549,  0.0689,
         1.1831]), tensor([-1.1083e-03, -3.5361e-04,  1.1785e-03, -1.1243e+00]), tensor([-1.2566e-03, -1.4192e-03, -1.8741e-03,  2.0075e-04,  7.4119e-03,
         5.8543e-03,  2.6716e+00]), tensor([-0.0084,  0.0050,  0.0030, -0.0045, -0.0047,  0.0017, -0.0051,  0.0047,
         0.0005,  0.0017,  0.0066,  0.0025, -0.0037,  0.0010, -0.2327])]


#### train_minibatch()

train_minibatch does 3 things:

1. compute_logits_vpred
2. combine loss components into loss
3. does backpropagation

`ppo_trainer.compute_logits_vpred()` seems to do something very similar to `ppo_trainer.batched_forward_pass()` only it is for the new_policy only and returns the logits as well, the values are also shifted one position into the future

In [14]:
# renaming our single sample to cleanly push this one sample through PPO for demonstration

old_logprobs_ = old_logprobs[idx].unsqueeze(0)
values_ = values[idx].unsqueeze(0)
rewards_ = rewards[idx].unsqueeze(0)
queries_ = queries[idx].unsqueeze(0)
responses_ = responses[idx].unsqueeze(0)
model_input_ = torch.cat([queries[idx],responses[idx]]).unsqueeze(0)

print(old_logprobs_, old_logprobs_.shape)
print(' ')
print(values_, values_.shape) 
print(' ')
print(rewards_, rewards_.shape)
print(' ')
print(model_input_, model_input_.shape)


tensor([[-9.0937, -2.4931, -7.3628, -4.2782, -3.5021, -2.1426, -1.5011, -3.9503,
         -4.5521]]) torch.Size([1, 9])
 
tensor([[ 0.6227,  0.1640,  1.4412,  0.2018, -0.0613,  1.0342,  1.4121,  0.5708,
          0.7421]]) torch.Size([1, 9])
 
tensor([[-0.0084,  0.0823,  0.1289, -0.1553, -0.1648,  0.0288,  0.0549,  0.0689,
          1.1831]]) torch.Size([1, 9])
 
tensor([[   40,  3505,  4964, 16089,   351,   337,     5,    38,    11,   290,
           484,  6304]]) torch.Size([1, 12])


In [15]:
print(values[0])

new_logprobs_, vpred_, logits_ = ppo_trainer.compute_logits_vpred(model_input_, queries_, responses_, rewards_)

print(new_logprobs_, new_logprobs_.shape)
print(' ')
print(vpred_, vpred_.shape)


tensor([ 0.6227,  0.1640,  1.4412,  0.2018, -0.0613,  1.0342,  1.4121,  0.5708,
         0.7421])
tensor([[-9.0937, -2.4931, -7.3628, -4.2782, -3.5021, -2.1426, -1.5011, -3.9503,
         -4.5521]], grad_fn=<SliceBackward0>) torch.Size([1, 9])
 
tensor([[ 0.2078,  1.4100,  0.2574, -0.0560,  1.0054,  1.5489,  0.5861,  1.2991,
         -7.9002]], grad_fn=<SliceBackward0>) torch.Size([1, 9])


#### Advantage = Returns - Values

The term `lastgaelam` meanns last [generalized advantage estimator](https://arxiv.org/pdf/1506.02438.pdf) GAE lambda.
There are many resources already for explaining what the [Advantage](https://huggingface.co/blog/deep-rl-a2c) function is, so I will not go into it too much. In the equation below $A^{(k)}_{t}$ is the advantage that represents "Compared to the average reward, or expected reward, reward we should get from state $ s_{t} $  till the end, How much more or less did we get specifically as a result of taking the action we took at step t rather than all the other actions we could have taken, not including the actions we took after or before step t", so in a simple equation:

$$ Advantage = Returns - Values $$

`returns` is the total sum of rewards $ R(t) = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^T r_{T} $ where T is the total number of timesteps in the episode and gamma $ \gamma $ is the discount factor that when small << 1 emphasizes short term rewards over long term rewards and when $ \gamma $ = 1 weights longer term and short term equally.

`values` is the model's prediction of what the returns will be at any given timestep t and state $s_t$


In [16]:
print('lambda', ppo_trainer.config.lam, 'gamma', ppo_trainer.config.gamma)


lambda 0.95 gamma 1


#### loss()

This method is where the specifics of GAE and PPO are implemented.

#### Generalized Advantage Estimator (GAE)

$A^{(1)}_{t}$ does this by incorporating only the actual reward r_t we got immediately after taking action $ a_{t} $ in $ s_{t} $ + $\gamma V(s_{t+1})$ to estimate the rest of the future - $V(s_t)$, the expected total rewards averaged across all the action options at state s_t, good and bad. 

\begin{align}
\hat{A}_t^{(1)} &= r_t + \gamma V(s_{t+1}) - V(s_t) \\
\hat{A}_t^{(2)} &= r_t + \gamma r_{t+1} + \gamma^2 V(s_{t+2}) - V(s_t) \\
\cdots &= \cdots \\ 
\hat{A}_t^{(\infty)} &= r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots - V(s_t)
\end{align}

$A^{(2)}_{t}$ is similar to $A^{(1)}_{t}$, only we incorporate 2 steps of actual reward in the future then estimate the rest, and so on and so on.


lam, or lambda $ \lambda $ is the weight parameter, it is taught intuitively here in this other lesson about [ Exponentially Weighted Moving Average](https://medium.com/mlearning-ai/exponentially-weighted-average-5eed00181a09) (EWMA), only in this lesson it is called $ \beta $. Basically, the higher $ \lambda $ the more you are placing weight on values other than the most immediate one. 

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*u3MIYRnLguhjvM0tr72wBA.png" width=600 height=400>

What you see is that lower the $ \beta $ is, the more noisy the signal. Thats because the lower the beta the less we are taking into account the more stable past values, instead changing the moving avareg alot based on the most recent volatile new piece of data. With higher beta we are weighing the past known and now static values more heavily, thereby inducing a smoother curve.

However, and im sorry for doing this, but with respect to GAE, the situation is reversed int both ways, from the example shown in the graph. So why did I show it to you? Well the example is easier to understand and the relationship is similar only reversed, and the relationship is harder to describe. But once you see that relationship, I think its easier to take the inverse of a relatshiption you do understand, than to explain the relationship is a more confusing setting. 

Whereas in typical times series EWMA, the most immediate data is the most recently data in the past and the other data is the data in the farther past, in GAE the most immediate data is the next reward and the other data is the rewards we are estimating we might get in the future, via the state value function V(s). The $ \lambda $ is therefore higher when you want to weight these far future estimates higher at the expense of the ones in your immediate future which are more certain. 

$\hat{A}_t^{GAE(\gamma,\lambda)}$  is the generalized advantage estimator. This [Blog on GAE](https://danieltakeshi.github.io/2017/04/02/notes-on-the-generalized-advantage-estimation-paper/) explains it well. The higher lambda is the more future steps (k's) you are taking into account

\begin{align}
\hat{A}_t^{GAE(\gamma,\lambda)} &= (1-\lambda)\Big(\hat{A}_{t}^{(1)} + \lambda \hat{A}_{t}^{(2)} + \lambda^2 \hat{A}_{t}^{(3)} + \cdots \Big) \\
&= (1-\lambda)\Big(\delta_t^V + \lambda(\delta_t^V + \gamma \delta_{t+1}^V) + \lambda^2(\delta_t^V + \gamma \delta_{t+1}^V + \gamma^2 \delta_{t+2}^V)+ \cdots \Big)  \\
&= (1-\lambda)\Big( \delta_t^V(1+\lambda+\lambda^2+\cdots) + \gamma\delta_{t+1}^V(\lambda+\lambda^2+\cdots) + \cdots \Big) \\
&= (1-\lambda)\left(\delta_t^V \frac{1}{1-\lambda} + \gamma \delta_{t+1}^V\frac{\lambda}{1-\lambda} + \cdots\right) \\
&= \sum_{l=0}^\infty (\gamma \lambda)^l \delta_{t+l}^{V}
\end{align}

***
The tradeoff here is that the estimators $A^{(k)}_{t}$ with small k have low variance but high bias, whereas those with large k have low bias but high variance. Why?

I think of it based on the number of terms. With small k, we have fewer terms to sum over (which means low variance). However, the bias is relatively large because it does not make use of extra “exact” information with r_K for K > k

Here’s another way to think of it as emphasized in the paper: V(s_t)
is constant among the estimator class, so it does not affect the relative bias or variance among the estimators: differences arise entirely due to the k -step returns.
***

In RL and machine learning, we are calling this noise, the variance, as in the bias variance tradeoff.

I like to sum this up as "the L in lambda for for longtermism" and depending on the choices we make today, there are many variants of the future we could end up in, so larges L means more longterms and more variance. 

Basically like many tradeoffs there exists a point of balance for your particular problem. Like in the below example, you get bad learning not only when lambda is too high, but also when it is too low.

<img src="https://d3i71xaburhd42.cloudfront.net/ca11ba7b2991fe07b7a99b3a3aeba2486ed36261/9-Figure4-1.png">

Im going to rewrite the set of equations above to better mirror the code that we will implement below;

first lets add the lambda term

\begin{align}
\hat{A}_t^{(1)} &= \delta^{V}_{t} = r_t + \gamma V(s_{t+1}) - V(s_t) \\
\hat{A}_t^{(2)} &= \delta^{V}_{t} + (\gamma \lambda) \delta^{V}_{t+1} \\
\hat{A}_t^{(k)} &= \sum_{l=0}^{l=k} (\gamma \lambda)^l \delta_{t+l}^{V}
\end{align}

next, rewrite ${A}_t^{(T - t)}$ in terms of t + 1, so that we can calculate the GAE for each
step t in the sequence from the last T to the first 0:

\begin{align}
\delta^{V}_{t} &= r_t + \gamma V(s_{t+1}) - V(s_t) \\
\hat{A}_t^{(T - t)} &= \delta^{V}_{t} + (\gamma \lambda) \delta^{V}_{t+1} \\
\end{align}

The above two expressions correspond to line 2 and 3 below respectively

In [17]:
gen_len = rewards_.shape[-1]
print('gen_len', gen_len)

lastgaelam = 0 # lastgaelam takes the role of delta_t+1 in 
advantages_reversed = []

# iterate backwards from last time step of episode, t = T -> 0 
for t in reversed(range(gen_len)):
    
    # 1. V(s_t+1) for all t except t = T
    nextvalues = values_[:, t + 1] if t < gen_len - 1 else 0.0  
    
    # 2. delta_t =  r_t + gamma*V(s_t+1) - V(s_t) 
    delta = rewards_[:, t] + ppo_trainer.config.gamma * nextvalues - values_[:, t]
    
    # 3. A_t = delta_t + gamma*lambda*delta_t+1
    lastgaelam = delta + ppo_trainer.config.gamma * ppo_trainer.config.lam * lastgaelam
    
    advantages_reversed.append(lastgaelam)
    
    print(t, nextvalues, lastgaelam)
    
print(' ')
print(advantages_reversed)
print(' ')

# reverse advantages_reversed to put it back into forward chronological order then concatenate
advantages_ = torch.stack(advantages_reversed[::-1]).transpose(0, 1)

print(advantages_)

gen_len 9
8 0.0 tensor([0.4410])
7 tensor([0.7421]) tensor([0.6591])
6 tensor([0.5708]) tensor([-0.1602])
5 tensor([1.4121]) tensor([0.2545])
4 tensor([1.0342]) tensor([1.1724])
3 tensor([-0.0613]) tensor([0.6954])
2 tensor([0.2018]) tensor([-0.4498])
1 tensor([1.4412]) tensor([0.9322])
0 tensor([0.1640]) tensor([0.4185])
 
[tensor([0.4410]), tensor([0.6591]), tensor([-0.1602]), tensor([0.2545]), tensor([1.1724]), tensor([0.6954]), tensor([-0.4498]), tensor([0.9322]), tensor([0.4185])]
 
tensor([[ 0.4185,  0.9322, -0.4498,  0.6954,  1.1724,  0.2545, -0.1602,  0.6591,
          0.4410]])


In [18]:
# since advantage = returns - values_
returns_ = advantages_ + values_
print(returns_)

# whitening simply subtracts the means and divides by the standard deviation to zero mean the data and 
# impose a std of 1

advantages_ = whiten(advantages_)
advantages_ = advantages_.detach()

advantages_

tensor([[1.0412, 1.0962, 0.9914, 0.8972, 1.1112, 1.2887, 1.2519, 1.2299, 1.1831]])


tensor([[-4.2870e-02,  9.6484e-01, -1.7462e+00,  5.0038e-01,  1.4361e+00,
         -3.6457e-01, -1.1781e+00,  4.2917e-01,  1.2797e-03]])

#### Clipping the Value Function (Critic)

This implementation detail is not explained in detail in the PPO paper but when you look at the discussion here 
https://github.com/openai/baselines/issues/91 and the TensorFlow implementation here: https://github.com/openai/baselines/blob/master/baselines/ppo2/model.py

The authors add that they

 *Clip the value function objective to reduce variability during Critic training*

here is the actual code from openai baselines

```python
        # CALCULATE THE LOSS
        # Total loss = Policy gradient loss - entropy * entropy coefficient + Value coefficient * value loss

        # Clip the value to reduce variability during Critic training
        # Get the predicted value
        vpred = train_model.vf
        vpredclipped = OLDVPRED + tf.clip_by_value(train_model.vf - OLDVPRED, - CLIPRANGE, CLIPRANGE)
        # Unclipped value
        vf_losses1 = tf.square(vpred - R)
        # Clipped value
        vf_losses2 = tf.square(vpredclipped - R)

        vf_loss = .5 * tf.reduce_mean(tf.maximum(vf_losses1, vf_losses2))
```
 and here are those same operations in PyTorch

In [19]:
print(ppo_trainer.config.cliprange_value)
print(values_)
print(vpred_)

vpredclipped_ = \
clip_by_value(vpred_, values_ - ppo_trainer.config.cliprange_value, values_ + ppo_trainer.config.cliprange_value)

print(vpredclipped_)

# minimize the squared-error loss
vf_losses1_ = (vpred_ - returns_) ** 2
vf_losses2_ = (vpredclipped_ - returns_) ** 2
vf_loss_ = 0.5 * torch.mean(torch.max(vf_losses1_, vf_losses2_))

0.2
tensor([[ 0.6227,  0.1640,  1.4412,  0.2018, -0.0613,  1.0342,  1.4121,  0.5708,
          0.7421]])
tensor([[ 0.2078,  1.4100,  0.2574, -0.0560,  1.0054,  1.5489,  0.5861,  1.2991,
         -7.9002]], grad_fn=<SliceBackward0>)
tensor([[0.4227, 0.3640, 1.2412, 0.0018, 0.1387, 1.2342, 1.2121, 0.7708, 0.5421]],
       grad_fn=<MaximumBackward0>)


in the TRL version a value called `vf_clipfrac` is calculated to report what proportion of the losses ended up being clipped, so < 0.5 means that most of our our value function losses were within the range.

#### Clipped Surrogate Objective

The Clipped Surrogate Objective is the main discussion point in the PPO paper

here is the main equation from the paper

$$ r_t(\theta) = \frac{\pi_\theta (a_t|s_t)}{\pi_{\theta_{old}} (a_t|s_t)} $$

\begin{align}
\mathcal{L}^{CLIP}(\theta) =
 \mathbb{E}_{a_t, s_t \sim \pi_{\theta{new}}} \biggl[
   min \Bigl(r_t(\theta) \bar{A_t},
             clip \bigl(
              r_t(\theta), 1 - \epsilon, 1 + \epsilon
             \bigr) \bar{A_t}
   \Bigr)
 \biggr]
\end{align}

In laymens terms, this says:

r(theta)_t is the ratio, not reward, between the new policy and old policy. meaning, a ratio of 1 means they are the same with respect to this particular action and state, a ratio < 0.8 or > 1.2 means that the new policy is very different from the old, by a factor of 0.2 in this example.

So in order to disincentivize radical noisy unconstructive updates to the new policy (This phrasing is from https://huggingface.co/blog/deep-rl-ppo), we update our policy only if:


1. Our ratio is in the range `[1 - epsilon, 1 + epsilon]` 

OR

2. Our ratio is outside the range, but the advantage leads to getting closer to the range. There are 2 cases:
    
    a. The ratio is below the range but the advantage is > 0
    
    b. The ratio is above the range but the advantage is < 0.


<img src="https://huggingface.co/blog/assets/93_deep_rl_ppo/recap.jpg" height=400 width=600>

[Table from "Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization" by Daniel Bick](https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf)

here it is in baselines tensorflow

```python
        # Calculate ratio (pi current policy / pi old policy)
        ratio = tf.exp(OLDNEGLOGPAC - neglogpac)

        # Defining Loss = - J is equivalent to max J
        pg_losses = -ADV * ratio

        pg_losses2 = -ADV * tf.clip_by_value(ratio, 1.0 - CLIPRANGE, 1.0 + CLIPRANGE)

        # Final PG loss
        pg_loss = tf.reduce_mean(tf.maximum(pg_losses, pg_losses2))
        approxkl = .5 * tf.reduce_mean(tf.square(neglogpac - OLDNEGLOGPAC))
        clipfrac = tf.reduce_mean(tf.to_float(tf.greater(tf.abs(ratio - 1.0), CLIPRANGE)))

        # Total loss
        loss = pg_loss - entropy * ent_coef + vf_loss * vf_coef
```

and here in pytorch 


In [20]:
ratio_ = torch.exp(new_logprobs_ - old_logprobs_)
pg_losses_ = -advantages_ * ratio_
pg_losses2_ = \
    -advantages_ * torch.clamp(ratio_, 1.0 - ppo_trainer.config.cliprange, 1.0 + ppo_trainer.config.cliprange)

pg_loss_ = torch.mean(torch.max(pg_losses_, pg_losses2_))

#### Combined Objective

<img src="https://huggingface.co/blog/assets/93_deep_rl_ppo/ppo-objective.jpg" height=400 width=600> 

The TRL authors leave out the entropy term because there does not seems to be a way to do it such that it gives much benefit at this time https://github.com/lvwerra/trl/issues/131, which is why in our case you see that after we return the policy gradient loss and value function loss from the function above, the final loss we backpropagate on is

`loss = pg_loss + self.config.vf_coef * vf_loss`

We maximize this objective, this is why there is a negative sign for the squared-error loss because maximizing the negative of this component is the same as minimizing the original component. 