#### Applying the PPO RL Algorithm to Natural Language Sequences

The reinforcement learning (RL) update step uses the log probabilities, value function estimations, stepwise rewards, the prompt tokens and generated tokens, to update the langauge model

Lets first just quickly go thru all the steps from part 1 to get to the same log probabilities, value function estimations, stepwise rewards, the prompt tokens and generated tokens in order to continue from where we left off

In [2]:
import torch

from minichatgpt.experiments.imdb import config, sent_kwargs
from minichatgpt import Lab
from minichatgpt.processdata.collators import imdb_dataloader_collator
from datasets import Dataset

# for the loss calculation
from minichatgpt.core import masked_whiten, masked_mean, clip_by_value

%load_ext autoreload
%autoreload 2

In [3]:
# The notebook should work even without GPUs, but if you have them, confirm you do
print ('number of GPUs:', torch.cuda.device_count())

# For the sake of the speed of this demonstration, the batch_size is temporarily decreased from 256 to 4
batch_size = 4
config.batch_size = batch_size
config.forward_batch_size = max(batch_size//8,1)

print('config.batch_size', config.batch_size)
print('config.forward_batch_size', config.forward_batch_size)

number of GPUs: 1
config.batch_size 4
config.forward_batch_size 1


In [4]:
lab = Lab(config)

dataset = lab.build_dataset(
    dataset_name="imdb",
    input_min_text_length=2,
    input_max_text_length=8,
)



tokenizer_config.json:   0%|          | 0.00/17.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/577 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/24895 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1168 > 1024). Running this sequence through the model will result in indexing errors


In [6]:
new_policy, old_policy, tokenizer = lab.init_policies_tokenizer()

lab.set_generation_config(
    do_sample=True,
    output_min_length=4,
    output_max_length=16,
    pad_token_id=tokenizer.eos_token_id,
)

ppo_trainer = lab.init_ppo_trainer(
    config, 
    new_policy, 
    old_policy, 
    tokenizer, 
    dataset, 
    dataloader_collator = imdb_dataloader_collator,
)

reward_model = lab.init_reward_model()

  state_dict = torch.load(filename, map_location="cpu")


In [7]:
for batch_step, batch in enumerate(ppo_trainer.dataloader):
    
    queries = batch['input_ids']
    
    #### Get response from gpt2
    responses = []
    for query in queries:
        gen_len = lab.output_length_sampler()
        lab.generation_kwargs["max_new_tokens"] = gen_len
        response = ppo_trainer.generate(query, **lab.generation_kwargs)
        responses.append(response.squeeze()[-gen_len:])

    batch['response'] = [tokenizer.decode(r.squeeze()) for r in responses]

    #### Compute sentiment score
    texts = [q + r for q,r in zip(batch['query'], batch['response'])]
    pipe_outputs = lab.reward_model(texts, **sent_kwargs)
    rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]
    break
    
queries, responses, scores = ppo_trainer._step_safety_checker(batch_size, queries, responses, rewards)

scores



[tensor(-2.1110, device='cuda:0'),
 tensor(-2.7184, device='cuda:0'),
 tensor(0.8919, device='cuda:0'),
 tensor(-1.8611, device='cuda:0')]

In [8]:
model_inputs = ppo_trainer.prepare_model_inputs(queries, responses)
model_inputs_names = list(model_inputs.keys())
print(model_inputs_names, model_inputs['input_ids'].shape)

# next we use these end of episode (end of sentence) rewards
all_logprobs, _, values, masks = ppo_trainer.batched_forward_pass(ppo_trainer.model, queries, responses, model_inputs)
ref_logprobs, _, _, _ = ppo_trainer.batched_forward_pass(ppo_trainer.ref_model, queries, responses, model_inputs)

rewards, non_score_reward = ppo_trainer.compute_rewards(scores, all_logprobs, ref_logprobs, masks)

rewards

['input_ids', 'attention_mask', 'labels'] torch.Size([4, 21])


tensor([[-0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,
         -0.0000, -0.0000, -2.1110, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,
         -0.0000, -0.0000, -0.0000, -0.0000],
        [-0.0000, -0.0000, -0.0000, -0.0000, -2.7184, -0.0000, -0.0000, -0.0000,
         -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,
         -0.0000, -0.0000, -0.0000, -0.0000],
        [-0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,
          0.8919, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,
         -0.0000, -0.0000, -0.0000, -0.0000],
        [-0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,
         -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,
         -0.0000, -0.0000, -0.0000, -1.8611]], device='cuda:0',
       grad_fn=<StackBackward0>)

#### train_minibatch()

train_minibatch does 2 things:

1. computes the loss components using PPO
2. combine loss components into `loss`
3. does backpropagation on `loss`

So far we have just calculated the tensors components we need to do PPO. Next we will study whats going on in the PPO update step `ppo_trainer.train_minibatch`.

Below, when you run these steps, even with one iteration, the policy will be updated such that the rewards  will have changed.

In [9]:
mini_batch_dict = {
    "queries": queries,
    "responses": responses,
    "logprobs": all_logprobs,
    "values": values,
    "rewards": rewards,
    "masks": masks,
}

mini_batch_dict.update(model_inputs)

mini_batch_dict.keys()

dict_keys(['queries', 'responses', 'logprobs', 'values', 'rewards', 'masks', 'input_ids', 'attention_mask', 'labels'])

In [10]:
def mini_batch_collator(data):
    return_dict = dict()
    for key in data[0]:
        if key in ["queries", "responses"]:
            return_dict[key] = [d[key] for d in data]
        else:
            return_dict[key] = torch.stack([d[key] for d in data]).to(ppo_trainer.accelerator.device)
    return return_dict

mini_batch_data = Dataset.from_dict(mini_batch_dict)
mini_batch_data.set_format("torch")

mini_batch_dataloader = torch.utils.data.DataLoader(
    mini_batch_data,
    batch_size=ppo_trainer.config.mini_batch_size,
    shuffle=True,
    collate_fn=mini_batch_collator,
)

In [11]:
for batch in mini_batch_dataloader:
    model_inputs = {k: batch[k] for k in model_inputs_names}
    
    new_logprobs, logits, vpreds, _ = ppo_trainer.batched_forward_pass(
        ppo_trainer.model, batch["queries"], batch["responses"], model_inputs
    )
    
    old_logprobs = batch["logprobs"]
    values = batch["values"]
    rewards = batch["rewards"]
    masks = batch["masks"]
    
    loss_p, loss_v, train_stats = ppo_trainer.loss(
        old_logprobs, 
        values, 
        rewards, 
        logits, 
        vpreds, 
        new_logprobs, 
        masks,
    )
    
    loss = loss_p + loss_v
    
    ppo_trainer.optimizer.zero_grad()
    ppo_trainer.accelerator.backward(loss)
    ppo_trainer.optimizer.step()
    
    break
    
train_stats

{'loss/policy': tensor(-5.1090e-08, device='cuda:0', grad_fn=<DivBackward0>),
 'loss/value': tensor(10.9744, device='cuda:0', grad_fn=<MulBackward0>),
 'loss/total': tensor(1.0974, device='cuda:0', grad_fn=<AddBackward0>),
 'policy/entropy': tensor(5.0373, device='cuda:0', grad_fn=<DivBackward0>),
 'policy/approxkl': tensor(0., device='cuda:0', grad_fn=<MulBackward0>),
 'policy/policykl': tensor(0., device='cuda:0', grad_fn=<DivBackward0>),
 'policy/clipfrac': tensor(0., device='cuda:0', dtype=torch.float64),
 'policy/advantages': tensor([[-1.4882, -1.4946, -1.5671, -1.0137,  0.4114,  1.1047, -0.3123,  0.4867,
           0.8904, -1.3661, -1.3661, -1.3661, -1.3661, -1.3661, -1.3661, -1.3661,
          -1.3661, -1.3661, -1.3661, -1.3661]], device='cuda:0'),
 'policy/advantages_mean': tensor(5.1090e-08, device='cuda:0'),
 'policy/ratio': tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1.]], device='cuda:0', grad_fn=<ExpBackward0>),
 'returns/

In [12]:
model_inputs = ppo_trainer.prepare_model_inputs(queries, responses)
model_inputs_names = list(model_inputs.keys())
print(model_inputs_names, model_inputs['input_ids'].shape)

# next we use these end of episode (end of sentence) rewards
all_logprobs, _, values, masks = ppo_trainer.batched_forward_pass(ppo_trainer.model, queries, responses, model_inputs)
ref_logprobs, _, _, _ = ppo_trainer.batched_forward_pass(ppo_trainer.ref_model, queries, responses, model_inputs)

rewards, non_score_reward = ppo_trainer.compute_rewards(scores, all_logprobs, ref_logprobs, masks)

rewards

['input_ids', 'attention_mask', 'labels'] torch.Size([4, 21])


tensor([[ 2.6242e-03,  3.9878e-03,  5.5797e-04, -5.9639e-03,  1.7390e-03,
          9.9352e-05,  1.1417e-05,  2.7222e-03, -5.1538e-03,  3.3707e-03,
         -2.1136e+00, -2.5463e-05, -2.3460e-05, -2.6894e-05, -3.1090e-05,
         -3.6526e-05, -2.9755e-05, -3.8910e-05, -4.4727e-05, -4.8828e-05],
        [-4.2026e-03,  1.3269e-03,  7.7820e-05, -1.5076e-03, -2.7170e+00,
         -4.2198e-03, -5.1532e-03, -5.0203e-03, -4.9840e-03, -4.9579e-03,
         -4.8109e-03, -4.7613e-03, -4.6192e-03, -4.5017e-03, -4.3607e-03,
         -4.2481e-03, -4.0533e-03, -3.8754e-03, -3.7760e-03, -3.6945e-03],
        [-3.8300e-04, -2.5545e-02,  8.6190e-01,  1.1374e-01, -1.5309e-02,
         -7.5511e-02,  2.6089e-03, -5.4370e-02,  7.7570e-01,  1.0541e-02,
         -3.4704e-04, -3.8314e-04, -2.1830e-04, -5.6267e-06,  4.4022e-04,
          8.9178e-04,  1.4722e-03,  1.5696e-03,  1.5185e-03,  1.4860e-03],
        [ 2.0596e-03, -1.5564e-03,  2.8242e-03, -1.5065e-03, -3.0094e-03,
          2.6947e-03,  6.7685e-04, 

### loss()

This method is where the specifics of GAE and PPO are implemented.



#### Advantage = Returns - Values

The [Advantage](https://huggingface.co/blog/deep-rl-a2c) is the A in $\nabla_{\theta} J(\theta) \approx \nabla_{\theta} \log \pi_{\theta}(\mathbf{a} \mid \mathbf{s}) \hat{A}^{\pi}(\mathbf{s}, \mathbf{a})$ Where J is the objective function we are taking the gradient of to use in updating our parameters. 


You can put several diffeernt expressions in the position where A is, including Q, the state action value function or the monte carlo single sample reward to go, aka cumulative remaining rewards of the rollout or episode. Using the Advantage in this spot is an attempt to place a lower variance term in this place. You can think of the Advantage as the relative advantage in terms of value of commiting to action a in state s, relative to the general value of being in state s and taking the average action the policy wouldve selected for you.

The term `lastgaelam` means last [generalized advantage estimator](https://arxiv.org/pdf/1506.02438.pdf) GAE lambda.

In the equation below $A^{(k)}_{t}$ is the advantage that represents "Compared to the average reward, or expected reward, reward we should get from state $ s_{t} $  till the end, How much more or less did we get specifically as a result of taking the action we took at step t rather than all the other actions we could have taken, not including the actions we took after or before step t", so in a simple equation:

$$ Advantage = Returns - Values $$

`returns` is the total sum of rewards $ R(t) = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^T r_{T} $ where T is the total number of timesteps in the episode and gamma $ \gamma $ is the discount factor that when small << 1 emphasizes short term rewards over long term rewards and when $ \gamma $ = 1 weights longer term and short term equally.

`values` is the model's prediction of what the returns will be at any given timestep t and state $s_t$

In [21]:
print('lambda', ppo_trainer.config.lam, 'gamma', ppo_trainer.config.gamma)

lambda 0.95 gamma 1


#### Generalized Advantage Estimator (GAE)

$A^{(1)}_{t}$ does this by incorporating only the actual reward r_t we got immediately after taking action $ a_{t} $ in $ s_{t} $ + $\gamma V(s_{t+1})$ to estimate the rest of the future - $V(s_t)$, the expected total rewards averaged across all the action options at state s_t, good and bad. 

\begin{align}
\hat{A}_t^{(1)} &= r_t + \gamma V(s_{t+1}) - V(s_t) \\
\hat{A}_t^{(2)} &= r_t + \gamma r_{t+1} + \gamma^2 V(s_{t+2}) - V(s_t) \\
\cdots &= \cdots \\ 
\hat{A}_t^{(\infty)} &= r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots - V(s_t)
\end{align}

$A^{(2)}_{t}$ is similar to $A^{(1)}_{t}$, only we incorporate 2 steps of actual reward in the future then estimate the rest, and so on and so on.


lam, or lambda $ \lambda $ is the weight parameter, it is taught intuitively here in this other lesson about [ Exponentially Weighted Moving Average](https://medium.com/mlearning-ai/exponentially-weighted-average-5eed00181a09) (EWMA), only in this lesson it is called $ \beta $. Basically, the higher $ \lambda $ the more you are placing weight on values other than the most immediate one. 

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*u3MIYRnLguhjvM0tr72wBA.png" width=600 height=400>

What you see is that lower the $ \beta $ is, the more noisy the signal. Thats because the lower the beta the less we are taking into account the more stable past values, instead changing the moving avareg alot based on the most recent volatile new piece of data. With higher beta we are weighing the past known and now static values more heavily, thereby inducing a smoother curve.

However, and im sorry for doing this, but with respect to GAE, the situation is reversed int both ways, from the example shown in the graph. So why did I show it to you? Well the example is easier to understand and the relationship is similar only reversed, and the relationship is harder to describe. But once you see that relationship, I think its easier to take the inverse of a relatshiption you do understand, than to explain the relationship is a more confusing setting. 

Whereas in typical times series EWMA, the most immediate data is the most recently data in the past and the other data is the data in the farther past, in GAE the most immediate data is the next reward and the other data is the rewards we are estimating we might get in the future, via the state value function V(s). The $ \lambda $ is therefore higher when you want to weight these far future estimates higher at the expense of the ones in your immediate future which are more certain. 

$\hat{A}_t^{GAE(\gamma,\lambda)}$  is the generalized advantage estimator. This [Blog on GAE](https://danieltakeshi.github.io/2017/04/02/notes-on-the-generalized-advantage-estimation-paper/) explains it well. The higher lambda is the more future steps (k's) you are taking into account

\begin{align}
\hat{A}_t^{GAE(\gamma,\lambda)} &= (1-\lambda)\Big(\hat{A}_{t}^{(1)} + \lambda \hat{A}_{t}^{(2)} + \lambda^2 \hat{A}_{t}^{(3)} + \cdots \Big) \\
&= (1-\lambda)\Big(\delta_t^V + \lambda(\delta_t^V + \gamma \delta_{t+1}^V) + \lambda^2(\delta_t^V + \gamma \delta_{t+1}^V + \gamma^2 \delta_{t+2}^V)+ \cdots \Big)  \\
&= (1-\lambda)\Big( \delta_t^V(1+\lambda+\lambda^2+\cdots) + \gamma\delta_{t+1}^V(\lambda+\lambda^2+\cdots) + \cdots \Big) \\
&= (1-\lambda)\left(\delta_t^V \frac{1}{1-\lambda} + \gamma \delta_{t+1}^V\frac{\lambda}{1-\lambda} + \cdots\right) \\
&= \sum_{l=0}^\infty (\gamma \lambda)^l \delta_{t+l}^{V}
\end{align}

***
The tradeoff here is that the estimators $A^{(k)}_{t}$ with small k have low variance but high bias, whereas those with large k have low bias but high variance. Why?

I think of it based on the number of terms. With small k, we have fewer terms to sum over (which means low variance). However, the bias is relatively large because it does not make use of extra “exact” information with r_K for K > k

Here’s another way to think of it as emphasized in the paper: V(s_t)
is constant among the estimator class, so it does not affect the relative bias or variance among the estimators: differences arise entirely due to the k -step returns.
***

In RL and machine learning, we are calling this noise, the variance, as in the bias variance tradeoff.

I like to sum this up as "the L in lambda for for longtermism" and depending on the choices we make today, there are many variants of the future we could end up in, so larges L means more longterms and more variance. 

Basically like many tradeoffs there exists a point of balance for your particular problem. Like in the below example, you get bad learning not only when lambda is too high, but also when it is too low.

<img src="https://d3i71xaburhd42.cloudfront.net/ca11ba7b2991fe07b7a99b3a3aeba2486ed36261/9-Figure4-1.png">

Im going to rewrite the set of equations above to better mirror the code that we will implement below;

first lets add the lambda term

\begin{align}
\hat{A}_t^{(1)} &= \delta^{V}_{t} = r_t + \gamma V(s_{t+1}) - V(s_t) \\
\hat{A}_t^{(2)} &= \delta^{V}_{t} + (\gamma \lambda) \delta^{V}_{t+1} \\
\hat{A}_t^{(k)} &= \sum_{l=0}^{l=k} (\gamma \lambda)^l \delta_{t+l}^{V}
\end{align}

next, rewrite ${A}_t^{(T - t)}$ in terms of t + 1, so that we can calculate the GAE for each
step t in the sequence from the last T to the first 0:

\begin{align}
\delta^{V}_{t} &= r_t + \gamma V(s_{t+1}) - V(s_t) \\
\hat{A}_t^{(T - t)} &= \delta^{V}_{t} + (\gamma \lambda) \delta^{V}_{t+1} \\
\end{align}

The above two expressions correspond to line 2 and 3 below respectively

In [27]:
values = values * masks
rewards = rewards * masks

gen_len = rewards.shape[-1]
print('gen_len', gen_len)

lastgaelam = 0 # lastgaelam takes the role of delta_t+1 in 
advantages_reversed = []

# iterate backwards from last time step of episode, t = T -> 0 
for t in reversed(range(gen_len)):
    
    # 1. V(s_t+1) for all t except t = T
    nextvalues = values[:, t + 1] if t < gen_len - 1 else 0.0  
    
    # 2. delta_t =  r_t + gamma*V(s_t+1) - V(s_t) 
    delta = rewards[:, t] + ppo_trainer.config.gamma * nextvalues - values[:, t]
    
    # 3. A_t = delta_t + gamma*lambda*delta_t+1
    lastgaelam = delta + ppo_trainer.config.gamma * ppo_trainer.config.lam * lastgaelam
    
    advantages_reversed.append(lastgaelam)
    
    #print(t, nextvalues, lastgaelam)
    
#print(' ')
#print(advantages_reversed)
#print(' ')

# reverse advantages_reversed to put it back into forward chronological order then concatenate
advantages = torch.stack(advantages_reversed[::-1]).transpose(0, 1)

print(advantages)

gen_len 21
tensor([[ 1.3324,  1.4025, -0.6980, -0.0607,  1.0487,  0.7870, -0.1646, -0.5177,
         -0.8657,  0.7834, -1.6866,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-0.5344, -0.5625, -3.2472, -2.1215, -2.5153, -3.4795,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 1.6768,  0.5432,  0.2625,  0.9680,  2.2337,  2.0082,  1.0950,  0.3707,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.4374,  0.4604,  0.4846,  0.5101,  0.5370,  0.5652, -0.3046, -0.9661,
         -1.8352, -2.9504, -2.5220, -2.9497, -0.9531,  0.3294, -0.3411,  0.3765,
          0.6635, -1.3289, -1.0629, -1.3277, -0.3742]],
       grad_fn=<TransposeBackward0>)


In [31]:
# since advantage = returns - values
returns = advantages + values
print(returns)

# whitening simply subtracts the means and divides by the standard deviation to zero mean the data and 
# impose a std of 1

advantages = masked_whiten(advantages, masks)
advantages = advantages.detach()

advantages

tensor([[ 1.3324,  1.4025,  1.3676,  1.3654,  1.4211,  1.4601,  1.4556,  1.4275,
          1.3849,  1.4243,  1.3307,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-0.5344, -0.5625, -0.7249, -0.8308, -0.9562, -1.1285,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 1.6768,  1.7039,  2.0918,  2.0957,  2.1689,  2.4382,  2.6648,  2.6982,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.4374,  0.4604,  0.4846,  0.5101,  0.5370,  0.5652,  0.5500,  0.5031,
          0.4118,  0.2638,  0.1347, -0.0132, -0.0592, -0.0455, -0.0621, -0.0423,
         -0.0132, -0.0940, -0.1501, -0.2157, -0.2349]], grad_fn=<AddBackward0>)


tensor([[ 1.3046,  1.3521, -0.0702,  0.3614,  1.1125,  0.9353,  0.2910,  0.0519,
         -0.1837,  0.9329, -0.7396,  0.4024,  0.4024,  0.4024,  0.4024,  0.4024,
          0.4024,  0.4024,  0.4024,  0.4024,  0.4024],
        [ 0.0406,  0.0216, -1.7962, -1.0340, -1.3007, -1.9535,  0.4024,  0.4024,
          0.4024,  0.4024,  0.4024,  0.4024,  0.4024,  0.4024,  0.4024,  0.4024,
          0.4024,  0.4024,  0.4024,  0.4024,  0.4024],
        [ 1.5378,  0.7702,  0.5802,  1.0579,  1.9149,  1.7622,  1.1438,  0.6535,
          0.4024,  0.4024,  0.4024,  0.4024,  0.4024,  0.4024,  0.4024,  0.4024,
          0.4024,  0.4024,  0.4024,  0.4024,  0.4024],
        [ 0.6986,  0.7142,  0.7306,  0.7478,  0.7660,  0.7852,  0.1962, -0.2517,
         -0.8402, -1.5953, -1.3052, -1.5948, -0.2429,  0.6255,  0.1715,  0.6574,
          0.8517, -0.4974, -0.3173, -0.4965,  0.1491]])

#### Clipping the Value Function (Critic)

This implementation detail is not explained in detail in the PPO paper but when you look at the discussion here 
https://github.com/openai/baselines/issues/91 and the TensorFlow implementation here: https://github.com/openai/baselines/blob/master/baselines/ppo2/model.py

The authors add that they

 *Clip the value function objective to reduce variability during Critic training*

here is the actual code from openai baselines

```python
        # CALCULATE THE LOSS
        # Total loss = Policy gradient loss - entropy * entropy coefficient + Value coefficient * value loss

        # Clip the value to reduce variability during Critic training
        # Get the predicted value
        vpred = train_model.vf
        vpredclipped = OLDVPRED + tf.clip_by_value(train_model.vf - OLDVPRED, - CLIPRANGE, CLIPRANGE)
        # Unclipped value
        vf_losses1 = tf.square(vpred - R)
        # Clipped value
        vf_losses2 = tf.square(vpredclipped - R)

        vf_loss = .5 * tf.reduce_mean(tf.maximum(vf_losses1, vf_losses2))
```
 and here are those same operations in PyTorch

In [36]:
print(ppo_trainer.config.cliprange_value)
print(values)
print(vpreds)

vpredclipped = \
    clip_by_value(vpreds, values - ppo_trainer.config.cliprange_value, values + ppo_trainer.config.cliprange_value)

print(vpredclipped)

# minimize the squared-error loss
vf_losses1 = (vpreds - returns) ** 2
vf_losses2 = (vpredclipped - returns) ** 2
vf_loss = 0.5 * masked_mean(torch.max(vf_losses1, vf_losses2), masks)
vf_clipfrac = masked_mean(torch.gt(vf_losses2, vf_losses1).double(), masks)

vf_clipfrac

0.2
tensor([[ 0.0000,  0.0000,  2.0656,  1.4261,  0.3724,  0.6731,  1.6202,  1.9452,
          2.2506,  0.6409,  3.0173,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  2.5223,  1.2907,  1.5591,  2.3511,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  1.1608,  1.8292,  1.1277, -0.0648,  0.4300,  1.5698,  2.3275,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.8546,  1.4692,
          2.2470,  3.2142,  2.6567,  2.9365,  0.8939, -0.3749,  0.2790, -0.4189,
         -0.6767,  1.2350,  0.9129,  1.1119,  0.1393]], grad_fn=<MulBackward0>)
tensor([[ 0.1872,  0.9147,  2.1325,  1.1916,  1.1885, -0.1127,  1.5022,  1.6317,
          0.8113,  0.9

tensor(0.4286, dtype=torch.float64)

in the TRL version a value called `vf_clipfrac` is calculated to report what proportion of the losses ended up being clipped, so < 0.5 means that most of our our value function losses were within the range.

#### Clipped Surrogate Objective

The Clipped Surrogate Objective is the main discussion point in the PPO paper

here is the main equation from the paper

$$ r_t(\theta) = \frac{\pi_\theta (a_t|s_t)}{\pi_{\theta_{old}} (a_t|s_t)} $$

\begin{align}
\mathcal{L}^{CLIP}(\theta) =
 \mathbb{E}_{a_t, s_t \sim \pi_{\theta{new}}} \biggl[
   min \Bigl(r_t(\theta) \bar{A_t},
             clip \bigl(
              r_t(\theta), 1 - \epsilon, 1 + \epsilon
             \bigr) \bar{A_t}
   \Bigr)
 \biggr]
\end{align}

In laymens terms, this says:

r(theta)_t is the ratio, not reward, between the new policy and old policy. meaning, a ratio of 1 means they are the same with respect to this particular action and state, a ratio < 0.8 or > 1.2 means that the new policy is very different from the old, by a factor of 0.2 in this example.

So in order to disincentivize radical noisy unconstructive updates to the new policy (This phrasing is from https://huggingface.co/blog/deep-rl-ppo), we update our policy only if:


1. Our ratio is in the range `[1 - epsilon, 1 + epsilon]` 

OR

2. Our ratio is outside the range, but the advantage leads to getting closer to the range. There are 2 cases:
    
    a. The ratio is below the range but the advantage is > 0
    
    b. The ratio is above the range but the advantage is < 0.


<img src="https://huggingface.co/blog/assets/93_deep_rl_ppo/recap.jpg" height=400 width=600>

[Table from "Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization" by Daniel Bick](https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf)

here it is in baselines tensorflow

```python
        # Calculate ratio (pi current policy / pi old policy)
        ratio = tf.exp(OLDNEGLOGPAC - neglogpac)

        # Defining Loss = - J is equivalent to max J
        pg_losses = -ADV * ratio

        pg_losses2 = -ADV * tf.clip_by_value(ratio, 1.0 - CLIPRANGE, 1.0 + CLIPRANGE)

        # Final PG loss
        pg_loss = tf.reduce_mean(tf.maximum(pg_losses, pg_losses2))
        approxkl = .5 * tf.reduce_mean(tf.square(neglogpac - OLDNEGLOGPAC))
        clipfrac = tf.reduce_mean(tf.to_float(tf.greater(tf.abs(ratio - 1.0), CLIPRANGE)))

        # Total loss
        loss = pg_loss - entropy * ent_coef + vf_loss * vf_coef
```

and here in pytorch 


In [39]:
ratio = torch.exp(new_logprobs - old_logprobs)
pg_losses = -advantages * ratio
pg_losses2 = \
    -advantages * torch.clamp(ratio, 1.0 - ppo_trainer.config.cliprange, 1.0 + ppo_trainer.config.cliprange)

pg_loss = masked_mean(torch.max(pg_losses, pg_losses2), masks)
pg_clipfrac = masked_mean(torch.gt(pg_losses2, pg_losses).double(), masks)

print(pg_loss)
pg_clipfrac

tensor(-1.3624e-08, grad_fn=<DivBackward0>)


tensor(0., dtype=torch.float64)

#### Combined Objective

<img src="https://huggingface.co/blog/assets/93_deep_rl_ppo/ppo-objective.jpg" height=400 width=600> 

The TRL authors leave out the entropy term because there does not seems to be a way to do it such that it gives much benefit at this time https://github.com/lvwerra/trl/issues/131, which is why in our case you see that after we return the policy gradient loss and value function loss from the function above, the final loss we backpropagate on is

`loss = pg_loss + self.config.vf_coef * vf_loss`

We maximize this objective, this is why there is a negative sign for the squared-error loss because maximizing the negative of this component is the same as minimizing the original component. 