In [1]:
from IPython.display import Image
import matplotlib.pyplot as plt

- references
    - https://blog.csdn.net/c9Yv2cf9I06K2A9E/article/details/129053162

## loss

- pg_loss + value_loss
    - pg_loss 是 PPO 中 actor 的 loss 函数，其通过 discount reward 和 importance ratio 来计算当前 step 的 reward 应该是多少：
    - value_loss 是 PPO 中 critic 的 loss 函数，其目的在于评判每一个 token 被生成后的 value 是多少。 

In [5]:
Image(url='../imgs/pg_loss.png', width=500)

In [6]:
Image(url='../imgs/value_loss.png', width=500)

## batched_forward_pass(queries, responses, scores)

### input

- `queries`: list
    - len(queries) == 1024 (batch size)
    - `[q.shape for q in queries]`
- `responses`: list
    - len(responses) == len(queries)
    - `[r.shape for r in responses]`
- `scores`: tensor
    - `scores.shape == torch.Size([1024])`
- `model_inputs = self.prepare_model_inputs(queries, responses)`: 
    - `model_inputs.keys() = dict_keys(['input_ids', 'attention_mask'])`
    - `model_inputs['attention_mask'].sum(dim=-1)`
        - len(r) + len(q)

```
model_inputs = self.prepare_model_inputs(queries, responses)
all_logprobs, _, values, masks = self.batched_forward_pass(self.model, queries, responses, model_inputs, ...)
ref_logprobs, _, _, _ = self.batched_forward_pass(self.ref_model, queries, responses, model_inputs, ...)
```

- `all_logprobs.shape == torch.Size([1024, 21])`
    - logp，已做过 gather
- `ref_logprobs.shape == torch.Size([1024, 21])`
    - logp，已做过 gather

## compute_rewards(scores, all_logprobs, ref_logprobs, masks)

- kl penalty

    $$
    D_{KL}(P||Q)=\sum_{x}P(x)\log\frac{P(x)}{Q(x)}
    $$

    - `logprob - ref_logprob`（相对）

    $$
    \log p-\log q=\log \frac{p}{q}
    $$

### kl_ctl

```
# self.kl_ctl = AdaptiveKLController(0.02, 6, 10000)
self.kl_ctl = AdaptiveKLController(self.config.init_kl_coef, self.config.target, self.config.horizon)
```

In [3]:
from trl.trainer import AdaptiveKLController

In [8]:
# https://arxiv.org/pdf/1909.08593.pdf, 2.2
kl_ctl = AdaptiveKLController(0.02, 6, 10000)

### $R(x,y)$

$$
R(x,y)=r(x,y)-\beta\log\frac{\pi(y|x)}{\rho(y|x)}
$$

```
reward = score - self.kl_ctl.value * kl
```

## compute_advantages(**values**, rewards, masks)

- `values` from value head (active model)

- lam is the GAE’s $λ$ parameter.
    - Generalized Advantage Estimation (GAE)
- returns = advantages + values
    - advantages = $r+\gamma V_{next} - V_{current}$
    - `values` from value head (active model)

## train_minibatch && loss