In [1]:
import trl

In [2]:
trl.__version__

'0.15.1'

## loss = 0

> loss 为 0 为什么还可以反向传播，更新梯度；

- loss 为 0，不意味着 gradient 为 0
    - $f(w)=(w-1)^2-1$，在 $w=0$ 时，$f(w)=0$，但其实其 gradient 为 -2
        - 梯度 * 学习率 才是 learning 的本质；
    - $w-\eta\cdot g=0-(0.1*-2)=0.2$
- loss 不再是一个好的 monitor 指标，而是 reward

### loss 为 0 不代表 gradient 为 0

In [1]:
import torch

# 情况1: x - x (梯度为0)
x = torch.tensor([3.0], requires_grad=True)
y1 = x - x  # 数学上等价于 0，但计算图保留关联
y1.backward()  # 反向传播计算梯度
print("Gradient for x - x:", x.grad.item())  # 输出 0.0

Gradient for x - x: 0.0


In [2]:
# 清除梯度，准备下一个示例
x.grad.zero_()

# 情况2: x - x.detach() (梯度为1)
y2 = x - x.detach()  # 分离第二个x，使其视为常数
y2.backward()  # 反向传播计算梯度
print("Gradient for x - x.detach():", x.grad.item())  # 输出 1.0

Gradient for x - x.detach(): 1.0


### loss = $\beta kl$

https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851

- trl grpo
    - $\beta = 0.04$（default，`GRPOConfig`）
    - 这个值其实是比较大的，math 用 0.001？？
- 抛开 kl
    - 一个 prompt 多个 generations（为一个 group）
        - 每个 generation 对应的 loss = -advantage (likelihood ratio = 1, $\pi_\theta=\pi_{\theta_{old}}$)
    - 一个 group 的 mean loss = - mean advantage = 0
- kl 的位置
    - 定义在 advantage 计算 reward 时
    - 定义在外部
    - grpo 原始公式是定义在外部的；
        - the GRPO implementation does not include the KL-divergence as part of the reward function. Instead, it directly incorporates the KL-divergence into the loss function, arguing that this approach simplifies the computation and avoids unnecessary complexity.

$$
\mathcal{J}_{GRPO}(\theta) = \mathbb{E}_{q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|q)} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min \left( \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})}, 1-\varepsilon, 1+\varepsilon \right) \hat{A}_{i,t} \right) - \beta D_{KL} (\pi_\theta || \pi_{ref}) \right]
$$

- If you are using the GRPO trainer then the old policy is in effect updated every step, this means you just use a detached version of the current policy.
    - 公式中的 $\pi_{\theta_{old}}$ 是 $\pi_\theta$ 的 detach 版（不参与计算图，视为常数）；
    - $r=\frac{\pi_\theta}{\pi_{\theta_{old}}}=1$,
    - $\text{clip}(1, 1-\epsilon, 1+\epsilon)=1$
- $\hat A_{i,t}=\tilde r_i=\frac{r_i-\mu}{\sigma}$ (z score) （token 级别的 adv = output 级别的 reward 组内 z-score 而来）
 

$$
\begin{split}
\mathcal{J}_{GRPO}(\theta)&= \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min \left( \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})}, 1-\varepsilon, 1+\varepsilon \right) \hat{A}_{i,t} \right) - \beta D_{KL} (\pi_\theta || \pi_{ref}) \\
&=\frac1G\sum_i^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\hat A_{i,t} -\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\
&=\frac1G\sum_i^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\hat A_i -\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\
&=\frac1G\sum_i^G\frac1{|o_i|} {|o_i|}\cdot \hat A_i -\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\
&=\frac1G\sum_i^G\hat A_i-\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\
&=\frac1G\sum_i^G\frac{r_i-\mu}{\sigma}-\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\
&=\frac{\sum_i r_i-G\mu}{G}-\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\
&= 0 -\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]\\
&=-\frac1G\sum_{i=1}^G\frac1{|o_i|}\sum_{t=1}^{|o_i|}\beta D_{kl}[\pi_\theta|\pi_{ref}]
\end{split}
$$

### per_device_train_batch_size & num_generations

https://github.com/huggingface/trl/pull/2776

- (`num_processes * per_device_batch_size`) must be divisible by `G`.
    - `per_device_batch_size` 刻画的是 gpu device 粒度 generations 的数量
    - `num_processes` 是 gpu 进程的数量；
    - `num_processes * per_device_batch_size` / `G`: prompts 吞吐量
- https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py#L571-L598
    - ensures each prompt is repeated across multiple processes. This guarantees that identical prompts are distributed to different GPUs, allowing rewards to be computed and normalized correctly within each prompt group. Using the same seed across processes ensures consistent prompt assignment, preventing discrepancies in group formation.
    - repeats the batch multiple times to allow reusing generations across multiple updates. Refer to _prepare_inputs to see how the generations are stored and reused.
    - In the following figure, the values are the prompt indices. The first row shows the first sampled batch, the
second row shows the second sampled batch, and so on.
    - 3 个 gpus，num_generations = 3，per_device_train_batch_size = 4
        - 3*4 / 3  = 4

    |      | GPU0   | GPU1       | GPU2     |
    |------|--------|------------|----------|
    | P0   | P00    | P01        | P02      |
    | P1   | P10    | P11        | P12      |
    | P2   | P20    | P21        | P22      |
    | P3   | P30    | P31        | P32      |

    - 进一步还考虑到了 `grad_accum` = 3，累加 batch forward，统一 backward