- latest `trl` version
    - https://huggingface.co/docs/trl/main/en/grpo_trainer

In [12]:
from IPython.display import Image

In [2]:
from datasets import load_dataset
dataset = load_dataset("trl-lib/tldr", split="train")

In [5]:
dataset

Dataset({
    features: ['prompt', 'completion'],
    num_rows: 116722
})

## GRPOConfig & GRPOTrainer

In [9]:
import os
os.environ["NCCL_P2P_DISABLE"] = "1"
os.environ["NCCL_IB_DISABLE"] = "1"

In [11]:
from rich.pretty import pprint
pprint(GRPOConfig(output_dir='test'))

- GRPOConfig
    - num_generations=8,
    - per_device_train_batch_size=1, * gradient_accumulation_steps=8,
        - per_device_train_batch_size * gradient_accumulation_steps * world_size ==> train_batch

In [6]:
from trl import GRPOConfig, GRPOTrainer

[2025-02-06 22:42:06,848] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


  def forward(ctx, input, weight, bias=None):
  def backward(ctx, grad_output):


## GRPO

In [14]:
# bs = 2
# G = 4
Image(url='https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/grpo_visual.png', width=400)

- GRPO is an **online learning**(on-policy) algorithm, meaning it improves iteratively by using the data generated by the trained model itself during training.
- four main steps:
    - Generating completions,
        - At each training step, we sample a batch of prompts and generate a set of $G$ completions(`num_generations`) for each prompt (denoted as $o_i$).
    - computing the advantage,
        - $\hat A_{i,t}=\frac{r_i-\mu(\mathbf r)}{\sigma(\mathbf r)}$
        - Outcome supervision provides the normalized reward at the end of each output $o_i$ and sets the advantages $\hat A_{i,t}$ of all tokens in the output as the normalized reward
    - **estimating** the KL divergence, (token-level see the figure)
        - https://huggingface.co/docs/trl/main/en/grpo_trainer#estimating-the-kl-divergence
        - `per_token_kl = torch.exp(ref_per_token_logps - per_token_logps) - (ref_per_token_logps - per_token_logps) - 1`
    - and computing the loss.
        - $\pi_{ref}, (\pi_{old}, \pi_\theta)$
        - https://github.com/huggingface/trl/issues/2608
```
# x - x.detach() allows for preserving gradients from x
per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)
per_token_loss = -(per_token_loss - self.beta * per_token_kl)
loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
```

### reward function

```
def reward_func(completions, **kwargs):
    """Reward function that gives higher scores to longer completions."""
    return [float(len(completion)) for completion in completions]
```
- completions
    - `[bs, G]`
- 20
    - https://github.com/huggingface/trl/issues/2771

### PPO vs. GRPO

在 RLHF 中，我们需要把人类反馈或reward model 对整个序列的打分（例如，一次对话的最终质量分）融合到强化学习训练中。此外，为了让模型在训练时不要偏离参考策略（reference model）太远，我们常常还会引入一个基于 KL 的惩罚项。

因此，在 PPO 里面，最关键的变化在于——如何构造每个 token（每个时间步）的奖励 $r_t$
 。这往往通过下面两步完成：

### training monitor

- You should rely mostly on the reward. And keep an eye on the generations (risk of reward hacking)
    - https://github.com/huggingface/trl/issues/2703