- deepseekmath (grpo)
    - https://arxiv.org/abs/2402.03300
- latest `trl` version
    - https://huggingface.co/docs/trl/main/en/grpo_trainer

In [1]:
from IPython.display import Image

## RL roadmap

- 基本概念: on-policy vs. off-policy
    - 理解复杂的公式（$\pi_\theta,\pi_{\theta_{old}}, \pi_{ref}$），理解计算过程；
    - 对哪个概率分布进行采样获得数据；
- GRPO <= PPO(CLIP) <= TRPO <= **PG** (policy gradient)
    - PPO: GAE (Generalized **Advantage** Estimation), TD-error

$$
\nabla_\theta J(\pi_\theta)=\mathbb E_{\tau \sim (\pi_\theta, T)}\left[\sum_{t=0}^T\nabla_\theta \log\pi_\theta(a_t|s_t)R(\tau)\right]
$$

## grpo demo


> dataset: 7473;

https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb#file-grpo_demo-py

- 单 4090:
    ```
    export CUDA_VISIBLE_DEVICES=0
    python grpo_demo.py
    ```
    - 4/4 * 4 => 4,
        - 1868;
    - 13 小时；
- 双 4090，ddp （accelerate map）
    - (4/4 * 4 * 2) => 8;
        - 934;
    - < 10小时；
- 双 4090，
    - deepspeed stage-2/3;
    - fsdp

### GRPOConfig & GRPOTrainer

- GRPOConfig
    - num_generations=8,
    - old
        - per_device_train_batch_size=1, * gradient_accumulation_steps=8,
            - per_device_train_batch_size * gradient_accumulation_steps * world_size ==> train_batch
    - new: https://github.com/huggingface/trl/pull/2776#issue-2833772774
        - per_device_train_batch_size:
            - it now represents the number of generations per device.
        - per_device_train_batch_size/num_generations * gradient_accumulation_steps
            - per_device_train_batch_size/num_generations: prompts per device
            - 也因此要求，per_device_train_batch_size 必须能被 num_generations 整除；

## GRPO

In [7]:
# bs = 2
# G = 4
Image(url='https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/grpo_visual.png', width=400)

- GRPO is an **online learning** algorithm, meaning it improves iteratively by using the data generated by the trained model itself during training.
- four main steps:
    - Generating completions,
        - At each training step, we sample a batch of prompts and generate a set of $G$ completions(`num_generations`) for each prompt (denoted as $o_i$).
    - computing the advantage,
        - $\hat A_{i,t}=\frac{r_i-\mu(\mathbf r)}{\sigma(\mathbf r)}$
        - Outcome supervision provides the normalized reward at the end of each output $o_i$ and sets the advantages $\hat A_{i,t}$ of all tokens in the output as the normalized reward
    - **estimating** the KL divergence, (token-level see the figure)
        - https://huggingface.co/docs/trl/main/en/grpo_trainer#estimating-the-kl-divergence
        - `per_token_kl = torch.exp(ref_per_token_logps - per_token_logps) - (ref_per_token_logps - per_token_logps) - 1`
    - and computing the loss.
        - $\pi_{ref}, (\pi_{\theta_{old}}, \pi_\theta)$
        - https://github.com/huggingface/trl/issues/2608
```
# x - x.detach() allows for preserving gradients from x
per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)
per_token_loss = -(per_token_loss - self.beta * per_token_kl)
loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
```

### pipeline

- grpo_trainer.py
    - `_prepare_inputs()`
    ```
    return {
        "prompt_ids": prompt_ids,
        "prompt_mask": prompt_mask,
        "completion_ids": completion_ids,
        "completion_mask": completion_mask,
        "ref_per_token_logps": ref_per_token_logps,
        "advantages": advantages,
    }
    ```
    - `compute_loss()`

> qwen2.5 vocab size: 151936

- prompts => tokenizer.apply_chat_template => model.generate
    - prompt_completion_ids = prompt_ids + prompt_completion_ids
    - rewards_per_func: (n_prompts, len(reward_funcs))
    - ref_per_token_logps($\pi_{ref}(q, o)$), per_token_logps ($\pi_\theta(q,o)$)
        - completion token level
            - selective_log_softmax(logits, index)
                - logits.shape: (n_prompts, n_completion, n_vocab)
                - index.shape: (n_prompts, n_complection)
                - => (n_prompts, n_complection)
$$
\exp(\log{\pi'}-\log{\pi})=\frac{\pi'}{\pi}
$$

- 目前的实现只有 $\pi_{\theta}, \pi_{ref}$，没有 $\pi_{\theta_{old}}$
    - https://github.com/huggingface/trl/issues/2608
        - The policy model only has **a single update** following each exploration stage. (deepseekmath)
        - $\pi_\theta$ 每次（rollout a group generations）只进行一次更新，而不是多次更新；
            - 对应 `for GRPO iteration = 1, . . . , 𝜇` (𝜇 == 1)
    - $\pi_{\theta_{old}}=\pi_\theta$
    - `torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)`
        - per_token_logps.detach() 不参与计算图的梯度计算；
    - 没有用到 clip，只有 $\frac{\pi_\theta}{\pi_{\theta_{old}}}A=1\cdot A$（ratio * advantage）
        - ratio = 1，一定在 $(1-\epsilon, 1+\epsilon)$ 之间的；

### training monitor

- You should rely mostly on the reward. And keep an eye on the generations (risk of reward hacking)
    - https://github.com/huggingface/trl/issues/2703