In [1]:
from IPython.display import Image

## Prerequisites

- misc
    - https://github.com/huggingface/alignment-handbook/tree/main
- 3 steps
    - pre-training a large language model (LLM) to predict the next token on internet-scale data, on clusters of thousands of GPUs. One calls the result a **"base model"**
    - supervised fine-tuning (SFT) to turn the base model into a useful assistant (ChatBot)
        - we turned a "base model" into a useful assistant, by training it to **generate useful completions given human instructions.**
    - human preference fine-tuning which increases the assistant's friendliness, helpfulness and safety.
        - "safe", "friendly", "harmless", "inclusive",
        - human preference fine-tuning

### align & why align

- dpo: direct preference optimization your language model is **secretly a reward model**
    - https://arxiv.org/abs/2305.18290
- collect human/ai feedback to learn $p(y_w\gt y_l)$
- RLHF - the OG（Original Gangster，始祖） of LLM alignment

    $$
    \max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y \mid x)} \underbrace{\left[ r_\phi(x, y) \right]}_{\text{maximise rewards}} - \underbrace{\beta \mathbb{D}_{\text{KL}} \left[ \pi_\theta(y \mid x) \parallel \pi_{\text{ref}}(y \mid x) \right]}_{\text{use KL penalty to prevent
    reward hacking (controlled by β)
    }}
    $$
    - RL（PPO）很多超参，且训练不稳定；
    - 还需要一个RM（Reward model），这样的话一共三个model要操作，actor model，ref model，Reward model

### reward hacking

- The alignment Problem（《人机对齐》）
    - University of Toronto economist Joshua Gans wanted to enlist the help of his older daughter in potty training her younger brother. So he did what any good economist would do. He offered her an incentive: anytime she helped her brother go to the bathroom, she would get a piece of candy. The daughter immediately found a loophole that her father, the economics professor, had overlooked. “I realized that the more that goes in, the more comes out,” she says. “So I was just feeding my brother buckets and buckets of water.” Gans affirms: “It didn’t really work out too well.”
        - 多伦多大学经济学家乔舒亚·甘斯（Joshua Gans）的一次亲身经历。他想让大女儿帮忙训练她的小弟弟使用厕所，于是他做了一个经济学家常做的事情：提供激励。他告诉女儿，每次她帮助弟弟上厕所，她都会得到一块糖果。女儿立刻发现了一个父亲——这位经济学教授——没有注意到的漏洞。她说：“我意识到，喝的越多，排的也越多。”于是她开始给弟弟大量灌水。甘斯证实道：“这并没有取得很好的效果。”
    - 指的是在给定奖励机制下，个体通过**非预期的方式**最大化奖励的行为。
- This constraint is added to avoid what is known as “reward hacking”: the language model (the policy) may just choose sequences of tokens that achieve high reward but may be total gibberish.

### Bradley-Terry model

- convert the preferences into a score (reward)

    $$
    P(y_w > y_l) = \frac{e^{r^*(x, y_w)}}{e^{r^*(x, y_w)} + e^{r^*(x, y_l)}}
    $$

- 显然希望最大化这个概率，即 $y_w\gt y_\ell$ 的概率尽可能地高，也就是基于 MLE 的方式求解 RM 的参数；


    $$
    P(y_w > y_l) = \frac{e^{r_\phi(x, y_w)}}{e^{r_\phi(x, y_w)} + e^{r_\phi(x, y_l)}}=\sigma(r_\phi(x, y_w)-r_\phi(x, y_l))
    $$

  - 因为有：$\frac{e^A}{e^A+e^B}=\frac{1}{1+e^{B-A}}=\frac{1}{1+e^{-(A-B)}}=\sigma(A-B)$

- reward model loss

    
    $$
    L=-\mathbb E_{(x,y_w,y_l)\sim D}\left[\log\sigma\left(r_\phi(x,y_w)-r_\phi(x,w_l)\right)\right]
    $$

- policy objective
  
    $$
    \max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y \mid x)} \underbrace{\left[ r_\phi(x, y) \right]}_{\text{maximise rewards}} - \underbrace{\beta \mathbb{D}_{\text{KL}} \left[ \pi_\theta(y \mid x) \parallel \pi_{\text{ref}}(y \mid x) \right]}_{\text{use KL penalty to prevent
    reward hacking (controlled by β)
    }}
    $$

### RLHF objective => DPO

$$
J_{RLHF} = \max_{\pi_\theta} \mathbb{E}_{x \sim D, y \sim \pi_\theta(y|x)} \left[ r_\phi(x, y) - \beta \mathbb{D}_{KL}\left[ \pi_\theta(y|x) \parallel \pi_{ref}(y|x) \right] \right]
$$

- 不可以直接通过 gradient descent 的方式来做优化，因为 $y\sim \pi_\theta(y|x)$（采样的过程，也包含了很多的策略，比如 greedy，beam-search ...）

DPO paper eq3 -> eq4，求得解析解（$Z(x) = \sum_y \pi_{\text{ref}}(y|x) \exp \left( \frac{1}{\beta} r(x, y) \right)$）；

$$
\pi_r(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp \left( \frac{1}{\beta} r(x, y) \right)
$$

进一步我们推导 $r(x,y)$

$$
\begin{split}
\log \pi^*(y|x)&= \log \left[ \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp \left( \frac{1}{\beta} r(x, y) \right) \right] \\
&= \log \pi_{\text{ref}}(y|x) - \log Z(x) + \log \exp \left( \frac{1}{\beta} r(x, y) \right) \\
&= \log \pi_{\text{ref}}(y|x) - \log Z(x) + \frac{1}{\beta} r(x, y)
\end{split}
$$

因此：

$$
r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)
$$

再来回顾下 Bradley-Terry model

$$
\begin{split}
p(y_w\gt y_l)&=\sigma(r(x,y_w)-r(x,y_l))\\
&=\sigma\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} + \beta \log Z(x) - \beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)}- \beta \log Z(x)\right)\\
&=\sigma\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)}-\beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)
\end{split}
$$

最终 DPO：

$$
L_{DPO}(\pi_{\theta}; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma \left( \beta \log \frac{\pi_{\theta}(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \beta \log \frac{\pi_{\theta}(y_l | x)}{\pi_{\text{ref}}(y_l | x)} \right) \right]
$$

## DPO（Direct Preference Optimization）

In [2]:
Image(url='./imgs/rlhf_dpo.png', width=400)

$$
\begin{split}
&\max_{\pi} \mathbb{E}_{(x, y_w, y_l) \sim D} \log \sigma \left( \beta \log \frac{\pi(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \beta \log \frac{\pi(y_l | x)}{\pi_{\text{ref}}(y_l | x)} \right)\\
&\log \sigma \left(\beta\left(\log\frac{\pi(y_w|x)}{\pi(y_l|x)}-\log\frac{\pi_{\text{ref}}(y_w|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right)
\end{split}
$$

- trains to assign high probability to positive examples $\pi_\theta(y_w|x)$ and low probability to negative examples $\pi_\theta(y_l|x)$ 
- only two models (actor/active model, reference model (sft))
    - $\beta$ is a temperature parameter for the DPO loss, typically something in the range of 0.1 to 0.5.
- 求导练习

    $$
    \left(\log\sigma(z)\right))'=\frac{1}{\sigma(z)}\cdot \sigma(z)(1-\sigma(z))=1-\sigma(z)=\sigma(-z)
    $$

$$
\begin{align*}
\nabla_{\theta} \mathcal{L}_{\text{DPO}} (\pi_{\theta}; \pi_{\text{ref}}) = & -\beta \mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \underbrace{\sigma \left( \hat{r}_{\theta}(x, y_l) - \hat{r}_{\theta}(x, y_w) \right)}_{\text{higher weight when reward estimate is wrong} } \left[ \underbrace{\nabla_{\theta} \log \pi(y_w | x)}_{\text{increase likelihood of } y_w} - \underbrace{\nabla_{\theta} \log \pi(y_l | x)}_{\text{decrease likelihood of } y_l} \right] \right]
\end{align*}
$$

- $\hat r_\theta(x,y)=\beta\log\frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$（implicit reward from LM）
    - 它表示的是模型 $\pi_\theta$ 相对于参考模型 $\pi_{\text{ref}}$ 对生成结果 $y$ 的偏好程度。
    - 与显式奖励（例如通过人工评分或者明确的奖励函数给出的奖励）不同，隐式奖励是通过模型内部的概率分布计算得到的。在DPO中，这种隐式奖励直接来源于模型本身的输出概率分布，因此称为“隐式奖励”。

In [2]:
Image(url='https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487af2f0-e51d-4140-92a7-23476c5ea016_1600x1015.png', width=400)

## practices

In [1]:
# !pip install -U trl

- https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama_2/scripts/
    - dpo_llama2.py

```
accelerate launch examples/research_projects/stack_llama_2/scripts/dpo_llama2.py \
    --model_name_or_path="sft/final_checkpoint" \
    --output_dir="dpo"
```

- basemodel: `meta-llama/Llama-2-7b-hf`
    - 非 chat/instruct 版
- sft:
- dpo (alignment => rlhf):

- dataset
    - lvwerra/stack-exchange-paired
        - question, Create pairs (response_j, response_k) where j was rated better than k
        - train: "data/rl"
            - 1652614 条样本
            - tokenize 的时候要指定 `num_proc`，以充分利用 cpu 核心/线程级别的分布式，加速数据的预处理
        - evaluation: "data/evaluation"
            - 242 条？？（开启 sanity_check），先保证程序运行没有bug；
    - process
        - num_proc: 开启 cpu 的多进程，会显著地提升大数据集的预处理效率；
          
        ```
        {
            'prompt': List[str],
            'chosen': List[str],
            'rejected': List[str],
        }
        return dataset.map(
            return_prompt_and_responses,
            batched=True,
            num_proc=num_proc,
            remove_columns=original_columns,
        )
        ```
        
- 关于参数
    - `total_train_batch_size` = `self._train_batch_size * args.gradient_accumulation_steps * args.world_size`
    - `max_steps`: Total optimization steps
- `DPOTrainer`

    ```
    dpo_trainer = DPOTrainer(
        model,
        ref_model=None,
    ```
    
    - 不需要 `ref_model`
    - `model = get_peft_model(model, peft_config)`

### loss


concatenated_forward

- concatenated_input_ids: shape, `[4, 259]`
- all_logits = model('concatenated_input_ids', ).logits
    - `[4, 259, 32000]`
- all_logps = get_batch_logps(all_logits, concatenated_labels)
    - torch.gather
    - `[4]`
  
```
chosen_logps = all_logps[:len_chosen]
rejected_logps = all_logps[len_chosen:]

chosen_logits = all_logits[:len_chosen]
rejected_logits = all_logits[len_chosen:]
```

dpo loss

```
losses, chosen_rewards, rejected_rewards = self.dpo_loss(
    policy_chosen_logps,
    policy_rejected_logps,
    reference_chosen_logps,
    reference_rejected_logps,
)
```

- $\pi_{\log \text{ratios}}=\pi_{\text{chosen}}-\pi_{\text{rejected}}$
- $\rho_{\log \text{ratios}}=\rho_{\text{chosen}}-\rho_{\text{rejected}}$
- $\text{logits} = \pi_{\log \text{ratios}} - \rho_{\log \text{ratios}}$
- loss
    - sigmoid
      
      $$
      \text{losses} = -\log \sigma(\beta \cdot \text{logits}) \cdot (1 - \alpha) - \log \sigma(-\beta \cdot \text{logits}) \cdot \alpha
      $$
      - $\alpha$：label_smoothing parameter
    - hinge:

      $$
      \text{losses} = \max(0, 1 - \beta \cdot \text{logits})
      $$
      
    - ipo:

      $$
      \text{losses} = \left( \text{logits} - \frac{1}{2\beta} \right)^2
      $$
      
    - kto pair (https://arxiv.org/abs/2402.01306)