In [1]:
from IPython.display import Image

- misc
    - https://github.com/huggingface/alignment-handbook/tree/main
- 3 steps
    - pre-training a large language model (LLM) to predict the next token on internet-scale data, on clusters of thousands of GPUs. One calls the result a **"base model"**
    - supervised fine-tuning (SFT) to turn the base model into a useful assistant (ChatBot)
        - we turned a "base model" into a useful assistant, by training it to **generate useful completions given human instructions.**
    - human preference fine-tuning which increases the assistant's friendliness, helpfulness and safety.
        - "safe", "friendly", "harmless", "inclusive",
        - human preference fine-tuning

## align & why align

- dpo: direct preference optimization your language model is **secretly a reward model**
    - https://arxiv.org/abs/2305.18290
- collect human/ai feedback to learn $p(y_w\gt y_l)$
- RLHF - the OG（Original Gangster，始祖） of LLM alignment

    $$
    \max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y \mid x)} \underbrace{\left[ r_\phi(x, y) \right]}_{\text{maximise rewards}} - \underbrace{\beta \mathbb{D}_{\text{KL}} \left[ \pi_\theta(y \mid x) \parallel \pi_{\text{ref}}(y \mid x) \right]}_{\text{use KL penalty to prevent
    reward hacking (controlled by β)
    }}
    $$
    - RL（PPO）很多超参，且训练不稳定；
    - 还需要一个RM（Reward model），这样的话一共三个model要操作，actor model，ref model，Reward model

## DPO（Direct Preference Optimization）

$$
\max_{\pi} \mathbb{E}_{(x, y_w, y_l) \sim D} \log \sigma \left( \beta \log \frac{\pi(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \beta \log \frac{\pi(y_l | x)}{\pi_{\text{ref}}(y_l | x)} \right)
$$
- only two models (actor/active model, reference model (sft))
- 求导练习

    $$
    \left(\log\sigma(z)\right))'=\frac{1}{\sigma(z)}\cdot \sigma(z)(1-\sigma(z))=1-\sigma(z)=\sigma(-z)
    $$

$$
\begin{align*}
\nabla_{\theta} \mathcal{L}_{\text{DPO}} (\pi_{\theta}; \pi_{\text{ref}}) = & -\beta \mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \underbrace{\sigma \left( \hat{r}_{\theta}(x, y_l) - \hat{r}_{\theta}(x, y_w) \right)}_{\text{higher weight when reward estimate is wrong} } \left[ \underbrace{\nabla_{\theta} \log \pi(y_w | x)}_{\text{increase likelihood of } y_w} - \underbrace{\nabla_{\theta} \log \pi(y_l | x)}_{\text{decrease likelihood of } y_l} \right] \right]
\end{align*}
$$

- $\hat r_\theta(x,y)=\beta\log\frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$（implicit reward from LM）
    - 它表示的是模型 $\pi_\theta$ 相对于参考模型 $\pi_{\text{ref}}$ 对生成结果 $y$ 的偏好程度。
    - 与显式奖励（例如通过人工评分或者明确的奖励函数给出的奖励）不同，隐式奖励是通过模型内部的概率分布计算得到的。在DPO中，这种隐式奖励直接来源于模型本身的输出概率分布，因此称为“隐式奖励”。

In [2]:
Image(url='https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487af2f0-e51d-4140-92a7-23476c5ea016_1600x1015.png', width=400)

## practices

- https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama_2/scripts/
    - dpo_llama2.py

```
accelerate launch examples/research_projects/stack_llama_2/scripts/dpo_llama2.py \
    --model_name_or_path="sft/final_checkpoint" \
    --output_dir="dpo"
```

- basemodel: `meta-llama/Llama-2-7b-hf`
- sft:
- dpo (alignment => rlhf):

- dataset
    - lvwerra/stack-exchange-paired
        - question, Create pairs (response_j, response_k) where j was rated better than k