In [1]:
from IPython.display import Image

- misc
    - https://github.com/huggingface/alignment-handbook/tree/main
- 3 steps
    - pre-training a large language model (LLM) to predict the next token on internet-scale data, on clusters of thousands of GPUs. One calls the result a **"base model"**
    - supervised fine-tuning (SFT) to turn the base model into a useful assistant (ChatBot)
        - we turned a "base model" into a useful assistant, by training it to **generate useful completions given human instructions.**
    - human preference fine-tuning which increases the assistant's friendliness, helpfulness and safety.
        - "safe", "friendly", "harmless", "inclusive",
        - human preference fine-tuning

## align & why align

- collect human/ai feedback to learn $p(y_w\gt y_l)$
- RLHF - the OG（Original Gangster，始祖） of LLM alignment

    $$
    \max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y \mid x)} \underbrace{\left[ r_\phi(x, y) \right]}_{\text{maximise rewards}} - \underbrace{\beta \mathbb{D}_{\text{KL}} \left[ \pi_\theta(y \mid x) \parallel \pi_{\text{ref}}(y \mid x) \right]}_{\text{use KL penalty to prevent
    reward hacking (controlled by β)
    }}
    $$
    - RL（PPO）很多超参，且训练不稳定；
    - 还需要一个RM（Reward model），这样的话一共三个model要操作，actor model，ref model，Reward model

## DPO（Direct Preference Optimization）

$$
\max_{\pi} \mathbb{E}_{(x, y_w, y_l) \sim D} \log \sigma \left( \beta \log \frac{\pi(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \beta \log \frac{\pi(y_l | x)}{\pi_{\text{ref}}(y_l | x)} \right)
$$
- only two models (actor/active model, reference model (sft))
- 求导练习

    $$
    \left(\log\sigma(z)\right))'=\frac{1}{\sigma(z)}\cdot \sigma(z)(1-\sigma(z))=1-\sigma(z)=\sigma(-z)
    $$

$$
\begin{align*}
\nabla_{\theta} \mathcal{L}_{\text{DPO}} (\pi_{\theta}; \pi_{\text{ref}}) = & -\beta \mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \underbrace{\sigma \left( \hat{r}_{\theta}(x, y_l) - \hat{r}_{\theta}(x, y_w) \right)}_{\text{higher weight when reward estimate is wrong} } \left[ \underbrace{\nabla_{\theta} \log \pi(y_w | x)}_{\text{increase likelihood of } y_w} - \underbrace{\nabla_{\theta} \log \pi(y_l | x)}_{\text{decrease likelihood of } y_l} \right] \right]
\end{align*}
$$

- $\hat r_\theta(x,y)=\beta\log\frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$（implicit reward from LM）
    - 它表示的是模型 $\pi_\theta$ 相对于参考模型 $\pi_{\text{ref}}$ 对生成结果 $y$ 的偏好程度。
    - 与显式奖励（例如通过人工评分或者明确的奖励函数给出的奖励）不同，隐式奖励是通过模型内部的概率分布计算得到的。在DPO中，这种隐式奖励直接来源于模型本身的输出概率分布，因此称为“隐式奖励”。

In [2]:
Image(url='https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F487af2f0-e51d-4140-92a7-23476c5ea016_1600x1015.png', width=400)

## practices

- https://colab.research.google.com/drive/1mWiOFBy3zY6OdINEvHN9EPoQ_VIvfFKw

In [3]:
# !pip install flash-attn --no-build-isolation
import os
os.environ['http_proxy'] = 'http://127.0.0.1:7890'
os.environ['https_proxy'] = 'http://127.0.0.1:7890'

### dataset

In [4]:
from datasets import load_dataset

raw_datasets = load_dataset("HuggingFaceH4/ultrafeedback_binarized")

In [5]:
# gen, sft, prefs
raw_datasets

DatasetDict({
    train_prefs: Dataset({
        features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'],
        num_rows: 61135
    })
    train_sft: Dataset({
        features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'],
        num_rows: 61135
    })
    test_prefs: Dataset({
        features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'],
        num_rows: 2000
    })
    test_sft: Dataset({
        features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'],
        num_rows: 1000
    })
    train_gen: Dataset({
        features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'],
        num_rows: 61135
    })
    test_gen: Dataset({
        features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'],
        num_rows

In [6]:
from datasets import DatasetDict

indices = range(0,100)

dataset_dict = {"train": raw_datasets["train_prefs"].select(indices),
                "test": raw_datasets["test_prefs"].select(indices)}

raw_datasets = DatasetDict(dataset_dict)
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'],
        num_rows: 100
    })
    test: Dataset({
        features: ['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'],
        num_rows: 100
    })
})

In [7]:
print(raw_datasets['train'][0]['prompt'])
print('=================')
print(raw_datasets['train'][0]['chosen'][1]['content'])
print('=================')
print(raw_datasets['train'][0]['rejected'][1]['content'])

how can i develop a habit of drawing daily
Developing a daily habit of drawing can be challenging but with consistent practice and a few tips, it can become an enjoyable and rewarding part of your daily routine. Here are some strategies to help you develop the habit of drawing daily:

1. Set a specific time: Allocate a specific time of the day to draw. It could be in the morning, afternoon, or evening. Make drawing a part of your daily routine.
2. Set a specific duration: Determine the amount of time you want to spend on drawing each day. It can be as little as 10 minutes or as long as an hour. Be consistent with the duration to help build the habit.
3. Start small and simple: Don't try to create a masterpiece every day, start with simple and easy-to-do sketches. Focus on improving your skills gradually.
4. Use a variety of tools and mediums: Experiment with different tools like pencils, pens, markers, and different mediums like paper, canvas, or digital apps to keep your drawing pract

### tokenizer

In [8]:
from transformers import AutoTokenizer

model_id = "alignment-handbook/zephyr-7b-sft-lora"

tokenizer = AutoTokenizer.from_pretrained(model_id)

if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

In [9]:
tokenizer.padding_side

'left'

In [10]:
tokenizer.special_tokens_map

{'bos_token': '<s>',
 'eos_token': '</s>',
 'unk_token': '<unk>',
 'pad_token': '</s>'}

In [11]:
# Truncate from left to ensure we don't lose labels in final turn
tokenizer.truncation_side = "left"

# Set reasonable default for models without max length
if tokenizer.model_max_length > 100_000:
    tokenizer.model_max_length = 2048

In [12]:
print(tokenizer.chat_template)

{% for message in messages %}
{% if message['role'] == 'user' %}
{{ '<|user|>
' + message['content'] + eos_token }}
{% elif message['role'] == 'system' %}
{{ '<|system|>
' + message['content'] + eos_token }}
{% elif message['role'] == 'assistant' %}
{{ '<|assistant|>
'  + message['content'] + eos_token }}
{% endif %}
{% if loop.last and add_generation_prompt %}
{{ '<|assistant|>' }}
{% endif %}
{% endfor %}


In [13]:
DEFAULT_CHAT_TEMPLATE = "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"
print(DEFAULT_CHAT_TEMPLATE)

{% for message in messages %}
{% if message['role'] == 'user' %}
{{ '<|user|>
' + message['content'] + eos_token }}
{% elif message['role'] == 'system' %}
{{ '<|system|>
' + message['content'] + eos_token }}
{% elif message['role'] == 'assistant' %}
{{ '<|assistant|>
'  + message['content'] + eos_token }}
{% endif %}
{% if loop.last and add_generation_prompt %}
{{ '<|assistant|>' }}
{% endif %}
{% endfor %}


In [14]:
tokenizer.chat_template = DEFAULT_CHAT_TEMPLATE

### Apply chat template

- 用于奖励建模或区分正例和负例
- two must keys
    - chosen
    - rejected
- `</s>`: eos_token, pad_token
- system, user, assistant

In [15]:
import re


def apply_chat_template(example, tokenizer, assistant_prefix="<|assistant|>\n"):
    def _strip_prefix(s, pattern):
        # Use re.escape to escape any special characters in the pattern
        return re.sub(f"^{re.escape(pattern)}", "", s)

    if all(k in example.keys() for k in ("chosen", "rejected")):
            # Compared to reward modeling, we filter out the prompt, so the text is everything after the last assistant token
            prompt_messages = [[msg for msg in example["chosen"] if msg["role"] == "user"][0]]
            # Insert system message
            if example["chosen"][0]["role"] != "system":
                prompt_messages.insert(0, {"role": "system", "content": ""})
            else:
                prompt_messages.insert(0, example["chosen"][0])
            # TODO: handle case where chosen/rejected also have system messages
            chosen_messages = example["chosen"][1:]
            rejected_messages = example["rejected"][1:]
            example["text_chosen"] = tokenizer.apply_chat_template(chosen_messages, tokenize=False)
            example["text_rejected"] = tokenizer.apply_chat_template(rejected_messages, tokenize=False)
            example["text_prompt"] = tokenizer.apply_chat_template(
                prompt_messages, tokenize=False, add_generation_prompt=True
            )
            example["text_chosen"] = _strip_prefix(example["text_chosen"], assistant_prefix)
            example["text_rejected"] = _strip_prefix(example["text_rejected"], assistant_prefix)
    else:
        raise ValueError(
            f"Could not format example as dialogue for `dpo` task! Require `[chosen, rejected]` keys but found {list(example.keys())}"
        )

    return example

In [16]:
test = apply_chat_template(raw_datasets['train'][0], tokenizer)

In [17]:
print(test['text_prompt'])

<|system|>
</s>
<|user|>
how can i develop a habit of drawing daily</s>
<|assistant|>



In [18]:
list(raw_datasets['train'].features)

['prompt',
 'prompt_id',
 'chosen',
 'rejected',
 'messages',
 'score_chosen',
 'score_rejected']

In [19]:
from multiprocessing import cpu_count

column_names = list(raw_datasets["train"].features)

raw_datasets = raw_datasets.map(
        apply_chat_template,
        fn_kwargs={"tokenizer": tokenizer},
        num_proc=cpu_count(),
        remove_columns=column_names,
        desc="Formatting comparisons with prompt template",
)
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text_chosen', 'text_rejected', 'text_prompt'],
        num_rows: 100
    })
    test: Dataset({
        features: ['text_chosen', 'text_rejected', 'text_prompt'],
        num_rows: 100
    })
})

In [20]:
# Replace column names with what TRL needs, text_chosen -> chosen and text_rejected -> rejected
for split in ["train", "test"]:
    raw_datasets[split] = raw_datasets[split].rename_columns(
        {"text_prompt": "prompt", "text_chosen": "chosen", "text_rejected": "rejected"}
    )
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['chosen', 'rejected', 'prompt'],
        num_rows: 100
    })
    test: Dataset({
        features: ['chosen', 'rejected', 'prompt'],
        num_rows: 100
    })
})

In [21]:
import random

# Print a few random samples from the training set:
for index in random.sample(range(len(raw_datasets["train"])), 3):
    print(f"Prompt sample {index} of the raw training set:\n\n{raw_datasets['train'][index]['prompt']}")
    print(f"Chosen sample {index} of the raw training set:\n\n{raw_datasets['train'][index]['chosen']}")
    print(f"Rejected sample {index} of the raw training set:\n\n{raw_datasets['train'][index]['rejected']}")

Prompt sample 98 of the raw training set:

<|system|>
</s>
<|user|>
In the year 1630 during the month of January Captain Cook began his voyage from the port of Calais in Europe, which is in the Northern Hemisphere. His intention was  to find the mythical land in the Southern Hemisphere. After about eleven months of sailing, in December 1630, he reached Australia, which is indeed in the Southern Hemisphere. Thinking he finally reached the mythical  land, he commenced his return voyage to Europe on June,  1631, He reached Europe in the middle of the following year on July, 1632.  Given the paragraph above, please answer correctly the following question:   Did Australia exprience increased or decreased solar flux when Captain Cook reached Australia in 1630?
----
Answer: increased


David is an environmental scientist. He needed to find causes of wildfires and suggest preventive measures. First, he visited a dense forest. He marked it as location A. Then he visited a grassland, which he ma

### SFT model

In [22]:
from peft import PeftConfig

peft_config = PeftConfig.from_pretrained(model_id)
print("Adapter weights model repo:", model_id)
print("Base model weights model repo:", peft_config.base_model_name_or_path)

Adapter weights model repo: alignment-handbook/zephyr-7b-sft-lora
Base model weights model repo: mistralai/Mistral-7B-v0.1


In [23]:
import torch
from peft import PeftModel
from transformers import BitsAndBytesConfig, AutoModelForCausalLM

In [24]:
torch.cuda.current_device()

0

In [25]:
# specify how to quantize the model
quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
)
device_map = {"": torch.cuda.current_device()} if torch.cuda.is_available() else None

In [26]:
# Step 1: load the base model (Mistral-7B in our case) in 4-bit
model_kwargs = dict(
    attn_implementation="flash_attention_2", # set this to True if your GPU supports it (Flash Attention drastically speeds up model computations)
    torch_dtype="auto",
    use_cache=False,  # set to False as we're going to use gradient checkpointing
    device_map=device_map,
    quantization_config=quantization_config,
)
base_model = AutoModelForCausalLM.from_pretrained(peft_config.base_model_name_or_path, **model_kwargs)

# Step 2: load base model + SFT adapter weights
# notice that only the adapter weights are trainable!
model = PeftModel.from_pretrained(base_model, model_id)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [27]:
def count_trainable_params(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

In [28]:
count_trainable_params(model)

0

In [29]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [30]:
print_trainable_parameters(model)

trainable params: 0 || all params: 3794014208 || trainable%: 0.0


### DPOTrainer

In [31]:
import os
os.environ['NCCL_P2P_DISABLE'] = '1' 
os.environ['NCCL_IB_DISABLE'] = '1'

In [32]:
# DPOTrainer??

In [33]:
from trl import DPOTrainer
from peft import LoraConfig
from transformers import TrainingArguments

# path where the Trainer will save its checkpoints and logs
output_dir = 'data/zephyr-7b-dpo-lora'

# based on config
training_args = TrainingArguments(
    bf16=True,
    # beta=0.01,
    do_eval=True,
    evaluation_strategy="steps",
    eval_steps=100,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant":False},
    hub_model_id="zephyr-7b-dpo-qlora",
    learning_rate=5.0e-6,
    log_level="info",
    logging_steps=10,
    lr_scheduler_type="cosine",
    # max_length=1024,
    # max_prompt_length=512,
    num_train_epochs=1,
    optim="paged_adamw_32bit",
    output_dir=output_dir,  # It is handy to append `hub_model_revision` to keep track of your local experiments
    per_device_train_batch_size=1,
    per_device_eval_batch_size=8,
    # push_to_hub=True,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=1,
    seed=42,
    warmup_ratio=0.1,
)

# based on the recipe: https://github.com/huggingface/alignment-handbook/blob/main/recipes/zephyr-7b-beta/dpo/config_qlora.yaml
peft_config = LoraConfig(
        r=128,
        lora_alpha=128,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj",  "up_proj",  "down_proj"],
)

trainer = DPOTrainer(
        model,
        ref_model=None,
        model_init_kwargs=None,
        ref_model_init_kwargs=None,
        args=training_args,
        # beta=training_args.beta,
        train_dataset=raw_datasets["train"],
        eval_dataset=raw_datasets["test"],
        tokenizer=tokenizer,
        max_length=1024,
        max_prompt_length=512,
        peft_config=peft_config,
        # loss_type=training_args.loss_type,
    )

  torch.utils._pytree._register_pytree_node(
    PyTorch 2.0.1 with CUDA 1108 (you have 2.2.2+cu121)
    Python  3.10.13 (you have 3.10.13)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


[2024-05-28 00:41:51,923] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
Using auto half precision backend


In [34]:
train_result = trainer.train()

You are using 8-bit optimizers with a version of `bitsandbytes` < 0.41.1. It is recommended to update your version as a major bug has been fixed in 8-bit optimizers.
***** Running training *****
  Num examples = 100
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Training with DataParallel so batch size has been adjusted to: 2
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 4
  Total optimization steps = 12
  Number of trainable parameters = 335,544,320
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mlanchunhui[0m ([33mloveresearch[0m). Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112515810721865, max=1.0…

The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss


OutOfMemoryError: CUDA out of memory. Tried to allocate 482.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 435.69 MiB is free. Including non-PyTorch memory, this process has 23.20 GiB memory in use. Of the allocated memory 20.84 GiB is allocated by PyTorch, and 1.77 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)