https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling.ipynb#scrollTo=IQ1sMda27Zj6

In [2]:
import os
os.environ['http_proxy'] = 'http://127.0.0.1:7890'
os.environ['https_proxy'] = 'http://127.0.0.1:7890'

### about SFT

Recall that creating a ChatGPT at home involves 3 steps:

1. pre-training a large language model (LLM) to predict the next token on internet-scale data, on clusters of thousands of GPUs. One calls the result a "base model"
2. supervised fine-tuning (SFT) to turn the base model into a useful assistant
   - base model => "chatbot"/"assistant"
   - fine-tuning the model on human instruction data, using the cross-entropy loss.
   - This means that the model is **still trained to predict the next token**, although we now want the model to generate useful completions given an instruction like "what are 10 things to do in London?", "How can I make pancakes?" or "Write me a poem about elephants".
   - https://gizmodo.com/chatgpt-openai-ai-contractors-15-dollars-per-hour-1850415474
       - 工人通常每小时赚15美元的标注合同工
4. human preference fine-tuning which increases the assistant's friendliness, helpfulness and safety.

SFT
- RAG SFT
    - https://www.bilibili.com/video/BV1Yx4y147t4/
- Multi-Turn conversation SFT
- Tool use (function calling) SFT

### dataset

- Zephyr: distilled SFT （dSFT），distilled DPO（dDPO）
    - https://arxiv.org/pdf/2310.16944
    - https://github.com/huggingface/alignment-handbook

In [89]:
from datasets import load_dataset

# based on config
raw_datasets = load_dataset("HuggingFaceH4/ultrachat_200k")

In [90]:
raw_datasets

DatasetDict({
    train_sft: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 207865
    })
    test_sft: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 23110
    })
    train_gen: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 256032
    })
    test_gen: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 28304
    })
})

In [91]:
from datasets import DatasetDict
raw_datasets = DatasetDict({
    "train": raw_datasets["train_sft"],
    "test": raw_datasets["test_sft"]
})

In [95]:
# from datasets import load_dataset

# raw_datasets = load_dataset("HuggingFaceH4/ultrachat_200k", split=["train_sft", "test_sft"])
# raw_datasets = DatasetDict({
#     "train": raw_datasets["train_sft"],
#     "test": raw_datasets["test_sft"]
# })

In [78]:
# from datasets import DatasetDict

# # remove this when done debugging
# indices = range(0,100)

# dataset_dict = {"train": raw_datasets["train_sft"].select(indices),
#                 "test": raw_datasets["test_sft"].select(indices)}

# raw_datasets = DatasetDict(dataset_dict)
# raw_datasets

In [96]:
raw_datasets['train'][0].keys()

dict_keys(['prompt', 'prompt_id', 'messages'])

In [97]:
print(raw_datasets['train'][0]['prompt_id'], raw_datasets['train'][0]['prompt'])

f0e37e9f7800261167ce91143f98f511f768847236f133f2d0aed60b444ebe57 These instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?
On your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!
Your Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.
Does this feature apply to all sections of the theme or just specific ones as listed in the text material?


In [99]:
for msg in raw_datasets['train'][0]['messages']:
    role = msg['role']
    content = msg['content']
    print(f'{role:20}:  {content}')

user                :  These instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?
On your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!
Your Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.
Does this feature apply to all sections of the theme or just specific ones as listed in the text material?
assistant           :  This feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.
user                :  Can you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?
assistant           :  Sure, here are the steps to enable the secondary 

### tokenizer

- pad token
    - During pre-training, one doesn't need to pad since one just creates blocks of text to predict the next token, but during fine-tuning, we will need to pad the (instruction, completion) pairs in order to create batches of equal length.
- max seqlen
    - this is required in order to truncate sequences which are too long for the model. Here we decide to train on at most 2048 tokens.
- chat template：https://huggingface.co/blog/chat-templates
    - `<|user|>` to indicate a user message and `<|assistant|>` to indicate the chatbot's response
    - 在 hf Transformers，chat_template 定义在 tokenizer
    - base model 是 None，instruct model 对应的 tokenizer 会有定义；

In [17]:
from transformers import AutoTokenizer

In [18]:
model_id = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [20]:
tokenizer.pad_token, tokenizer.pad_token_id

(None, None)

In [45]:
print(tokenizer.bos_token, tokenizer.eos_token)
print(tokenizer.encode('<s>', add_special_tokens=False), tokenizer.encode('</s>', add_special_tokens=False))
print(tokenizer.decode(1), tokenizer.decode(2))

<s> </s>
[1] [2]
<s> </s>


In [74]:
# set pad_token_id equal to the eos_token_id if not set
if tokenizer.pad_token_id is None:
  tokenizer.pad_token_id = tokenizer.eos_token_id

In [47]:
tokenizer.model_max_length

1000000000000000019884624838656

In [48]:
# Set reasonable default for models without max length
if tokenizer.model_max_length > 100_000:
  tokenizer.model_max_length = 2048

In [21]:
tokenizer.chat_template

In [27]:
print('meta-llama/Meta-Llama-3-8B')
print(AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B').chat_template)
print('======================')
print('meta-llama/Meta-Llama-3-8B-Instruct')
print(AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B-Instruct').chat_template)
print('======================')
print('mistralai/Mistral-7B-Instruct-v0.1')
print(AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.1').chat_template)

meta-llama/Meta-Llama-3-8B
None
meta-llama/Meta-Llama-3-8B-Instruct
{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}{% endif %}
mistralai/Mistral-7B-Instruct-v0.1
{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token + ' ' }}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}


### apply chat template

In [50]:
DEFAULT_CHAT_TEMPLATE = "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"
tokenizer.chat_template = DEFAULT_CHAT_TEMPLATE

- `tokenizer.apply_chat_template(messages, tokenize=False)`
    - 接受的是 list
    - 基于 role
        - `'<|user|>\n' + message['content'] + eos_token`
        - `'<|system|>\n' + message['content'] + eos_token`
        - `'<|assistant|>\n'  + message['content'] + eos_token`


In [57]:
tokenizer.eos_token

'</s>'

In [101]:
import re
import random
from multiprocessing import cpu_count

def apply_chat_template(example, tokenizer):
    messages = example["messages"]
    # We add an empty system message if there is none
    if messages[0]["role"] != "system":
        messages.insert(0, {"role": "system", "content": ""})
    example["text"] = tokenizer.apply_chat_template(messages, tokenize=False)

    return example

column_names = list(raw_datasets["train"].features)
column_names

['prompt', 'prompt_id', 'messages']

In [102]:
raw_datasets = raw_datasets.map(apply_chat_template,
                                num_proc=cpu_count(),
                                fn_kwargs={"tokenizer": tokenizer},
                                remove_columns=column_names,
                                desc="Applying chat template",)

In [103]:
# create the splits
train_dataset = raw_datasets["train"]
eval_dataset = raw_datasets["test"]

train_dataset

Dataset({
    features: ['text'],
    num_rows: 207865
})

In [104]:
print(raw_datasets['train'][0]['text'])

<|system|>
</s>
<|user|>
These instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?
On your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!
Your Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.
Does this feature apply to all sections of the theme or just specific ones as listed in the text material?</s>
<|assistant|>
This feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.</s>
<|user|>
Can you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?</s>
<|assistant|>
Sure, here are the steps to enable the secondary image hover featur

### model

In [105]:
from transformers import BitsAndBytesConfig, TrainingArguments
import torch

# specify how to quantize the model
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4", 
    bnb_4bit_use_double_quant=True, 
    bnb_4bit_compute_dtype=torch.bfloat16,
)
device_map = {"": torch.cuda.current_device()} if torch.cuda.is_available() else None

model_kwargs = dict(
    attn_implementation="flash_attention_2", # set this to True if your GPU supports it (Flash Attention drastically speeds up model computations)
    torch_dtype="auto",
    use_cache=False, # set to False as we're going to use gradient checkpointing
    device_map=device_map,
    quantization_config=quantization_config,
)

In [None]:
quantization_config

In [60]:
device_map

{'': 0}

In [69]:
from transformers import AutoModelForCausalLM

In [71]:
model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)

loading configuration file config.json from cache at /media/whaow/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/7231864981174d9bee8c7687c24c8344414eae6b/config.json
Model config MistralConfig {
  "_name_or_path": "mistralai/Mistral-7B-v0.1",
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 10000.0,
  "sliding_window": 4096,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": false,
  "vocab_size": 32000
}

loading weights file model.safetensors from cache at /media/whaow/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing MistralForCausalLM.

All the weights of MistralForCausalLM were initialized from the model checkpoint at mistralai/Mistral-7B-v0.1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MistralForCausalLM for predictions without further training.
loading configuration file generation_config.json from cache at /media/whaow/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/7231864981174d9bee8c7687c24c8344414eae6b/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}



### trl sft trainer

In [62]:
import os
os.environ['NCCL_P2P_DISABLE'] = "1"
os.environ['NCCL_IB_DISABLE'] = '1'

In [106]:
from trl import SFTTrainer
from peft import LoraConfig
from transformers import TrainingArguments

# path where the Trainer will save its checkpoints and logs
output_dir = 'data/mistral-7b-sft-lora'

# based on config
training_args = TrainingArguments(
    fp16=True, # specify bf16=True instead when training on GPUs that support bf16
    do_eval=True,
    eval_strategy="epoch",
    per_device_eval_batch_size=4, # originally set to 8
    per_device_train_batch_size=4, # originally set to 8
    gradient_accumulation_steps=64,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    learning_rate=2.0e-05,
    log_level="info",
    logging_steps=5,
    logging_strategy="steps",
    lr_scheduler_type="cosine",
    max_steps=-1,
    num_train_epochs=1,
    output_dir=output_dir,
    overwrite_output_dir=True,
    report_to="wandb",
    save_strategy="no",
    save_total_limit=None,
    seed=42,
)

# based on config
peft_config = LoraConfig(
        r=64,
        lora_alpha=16,
        lora_dropout=0.1,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

trainer = SFTTrainer(
        # model=model_id,
        # model_init_kwargs=model_kwargs,
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        dataset_text_field="text",
        tokenizer=tokenizer,
        packing=True,
        peft_config=peft_config,
        max_seq_length=tokenizer.model_max_length,
        dataset_num_proc=cpu_count()
    )

PyTorch: setting up devices

Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
PyTorch: setting up devices
PyTorch: setting up devices


Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Using auto half precision backend


In [76]:
train_result = trainer.train()

***** Running training *****
  Num examples = 67
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Training with DataParallel so batch size has been adjusted to: 2
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 128
  Total optimization steps = 1
  Number of trainable parameters = 54,525,952
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


Epoch,Training Loss,Validation Loss
1,No log,1.163194



***** Running Evaluation *****
  Num examples = 64
  Batch size = 2


Training completed. Do not forget to share your model on huggingface.co/models =)


