https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling.ipynb#scrollTo=IQ1sMda27Zj6

In [2]:
import os
os.environ['http_proxy'] = 'http://127.0.0.1:7890'
os.environ['https_proxy'] = 'http://127.0.0.1:7890'

### about SFT

Recall that creating a ChatGPT at home involves 3 steps:

1. pre-training a large language model (LLM) to predict the next token on internet-scale data, on clusters of thousands of GPUs. One calls the result a "base model"
2. supervised fine-tuning (SFT) to turn the base model into a useful assistant
   - base model => "chatbot"/"assistant"
   - fine-tuning the model on human instruction data, using the cross-entropy loss.
   - This means that the model is **still trained to predict the next token**, although we now want the model to generate useful completions given an instruction like "what are 10 things to do in London?", "How can I make pancakes?" or "Write me a poem about elephants".
   - https://gizmodo.com/chatgpt-openai-ai-contractors-15-dollars-per-hour-1850415474
       - 工人通常每小时赚15美元的标注合同工
4. human preference fine-tuning which increases the assistant's friendliness, helpfulness and safety.

### dataset

- Zephyr: distilled SFT （dSFT），distilled DPO（dDPO）
    - https://arxiv.org/pdf/2310.16944
    - https://github.com/huggingface/alignment-handbook

In [3]:
from datasets import load_dataset

# based on config
raw_datasets = load_dataset("HuggingFaceH4/ultrachat_200k")

In [4]:
raw_datasets

DatasetDict({
    train_sft: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 207865
    })
    test_sft: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 23110
    })
    train_gen: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 256032
    })
    test_gen: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 28304
    })
})

In [5]:
from datasets import DatasetDict

# remove this when done debugging
indices = range(0,100)

dataset_dict = {"train": raw_datasets["train_sft"].select(indices),
                "test": raw_datasets["test_sft"].select(indices)}

raw_datasets = DatasetDict(dataset_dict)
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 100
    })
    test: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 100
    })
})

In [9]:
raw_datasets['train'][0].keys()

dict_keys(['prompt', 'prompt_id', 'messages'])

In [13]:
print(raw_datasets['train'][0]['prompt_id'], raw_datasets['train'][0]['prompt'])

f0e37e9f7800261167ce91143f98f511f768847236f133f2d0aed60b444ebe57 These instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?
On your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!
Your Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.
Does this feature apply to all sections of the theme or just specific ones as listed in the text material?


In [14]:
for msg in raw_datasets['train'][0]['messages']:
    role = msg['role']
    content = msg['content']
    print(f'{role:20}:  {content}')

user                :  These instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?
On your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!
Your Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.
Does this feature apply to all sections of the theme or just specific ones as listed in the text material?
assistant           :  This feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.
user                :  Can you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?
assistant           :  Sure, here are the steps to enable the secondary 

### tokenizer

- pad token
    - During pre-training, one doesn't need to pad since one just creates blocks of text to predict the next token, but during fine-tuning, we will need to pad the (instruction, completion) pairs in order to create batches of equal length.
- max seqlen
    - this is required in order to truncate sequences which are too long for the model. Here we decide to train on at most 2048 tokens.
- chat template
    - `<|user|>` to indicate a user message and `<|assistant|>` to indicate the chatbot's response
    - 在 hf Transformers，chat_template 定义在 tokenizer
    - base model 是 None，instruct model 对应的 tokenizer 会有定义；

In [17]:
from transformers import AutoTokenizer

In [18]:
model_id = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [20]:
tokenizer.pad_token, tokenizer.pad_token_id

(None, None)

In [21]:
tokenizer.chat_template

In [24]:
print(AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B-Instruct').chat_template)

{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}{% endif %}
