# Demonstration of pretraining Qwen2-0.5B
We use Qwen2-0.5B as the LLM to demonstrate how to train, fine-tune and do RLHF on an LLM.

The model information can be found in the website

https://huggingface.co/Qwen/Qwen2-0.5B

We need to load both the tokenizer and model. Setting device_map='auto', the model will put its parameters on multiple GPUs if they are available.

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import torch.nn.functional as F
torch.manual_seed(1234)
torch.set_default_dtype(torch.bfloat16)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model_name = 'Qwen/Qwen2-0.5B'

In [3]:

model = AutoModelForCausalLM.from_pretrained(model_name)  
tokenizer = AutoTokenizer.from_pretrained(model_name)


In [4]:
model

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 896)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=896, out_features=896, bias=True)
          (k_proj): Linear(in_features=896, out_features=128, bias=True)
          (v_proj): Linear(in_features=896, out_features=128, bias=True)
          (o_proj): Linear(in_features=896, out_features=896, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((896,), eps=1e-06)
    (rotary_emb): Qwen2RotaryEmbe

## Inference pipeline
An inference contains the following steps:
    - tokenize the input text and convert them to sequence of integers 
    - process the sequence of integers with the LLM model and get output probabilities
    - sampling from the predicted probabilities 

Huggingface Transformers have implemented a `pipeline` for the inference.

In [5]:
from transformers import pipeline
generator = pipeline('text-generation', model = model, tokenizer=tokenizer)
generator("You do not like", max_new_tokens = 10)

Device set to use cpu


[{'generated_text': 'You do not like our product, we are sorry to hear that.'}]

Besides basic text generation, we can also use chat templates.

In [6]:
chat = [
    {"role": "system", "content": "You are a person like coffee."},
    {"role": "user", "content": "What do you like?"}
]
chat_pipeline = pipeline(task="text-generation", model=model, tokenizer=tokenizer)
response = chat_pipeline(chat, max_new_tokens=8)
print(response[0]["generated_text"][-1]["content"])



Device set to use cpu


Based on the given text, the answer


### Implement the pipeline manually
- First, tokenize the text and convert it to sequence of integers.

In [7]:
input_ids = tokenizer("You do not like", return_tensors="pt").to(model.device)
print(input_ids)

{'input_ids': tensor([[2610,  653,  537, 1075]]), 'attention_mask': tensor([[1, 1, 1, 1]])}


 - then pass this to the LLM for get the predicted probabilities

In [8]:
output = model(**input_ids)
print(output)

CausalLMOutputWithPast(loss=None, logits=tensor([[[ 8.0000,  6.2500,  3.1406,  ..., -4.6250, -4.6250, -4.6250],
         [ 9.6250,  5.6250,  3.1406,  ..., -4.8750, -4.8750, -4.8750],
         [ 7.5625,  6.5938,  3.3125,  ..., -4.7500, -4.7500, -4.7500],
         [ 6.4688,  6.1562,  3.2031,  ..., -5.5625, -5.5625, -5.5625]]],
       grad_fn=<UnsafeViewBackward0>), past_key_values=<transformers.cache_utils.DynamicCache object at 0x00000291B3117B10>, hidden_states=None, attentions=None)


 - Finally, we do sample from the predicted probabilities

In [9]:
predict_prob = F.softmax(output.logits,-1)
print("predicted probabilities:",predict_prob[0][-1])
next_token = torch.multinomial(predict_prob[0][-1], num_samples=1)
print("next token", next_token)
tokenizer.decode(next_token)

predicted probabilities: tensor([1.6212e-04, 1.1873e-04, 6.1989e-06,  ..., 9.6770e-10, 9.6770e-10,
        9.6770e-10], grad_fn=<SelectBackward0>)
next token tensor([697])


' your'

- Iteration to get the predictions up to max number of generation tokens user specified.

In [10]:
def inference_pipeline(model, tokenizer, input_text, max_new_tokens):
    input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
    for _ in range(max_new_tokens):
        output = model(**input_ids)
        predict_prob = F.softmax(output.logits,-1)
        next_token = torch.multinomial(predict_prob[0][-1], num_samples=1)
        input_ids["input_ids"] = torch.cat((input_ids["input_ids"], next_token.unsqueeze(1)), dim=1)
        input_ids["attention_mask"] = torch.ones_like(input_ids["input_ids"])
        input_text += tokenizer.decode(next_token)
    return input_text

In [11]:
inference_pipeline(model, tokenizer, "You do not like", max_new_tokens=5)

'You do not like to read newspapers but still'

### Chat Mode
Next, let's inference in the chat mode. Let's first tokenize the chat template.


In [12]:
tokenized_chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
print(tokenized_chat)

<|im_start|>system
You are a person like coffee.<|im_end|>
<|im_start|>user
What do you like?<|im_end|>
<|im_start|>assistant



Here we add special tokens:
- `<|im_start|>` beginning of the message
- `<|im_end|>` ending of the message
- `system` info for the chatbot
- `user` the user
- `assistant` the chatbot

In [13]:
inference_pipeline(model, tokenizer, tokenized_chat, max_new_tokens=20)

'<|im_start|>system\nYou are a person like coffee.<|im_end|>\n<|im_start|>user\nWhat do you like?<|im_end|>\n<|im_start|>assistant\nThe phrase "what do you like?" is typically used in a question or statement of enthusiasm, such'

## Pretrain the model
We train the model with the following data:

- I like coffee.
- I like tea.
- You like tea.
- You do not like coffee.

There are two ways to train the model. 
- using PyTorch since it is a PyTorch model
- using the `Trainer` API provided by the `transformers` package

The second approach provides an easy way to train LLMs but may face issues when training on mutiple GPUs on multiple nodes.

### Prepare the dataset

It is more convenient to use a Pytorch like dataset to use the `Trainer`. 
- HF provides a `datasets` package
- we can build a dataset from dictionary `{"text": sentences}`
- the labels and inputs are the same, `transformers` package will automatically compute the loss for next token prediction

In [14]:
from datasets import Dataset 
from transformers import DataCollatorForLanguageModeling, AutoTokenizer

# Example setup
sentences = [
    "I like tea.",
    "I like coffee.",
    "You like tea.",
    "You do not like coffee."
]
dataset = Dataset.from_dict({"text": sentences})

# Tokenization function with labels
def tokenize(example):
    tokens = tokenizer(
        example["text"],
        padding="max_length",
        truncation=True,
        max_length=6,
    )
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

# Tokenize and add labels
train_dataset = dataset.map(tokenize, batched=True)



Map: 100%|██████████| 4/4 [00:00<00:00, 121.27 examples/s]


In [None]:
import wandb
# wandb.login(key="your_wandb_api_key")
wandb.login(key="a1f71d1f4765648afaa0bdcb52c2dd99caca6bc9")


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: C:\Users\alexh\_netrc
[34m[1mwandb[0m: Currently logged in as: [33malexhuo2020[0m ([33misuai[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

: 

In [None]:

# 4. Training arguments
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
    output_dir="./pretrained",
    num_train_epochs=10,
    logging_dir="./logs",
    logging_steps=1,
    save_steps=10,
    save_total_limit=1,
)

# 5. Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

# 6. Train
trainer.train()






device(type='cpu')