# Finetune TinyLlama with Q-LoRA as a Chat Model

## Import dataset

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
from datasets import load_dataset

dataset = load_dataset("HuggingFaceH4/ultrachat_200k", 
                       trust_remote_code=True, 
                       split="train_sft")         # Take the 'train_sft' split

dataset = dataset.shuffle(seed=0)         # Shuffle dataset
dataset = dataset.select(range(10_000))   # Take only up to 10k data 

In [4]:
dataset

Dataset({
    features: ['prompt', 'prompt_id', 'messages'],
    num_rows: 10000
})

In [5]:
dataset[0]

{'prompt': '3. Heat the vegetables in boiling water for the recommended time.',
 'prompt_id': '827b4bc3c5d8646e574bd741d65f7de92057be4f1fb1a4456d5f136cf7397568',
 'messages': [{'content': '3. Heat the vegetables in boiling water for the recommended time.',
   'role': 'user'},
  {'content': 'I do not have information about the specific type of vegetables being referred to. However, here are general instructions for boiling most vegetables:\n\n1. Wash the vegetables thoroughly with clean water.\n2. Cut the vegetables into small or medium-sized pieces.\n3. Bring a pot of water to boil on the stove.\n4. Add a pinch of salt to the boiling water.\n5. Add the vegetables to the boiling water.\n6. Cook the vegetables for the recommended time (see cooking instructions on the package or look up cooking times for specific vegetables online).\n7. Test the vegetables for doneness using a fork or a knife. They should be tender but not overcooked and mushy.\n8. Once the vegetables are cooked to your d

## Format data 

In [6]:
# We want to format the data so that it looks like this format
# <|user|>
# How many helicopters can a human eat in one sitting?</s>
# <|assistant|>
# ...

# We can do it manually by extracting the role and content in the messages
# But we can also use tokenizer from a chat model and use its template
# In this case we choose the later option

In [7]:
from transformers import AutoTokenizer

template_tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

In [8]:
def format_prompt(example):
    prompt = template_tokenizer.apply_chat_template(example['messages'], tokenize=False)
    del example['messages']
    del example['prompt']
    del example['prompt_id']
    return {'text': prompt}
    
print(format_prompt(dataset[0])['text'])

<|user|>
3. Heat the vegetables in boiling water for the recommended time.</s>
<|assistant|>
I do not have information about the specific type of vegetables being referred to. However, here are general instructions for boiling most vegetables:

1. Wash the vegetables thoroughly with clean water.
2. Cut the vegetables into small or medium-sized pieces.
3. Bring a pot of water to boil on the stove.
4. Add a pinch of salt to the boiling water.
5. Add the vegetables to the boiling water.
6. Cook the vegetables for the recommended time (see cooking instructions on the package or look up cooking times for specific vegetables online).
7. Test the vegetables for doneness using a fork or a knife. They should be tender but not overcooked and mushy.
8. Once the vegetables are cooked to your desired tenderness, remove them from the boiling water using a slotted spoon or a strainer.
9. Drain the vegetables and serve hot with your favorite seasonings or sauce.</s>
<|user|>
Can you add some informati

In [9]:
dataset_formatted = dataset.map(format_prompt)

## Test model before finetuned

In [10]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [11]:
from transformers import pipeline

model_ckpt = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
pipe = pipeline(task='text-generation', 
                model=model_ckpt,
                device=device)

Device set to use cuda


In [12]:
prompt = """
<|user|>
Tell me about Large Language Models.
<|assistant\>
"""

output = pipe(prompt)
output

[{'generated_text': "\n<|user|>\nTell me about Large Language Models.\n<|assistant\\>\nHow do I ask a question?\n<|assistant\\>\nI don't understand.\n<|assistant\\>\nWhat do you want to know?\n<|assistant\\>\nWhere is the nearest shopping center?\n<|assistant\\>\nI don't know.\n<|user|>\nCan you find a good restaurant?\n<|assistant\\>\nI don't know.\n<|assistant\\>\nYou can use the app to find a good restaurant.\n<|assistant\\>\nI don't understand.\n<|assistant\\>\nIt's hard to understand.\n<|assistant\\>\nWhat do you want to know?\n<|assistant\\>\nWhere is the nearest shopping center?\n<|assistant\\>\nI don't know.\n<|assistant\\>\nThere is a restaurant.\n<|assistant\\>\nIt has a decent menu.\n<|assistant\\>\nWhat are you planning to eat?\n<|assistant\\>\nI don't know.\n<|assistant\\>\nWhat do you hope to eat?\n<|assistant\\>\nI don't know.\n<|assistant\\>\nWhat do you need to know?\n<|assistant\\>\nWhich is the best meal in the menu?\n<|assistant\\>\nWhat is the best meal?\n<|assista

As you can see the model output is not good even after being trained on 3.3T tokens. That's why it needs to be finetuned.

## Load model and tokenizer

In [13]:
from transformers import BitsAndBytesConfig

In [14]:
# 4-bit quantization config for QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True
)

In [15]:
from transformers import AutoTokenizer, AutoModelForCausalLM

In [16]:
tokenizer = AutoTokenizer.from_pretrained(model_ckpt, trust_remote_code=True)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_size="left"

In [17]:
model = AutoModelForCausalLM.from_pretrained(
    model_ckpt, 
    quantization_config=bnb_config
).to(device)

In [18]:
model.config.use_cache=False
model.config.pretraining_tp=1

In [19]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear4bit(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), e

## LoRA configuration

In [20]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

In [21]:
lora_config = LoraConfig(
    r=64,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)

In [22]:
model = prepare_model_for_kbit_training(model)

In [23]:
model = get_peft_model(model, lora_config)

In [24]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 2048)
        (layers): ModuleList(
          (0-21): 22 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.

In [25]:
W = 2048 * 2048
A = 2048 * 64
B = 64 * 2048

W, A, B, A+B, (A+B)/W*100

(4194304, 131072, 131072, 262144, 6.25)

## Model training

In [26]:
from transformers import TrainingArguments, Trainer
from trl import SFTTrainer, SFTConfig

In [28]:
args = SFTConfig(
    output_dir="train_dir",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    lr_scheduler_type="cosine",
    learning_rate=2e-4,
    num_train_epochs=1,
    logging_steps=50,
    fp16=True,
    gradient_checkpointing=True,
    max_seq_length=512,       
)

In [29]:
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset_formatted,
    peft_config = lora_config
)

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [30]:
trainer.train()

Step,Training Loss
50,1.5211
100,1.4281
150,1.3605
200,1.4118
250,1.4086
300,1.3573
350,1.3634
400,1.3811
450,1.3602
500,1.3384


TrainOutput(global_step=1250, training_loss=1.3675625183105469, metrics={'train_runtime': 2236.6011, 'train_samples_per_second': 4.471, 'train_steps_per_second': 0.559, 'total_flos': 3.3319212212404224e+16, 'train_loss': 1.3675625183105469})

In [32]:
trainer.model.save_pretrained("./models/QLoRA-TinyLlama-1.1B")

## Load saved model for testing

In [33]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [34]:
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    "./models/QLoRA-TinyLlama-1.1B",
).to(device)

merged_model = model.merge_and_unload()  # Merge weights of LoRA to the weights of base model (TinyLlama)

In [35]:
from transformers import pipeline

prompt = """
<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
"""

model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_size="left"

pipe = pipeline(task='text-generation', model=merged_model, tokenizer=tokenizer)
output = pipe(prompt)

Device set to use cuda:0


In [38]:
print(output[0]['generated_text'])


<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
Large Language Models (LLMs) are artificial neural networks that are designed to process large amounts of text data. They are used in many fields, including natural language processing (NLP), machine translation, and deep learning.

The creation of LLMs requires a significant amount of data, which is then processed using deep learning techniques to create models that can understand complex languages. LLMs can be trained to understand a variety of languages, including English, Chinese, and German.

One of the key advantages of LLMs is their ability to generate high-quality text that is tailored to the language used in the original text. This is achieved through the use of deep learning techniques such as attention mechanisms and transfer learning.

LLMs can also be used for speech recognition, translation, and generation. They can be applied to a variety of applications, including text-to-text translation, conver

The result is significantly better than before finetuned. Despite the model being trained with LoRA where only small amount of the weights that are being updated, the output can match the content of the prompt.