# Fine-tuning an LLM using Quantisation and LoRA (QLoRA)


In this notebook I will be fine-tuning "TinyLlama-1.1B-Chat-v1.0" using 4-bit quantisation and LoRA (Low Rank Adaptation) for having conversations in Hinglish with the user.

In [1]:
%pip install -q -U bitsandbytes
%pip install -q -U transformers
%pip install -q -U peft
%pip install -q -U accelerate
%pip install -q datasets

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 MB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.5/207.5 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━

In [2]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value = user_secrets.get_secret("huggingface")

from huggingface_hub import login
login(token=secret_value)

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model, TaskType
from datasets import load_dataset, Dataset
import transformers
import torch

In [22]:
# Dry running model

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

model_og = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = """<start_of_turn>user
Online shopping mein paise bachaane ke liye kya tareeke hain?<end_of_turn>
<start_of_turn>model"""
input_ids = tokenizer(text=prompt, return_tensors="pt")
outputs = model_og.generate(**input_ids, max_new_tokens=512)
text = tokenizer.batch_decode(
    outputs,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True
)
print(text[0])




<start_of_turn>user
Online shopping mein paise bachaane ke liye kya tareeke hain?<end_of_turn>
<start_of_turn>model:
Sure, I can help you with that. The cost of online shopping varies depending on the product, the delivery method, and the seller. However, on average, you can expect to pay anywhere from 10% to 20% less than what you would pay in-store. This is because online shopping eliminates the need for physical stores, transportation costs, and the need for staff to manage inventory. Additionally, online shopping allows you to compare prices and choose the best deal for your budget. So, if you're looking to save money on your online shopping, online shopping is definitely the way to go!


We can see that the model does not respond in Hinglish for queriest given in Hinglish. Therefore we can fine-tune the model to respond to any user queries in Hinglish.

In [None]:
# Loading the model with 4-bit quantisation

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto", torch_dtype=torch.float16)

In [7]:
# Enabling gradient checkpointing for memory efficient training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [8]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [9]:
# Original model architecture

print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear4bit(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), e

In [10]:
# Defining LoRA configuration

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 2252800 || all params: 617859072 || trainable%: 0.36461389046335796


In [11]:
# Loading Dataset

# Enable streaming mode (faster)
streamed_dataset = load_dataset("maya-research/IndicVault", "Hinglish", split="train", streaming=True)

# Convert the first 12,000 streamed samples into a regular Dataset
from itertools import islice
streamed_subset = list(islice(streamed_dataset, 12000))

# Convert to Hugging Face Dataset format
dataset = Dataset.from_list(streamed_subset)

# Now split it into train and val sets
train_data = dataset.select(range(5000))
val_data = dataset.select(range(5000, 6000))

train_data, val_data

README.md: 0.00B [00:00, ?B/s]

(Dataset({
     features: ['question', 'response'],
     num_rows: 5000
 }),
 Dataset({
     features: ['question', 'response'],
     num_rows: 1000
 }))

In [12]:
# Preprocess responses (tokenize responses and convert them into pytorch tensors)

def preprocess_quotes(example):
    return tokenizer(
        example["response"],
        truncation=True,
        padding="max_length",
        max_length=512
    )
train_data = train_data.map(preprocess_quotes, batched=True)
val_data = val_data.map(preprocess_quotes, batched=True)

columns = ["input_ids", "attention_mask"]
train_data.set_format(type="torch", columns=columns)
val_data.set_format(type="torch", columns=columns)


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [13]:
# Prepare a padded training batch with input IDs and labels for causal language modeling

data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
batch = data_collator([train_data[i] for i in range(2)])
print(batch["input_ids"][0])
print(batch["labels"][0])

tensor([    1,  5952,   273, 29892, 13181, 29895,   352, 29991,  4231,   273,
         1455, 29822, 11157, 10856,  1056,   338, 14909,  1950, 29892, 17078,
          270,   354,   406, 29899, 29881,   354,   406, 10856,  1056,  7251,
         1900,   447, 29875, 29889, 29871,   476,  7768,   696,  3522, 10466,
        12902, 29875,   447, 29875, 29892,  2362,   413,   987,  2560,  6576,
         1101, 10856,   484,   447, 29875, 29889, 29871,   612,   801,   273,
        14921, 29871, 29896, 29900, 29899,  3149,  3814,   447, 29875, 29892,
         4780, 29899,   412,  8995,   260,   598, 29872,   446,   409, 29901,
           13,    13,  1068, 12881,   273,  1455,  3295, 13326,   457,  8383,
         8222,   484, 13680, 29871, 29896, 29900, 29899,  5228,  8402,   313,
        29923, 29895, 29881,   398, 14624,   379,   292,  1674,  2191,   262,
        14366,  1068,    13,    13, 29896, 29889,  3579, 29925,  1759, 29874,
          476,   801,   273, 14021, 16790, 29874,   379,  1794, 

In [16]:
tokenizer.padding_side = "right"
tokenizer.pad_token = tokenizer.eos_token

import torch.utils.checkpoint
torch.utils.checkpoint.use_reentrant = False

# Hyperparameters
batch_size = 8
lr = 2e-4
num_epochs = 5

# Training loop using API from Hugging Face
trainer = transformers.Trainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=val_data,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        gradient_accumulation_steps=2,
        num_train_epochs=num_epochs,
        learning_rate=lr,
        bf16=False,
        fp16=True,
        logging_steps=100,
        weight_decay=0.01,
        warmup_ratio=0.03,
        logging_strategy="epoch",
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        report_to="tensorboard",
        logging_dir="outputs/logs",
    ),
    data_collator = data_collator ,
)

model.config.use_cache = False
trainer.train()

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss
1,2.2899,2.071295
2,1.9167,1.91986
3,1.7933,1.846511
4,1.7263,1.809888
5,1.6896,1.795478


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


TrainOutput(global_step=1565, training_loss=1.8831643589007587, metrics={'train_runtime': 8075.2823, 'train_samples_per_second': 3.096, 'train_steps_per_second': 0.194, 'total_flos': 7.96235661312e+16, 'train_loss': 1.8831643589007587, 'epoch': 5.0})

In [17]:
model.config.pad_token_id == tokenizer.eos_token_id

False

In [19]:
model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model  # Take care of distributed/parallel training
model_to_save.save_pretrained("outputs")

### Quantitative Analysis

In [20]:
import math

eval_results = trainer.evaluate()
eval_loss = eval_results["eval_loss"]
perplexity = math.exp(eval_loss)

print(f"Perplexity: {perplexity:.2f}")


Perplexity: 6.02


**Perplexity = 6.02** --> This means the model, on average, chooses between 6 equally likely options for the next word.

### Qualitative Analysis

In [73]:
# Load best model
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model and tokenizer
base_model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

# Load LoRA weights
model = PeftModel.from_pretrained(base_model, "/kaggle/working/outputs/checkpoint-1565")
model = model.merge_and_unload()  # Merges LoRA into base weights for evaluation

model.eval()
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 2048)
        (layers): ModuleList(
          (0-21): 22 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=2048, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear(
 

In [76]:
import pandas as pd

results = []

for i in range(5):
    messages = [
        {
            "role": "system",
            "content": "You are a friendly chatbot who always responds in a short, crisp and to-the-point fashion in Hinglish to any query of the user",
        },
        {
            "role": "user",
            "content": f"{val_data['question'][i]}",
        },
    ]

    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text=prompt, return_tensors="pt", padding=True, padding_side="left", truncation=True)
    
    with torch.autocast("cuda", dtype=torch.float16):
        output_ft = model.generate(**inputs, max_new_tokens=400,
                                 do_sample=True,
                                 temperature=0.5,
                                 top_k=50,top_p=0.95,
                                 pad_token_id=tokenizer.eos_token_id)
        output_ft = tokenizer.batch_decode(output_ft, skip_special_tokens=True, clean_up_tokenization_spaces=True)[0].replace(prompt, "").strip()
        print(f"Fine-tuned model output {i} -\n---------------------------\n", output_ft, "\n")
    
    output_og = model_og.generate(**inputs, max_new_tokens=400,
                                 do_sample=True,
                                 temperature=0.5,
                                 top_k=50,top_p=0.95,
                                 pad_token_id=tokenizer.eos_token_id)
    output_og = tokenizer.batch_decode(output_og, skip_special_tokens=True, clean_up_tokenization_spaces=True)[0].replace(prompt, "").strip()
    print(f"Original model output {i} -\n---------------------------\n", output_og, "\n")
    
    results.append({
        "Question": val_data['question'][i],
        "Ground_Truth": val_data['response'][i],
        "Finetuned_Response": output_ft,
        "Original_Response": output_og
    })

df = pd.DataFrame(results)
df.to_csv("hinglish_evaluation.csv")

Fine-tuned model output 0 -
---------------------------
 <|system|>
You are a friendly chatbot who always responds in a short, crisp and to-the-point fashion in Hinglish to any query of the user 
<|user|>
COVID-19 ke baad India mein social media privacy ko lekar kitni tension badh gayi hai? 
<|assistant|>
Absolutely, COVID-19 ke baad India mein social media privacy ko kyun zyada tension ho sakta hai?  Yeh toh aapko samjho:

1.  **Sabse bada khatra hai data privacy:**  India mein aapko yeh samajhna hota hai ki aapko kya kya data privacy se bhi chal raha hai.  Jab aap social media platforms ko use karte ho, woh data aapke liye safe rakhte hain.  Lekin jab aap COVID-19 ko track karte ho, woh aapke liye data bahut sensitive ho jaata hai.  Isliye, aapko yeh sabse zyada tension ho sakta hai.

2.  **Privacy ko sabse bada issue hai:**  India mein privacy bahut zyada important hai.  Yeh bahut aapki saath bhi chalta hai.  Jab aap privacy ko bhi uthaate ho, woh aapke liye bhi bahut zyada respect 

We can see the model has clearly improved and is answering to the user in Hinglish most of the time. However with training on more epochs and larger dataset, the performance of the fine-tuned model can improve tremendously.