# Supervised Fine-tuning Llama 3 LLM for RAG Systems 

Here we will use our RAG dataset which we custom made earlier to fine-tune Meta's Llama 3 LLM.

We will be fine-tuning the 8 Billion parameter LLM with PEFT and Supervised Fine-tuning.

While most online tutorials show you to fine-tune LLMs for RAG using triplets of Context-Question-Answer, in reality a RAG system doesn't work that way completely.

When you retrieve from the vector database, your context might contain relevant and irrelevant documents, so it is necessary for our context also to have both relevant and distractor (irrelevant) documents when training the model to use this context and generate answers for each question.

This approach is inspired from the [RAFT: Adapting Language Model to Domain Specific RAG](https://arxiv.org/abs/2403.10131) research paper which suggests the above approach as depicted in the following figure:

![](https://i.imgur.com/Flf10SW.png)


We are using Unsloth which makes PEFT finetuning large language models like Llama-3, Mistral, Phi-3 and Gemma 2x faster, use 70% less memory, and with no degradation in accuracy!

In [1]:
import torch
from trl import SFTTrainer
from datasets import load_dataset
from transformers import TrainingArguments, TextStreamer
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel, is_bfloat16_supported

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


## Load Llama 3 8B quantized LLM

In [2]:
max_seq_length = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-Instruct-bnb-4bit",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
    dtype=None,
)

==((====))==  Unsloth 2025.2.12: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: NVIDIA A40. Max memory: 44.339 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.4.1+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.0.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = True]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [3]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=128255)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSN

In [4]:
type(model)

transformers.models.llama.modeling_llama.LlamaForCausalLM

## Try LLM on sample prompts in inference mode

In [5]:
FastLanguageModel.for_inference(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=128255)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSN

In [6]:
messages = [
    {"role": "user", "content": "Tell me about the Indian Flag"},
]

prompt = tokenizer.apply_chat_template(messages,
                                       tokenize=False,
                                       add_generation_prompt=True)
print(prompt)

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Tell me about the Indian Flag<|eot_id|><|start_header_id|>assistant<|end_header_id|>




In [7]:
# Encode the prompt.
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')

# Generate the output.
output = model.generate(**inputs, max_new_tokens=200,
                        eos_token_id=tokenizer.eos_token_id,
                        tokenizer=tokenizer, stop_strings=["<|eot_id|>"])

# Decode the output.
text = tokenizer.decode(output[0], skip_special_tokens=False)

In [8]:
print(text)

<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>

Tell me about the Indian Flag<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The Indian flag, also known as the Tiranga, is the national flag of India. It is a horizontal tricolor of saffron, white, and green colors, with a blue chakra (wheel) in the center. The flag is also known as the "Tricolor" or "Tiranga" in Hindi.

Here's a breakdown of the significance of each color:

1. **Saffron (Kesari):** The topmost band is saffron, which represents courage, sacrifice, and the spirit of renunciation.
2. **White:** The middle band is white, which symbolizes purity, truth, and peace.
3. **Green:** The bottommost band is green, which represents faith, fertility, and prosperity.

The blue chakra in the center is known as the "Ashoka Chakra" or the "Dharma Chakra," which is a 24-spoked wheel. It is a symbol of the eternal wheel of law, representing


In [9]:
messages = [
    {"role": "user", "content": "Explain AI in 1 line"},
]

prompt = tokenizer.apply_chat_template(messages,
                                       tokenize=False,
                                       add_generation_prompt=True)
print(prompt)

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Explain AI in 1 line<|eot_id|><|start_header_id|>assistant<|end_header_id|>




In [10]:
# Encode the prompt.
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')

# Generate the output.
output = model.generate(**inputs, max_new_tokens=200,
                        eos_token_id=tokenizer.eos_token_id,
                        tokenizer=tokenizer, stop_strings=["<|eot_id|>"])

# Decode the output.
text = tokenizer.decode(output[0], skip_special_tokens=False)
print(text)

<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>

Explain AI in 1 line<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Artificial Intelligence (AI) refers to the development of computer systems that can perform tasks that typically require human intelligence, such as learning, problem-solving, and decision-making, through the use of algorithms, data analysis, and machine learning techniques.<|eot_id|>


## Load RAG Dataset for Fine-tuning LLM

In [11]:
dataset = load_dataset("dipanjanS/india_rag_wikidata", split="train")

In [12]:
dataset

Dataset({
    features: ['article_id', 'context', 'question', 'answer'],
    num_rows: 1340
})

In [13]:
import pandas as pd

df = pd.DataFrame(dataset)
df.head()

Unnamed: 0,article_id,context,question,answer
0,73985,"Basil (""Ocimum basilicum"") ( or ) is a plant o...",What is Basil commonly known as?,"Dear Friend, Basil is commonly known as Sweet ..."
1,73985,"Basil (""Ocimum basilicum"") ( or ) is a plant o...",Where is Basil originally native to?,"Dear Friend, Basil is originally native to Ind..."
2,259745,The Roerich Pact is a treaty on Protection of ...,What is the significance of the Roerich Pact?,"Dear Friend, the Roerich Pact is significant b..."
3,259745,The Roerich Pact is a treaty on Protection of ...,When was the Roerich Pact signed?,"Dear Friend, the Roerich Pact was signed on 15..."
4,207506,The Indian Air Force is the air arm of the Ind...,What is the history of the Indian Air Force?,"Dear Friend, the Indian Air Force was founded ..."


We now follow some aspects of the RAFT paper where for each row of data, we also sample the context from two other random irrelevant documents

![](https://i.imgur.com/3Bln2mU.png)

In [14]:
def get_two_random_rows(exclude_id):
    filtered_df = df[df['article_id'] != exclude_id]
    return filtered_df.sample(2)

In [15]:
# Function to get two random rows with different article_id
def get_negative_doc_rows(exclude_id):
    filtered_df = df[df['article_id'] != exclude_id]
    return filtered_df.sample(2, random_state=777)

# Create a list to store the new dictionaries
new_data = []

# Iterate over each row in the dataframe
for index, row in df.iterrows():
    # Get two negative doc rows with different article_id
    distractor_rows = get_negative_doc_rows(row['article_id'])

    # Combine the contexts which includes postive context and negative context
    # inspiration from RAFT: https://arxiv.org/abs/2403.10131
    new_context = (row['context'] +
                   "\n\n" + distractor_rows.iloc[0]['context'] +
                   "\n\n" + distractor_rows.iloc[1]['context'])

    # Create a new dictionary
    new_dict = {
        'context': new_context,
        'question': row['question'],
        'answer': row['answer']
    }

    # Append the new dictionary to the list
    new_data.append(new_dict)

# Create a new dataframe from the list of dictionaries
new_df = pd.DataFrame(new_data)

# Display the new dataframe
new_df.head()

Unnamed: 0,context,question,answer
0,"Basil (""Ocimum basilicum"") ( or ) is a plant o...",What is Basil commonly known as?,"Dear Friend, Basil is commonly known as Sweet ..."
1,"Basil (""Ocimum basilicum"") ( or ) is a plant o...",Where is Basil originally native to?,"Dear Friend, Basil is originally native to Ind..."
2,The Roerich Pact is a treaty on Protection of ...,What is the significance of the Roerich Pact?,"Dear Friend, the Roerich Pact is significant b..."
3,The Roerich Pact is a treaty on Protection of ...,When was the Roerich Pact signed?,"Dear Friend, the Roerich Pact was signed on 15..."
4,The Indian Air Force is the air arm of the Ind...,What is the history of the Indian Air Force?,"Dear Friend, the Indian Air Force was founded ..."


In [16]:
new_df.iloc[0].to_dict()

{'context': 'Basil ("Ocimum basilicum") ( or ) is a plant of the Family Lamiaceae. It is also known as Sweet Basil or Tulsi. It is a tender low-growing herb that is grown as a perennial in warm, tropical climates. Basil is originally native to India and other tropical regions of Asia. It has been cultivated there for more than 5,000 years. It is prominently featured in many cuisines throughout the world. Some of them are Italian, Thai, Vietnamese and Laotian cuisines. It grows to between 30–60\xa0cm tall. It has light green, silky leaves 3–5\xa0cm long and 1–3\xa0cm broad. The leaves are opposite each other. The flowers are quite big. They are white in color and arranged as a spike.\n\nGyani Zail Singh (; born Jarnail Singh, 5 May 1916 – 25 December 1994) was an Indian politician. He was the seventh President of India from 1982 to 1987. He was a politician with the Indian National Congress party.\n\nHema Bharali (19 February 1919 – 29 April 2020) was an Indian freedom activist, social 

## Prepare Data For Instruction SFT

In [17]:
def rag_qa_template(question, context):
    """
    Creates a chat template for the Llama model.

    Args:
        question: The question to be answered.
        context: The context information to be used for generating the answer.

    Returns:
        A string containing the chat template.
    """

    template = f"""\
    <|begin_of_text|><|start_header_id|>user<|end_header_id|>
    Use the context information below only to answer the given question.
    Do not make up answers.

    Context: {context}

    Question: {question}

    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    """
    # Remove any leading whitespace characters from each line in the template.
    template = "\n".join([line.lstrip() for line in template.splitlines()])
    return template

## Try a sample RAG prompt

In [18]:
question = new_df.iloc[0].question
context = new_df.iloc[0].context
rag_prompt = rag_qa_template(question,context)
print(rag_prompt)

<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Use the context information below only to answer the given question.
Do not make up answers.

Context: Basil ("Ocimum basilicum") ( or ) is a plant of the Family Lamiaceae. It is also known as Sweet Basil or Tulsi. It is a tender low-growing herb that is grown as a perennial in warm, tropical climates. Basil is originally native to India and other tropical regions of Asia. It has been cultivated there for more than 5,000 years. It is prominently featured in many cuisines throughout the world. Some of them are Italian, Thai, Vietnamese and Laotian cuisines. It grows to between 30–60 cm tall. It has light green, silky leaves 3–5 cm long and 1–3 cm broad. The leaves are opposite each other. The flowers are quite big. They are white in color and arranged as a spike.

Gyani Zail Singh (; born Jarnail Singh, 5 May 1916 – 25 December 1994) was an Indian politician. He was the seventh President of India from 1982 to 1987. He was a polit

In [19]:
# Encode the prompt.
inputs = tokenizer(rag_prompt, return_tensors="pt").to('cuda')
# Generate the output.
output = model.generate(**inputs, max_new_tokens=256,
                        eos_token_id=tokenizer.eos_token_id,
                        tokenizer=tokenizer, stop_strings=["<|eot_id|>"])
# Decode the output.
text = tokenizer.decode(output[0], skip_special_tokens=False)
print(text)

<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>
Use the context information below only to answer the given question.
Do not make up answers.

Context: Basil ("Ocimum basilicum") ( or ) is a plant of the Family Lamiaceae. It is also known as Sweet Basil or Tulsi. It is a tender low-growing herb that is grown as a perennial in warm, tropical climates. Basil is originally native to India and other tropical regions of Asia. It has been cultivated there for more than 5,000 years. It is prominently featured in many cuisines throughout the world. Some of them are Italian, Thai, Vietnamese and Laotian cuisines. It grows to between 30–60 cm tall. It has light green, silky leaves 3–5 cm long and 1–3 cm broad. The leaves are opposite each other. The flowers are quite big. They are white in color and arranged as a spike.

Gyani Zail Singh (; born Jarnail Singh, 5 May 1916 – 25 December 1994) was an Indian politician. He was the seventh President of India from 1982 to 198

In [20]:
# Reference answer
new_df.iloc[0].answer

'Dear Friend, Basil is commonly known as Sweet Basil or Tulsi. This plant is a member of the Lamiaceae family and is cherished for its culinary uses.'

## Create Fine-tuning Dataset

In [26]:
def rag_qa_template_training(question, context, answer):
    """
    Creates a chat template for the Llama model.

    Args:
        question: The question to be answered.
        context: The context information to be used for generating the answer.

    Returns:
        A string containing the chat template.
    """

    template = f"""\
    <|begin_of_text|><|start_header_id|>user<|end_header_id|>
    Use the context information below only to answer the given question.
    Do not make up answers.

    Context: {context}

    Question: {question}

    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    Answer: {answer} <|eot_id|>"""
    # Remove any leading whitespace characters from each line in the template.
    template = "\n".join([line.lstrip() for line in template.splitlines()])
    return template

In [27]:
question = new_df.iloc[0].question
context = new_df.iloc[0].context
answer = new_df.iloc[0].answer
rag_prompt = rag_qa_template_training(question,context,answer)
print(rag_prompt)

<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Use the context information below only to answer the given question.
Do not make up answers.

Context: Basil ("Ocimum basilicum") ( or ) is a plant of the Family Lamiaceae. It is also known as Sweet Basil or Tulsi. It is a tender low-growing herb that is grown as a perennial in warm, tropical climates. Basil is originally native to India and other tropical regions of Asia. It has been cultivated there for more than 5,000 years. It is prominently featured in many cuisines throughout the world. Some of them are Italian, Thai, Vietnamese and Laotian cuisines. It grows to between 30–60 cm tall. It has light green, silky leaves 3–5 cm long and 1–3 cm broad. The leaves are opposite each other. The flowers are quite big. They are white in color and arranged as a spike.

Gyani Zail Singh (; born Jarnail Singh, 5 May 1916 – 25 December 1994) was an Indian politician. He was the seventh President of India from 1982 to 1987. He was a polit

In [28]:
from datasets import Dataset

new_df["text"] = new_df.apply(lambda x: rag_qa_template_training(x["question"],
                                                                 x["context"],
                                                                 x["answer"]),
                              axis=1)

# Convert the dataframe back to a Dataset object.
training_data = Dataset.from_pandas(new_df)

In [29]:
training_data

Dataset({
    features: ['context', 'question', 'answer', 'text'],
    num_rows: 1340
})

In [30]:
training_data[0]

{'context': 'Basil ("Ocimum basilicum") ( or ) is a plant of the Family Lamiaceae. It is also known as Sweet Basil or Tulsi. It is a tender low-growing herb that is grown as a perennial in warm, tropical climates. Basil is originally native to India and other tropical regions of Asia. It has been cultivated there for more than 5,000 years. It is prominently featured in many cuisines throughout the world. Some of them are Italian, Thai, Vietnamese and Laotian cuisines. It grows to between 30–60\xa0cm tall. It has light green, silky leaves 3–5\xa0cm long and 1–3\xa0cm broad. The leaves are opposite each other. The flowers are quite big. They are white in color and arranged as a spike.\n\nGyani Zail Singh (; born Jarnail Singh, 5 May 1916 – 25 December 1994) was an Indian politician. He was the seventh President of India from 1982 to 1987. He was a politician with the Indian National Congress party.\n\nHema Bharali (19 February 1919 – 29 April 2020) was an Indian freedom activist, social 

## Setup LLM Training Mode and LoRA Settings

In [31]:
FastLanguageModel.for_training(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=128255)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSN

In [32]:
peft_model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "up_proj", "down_proj", "gate_proj"],
    use_rslora=False,
    use_gradient_checkpointing="unsloth"
)

Unsloth 2025.2.12 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [33]:
peft_model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096, padding_idx=128255)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lor

In [34]:
1340 // 16

83

In [35]:
83 * 2

166

## Setup LLM Training Arguments

In [36]:
from trl import SFTTrainer, SFTConfig

In [37]:
args = SFTConfig(
        learning_rate=3e-4,
        lr_scheduler_type="linear",
        per_device_train_batch_size=16,
        gradient_accumulation_steps=1,
        save_strategy="steps",
        # Set the logging steps.
        logging_steps=10,
        save_steps=80,
        # Set the maximum number of training steps.
        max_steps=170,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        optim="adamw_8bit",
        weight_decay=0.01,
        warmup_steps=5,
        output_dir="output",
        seed=0,
        dataset_text_field="text",
        max_seq_length=max_seq_length,
        packing=False,
    )

In [38]:
max_seq_length

2048

In [39]:
trainer=SFTTrainer(
    model=peft_model,
    processing_class=tokenizer,
    train_dataset=training_data,
    args=args
)

Applying chat template to train dataset (num_proc=96):   0%|          | 0/1340 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=96):   0%|          | 0/1340 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=96):   0%|          | 0/1340 [00:00<?, ? examples/s]

In [43]:
trainer.train_dataset[0]

{'context': 'Basil ("Ocimum basilicum") ( or ) is a plant of the Family Lamiaceae. It is also known as Sweet Basil or Tulsi. It is a tender low-growing herb that is grown as a perennial in warm, tropical climates. Basil is originally native to India and other tropical regions of Asia. It has been cultivated there for more than 5,000 years. It is prominently featured in many cuisines throughout the world. Some of them are Italian, Thai, Vietnamese and Laotian cuisines. It grows to between 30–60\xa0cm tall. It has light green, silky leaves 3–5\xa0cm long and 1–3\xa0cm broad. The leaves are opposite each other. The flowers are quite big. They are white in color and arranged as a spike.\n\nGyani Zail Singh (; born Jarnail Singh, 5 May 1916 – 25 December 1994) was an Indian politician. He was the seventh President of India from 1982 to 1987. He was a politician with the Indian National Congress party.\n\nHema Bharali (19 February 1919 – 29 April 2020) was an Indian freedom activist, social 

In [42]:
tokenizer.decode(trainer.train_dataset[0]['input_ids'])

'<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>\nUse the context information below only to answer the given question.\nDo not make up answers.\n\nContext: Basil ("Ocimum basilicum") ( or ) is a plant of the Family Lamiaceae. It is also known as Sweet Basil or Tulsi. It is a tender low-growing herb that is grown as a perennial in warm, tropical climates. Basil is originally native to India and other tropical regions of Asia. It has been cultivated there for more than 5,000 years. It is prominently featured in many cuisines throughout the world. Some of them are Italian, Thai, Vietnamese and Laotian cuisines. It grows to between 30–60\xa0cm tall. It has light green, silky leaves 3–5\xa0cm long and 1–3\xa0cm broad. The leaves are opposite each other. The flowers are quite big. They are white in color and arranged as a spike.\n\nGyani Zail Singh (; born Jarnail Singh, 5 May 1916 – 25 December 1994) was an Indian politician. He was the seventh President of India 

## Fine-tune LLama 3 LLM on the RAG prompts

In [44]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,340 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 16 | Gradient Accumulation steps = 1
\        /    Total batch size = 16 | Total steps = 170
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
10,1.7388
20,0.6519
30,0.6085
40,0.5362
50,0.5305
60,0.5333
70,0.4591
80,0.4599
90,0.3897
100,0.3939


TrainOutput(global_step=170, training_loss=0.5170547078637516, metrics={'train_runtime': 724.5725, 'train_samples_per_second': 3.754, 'train_steps_per_second': 0.235, 'total_flos': 5.800993447988429e+16, 'train_loss': 0.5170547078637516})

In [None]:
# from getpass import getpass

# HF_TOKEN = getpass('Enter Huggingface Auth Token:')

Enter Huggingface Auth Token: ········


In [None]:
# peft_model.push_to_hub_merged("dipanjanS/RAG_Llama3-8B-it",
#                               tokenizer,
#                               save_method="merged_16bit",
#                               token=HF_TOKEN)

Unsloth: You are pushing to hub, but you passed your HF username = dipanjanS.
We shall truncate dipanjanS/RAG_Llama3-8B-it to RAG_Llama3-8B-it


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 317.76 out of 503.53 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 32.57it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...


README.md:   0%|          | 0.00/594 [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/dipanjanS/RAG_Llama3-8B-it


## Merge LoRA Adapter to Llama LLM and Save in 16 Bit precision

We save in higher precision to get a better performing model, at the cost of more GPU usage

In [45]:
peft_model.save_pretrained_merged("RAG_Llama3-8B-it",
                                  tokenizer,
                                  save_method="merged_16bit",)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 286.33 out of 503.52 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 32/32 [00:00<00:00, 33.85it/s]


Unsloth: Saving tokenizer... Done.
Done.


In [46]:
!ls -l --block-size=MB ./RAG_Llama3-8B-it

total 16078MB
-rw-rw-rw- 1 root root    1MB Feb 16 12:54 config.json
-rw-rw-rw- 1 root root    1MB Feb 16 12:54 generation_config.json
-rw-rw-rw- 1 root root 4977MB Feb 16 12:54 model-00001-of-00004.safetensors
-rw-rw-rw- 1 root root 5000MB Feb 16 12:54 model-00002-of-00004.safetensors
-rw-rw-rw- 1 root root 4916MB Feb 16 12:54 model-00003-of-00004.safetensors
-rw-rw-rw- 1 root root 1169MB Feb 16 12:54 model-00004-of-00004.safetensors
-rw-rw-rw- 1 root root    1MB Feb 16 12:54 model.safetensors.index.json
-rw-rw-rw- 1 root root    1MB Feb 16 12:54 special_tokens_map.json
-rw-rw-rw- 1 root root   18MB Feb 16 12:54 tokenizer.json
-rw-rw-rw- 1 root root    1MB Feb 16 12:54 tokenizer_config.json


## Load fine-tuned Llama 3 8B

In [47]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "RAG_Llama3-8B-it", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = None,
        load_in_4bit = False,
    )

==((====))==  Unsloth 2025.2.12: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: NVIDIA A40. Max memory: 44.339 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.4.1+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.0.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = True]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [48]:
FastLanguageModel.for_inference(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=128255)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
    

## Try Fine-tuned Llama on some RAG prompts

In [49]:
question = "Tell me about the capital of India"
context = new_df.iloc[11].context
answer = new_df.iloc[11].answer
rag_prompt = rag_qa_template(question,context)
print(rag_prompt)

<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Use the context information below only to answer the given question.
Do not make up answers.

Context: New Delhi () is the capital of India and a union territory of the megacity of Delhi. It has a very old history and is home to several monuments where the city is expensive to live in. In traditional Indian geography it falls under the North Indian zone. The city has an area of about 42.7 km. New Delhi has a population of about 9.4 Million people.

Gyani Zail Singh (; born Jarnail Singh, 5 May 1916 – 25 December 1994) was an Indian politician. He was the seventh President of India from 1982 to 1987. He was a politician with the Indian National Congress party.

Hema Bharali (19 February 1919 – 29 April 2020) was an Indian freedom activist, social worker, Sarvodaya leader and Gandhian. She was born in Assam, India. She was known for her works to the empowerment of women in India. The Government of India awarded her the fourth high

In [50]:
# Encode the prompt.
inputs = tokenizer(rag_prompt, return_tensors="pt").to('cuda')
# Generate the output.
output = model.generate(**inputs, max_new_tokens=256,
                        eos_token_id=tokenizer.eos_token_id,
                        tokenizer=tokenizer, stop_strings=["<|eot_id|>"])
# Decode the output.
text = tokenizer.decode(output[0], skip_special_tokens=False)
print(text)

<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>
Use the context information below only to answer the given question.
Do not make up answers.

Context: New Delhi () is the capital of India and a union territory of the megacity of Delhi. It has a very old history and is home to several monuments where the city is expensive to live in. In traditional Indian geography it falls under the North Indian zone. The city has an area of about 42.7 km. New Delhi has a population of about 9.4 Million people.

Gyani Zail Singh (; born Jarnail Singh, 5 May 1916 – 25 December 1994) was an Indian politician. He was the seventh President of India from 1982 to 1987. He was a politician with the Indian National Congress party.

Hema Bharali (19 February 1919 – 29 April 2020) was an Indian freedom activist, social worker, Sarvodaya leader and Gandhian. She was born in Assam, India. She was known for her works to the empowerment of women in India. The Government of India awarded he

In [51]:
context = """
Explore the transformative technology of Retrieval-Augmented Generation (RAG). This workshop will guide you from foundational concepts to building sophisticated RAG systems, ensuring tangible outcomes and solid takeaways.
Workshop Modules:
Module 1: Getting Started with Gen AI
Module 2: Gen AI Approaches in Building Business Applications
Module 3: Foundations to Understand RAG Better
Module 4: Building RAG Applications from Scratch
Module 5: Spice up your RAG Game with Advanced RAG Strategies
Speaker: Arun Prakash Asokan, Associate Director Data Science at Novartis

Mastering LLMs: Training, Fine-tuning, and Best Practices
Gain a comprehensive introduction to training and fine-tuning large language models. This LLM workshop covers essential methodologies and hands-on sessions with tools like the HuggingFace ecosystem, PEFT, TRL and Unsloth AI.
Workshop Modules:
Module 1: LLM and Generative AI Essentials
Module 2: Training and Fine-tuning LLMs
Module 3: Parameter-efficient Fine-tuning LLMs
Module 4: Instruction-based Fine-tuning LLMs using Reinforcement Learning
Module 5: Wrap-up and Best Practices
Speaker: Dipanjan Sarkar, Head of Community and Principal AI Scientist at Analytics Vidhya

Mastering Language Models: From Concepts to Code in PyTorch
In this workshop, learn the working behind ChatGPT and get hands-on with coding, training, and fine-tuning your own language models using PyTorch. This workshop will be conducted in a unique “StatQuest Style,” ensuring every detail is clearly explained.
Workshop Modules
Module 1: Introduction to Neural Networks and Transformers
Module 2: (BAM!): Essential Matrix Algebra for Coding Transformers
Module 3: (DOUBLE BAM!!): Coding a Language Model from Scratch
Module 4: (TRIPLE BAM!!!): Fine-tuning a Production Grade Large Language Model
Speaker: Joshua Starmer PhD, Founder & CEO at StatQuest
"""

question = """Tell me who is the speaker of the fine-tuning workshop
              and what do we learn from this workshop?"""

rag_prompt = rag_qa_template(question,context)
# Encode the prompt.
inputs = tokenizer(rag_prompt, return_tensors="pt").to('cuda')
# Generate the output.
output = model.generate(**inputs, max_new_tokens=512,
                        eos_token_id=tokenizer.eos_token_id,
                        tokenizer=tokenizer, stop_strings=["<|eot_id|>"])
# Decode the output.
text = tokenizer.decode(output[0], skip_special_tokens=False)
print(text)

<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>
Use the context information below only to answer the given question.
Do not make up answers.

Context: 
Explore the transformative technology of Retrieval-Augmented Generation (RAG). This workshop will guide you from foundational concepts to building sophisticated RAG systems, ensuring tangible outcomes and solid takeaways.
Workshop Modules:
Module 1: Getting Started with Gen AI
Module 2: Gen AI Approaches in Building Business Applications
Module 3: Foundations to Understand RAG Better
Module 4: Building RAG Applications from Scratch
Module 5: Spice up your RAG Game with Advanced RAG Strategies
Speaker: Arun Prakash Asokan, Associate Director Data Science at Novartis

Mastering LLMs: Training, Fine-tuning, and Best Practices
Gain a comprehensive introduction to training and fine-tuning large language models. This LLM workshop covers essential methodologies and hands-on sessions with tools like the HuggingFace ec

In [52]:
import torch

torch.cuda.empty_cache()