# Build End-to-End RAG System using Fine-tuned Models

Here we will build a simple RAG pipeline which will use:

- Custom Fine-tuned Embedder
- Custom Fine-tuned Llama 3 LLM



## Load RAG Dataset

In [1]:
from datasets import load_dataset

dataset = load_dataset("dipanjanS/india_rag_wikidata", split="train")

In [2]:
df = dataset.to_pandas()
df.head(10)

Unnamed: 0,article_id,context,question,answer
0,73985,"Basil (""Ocimum basilicum"") ( or ) is a plant o...",What is Basil commonly known as?,"Dear Friend, Basil is commonly known as Sweet ..."
1,73985,"Basil (""Ocimum basilicum"") ( or ) is a plant o...",Where is Basil originally native to?,"Dear Friend, Basil is originally native to Ind..."
2,259745,The Roerich Pact is a treaty on Protection of ...,What is the significance of the Roerich Pact?,"Dear Friend, the Roerich Pact is significant b..."
3,259745,The Roerich Pact is a treaty on Protection of ...,When was the Roerich Pact signed?,"Dear Friend, the Roerich Pact was signed on 15..."
4,207506,The Indian Air Force is the air arm of the Ind...,What is the history of the Indian Air Force?,"Dear Friend, the Indian Air Force was founded ..."
5,207506,The Indian Air Force is the air arm of the Ind...,What role does the Indian Air Force play?,"Dear Friend, the Indian Air Force serves as th..."
6,4954,Prem Pal Singh Rawat (in India called Maharaji...,Who is Prem Pal Singh Rawat?,"Dear Friend, Prem Pal Singh Rawat, also known ..."
7,4954,Prem Pal Singh Rawat (in India called Maharaji...,What organizations has Prem Pal Singh Rawat be...,"Dear Friend, Prem Pal Singh Rawat has been ass..."
8,5113,Chennai (formerly known as Madras) is the capi...,What is significant about Chennai's history?,"Dear Friend, Chennai, formerly known as Madras..."
9,5113,Chennai (formerly known as Madras) is the capi...,Where is Chennai located?,"Dear Friend, Chennai is located on the Coroman..."


## Get Context Document Corpus

In [3]:
context_corpus = df['context'].drop_duplicates().tolist()
len(context_corpus)

669

In [4]:
context_corpus[:2]

['Basil ("Ocimum basilicum") ( or ) is a plant of the Family Lamiaceae. It is also known as Sweet Basil or Tulsi. It is a tender low-growing herb that is grown as a perennial in warm, tropical climates. Basil is originally native to India and other tropical regions of Asia. It has been cultivated there for more than 5,000 years. It is prominently featured in many cuisines throughout the world. Some of them are Italian, Thai, Vietnamese and Laotian cuisines. It grows to between 30–60\xa0cm tall. It has light green, silky leaves 3–5\xa0cm long and 1–3\xa0cm broad. The leaves are opposite each other. The flowers are quite big. They are white in color and arranged as a spike.',
 'The Roerich Pact is a treaty on Protection of Artistic and Scientific Institutions and Historic Monuments, signed by the representatives of 21 states in the Oval Office of the White House on 15 April 1935. As of January 1, 1990, the Roerich Pact had been ratified by ten nations: Brazil, Chile, Colombia, Cuba, the 

## Load up Custom Fine-tuned BGE Embedder Model

Here we load the embedder model we fine-tuned earlier

In [5]:
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

# model_name = "dipanjanS/bge-base-en-v1.5-fte"
model_name = "./bge-base-en-v1.5-fte"
ft_embedder = HuggingFaceEmbeddings(model_name=model_name)

## Create Vector Database Index

In [6]:
from langchain_chroma import Chroma

# create vector DB of docs and embeddings - takes < 30s on Colab
db = Chroma.from_texts(texts=context_corpus,
                       collection_name='wikipedia_db',
                       embedding=ft_embedder,
                       collection_metadata={"hnsw:space": "cosine"})

In [7]:
retriever = db.as_retriever(search_type="similarity",
                            search_kwargs={"k": 3})

## Test Sample Retrievals

In [8]:
query = "Tell me about the Indian flag"

docs = retriever.invoke(query)
docs

[Document(id='27705482-da7e-4925-8978-11ba2ebc3dbd', metadata={}, page_content='The modern Flag of The Republic of India has three colours, which are placed horizontally. At the top is saffron, which signifies sacrifice and patriotism. In the middle is white, which stands for truth in word and actions and purity in our thoughts. At the bottom is green, which stands for life and prosperity. In the middle of the white is a blue wheel, which is called the Ashoka Chakra. It has 24 spokes and it stands for progress.The Chakra or the wheel also symbolizes the Power of the State governed by Dharma. It is also called the tiranga or tricolour. The flag was discovered by Venkayya Pingali.'),
 Document(id='dbe30b7d-9609-4757-b96a-f5e3926115fc', metadata={}, page_content='The modern Republic of India (Hindi:); has several official National symbols including a historical document, a flag, an emblem, an anthem, a memorial tower as well as several national heroes. All the symbols were picked up at va

## Create a Retrieval Function on the Vector DB

In [9]:
def rag_retrieval(query):
    docs = retriever.invoke(query)
    context = "\n\n".join([doc.page_content for doc in docs])
    return context

query = "Tell me about the national anthem of India"
print(rag_retrieval(query))

"Jana Gana Mana" (जन-गण-मन) is the title for the national anthem of India. In 1911, it was composed and originally written in Bengali by popular Indic poet and artist Rabindranath Tagore—the same author for the Bangladeshi anthem ""Amar Shonar Bangla"". Originally a five-verse Brahmo hymn in Bengali called ""Bharoto Bhagyo Bidhata"", in January 1950, only the first verse was made to be officially part of the national anthem of India per the . In addition, the lyrics now contain many words of Sanskrit origin.

The modern Republic of India (Hindi:); has several official National symbols including a historical document, a flag, an emblem, an anthem, a memorial tower as well as several national heroes. All the symbols were picked up at various times. The design of the national flag was officially adopted by the Constituent Assembly just 21 days before Independence, on the 24th of July in 1947. There are also several other patriotic symbols including the national animal, bird, fruit, flower

## Load Fine-tuned Llama 3 LLM

Here we will use our custom fine-tuned Llama 3 8B LLM from the previous hands-on demo

In [10]:
from unsloth import FastLanguageModel

model_name = "RAG_Llama3-8B-it"
# model_name = "dipanjanS/RAG_Llama3-8B-it"
max_seq_length = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name, # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = None,
        load_in_4bit = False,
    )
FastLanguageModel.for_inference(model)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.2.12: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: NVIDIA A40. Max memory: 44.339 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.4.1+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.0.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = True]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=128255)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
    

## Create a simple RAG prompt

In [11]:
def rag_qa_prompt(question, context):
    """
    Creates a chat prompt for the Llama model.

    Args:
        question: The question to be answered.
        context: The context information to be used for generating the answer.

    Returns:
        A string containing the chat template.
    """

    prompt = f"""\
    <|begin_of_text|><|start_header_id|>user<|end_header_id|>
    Use the context information below only to answer the given question.
    Do not make up answers.

    Context: {context}

    Question: {question}

    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    """
    # Remove any leading whitespace characters from each line in the prompt.
    prompt = "\n".join([line.lstrip() for line in prompt.splitlines()])
    return prompt

## Create a LLM Response Generation Function

In [12]:
def rag_generation(model, tokenizer, prompt, max_tokens=512):
    """
    Generates a response from the Llama model using the given prompt.

    Args:
        prompt: The prompt to be used for generating the response.

    Returns:
        A string containing the generated response.
    """

    # Encode the prompt.
    inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
    # Generate the output.
    output = model.generate(**inputs, max_new_tokens=max_tokens,
                            eos_token_id=tokenizer.eos_token_id,
                            tokenizer=tokenizer,
                            stop_strings=["<|eot_id|>"])
    # Decode the output.
    text = tokenizer.decode(output[0], skip_special_tokens=False)
    return text

## Create the RAG Pipeline

Here we will put everything together and chain the three functions one after the other:

- RAG retrieval from Vector DB
- Pass context docs with user question to RAG prompt
- Send prompt to LLM and generate response

In [15]:
def rag_pipeline(query, model, tokenizer, max_tokens=512):
    context = rag_retrieval(query)
    prompt = rag_qa_prompt(query, context)
    response = rag_generation(model, tokenizer,
                              prompt, max_tokens)
    response = (response.split('<|start_header_id|>assistant<|end_header_id|>')[1]
                       .split('<|eot_id|>')[0])
    return response

## Test out the RAG system

In [16]:
query = "Tell me about the Indian flag"
result = rag_pipeline(query, model, tokenizer)
print(result)


Answer: Dear Friend, the Indian flag, also known as the tiranga or tricolour, has three colours placed horizontally. The top saffron represents sacrifice and patriotism, while the middle white symbolizes truth and purity. The green at the bottom represents life and prosperity. 


In [17]:
query = "What is the capital of India?"
result = rag_pipeline(query, model, tokenizer)
print(result)


Answer: Dear Friend, the capital of India is New Delhi. It is also a union territory and a part of the megacity of Delhi. 


In [21]:
query = "What is the capital of India? Give a detailed answer"
result = rag_pipeline(query, model, tokenizer)
print(result)


Answer: Dear Friend, the capital of India is New Delhi. It is a union territory located within the megacity of Delhi and is renowned for its rich history and cultural heritage. New Delhi serves as the seat of the federal government and is a hub for politics, education, and commerce. 


In [22]:
query = "Tell me about the national anthem of India"
result = rag_pipeline(query, model, tokenizer)
print(result)


Answer: Dear Friend, the national anthem of India is called 'Jana Gana Mana', which was composed by Rabindranath Tagore. It was originally a five-verse Brahmo hymn in Bengali and was later made official in January 1950. 


In [23]:
query = "Tell me about the national song of India"
result = rag_pipeline(query, model, tokenizer)
print(result)


Answer: Dear Friend, the national song of India is 'Vande Mataram'. It is a significant part of Indian culture and was written by Bankim Chandra Chattopadhyay. The lyrics are a mix of Sanskrit and Bengali, and the title means 'Long live the Motherland'. 


#### Frameworks to evaluate RAG systems

- Trulens
- LLM as a Judge
- Ragas
- Deepeval