# __Demo: RAG Implementation from scratch__

### Steps to be followed:

1. Install and import the dependencies
2. Load the document
3. Split the document into chunks
4. Generate embeddings for each chunk
5. Build the FAISS vector store and create a retriever
6. Design a prompt template for the language model
7. Load and configure a quantized language model
8. Set up the generation pipeline and chain the components
9. Invoke the pipeline with a query

# **Step1: Install and import the dependecies**

In [None]:
!pip install --upgrade huggingface_hub sentence-transformers bitsandbytes

Defaulting to user installation because normal site-packages is not writeable
Collecting sentence-transformers
  Downloading sentence_transformers-3.4.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.4.1-py3-none-any.whl (275 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.9/275.9 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0m
[?25hInstalling collected packages: sentence-transformers
  Attempting uninstall: sentence-transformers
    Found existing installation: sentence-transformers 2.2.2
    Uninstalling sentence-transformers-2.2.2:
      Successfully uninstalled sentence-transformers-2.2.2
Successfully installed sentence-transformers-3.4.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
import os


# **Step 2: Load the document**

Load the document that will be used as the knowledge source.

**Knowledge base**: The text document serves as the underlying knowledge base. Later, when a query is made, relevant parts of this document will be retrieved to augment the LLM's response.






In [None]:
text_loader = TextLoader("state_of_union.txt")
text_document = text_loader.load()
print(text_document[:100])  # Prints the first 100 characters of the text document



[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \n\nGroups of citize

# **Step 3: Split the document into chunks**

Break down the large document into manageable pieces.

**Fine-Grained Retrieval**: Smaller chunks allow the retriever to more precisely locate the context relevant to the query, enhancing the generation step with focused context.

In [None]:
doc_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
split_texts = doc_splitter.split_documents(text_document)
print(len(split_texts))  # Prints the number of chunks the PDF has been split into


53


# **Step 4: Generate embeddings for each chunk**

Convert text chunks into numerical vectors (embeddings) that capture semantic meaning.

**Semantic Search**: Embeddings allow the FAISS vector store to perform similarity searches, ensuring that the most relevant context is retrieved for any given query.

**Verification**: Printing the length of the embedding vector confirms the transformation was successful.

In [None]:
MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
hf_embed = HuggingFaceEmbeddings(model_name=MODEL_NAME)
text = split_texts[0].page_content
hf_embed_result = hf_embed.embed_documents([text])
print(len(hf_embed_result[0]))  # Prints the length of the first embedded document

2025-02-18 13:16:28.513546: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-18 13:16:28.566083: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

768


#### If we quickly want to see how the embeddings for the chunks will look like we will do the below

In [None]:
embedded_chunks = [hf_embed.embed_query(chunk.page_content) for chunk in split_texts]

In [None]:
import pandas as pd
df_chunks = pd.DataFrame(embedded_chunks)
df_chunks


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
0,-0.01757,0.101777,-0.010494,0.024629,0.007867,-0.008709,-0.047972,-0.011033,-0.000209,-0.036135,...,-0.019824,-0.01993,0.044173,0.027855,-0.045804,-0.009089,-0.035914,0.000358,-0.003852,-0.017099
1,0.033408,0.078639,-0.002185,-0.031801,-0.012582,-0.014186,-2.5e-05,-0.041827,0.026004,0.033043,...,-0.006442,-0.016889,-0.027715,0.053919,0.027535,0.008685,-0.011436,0.068335,-0.028501,-0.037242
2,0.012558,0.033666,0.001252,0.005224,-0.022721,-0.028479,0.052828,-0.022878,-0.009452,0.001424,...,-0.012291,-0.024313,0.042918,0.025881,0.014226,-0.048886,-0.016243,0.035996,-0.012137,-0.024365
3,-0.038429,0.036189,0.013976,0.010168,0.009914,-0.025288,-0.008883,-0.049233,0.026862,0.001242,...,-0.000914,-0.014002,0.074753,0.019576,0.037865,-0.024706,-2e-06,0.058,-0.010562,0.000267
4,0.014516,0.041241,0.003573,-0.037763,-0.0101,-0.032179,0.032072,-0.017779,0.000895,-0.007063,...,-0.027593,0.015847,0.057699,0.030875,0.027842,8.2e-05,0.009793,-0.00043,-0.051848,-0.034995
5,0.041825,0.035288,0.004461,-0.040438,-0.02373,-0.059049,0.025266,-0.003494,0.019276,0.017754,...,-0.069936,0.00159,-0.001447,0.068728,0.011128,-0.009241,-0.027915,0.030503,-0.057551,-0.032336
6,0.021648,0.014052,0.023061,-0.04013,-0.01539,-0.032677,0.025847,-0.002982,0.084222,-0.001781,...,-0.043137,-0.005948,0.100516,0.046437,0.033617,-0.013371,0.003821,0.02407,-0.060905,-0.036954
7,0.000521,0.041569,0.005422,-0.008324,-0.044695,-0.049875,0.010778,-0.03033,0.007619,0.015853,...,-0.063684,0.022584,0.050347,0.042918,0.02916,-0.046929,0.000421,0.0004,-0.094893,-0.00346
8,0.032507,0.082838,0.01287,0.01796,-0.045969,-0.029158,0.019323,-0.018328,0.011206,0.019157,...,-0.043993,0.004471,0.04404,0.051866,0.042437,-0.07603,-0.006913,0.02264,-0.06728,-0.022953
9,-0.012315,0.038408,-0.010831,0.002273,0.01661,-0.036781,-0.011865,-0.025149,0.021735,-0.024317,...,-0.04784,0.013128,0.036254,0.035082,0.044569,-0.025483,0.009615,0.030583,-0.03756,-0.01257


# **Step 5: Build the FAISS vector store and create a retriever**

Build an index (FAISS) for the document embeddings and create a retriever.

**Retrieval step**: The retriever is responsible for fetching the most relevant chunks from the document based on the query. These retrieved contexts will later be fed into the generation step to produce an informed answer.


In [None]:
vectorstore=FAISS.from_documents(split_texts, hf_embed)

# It will take thesame embedding of the chunks as shown above and and crate a vecor database for it which will be temporary, ie non persistent

#### Let's see if the retriever works

In [None]:
retriever=vectorstore.as_retriever()

In [None]:
# The way the retriever works

query = "What are the key points from the State Of The Union"
docs = retriever.get_relevant_documents(query)
for doc in docs:
    print(doc.page_content)

Let’s increase Pell Grants and increase our historic support of HBCUs, and invest in what Jill—our First Lady who teaches full-time—calls America’s best-kept secret: community colleges. 

And let’s pass the PRO Act when a majority of workers want to form a union—they shouldn’t be stopped.  

When we invest in our workers, when we build the economy from the bottom up and the middle out together, we can do something we haven’t done in a long time: build a better America.
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny.
But in my administration, the watch

In [None]:
query2 = "How is the United States supporting Ukraine economically and militarily?"

In [None]:
docs = retriever.get_relevant_documents(query2)
for doc in docs:
    print(doc.page_content)

The Russian stock market has lost 40% of its value and trading remains suspended. Russia’s economy is reeling and Putin alone is to blame. 

Together with our allies we are providing support to the Ukrainians in their fight for freedom. Military assistance. Economic assistance. Humanitarian assistance. 

We are giving more than $1 Billion in direct assistance to Ukraine. 

And we will continue to aid the Ukrainian people as they defend their country and to help ease their suffering.
Along with twenty-seven members of the European Union including France, Germany, Italy, as well as countries like the United Kingdom, Canada, Japan, Korea, Australia, New Zealand, and many others, even Switzerland. 

We are inflicting pain on Russia and supporting the people of Ukraine. Putin is now isolated from the world more than ever. 

Together with our allies –we are right now enforcing powerful economic sanctions.
The United States is a member along with 29 other nations. 

It matters. American diplo

# **Step 6: Design a prompt template for the language model**
Establish a prompt that instructs the LLM on how to utilize the retrieved context to generate a concise answer.

**Guiding Generation**: The prompt template bridges retrieval and generation by ensuring the LLM uses the provided context (from the retriever) to answer the query accurately.

In [None]:
from langchain.prompts import ChatPromptTemplate

In [None]:
template="""You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Use one sentence and keep the answer concise.
Question: {question}
Context: {context}
Answer:
"""

In [None]:
prompt=ChatPromptTemplate.from_template(template)

In [None]:
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

In [None]:
output_parser=StrOutputParser()

# **Step 7: Load and configure a quantized language model**

Load a quantized version of a large language model (Falcon3-1B-Base) for efficient and cost-effective text generation.

**Generation Step**: This model is responsible for generating the final answer. It takes the prompt (which includes the retrieved context) and produces a response, completing the RAG pipeline.

**Efficiency**: 4-bit quantization reduces resource usage while maintaining performance, crucial for deploying RAG systems in production.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

MODEL_NAME = "tiiuae/Falcon3-1B-Base"

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model.eval()


generation_config.json:   0%|          | 0.00/91.0 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/362k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.78M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(131072, 2048)
    (layers): ModuleList(
      (0-17): 18 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear4bit(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-06)
      )
    )
    (norm): LlamaRMSNorm((2048,)

In [None]:
model.eval()
generation_config = model.generation_config
# Set temperature to 0 for deterministic responses
generation_config.temperature = 0.8
# Set number of returned sequences to 1
generation_config.num_return_sequences = 1
# Set maximum new tokens per response
generation_config.max_new_tokens = 256
# Disable token caching
generation_config.use_cache = False
# Set repetition penalty for more diverse responses
generation_config.repetition_penalty = 1.7
# Define pad and EOS token IDs
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

In [None]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    StoppingCriteria,
    StoppingCriteriaList,
    pipeline,
)

# **Step 8: Set up the generation pipeline and chain the components**

Build an end-to-end pipeline that seamlessly connects document retrieval with text generation.

**Integration**: The chain uses the retriever to fetch context, applies the prompt template to integrate the query with the retrieved context, and then passes the final prompt to the LLM for answer generation.

**Pipeline composition**: Using the pipe operator (|), the components are elegantly chained together to perform a complete RAG operation in one go.

In [None]:
from langchain.llms import HuggingFacePipeline # Import HuggingFacePipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Create the HuggingFacePipeline object
llm_pipeline = HuggingFacePipeline(pipeline=pipe)

Device set to use cuda:0


In [None]:
rag_chain = (
    {"context": retriever,  "question": RunnablePassthrough()}
    | prompt
    | llm_pipeline
    | output_parser
)

# **Step 9: Invoke the pipeline with a query**

Execute the entire RAG pipeline with a sample query.

**Final output**: The pipeline retrieves relevant chunks from the document, forms a context-rich prompt, and the LLM generates a concise answer based on that context.

**End-to-end flow**: This step demonstrates the full cycle of RAG—retrieval and augmented generation—in action.

In [None]:
result = rag_chain.invoke("How is the United States supporting Ukraine economically and militarily?")



In [None]:
result

'In order to provide financial or military resources directly towards helping those affected by conflict situations such as war crimes investigations can also involve funding humanitarian organizations working within these areas which may include medical teams assisting victims during conflicts; this would fall under international law regarding human rights protection measures when dealing specifically about how funds should ideally go if there exists any form of violence occurring between different groups living together peacefully but still having disagreements over territory/resources etc., so it might make sense depending upon specific circumstances whether certain types of donations made via official channels provided through government agencies responsible overseeing security forces operating near borders where clashes often occur due lack thereof proper communication systems being established beforehand among both sides involved making sure everyone understands exactly why money

# Conclusion

This RAG (Retrieval-augmented generation) pipeline exemplifies how to combine retrieval-based methods with generative AI to produce informed, context-driven answers. By following these high-level steps—setting up the environment, loading and splitting the document, generating embeddings, building a FAISS vector store, and creating a retriever—you establish a robust foundation for pinpointing the most relevant pieces of information. Integrating a prompt template ensures that the language model is guided to leverage this retrieved context effectively. Finally, by employing a quantized language model in an end-to-end chain, the system efficiently generates concise and accurate responses. Overall, this approach not only enhances the model’s output by grounding it in factual context but also streamlines the process, making it scalable and adaptable to various domains and applications.