# Retrieval-Augmented Generation (RAG) architecture

## 1. Document Loading and Chunking

### 1.1 PDF Text Extraction

* **Library choice**: use a robust parser (e.g. PyPDF2, PDFMiner) that preserves text order and handles multi-column layouts.
* **Normalization**:

  * Unicode normalization (NFC/NFD)
  * Removal of control characters, page headers/footers
  * Consistent newline handling
* **Error handling**:

  * Fallback for pages that fail to parse (e.g. OCR)
  * Logging of missing or malformed sections

### 1.2 Chunking Strategy

* **Fixed-length windows**: contiguous substrings of length $C$ (e.g. $C=500$ characters)
* **Sentence-aware splitting**: ensure splits occur at punctuation boundaries
* **Overlap / stride**: sliding window with overlap $o$ (e.g. $o=50$) to preserve context across chunk borders
* **Result**: set of fragments $\{d_i\}_{i=1}^N$, each amenable to independent embedding

In [None]:
from PyPDF2 import PdfReader

reader = PdfReader("temp.pdf")
pages_text = []
for page in reader.pages:
    pages_text.append(page.extract_text())

full_text = "\n".join(pages_text)

chunks = []
chunk_size = 500
i = 0

while i < len(full_text):
    chunks.append(full_text[i : i + chunk_size])
    i += chunk_size


## 2. Loading Model Weights and Preparing for Inference

### 2.1 Configuration and Architecture

* **Model config**: JSON specifying

  * Hidden size $D$ (e.g. 768, 1024)
  * Number of layers $N$
  * Attention heads $H$
  * Vocabulary size $V$

### 2.2 Weights Initialization

* **State dict**: mapping of parameter names to tensors
* **Exact name matching**: ensure loaded keys match model’s `named_parameters()`
* **Device placement**:

  * CPU vs GPU (e.g. `model.to(device)`)
  * Mixed-precision (FP16/FP32)

### 2.3 Evaluation Mode

* **`model.eval()`**:

  * Disables dropout layers
  * Freezes batch-norm and layer-norm statistics
  * Guarantees deterministic forward pass

In [None]:
import torch
from transformers import AutoConfig, AutoModel

# Load configuration, requesting hidden_states
config = AutoConfig.from_pretrained(
    "intfloat/multilingual-e5-large",
    output_hidden_states=True
)

# Instantiate model architecture
model = AutoModel.from_config(config)

# Load weights from the pretrained checkpoint
state_dict = torch.load("pytorch_model.bin", map_location="cpu")
model.load_state_dict(state_dict)

# Switch to evaluation mode (disables dropout, etc.)
model.eval()



## 3. Tokenization with SentencePiece

### 3.1 Tokenizer Model

* **Type**: Unigram LM or Byte-Pair Encoding (BPE)
* **Vocabulary**: size $|V|$ (typical 30 000–100 000)
* **Model artifacts**:

  * `spiece.model` (binary)
  * `vocab.txt` (optional)

### 3.2 Encoding Procedure

* **Subword segmentation**: text $\to$ ID sequence
* **Unknown handling**: reserved `<unk>` token for out-of-vocabulary bytes
* **Special tokens**: `<s>`, `</s>`, `<pad>`

### 3.3 Sequence Formatting

* **Maximum length $$L$$**: truncate sequences $>L$ tokens
* **Padding**: append `<pad>` IDs to reach length $L$
* **Attention mask**: binary vector $m \in \{0,1\}^L$ where

  $$
    m_j = 
    \begin{cases}
      1 & \text{if token } j \text{ is real}\\
      0 & \text{if token } j \text{ is padding}
    \end{cases}
  $$


In [None]:
import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("spiece.model")

max_len = 512
all_input_ids = []
all_attention_masks = []

for chunk in chunks:
    # Encode to raw IDs
    ids = sp.encode(chunk, out_type=int)
    # Truncate
    if len(ids) > max_len:
        ids = ids[:max_len]
    # Build mask of 1s for real tokens
    mask = []
    for _ in ids:
        mask.append(1)
    # Pad up to max_len with 0s
    pad_len = max_len - len(ids)
    j = 0
    while j < pad_len:
        ids.append(0)
        mask.append(0)
        j += 1
    all_input_ids.append(ids)
    all_attention_masks.append(mask)

# Convert to tensors
input_ids_tensor = torch.tensor(all_input_ids, dtype=torch.long)
attention_mask_tensor = torch.tensor(all_attention_masks, dtype=torch.long)


## 4. Embedding via Pre-last-Layer Mean-Pooling

### 4.1 Hidden-State Extraction

* **Forward pass** yields hidden states
  $$
  \{H^{(0)},H^{(1)},\dots,H^{(N)}\}
  $$ 
  each 
  $$
  {H^{(\ell)}\in\mathbb{R}^{B\times L\times D}}
  $$

### 4.2 Layer Selection

* **Penultimate** layer $H^{(N-1)}$ often yields stronger semantic features
* **Alternative**: experiment with weighted sums of multiple layers

### 4.3 Mean-Pooling Operation

* **Masked sum**:

  $$
    s_i = \sum_{j=1}^L H^{(N-1)}_{i,j,:}\times m_{i,j}
  $$
* **Normalization**:

  $$
    e_i = \frac{s_i}{\sum_{j=1}^L m_{i,j}}
  $$
* **Result**: embedding matrix $$E\in\mathbb{R}^{N\times D}$$


In [None]:
import numpy as np

with torch.no_grad():
    outputs = model(
        input_ids=input_ids_tensor,
        attention_mask=attention_mask_tensor
    )
    # hidden_states is a tuple: (layer0, layer1, …, layerN)
    all_hidden = outputs.hidden_states
    # Select the pre-last layer
    # shape: (N, L, D)
    prelast = all_hidden[-2]
    # (N, L, 1)
    mask = attention_mask_tensor.unsqueeze(-1).float()
    # (N, D)
    summed = (prelast * mask).sum(dim=1)
    # (N, 1)
    counts = mask.sum(dim=1).clamp(min=1)
    # (N, D)
    mean_pooled = summed / counts
    # (N, D)
    embeddings = mean_pooled.cpu().numpy()


## 5. Building the FAISS Index

### 5.1 Index Type: Exact L2

* **`IndexFlatL2`** stores all vectors in contiguous memory
* **Distance metric**: Euclidean
  $\|x-y\|_2^2$
  computed via highly-optimized BLAS kernels

### 5.2 Insertion

* Add embedding matrix $E$ directly: $\text{index}.add(E)$

### 5.3 Performance Considerations

* **Memory footprint**: $N\times D\times$ sizeof(float)
* **Batch search throughput**: parallelized over queries


In [None]:
import faiss

dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(embeddings)

## 6. Querying: Tokenize, Embed, Retrieve

### 6.1 Query Preprocessing

* Apply identical SentencePiece pipeline and padding/truncation

### 6.2 Query Embedding

* Forward pass $\to$ penultimate hidden state $\to$ mean-pooling
* Yields query vector $q\in\mathbb{R}^D$

### 6.3 Nearest-Neighbor Search

* **`index.search(q, k)`** returns top-$k$ indices $\{i_1,\dots,i_k\}$
* **Distance**: squared L2 distances $\{d_1,\dots,d_k\}$
* **Mapping**: map indices back to original chunks

In [None]:
query = "What are the main topics?"

# Tokenize the query
q_ids = sp.encode(query, out_type=int)
if len(q_ids) > max_len:
    q_ids = q_ids[:max_len]
q_mask = []
for _ in q_ids:
    q_mask.append(1)
pad_len = max_len - len(q_ids)
k = 0
while k < pad_len:
    q_ids.append(0)
    q_mask.append(0)
    k += 1

q_input_ids = torch.tensor([q_ids], dtype=torch.long)
q_attention_mask = torch.tensor([q_mask], dtype=torch.long)

# Embed the query
with torch.no_grad():
    # (1, L, D)
    q_out = model(input_ids=q_input_ids, attention_mask=q_attention_mask).hidden_states[-2]  

    # (1, L, 1)
    q_mask2 = q_attention_mask.unsqueeze(-1).float()

    # (1, D)   
    q_sum = (q_out * q_mask2).sum(dim=1)

    # (1, 1)
    q_cnt = q_mask2.sum(dim=1).clamp(min=1)

    # (1, D)
    q_emb = (q_sum / q_cnt).cpu().numpy()

# retrieve top-3
D, I = index.search(q_emb, 3)
retrieved_chunks = []
for idx in I[0]:
    retrieved_chunks.append(chunks[idx])


## 7. Generating the Final Answer

### 7.1 Prompt Construction

* Prepend retrieved chunks under a **Context:** heading
* Append **Question:** and **Answer:** tokens

### 7.2 LLM Inference

* Use a text-generation model in eval mode
* Specify decoding hyperparameters:

  * **Temperature** $T$
  * **Top-p** (nucleus) sampling
  * **Max tokens** $M$

### 7.3 Output Decoding

* Convert token IDs to text
* Post-process whitespace and formatting
* Return coherent, context-grounded response

In [None]:
from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="mistralai/Mistral-7B-Instruct-v0.2",
    device=0,
    max_new_tokens=200,
)

# build prompt explicitly
prompt = "Context:\n"
for segment in retrieved_chunks:
    prompt += segment + "\n"
prompt += "\nQuestion: " + query + "\nAnswer:"

# generate and print
result = generator(prompt)
print(result[0]["generated_text"])

## Here is how the entire seven-step pipeline collapses to just a handful of lines using LlamaIndex or LangChain

**LlamaIndex and LangChain: High-Level Wrappers for RAG Pipelines**

* These frameworks serve as orchestrators for modern Retrieval-Augmented Generation (RAG) pipelines.
* They unify and simplify complex workflows that involve document loading, tokenization, embedding, indexing, retrieval, and answer generation.
* Under the hood, they integrate with specialized libraries for parsing, embedding models, vector search, and large language models.
* Developers can implement complete RAG systems with just a few lines of code, avoiding the need for manual integration of each tool.
* This approach streamlines development, enables rapid prototyping, and ensures robust, production-ready best practices.


In [None]:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM

# 1–2. Load & chunk PDF
docs = SimpleDirectoryReader(input_files=["temp.pdf"]).load_data()

# 3–5. Embed with E5, build FAISS under the hood
embed_model = HuggingFaceEmbedding(model_name="intfloat/multilingual-e5-large")
index = VectorStoreIndex.from_documents(docs, embed_model=embed_model)

# 6–7. Query + generate answer with Mistral-7B
llm = HuggingFaceLLM(model_name="mistralai/Mistral-7B-Instruct-v0.2")
query_engine = index.as_query_engine(llm=llm)
print(query_engine.query("What are the main topics?"))


Fetching 3 files:  33%|███▎      | 1/3 [01:08<02:17, 68.61s/it]Error while downloading from https://cdn-lfs-us-1.hf.co/repos/25/f2/25f242d117fa40b7cc0b5e85e97135c923bc5665bde4204e7fabadb99a561eab/a42716540ecb2385d371f2109835921ff535406cac8fe8ff28f2f0b5fc7895bd?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model-00002-of-00003.safetensors%3B+filename%3D%22model-00002-of-00003.safetensors%22%3B&Expires=1751827618&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc1MTgyNzYxOH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzI1L2YyLzI1ZjI0MmQxMTdmYTQwYjdjYzBiNWU4NWU5NzEzNWM5MjNiYzU2NjViZGU0MjA0ZTdmYWJhZGI5OWE1NjFlYWIvYTQyNzE2NTQwZWNiMjM4NWQzNzFmMjEwOTgzNTkyMWZmNTM1NDA2Y2FjOGZlOGZmMjhmMmYwYjVmYzc4OTViZD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=YSu6nofz3bv42kifBpXv3sstj3diGjalvedg6ptEdcGCPfJiSbghN1GLbt5JKSBaLbUj7AnIkU19-4O2OZmFODDMT8qke0YwqFpOSREp66OIxCKIJu7DePnaeE1SuBUQwpKbb3lHl5oLcDP1Mk9AZqTKFywW45vqSGb0k6C

IndexError: index out of range in self

: 

In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from transformers import pipeline

# 1–2. Load & chunk PDF
loader = PyPDFLoader("temp.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

# 3–5. Embed & index
embeddings = HuggingFaceEmbeddings(model_name="intfloat/multilingual-e5-large")
db = FAISS.from_documents(chunks, embeddings)

# 6–7. Build RAG chain and query
llm_pipe = HuggingFacePipeline(pipeline=pipeline(
    "text-generation",
    model="mistralai/Mistral-7B-Instruct-v0.2",
    device=0,
    max_new_tokens=200
))
qa = RetrievalQA.from_chain_type(llm=llm_pipe, retriever=db.as_retriever())
print(qa.run("What are the main topics?"))
