RAG Pipeline:

    A[User Query] --> B[Query Preprocessing]
    B --> C[Query Embedding]
    D[Document(s)] --> E[Document Loading & Parsing]
    E --> F[Text Chunking]
    F --> G[Chunk Embedding]
    G --> H[(Vector Store)]
    C --> I[Similarity Search]
    H --> I
    I --> J[Reranking]
    J --> K[Context Retrieval]
    K --> L[Prompt Engineering]
    A --> L
    L --> M[Large Language Model (LLM)]
    M --> N[Answer Generation]
    N --> O[Answer Post-processing]
    O --> P[Final Answer & Sources]

Installing Dependencies

In [None]:
!pip install symspellpy
!pip install PyMuPDF
!pip install faiss-cpu
!pip install -q condacolab  # for installing faiss-gpu, dependency issues prevent direct installation
!pip install "numpy<2"  # also for faiss
# for cpu, use !pip install faiss-cpu
# run this command only the first time, otherwise will create duplicates
!wget https://raw.githubusercontent.com/wolfgarbe/SymSpell/master/SymSpell/frequency_dictionary_en_82_765.txt
# for installing faiss-gpu via condacolab
import condacolab
condacolab.install()

Collecting symspellpy
  Downloading symspellpy-6.9.0-py3-none-any.whl.metadata (3.9 kB)
Collecting editdistpy>=0.1.3 (from symspellpy)
  Downloading editdistpy-0.1.6-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading symspellpy-6.9.0-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m47.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading editdistpy-0.1.6-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (158 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.4/158.4 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: editdistpy, symspellpy
Successfully installed editdistpy-0.1.6 symspellpy-6.9.0
Collecting PyMuPDF
  Downloading pymupdf-1.26.0-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.0-cp39-abi3-manyli

In [None]:
!mamba install -c pytorch -c nvidia -c conda-forge faiss-gpu cudatoolkit=11.8


Looking for: ['faiss-gpu', 'cudatoolkit=11.8']

[?25l[2K[0G[+] 0.0s
[2K[1A[2K[0Gpytorch/linux-64 (check zst)                       Checked  0.1s
[?25h[?25l[2K[0G[+] 0.0s
pytorch/noarch (ch..  ⣾  [2K[1A[2K[0Gpytorch/noarch (check zst)                        
[?25h[?25l[2K[0G[+] 0.0s
nvidia/linux-64 (c..  ⣾  [2K[1A[2K[0Gnvidia/linux-64 (check zst)                       
[?25h[?25l[2K[0G[+] 0.0s
nvidia/noarch (che..  ⣾  [2K[1A[2K[0Gnvidia/noarch (check zst)                         
[?25h[?25l[2K[0G[?25h[?25l[2K[0G[?25h[?25l[2K[0G[+] 0.0s
pytorch/linux-64  ⣾  [2K[1A[2K[0Gpytorch/linux-64                                  
pytorch/noarch                                      10.2kB @ 104.6kB/s  0.1s
[+] 0.1s
conda-forge/linux-64  ⣾  
conda-forge/noarch    ⣾  
nvidia/linux-64       ⣾  
nvidia/noarch         ⣾  [2K[1A[2K[1A[2K[1A[2K[1A[2K[0Gnvidia/noarch                                     
nvidia/linux-64                                

In [None]:
# please paste your api key for the deepseek R1 free model here
my_api_key=""

In [None]:
# Query Preprocessor
from query_preprocessor import QueryPreprocessor

preprocessor = QueryPreprocessor(
  min_query_length=2,
  max_query_length=256,
  enable_spell_check=True
)

query = "What's   the mechanism of COVID-19 vaccination?   What evne is covid? How do we stpo it?"
processed_query = preprocessor.preprocess(query)
print(processed_query)

['what is the mechanism of covid-19 vaccination?', 'what even is covid?', 'how do we stop it?']


In [None]:
# Document Loader
from document_loader import DocumentLoader

loader = DocumentLoader()
sample_pdf = "chocolate_cake_recipe.pdf"

text, metadata = loader.load_document(sample_pdf)

print(f"Successfully parsed {metadata['page_count']} pages from {sample_pdf}")
print(f"Tile: {metadata.get('title', 'N/A')}")
print(f"Cleaned Text:\n {text}")

Successfully parsed 3 pages from chocolate_cake_recipe.pdf
Tile: Chocolate Cake | RecipeTin Eats
Cleaned Text:
 29/3/18, 12)34 pm
Chocolate Cake | RecipeTin Eats
Page 1 of 3
https://www.recipetineats.com/?p=28074&preview=true
Prep Time
10 mins
Cook Time
35 mins
Total Time
45 mins
Chocolate Cake
 
This is the everyday Chocolate Cake I make over and over again. The crumb is
tender and moist, it truly tastes of chocolate (rarer than you might think!) and
you only need one bowl and a whisk. It's the famous Hershey's "Perfectly
Chocolate" Cake and quite possibly the only recipe on this entire site that I use as
written, without any changes to the ingredients (but don't skip my baking tips in
the notes!). Recipe VIDEO below.
Servings: 8 -10 slices
Author: Nagi
Ingredients
2 cups / 440g white sugar (Note 1)
1 3/4 cups / 265g plain / all purpose flour
3/4 cup / 55g cocoa powder , unsweetened (Note 2)
1 1/2 tsp baking powder
1 1/2 tsp baking soda
1 tsp salt
2 eggs (~55-65g / 2 oz each)
1 cup / 

In [None]:
# Text Chunker
from text_chunker import TextChunker

chunker = TextChunker(chunk_size=512, chunk_overlap=128) # standard values
chunks = chunker.chunk_text(text, metadata)

print(f"Generated {len(chunks)} chunks:")
for i, chunk in enumerate(chunks):
  print(f"\nChunk {i+1} (Chars: {len(chunk['text'])}):")
  print(chunk['text'])
  print(f"Metadata: {chunk['metadata']}")

Generated 13 chunks:

Chunk 1 (Chars: 501):
29/3/18, 12)34 pm
Chocolate Cake | RecipeTin Eats
Page 1 of 3
https://www.recipetineats.com/?p=28074&preview=true
Prep Time
10 mins
Cook Time
35 mins
Total Time
45 mins
Chocolate Cake
 
This is the everyday Chocolate Cake I make over and over again. The crumb is
tender and moist, it truly tastes of chocolate (rarer than you might think!) and
you only need one bowl and a whisk. It's the famous Hershey's "Perfectly
Chocolate" Cake and quite possibly the only recipe on this entire site that I use as
Metadata: {'page_count': 3, 'author': 'Nagi Maehashi', 'title': 'Chocolate Cake | RecipeTin Eats', 'chunk_index': 0}

Chunk 2 (Chars: 500):
Chocolate" Cake and quite possibly the only recipe on this entire site that I use as
written, without any changes to the ingredients (but don't skip my baking tips in
the notes!). Recipe VIDEO below.
Servings: 8 -10 slices
Author: Nagi
Ingredients
2 cups / 440g white sugar (Note 1)
1 3/4 cups / 265g plain / all p

In [None]:
# Embedding Generator
from embedding_generator import EmbeddingGenerator

embedder = EmbeddingGenerator(model_name="mpnet")

# ----------------------------------------------------
samples = [
    "Neural networks are computing systems inspired by biological brains.",
    "Transformer models use attention mechanisms to process text.",
    "Chicken is a great source of protein."
]
embeddings = embedder.embed_text(samples)

from sklearn.metrics.pairwise import cosine_similarity

print("Similarity between sample texts:")
print(cosine_similarity([embeddings[0]], [embeddings[1]]))
print(cosine_similarity([embeddings[0]], [embeddings[2]]))
# ----------------------------------------------------

embeddings = embedder.embed_text(chunks)

print(f"Using model: {embedder.get_model_info()['name']}")
print(f"Embedding dimensions: {embedder.embedding_size}")
print(f"Generated {len(embeddings)} embeddings")
print(f"First embedding vector (length {len(embeddings[0])}):")
print(embeddings[0][:10])  # Show first 10 dimensions

Similarity between sample texts:
[[0.32122013]]
[[0.06740446]]
Using model: mpnet
Embedding dimensions: 768
Generated 13 embeddings
First embedding vector (length 768):
[ 0.05461106 -0.02150607  0.01168926 -0.01334951 -0.07141707  0.00237363
 -0.08287673 -0.00474756 -0.06211413  0.04496194]


In [None]:
# Vector Store
from vector_store import VectorStore

vector_str = VectorStore(dimension=768, index_path="my_index")

vector_str.add_chunks(embeddings, chunks)
print(f"Index contains {vector_str.get_index_size()} chunks")

query = "What's the best cocoa powder to use in this recipe?"
query_embedding = embedder.embed_text(query)

results = vector_str.search(query_embedding, k=5)
print("\nTop results:")
for i, res in enumerate(results):
  print(f"\nResult #{i+1} (Similarity: {res['similarity']:.4f})")
  print(f"Text: {res['chunk']}...")
  print(f"Metadata: {res['metadata']}")

vector_str.save_index()

Index contains 13 chunks

Top results:

Result #1 (Similarity: 0.5892)
Text: recipe up by 50%).
Recipe Notes
1. I use caster / superfine out of habit for all baking recipes, but regular is ok too.
2. Regular cocoa powder words just fine here, but dutch processed will make it a slightly more intense
chocolate flavour. I use regular for this cake. 
3. SPRINGFORM PAN (important): Even the best ones are not 100% leakproof so with very thin batters like
with this cake, you will get a small amount of leakage. The best way to combat this is to "plug" the space...
Metadata: {'page_count': 3, 'author': 'Nagi Maehashi', 'title': 'Chocolate Cake | RecipeTin Eats', 'chunk_index': 5}

Result #2 (Similarity: 0.4688)
Text: Chocolate" Cake and quite possibly the only recipe on this entire site that I use as
written, without any changes to the ingredients (but don't skip my baking tips in
the notes!). Recipe VIDEO below.
Servings: 8 -10 slices
Author: Nagi
Ingredients
2 cups / 440g white sugar (Note 1)

In [None]:
# Reranker
from reranker import Reranker

ranker = Reranker()
reranked = ranker.rerank(query, results, top_k=5)

for i, result in enumerate(reranked):
  print(f"Result {i+1}: {result}")

Result 1: {'chunk': 'recipe up by 50%).\nRecipe Notes\n1. I use caster / superfine out of habit for all baking recipes, but regular is ok too.\n2. Regular cocoa powder words just fine here, but dutch processed will make it a slightly more intense\nchocolate flavour. I use regular for this cake. \n3. SPRINGFORM PAN (important): Even the best ones are not 100% leakproof so with very thin batters like\nwith this cake, you will get a small amount of leakage. The best way to combat this is to "plug" the space', 'metadata': {'page_count': 3, 'author': 'Nagi Maehashi', 'title': 'Chocolate Cake | RecipeTin Eats', 'chunk_index': 5}, 'similarity': 0.5891648530960083, 'relevance': 4.114064693450928}
Result 2: {'chunk': 'Chocolate" Cake and quite possibly the only recipe on this entire site that I use as\nwritten, without any changes to the ingredients (but don\'t skip my baking tips in\nthe notes!). Recipe VIDEO below.\nServings: 8 -10 slices\nAuthor: Nagi\nIngredients\n2 cups / 440g white sugar 

In [None]:
# Prompt Engineer
from prompt_engineer import PromptEngineer

prompt_eng = PromptEngineer()
prompt = prompt_eng.format_prompt(processed_query=query, context_chunks=reranked)

print("Generated Prompt: ")
print(prompt)

Generated Prompt: 
You are an expert research assistant. Your task is to answer questions based ONLY on the provided context.
RULES:
1. Answer the question using ONLY the context provided
2. If the question cannot be answered with the context, say "I could not find an answer in the provided document(s)"
3. Be concise but comprehensive
4. Never invent information not present in the context

CONTEXT DOCUMENTS:
### CONTEXT 1 [Relevance: 4.11]
CONTENT: recipe up by 50%).
Recipe Notes
1. I use caster / superfine out of habit for all baking recipes, but regular is ok too.
2. Regular cocoa powder words just fine here, but dutch processed will make it a slightly more intense
chocolate flavour. I use regular for this cake. 
3. SPRINGFORM PAN (important): Even the best ones are not 100% leakproof so with very thin batters like
with this cake, you will get a small amount of leakage. The best way to combat this is to "plug" the space

### CONTEXT 2 [Relevance: 0.28]
CONTENT: Chocolate" Cake and qu

In [None]:
# Full implementation
import time
from deepseek_llm import DeepSeekLLM
import numpy as np

# define top k
top_k = 30
final_k = 5

# init pipeline parts
preprocessor = QueryPreprocessor(
  min_query_length=2,
  max_query_length=256,
  enable_spell_check=True
)
loader = DocumentLoader()
chunker = TextChunker(chunk_size=512, chunk_overlap=128)
embedder = EmbeddingGenerator(model_name="mpnet")
vector_str = VectorStore(dimension=768, index_path="my_index")
ranker = Reranker()
prompt_eng = PromptEngineer(max_content_length=160000)
llm = DeepSeekLLM(api_key=my_api_key)

# user input
doc_path = input("Hello! Welcome to the Document Question Answering Model by Vedik Upadhyay. Please enter the path for the document you wish to use: ")
query = input("What's your question?\n")

# pipeline
start = time.time()
questions = preprocessor.preprocess(query)
print(f"Processing query... {time.time() - start:.2f}s")

start = time.time()
text, metadata = loader.load_document(doc_path)
print(f"Loading text... {time.time() - start:.2f}s")

start = time.time()
chunks = chunker.chunk_text(text, metadata)
print(f"Chunking context... {time.time() - start:.2f}s")

start = time.time()
embeddings = embedder.embed_text(chunks)
print(f"Embedding chunks... {time.time() - start:.2f}s")

start = time.time()
vector_str.add_chunks(embeddings, chunks)
print(f"Storing embeddings... {time.time() - start:.2f}s")

start = time.time()
for q in questions:
  q_embed = embedder.embed_text(q)
  results = vector_str.search(q_embed, k=top_k)
  reranked = ranker.rerank(q, results, top_k=final_k)
  prompt = prompt_eng.format_prompt(processed_query=q, context_chunks=reranked)
  answer = llm.answer_query(prompt)
  print(f"Answering Questions... {time.time() - start:.2f}s")
  print(f"\n\nQuestion: {q}\nAnswer: {answer}")

Hello! Welcome to the Document Question Answering Model by Vedik Upadhyay. Please enter the path for the document you wish to use: nytimes_article.pdf
What's your question?
Is the World Cup going to be canceled? Why do fans fear immigration?
Processing query... 0.00s
Loading text... 0.03s
Chunking context... 0.00s
Embedding chunks... 0.19s
Storing embeddings... 0.00s
Answering Questions... 14.53s


Question: is the world cup going to be cancelled
Answer: Based solely on the provided context:

1.  **The World Cup itself is not cancelled.** Context 2 explicitly states: "It is expected to draw about 6.5 million people, mostly to the United States, where most matches will be played" and Context 3 mentions "next summer’s World Cup".
2.  **A specific game or event related to the World Cup was cancelled.** Context 1 describes the cancellation of a game  due to fears of immigration raids targeting fans without legal status. This cancellation is described as a "preview" of how immigration polic