
This notebook prepares a retriever system for our multi-agent RAG QA project.
We'll:
- Load and clean the HotpotQA dataset
- Generate sentence embeddings using BGE (`BAAI/bge-large-en`)
- Index the embeddings using FAISS for fast semantic search

In [2]:
!pip install torch transformers accelerate bitsandbytes \
    datasets faiss-cpu sentence-transformers \
    langchain langgraph uvicorn fastapi rich tqdm



In [3]:
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
import pickle
import os
from tqdm import tqdm

In [5]:
import json

with open("../data/raw/hotpot_train_v1.1.json", "r") as f:
    hotpot_data = json.load(f)

# Preview
print(hotpot_data[0]["question"])
print(hotpot_data[0]["context"][0])  # list of [title, paragraph list]

Which magazine was started first Arthur's Magazine or First for Women?
['Radio City (Indian radio station)', ["Radio City is India's first private FM radio station and was started on 3 July 2001.", ' It broadcasts on 91.1 (earlier 91.0 in most cities) megahertz from Mumbai (where it was started in 2004), Bengaluru (started first in 2001), Lucknow and New Delhi (since 2003).', ' It plays Hindi, English and regional songs.', ' It was launched in Hyderabad in March 2006, in Chennai on 7 July 2006 and in Visakhapatnam October 2007.', ' Radio City recently forayed into New Media in May 2008 with the launch of a music portal - PlanetRadiocity.com that offers music related news, videos, songs, and other music-related features.', ' The Radio station currently plays a mix of Hindi and Regional music.', ' Abraham Thomas is the CEO of the company.']]


In [6]:
# Flatten context into individual documents
documents = []

for sample in hotpot_data:
    for title, paragraphs in sample["context"]:
        for para in paragraphs:
            documents.append({
                "title": title,
                "text": para
            })

print(f"Total documents: {len(documents)}")
print(documents[0])

Total documents: 3703344
{'title': 'Radio City (Indian radio station)', 'text': "Radio City is India's first private FM radio station and was started on 3 July 2001."}


In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
from tqdm import tqdm

model = SentenceTransformer("BAAI/bge-large-en")
model.max_seq_length = 512

emb_store = []
id_store = []

# Embed the first N documents to limit memory use
N = 200_000  # adjust based on your system
for i in tqdm(range(min(N, len(documents)))):
    text = documents[i]["text"]
    vec = model.encode(text, normalize_embeddings=True)
    emb_store.append(vec)
    id_store.append(i)

# Convert to numpy array for FAISS
embs = np.stack(emb_store).astype("float32")

 32%|███▏      | 64350/200000 [45:07<1:25:49, 26.34it/s]   