<a href="https://colab.research.google.com/github/Yokitha-07/RAG/blob/main/RetrievalMethods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [11]:
#Installs required libraries:
# rank_bm25 → for BM25 ranking algorithm.
# sentence-transformers → for dense vector embeddings.
# faiss-cpu → for efficient similarity search (not directly used here).
#%%capture hides the installation output in Jupyter/Colab.
%%capture
!pip install rank_bm25 sentence-transformers faiss-cpu

In [12]:
#Imports the BM25 ranking model, transformer model for embeddings, and NumPy for math operations.
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np

In [13]:
docs = [
    "The Eiffel Tower is located in Paris, France.",
    "The Leaning Tower of Pisa is in Italy.",
    "Paris is the capital of France, known for art, fashion and history.",
    "Paris is the French capital",
    "The Great Wall of China is visible from space.",
    "France is famous for its cuisine and wines."
]

In [14]:
# Tokenization: Converts each document into lowercase word lists.
# BM25Okapi: Creates the BM25 model using the tokenized docs.
# Query scoring: BM25 scores each document based on how well it matches the query.

tokenized_docs = [doc.lower().split() for doc in docs]
bm25 = BM25Okapi(tokenized_docs)

query = "capital of France"
scores = bm25.get_scores(query.lower().split())
for i, s in sorted(enumerate(scores), key=lambda x: -x[1]):
    print(f"{i+1}: {docs[i]} (score={s:.2f})")


6: France is famous for its cuisine and wines. (score=1.32)
4: Paris is the French capital (score=0.72)
3: Paris is the capital of France, known for art, fashion and history. (score=0.49)
1: The Eiffel Tower is located in Paris, France. (score=0.00)
2: The Leaning Tower of Pisa is in Italy. (score=0.00)
5: The Great Wall of China is visible from space. (score=0.00)


BM25 uses, inverse docuemnt frequency (IDF) and document length to calculate the ranking score,
\n
```
Score = IDF/Doc_Lenght
```

IDF quantifies how rare a word is across the entire collection, giving higher importance to words that appear in fewer documents and lower importance to words that are common in many documents.

For BM25 : `French` is not simmilar to `France`

Sparse retrieval fails to understand synonyms or related meanings, it matches words, not ideas.

Dense Retrieval with Sentence Transformer

In [15]:
model = SentenceTransformer("all-MiniLM-L6-v2")

In [16]:
doc_embeddings = model.encode(docs, normalize_embeddings=True)

query = "capital of France"
query_embedding = model.encode([query], normalize_embeddings=True)

scores = np.dot(doc_embeddings, query_embedding.T).squeeze()
for i, s in sorted(enumerate(scores), key=lambda x: -x[1]):
    print(f"{i+1}: {docs[i]} (similarity={s:.2f})")

4: Paris is the French capital (similarity=0.89)
3: Paris is the capital of France, known for art, fashion and history. (similarity=0.75)
6: France is famous for its cuisine and wines. (similarity=0.56)
1: The Eiffel Tower is located in Paris, France. (similarity=0.43)
2: The Leaning Tower of Pisa is in Italy. (similarity=0.17)
5: The Great Wall of China is visible from space. (similarity=0.05)


Dense retrieval is able to understand synonyms or related meanings, it matches ideas not words.



✅ Summary

This code demonstrates how:

* BM25 retrieves documents using exact word overlap.

* Sentence Transformers retrieve documents based on semantic similarity.

Dense retrieval performs better in understanding contextual meaning, making it ideal for modern semantic search systems and RAG (Retrieval-Augmented Generation) applications.