# Assignment 3: Advanced Topics in RAG
Objective: Explore improvements and research directions in RAG.
Tasks (choose one):
1. Implement Maximum Marginal Relevance (MMR) for retrieval.
2. Explore self-querying retrievers (queries enhanced by LLMs).
3. Test hybrid retrieval: combining keyword search with embeddings.
4. Write a mini literature review on current challenges in RAG (hallucinations, latency, memory).

#1. Implement Maximum Marginal Relevance (MMR) for retrieval
**What it is:** MMR is a method that balances relevance and diversity when retrieving documents. Instead of just picking the top-k similar results, it avoids redundancy by selecting results that cover different aspects.
**In the assignment:** You’d take a set of retrieved documents (via embeddings) and re-rank them with MMR, showing how it improves retrieval quality.
**Outcome:** Code that retrieves less redundant, more diverse documents.

In [2]:
# -----------------------------------------------------------
# Task 1: Maximum Marginal Relevance (MMR) for Retrieval
# -----------------------------------------------------------

# Install required libraries (uncomment if running in Colab/first time)
#!pip install faiss-cpu sentence-transformers

import numpy as np
import faiss
from sentence_transformers import SentenceTransformer

# -----------------------------------------------------------
# Step 1: Load embedding model
# -----------------------------------------------------------
# We'll use a light Sentence-BERT model for embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")

# -----------------------------------------------------------
# Step 2: Create a small document collection
# -----------------------------------------------------------
documents = [
    "The capital of France is Paris.",
    "Paris is known for the Eiffel Tower.",
    "The Louvre Museum is located in Paris.",
    "France is in Europe and has a rich history.",
    "Machine learning enables computers to learn from data.",
    "Deep learning is a subset of machine learning.",
    "Neural networks are used in deep learning models.",
    "Reinforcement learning trains agents via rewards."
]

# Encode documents into embeddings
doc_embeddings = model.encode(documents, convert_to_numpy=True)

# -----------------------------------------------------------
# Step 3: Build FAISS index for fast retrieval
# -----------------------------------------------------------
dim = doc_embeddings.shape[1]  # dimension of embeddings
index = faiss.IndexFlatL2(dim)  # L2 distance
index.add(doc_embeddings)       # add all documents

# -----------------------------------------------------------
# Step 4: Define Maximum Marginal Relevance (MMR) function
# -----------------------------------------------------------
def mmr(query_embedding, doc_embeddings, top_k=5, diversity=0.5):
    """
    Maximum Marginal Relevance (MMR) implementation.

    Args:
        query_embedding: embedding of the user query
        doc_embeddings: matrix of document embeddings
        top_k: number of documents to return
        diversity: trade-off parameter (0 = relevance only, 1 = diversity only)

    Returns:
        indices of selected documents
    """

    # Compute similarity between query and documents
    query_doc_sim = np.dot(doc_embeddings, query_embedding.T)

    # Normalize for cosine similarity
    query_doc_sim = query_doc_sim / (np.linalg.norm(query_embedding) * np.linalg.norm(doc_embeddings, axis=1))

    # Initialize
    selected = []
    candidates = list(range(len(doc_embeddings)))

    for _ in range(top_k):
        if len(selected) == 0:
            # First pick: most relevant to query
            idx = np.argmax(query_doc_sim)
            selected.append(idx)
            candidates.remove(idx)
        else:
            # Compute marginal relevance
            candidate_sims = query_doc_sim[candidates]
            selected_sims = np.max(
                np.dot(doc_embeddings[candidates], doc_embeddings[selected].T) /
                (np.linalg.norm(doc_embeddings[candidates], axis=1, keepdims=True) * np.linalg.norm(doc_embeddings[selected], axis=1)),
                axis=1
            )

            mmr_score = (1 - diversity) * candidate_sims - diversity * selected_sims
            idx = candidates[np.argmax(mmr_score)]
            selected.append(idx)
            candidates.remove(idx)

    return selected

# -----------------------------------------------------------
# Step 5: Try retrieval with MMR
# -----------------------------------------------------------
query = "Tell me about France and Paris."
query_embedding = model.encode([query], convert_to_numpy=True)[0]

# Get top 5 documents using MMR
selected_indices = mmr(query_embedding, doc_embeddings, top_k=5, diversity=0.6)

print("🔍 Query:", query)
print("\n📄 Retrieved Documents (MMR):")
for idx in selected_indices:
    print("-", documents[idx])


Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m46.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

🔍 Query: Tell me about France and Paris.

📄 Retrieved Documents (MMR):
- The capital of France is Paris.
- Reinforcement learning trains agents via rewards.
- France is in Europe and has a rich history.
- Deep learning is a subset of machine learning.
- Paris is known for the Eiffel Tower.


#2. Explore self-querying retrievers (queries enhanced by LLMs)
**What it is:** Normally, retrieval uses the raw user query. A self-querying retriever uses an LLM to rewrite or enrich the query (e.g., adding keywords, synonyms, or structured filters).
**In the assignment:** You’d build a pipeline where an LLM transforms the query before retrieval, then compare results with plain retrieval.
**Outcome: **Code + short analysis showing how query rewriting affects retrieval performance.

In [4]:
# -----------------------------------------------------------
# Task 2: Self-Querying Retriever with Free Local LLM
# -----------------------------------------------------------

# Install libraries (only once)
# !pip install sentence-transformers faiss-cpu transformers

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import pipeline

# -----------------------------------------------------------
# Step 1: Load models
# -----------------------------------------------------------
# Embedding model for retrieval
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# Local LLM for query rewriting (Flan-T5 small is free & fast)
query_rewriter = pipeline("text2text-generation", model="google/flan-t5-small")

# -----------------------------------------------------------
# Step 2: Create sample documents
# -----------------------------------------------------------
documents = [
    "The capital of France is Paris.",
    "Paris is known for the Eiffel Tower.",
    "The Louvre Museum is located in Paris.",
    "France is in Europe and has a rich history.",
    "Machine learning enables computers to learn from data.",
    "Deep learning is a subset of machine learning.",
    "Neural networks are used in deep learning models.",
    "Reinforcement learning trains agents via rewards."
]

doc_embeddings = embedder.encode(documents, convert_to_numpy=True)

# Build FAISS index
dim = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(doc_embeddings)

# -----------------------------------------------------------
# Step 3: Query Rewriting with Local LLM
# -----------------------------------------------------------
def enhance_query_llm(query: str) -> str:
    """
    Use Flan-T5 to expand the query with related keywords only.
    """
    prompt = f"Expand this search query with related terms and synonyms, only as a list of keywords:\nQuery: {query}\nKeywords:"

    result = query_rewriter(
        prompt,
        max_new_tokens=20,     # keep short
        do_sample=False,       # deterministic
        temperature=0.0        # no creativity
    )

    return result[0]['generated_text']


# -----------------------------------------------------------
# Step 4: Retrieval function
# -----------------------------------------------------------
def retrieve(query, k=3, use_self_query=True):
    if use_self_query:
        enhanced = enhance_query_llm(query)
        print("🤖 Enhanced Query:", enhanced)
        query_to_use = enhanced
    else:
        print("📝 Original Query:", query)
        query_to_use = query

    q_emb = embedder.encode([query_to_use], convert_to_numpy=True)
    distances, indices = index.search(q_emb, k)
    return [documents[i] for i in indices[0]]

# -----------------------------------------------------------
# Step 5: Run test
# -----------------------------------------------------------
query = "Tell me about Paris."

print("\n=== Retrieval WITHOUT Self-Query ===")
print(retrieve(query, use_self_query=False))

print("\n=== Retrieval WITH Self-Query (LLM) ===")
print(retrieve(query, use_self_query=True))


Device set to use cpu
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



=== Retrieval WITHOUT Self-Query ===
📝 Original Query: Tell me about Paris.
['The capital of France is Paris.', 'Paris is known for the Eiffel Tower.', 'France is in Europe and has a rich history.']

=== Retrieval WITH Self-Query (LLM) ===
🤖 Enhanced Query: Paris is a city in the U.S. state of New York.
['The capital of France is Paris.', 'Paris is known for the Eiffel Tower.', 'The Louvre Museum is located in Paris.']


#3. Test hybrid retrieval: combining keyword search with embeddings
**What it is:** There are two main retrieval approaches: keyword search (BM25, TF-IDF) and semantic embeddings (vector search). Hybrid retrieval combines both to improve accuracy.
**In the assignment:** You’d run retrieval with both methods, merge results (e.g., weighted scoring), and compare against using each method alone.
**Outcome:** Code that demonstrates hybrid search and results comparison.

In [5]:
# -----------------------------------------------------------
# Task 3: Hybrid Retrieval (Keyword + Embeddings)
# -----------------------------------------------------------

# Install libraries if needed
# !pip install sentence-transformers faiss-cpu scikit-learn

import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# -----------------------------------------------------------
# Step 1: Dataset
# -----------------------------------------------------------
documents = [
    "The capital of France is Paris.",
    "Paris is known for the Eiffel Tower.",
    "The Louvre Museum is located in Paris.",
    "France is in Europe and has a rich history.",
    "Machine learning enables computers to learn from data.",
    "Deep learning is a subset of machine learning.",
    "Neural networks are used in deep learning models.",
    "Reinforcement learning trains agents via rewards."
]

query = "Tell me about Paris and France."

# -----------------------------------------------------------
# Step 2: Keyword Search (TF-IDF)
# -----------------------------------------------------------
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Encode query with TF-IDF
query_vec_tfidf = vectorizer.transform([query])

# Compute cosine similarity
keyword_scores = cosine_similarity(query_vec_tfidf, tfidf_matrix)[0]

# -----------------------------------------------------------
# Step 3: Embedding Search (Sentence-BERT + FAISS)
# -----------------------------------------------------------
embedder = SentenceTransformer("all-MiniLM-L6-v2")

doc_embeddings = embedder.encode(documents, convert_to_numpy=True)
query_embedding = embedder.encode([query], convert_to_numpy=True)

dim = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(doc_embeddings)

distances, indices = index.search(query_embedding, len(documents))

# Convert FAISS distances to similarity scores (1 / distance)
embedding_scores = 1 / (1 + distances[0])

# Reorder to align with document indices
embedding_scores_full = np.zeros(len(documents))
for rank, idx in enumerate(indices[0]):
    embedding_scores_full[idx] = embedding_scores[rank]

# -----------------------------------------------------------
# Step 4: Hybrid Scoring
# -----------------------------------------------------------
# Weighted sum: alpha controls importance of embeddings vs keywords
alpha = 0.5  # 0.5 = equal weight, can tune
hybrid_scores = alpha * embedding_scores_full + (1 - alpha) * keyword_scores

# Get top-k results
top_k = 5
ranked_indices = np.argsort(hybrid_scores)[::-1][:top_k]

# -----------------------------------------------------------
# Step 5: Show Results
# -----------------------------------------------------------
print("🔍 Query:", query)

print("\n=== Keyword Search Results ===")
for i in np.argsort(keyword_scores)[::-1][:top_k]:
    print(f"({keyword_scores[i]:.4f}) {documents[i]}")

print("\n=== Embedding Search Results ===")
for i in np.argsort(embedding_scores_full)[::-1][:top_k]:
    print(f"({embedding_scores_full[i]:.4f}) {documents[i]}")

print("\n=== Hybrid Search Results ===")
for i in ranked_indices:
    print(f"({hybrid_scores[i]:.4f}) {documents[i]}")


🔍 Query: Tell me about Paris and France.

=== Keyword Search Results ===
(0.4462) France is in Europe and has a rich history.
(0.4233) The capital of France is Paris.
(0.1586) The Louvre Museum is located in Paris.
(0.1514) Paris is known for the Eiffel Tower.
(0.0000) Machine learning enables computers to learn from data.

=== Embedding Search Results ===
(0.6699) The capital of France is Paris.
(0.6111) France is in Europe and has a rich history.
(0.5566) Paris is known for the Eiffel Tower.
(0.4893) The Louvre Museum is located in Paris.
(0.3474) Machine learning enables computers to learn from data.

=== Hybrid Search Results ===
(0.5466) The capital of France is Paris.
(0.5287) France is in Europe and has a rich history.
(0.3540) Paris is known for the Eiffel Tower.
(0.3240) The Louvre Museum is located in Paris.
(0.1737) Machine learning enables computers to learn from data.


#4. Mini literature review on RAG challenges (hallucinations, latency, memory)
**What it is:** Instead of coding, this is a writing task. You’d
read 3–5 recent papers or articles and summarize challenges in RAG.
**Main issues to cover:**

***Hallucinations*** – LLMs generating facts not in retrieved docs.
***Latency*** – retrieval + generation makes responses slower.
***Memory –*** handling large context windows or long conversations.


**In the assignment:** You’d write 1–2 pages reviewing these problems and possible research directions.
**Outcome:** A short research-style write-up.