<a href="https://colab.research.google.com/github/debojit11/ml_nlp_dl_transformers/blob/main/Untitled2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📚 Week 19 – Advanced RAG: Hybrid Retrieval & Evaluation Metrics

---

## 🎯 Objectives

This week, you'll:
- Understand **Hybrid Retrieval** (dense + sparse)
- Implement BM25 + FAISS hybrid retrievers
- Combine retrieval scores
- Evaluate RAG with simple metrics (e.g., context recall)

---

## 🧠 What is Hybrid Retrieval?

**Dense retrieval** (like FAISS + embeddings):
- Captures semantic meaning
- Misses exact keyword matches sometimes

**Sparse retrieval** (like BM25):
- Captures exact token matches
- Misses semantic similarities sometimes

👉 **Hybrid** combines the best of both worlds:
- Use both scores together
- Boost both precision and recall

---
## 🧠 Why Hybrid Retrieval?

Dense models (like Sentence Transformers):
- Good at semantic similarity
- Miss exact keyword matches

Sparse models (like BM25):
- Good at exact keyword matching
- Miss semantic paraphrases

**Hybrid Retrieval = Dense + Sparse → Better Recall**

---

## 🔧 Setup

In [None]:
!pip install -q faiss-cpu rank_bm25 rouge-score

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


In [None]:
from sentence_transformers import SentenceTransformer
from transformers import pipeline
from rank_bm25 import BM25Okapi
from rouge_score import rouge_scorer
import faiss
import numpy as np

---

## 📄 Define the Corpus

In [None]:
corpus = [
    "Transformers revolutionized NLP with attention mechanisms.",
    "BM25 is a ranking function used by search engines.",
    "Deep learning models require large datasets for training.",
    "FAISS allows efficient similarity search over vector databases.",
    "The Eiffel Tower was completed in 1889.",
    "PyTorch and TensorFlow are popular deep learning frameworks."
]

---

## 🔍 Create Dense Retriever (FAISS + SentenceTransformer)

In [None]:
dense_model = SentenceTransformer("all-MiniLM-L6-v2")
dense_embeddings = dense_model.encode(corpus, convert_to_numpy=True)
dense_embeddings = dense_embeddings / np.linalg.norm(dense_embeddings, axis=1, keepdims=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
dense_index = faiss.IndexFlatIP(dense_embeddings.shape[1])
dense_index.add(dense_embeddings)

In [None]:
def dense_retrieve(query, k=3):
    query_embedding = dense_model.encode([query], convert_to_numpy=True)
    query_embedding = query_embedding / np.linalg.norm(query_embedding)
    D, I = dense_index.search(query_embedding, k)
    return [(corpus[i], float(D[0][idx])) for idx, i in enumerate(I[0])]

---

## 🔍 Create Sparse Retriever (BM25)

In [None]:
tokenized_corpus = [doc.lower().split() for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

In [None]:
def sparse_retrieve(query, k=3):
    tokenized_query = query.lower().split()
    scores = bm25.get_scores(tokenized_query)
    top_k = np.argsort(scores)[::-1][:k]
    return [(corpus[i], float(scores[i])) for i in top_k]

---

## 🧩 Combine Dense + Sparse → Hybrid Retrieval

In [None]:
def hybrid_retrieve(query, k_dense=3, k_sparse=3, alpha=0.5):
    dense_results = dense_retrieve(query, k=k_dense)
    sparse_results = sparse_retrieve(query, k=k_sparse)

    scores = {}

    # Add dense scores
    for doc, score in dense_results:
        scores[doc] = scores.get(doc, 0) + alpha * score

    # Add sparse scores
    for doc, score in sparse_results:
        scores[doc] = scores.get(doc, 0) + (1 - alpha) * score

    # Sort combined scores
    ranked_docs = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in ranked_docs[:k_dense]]

---

## 🧪 Test Hybrid Retriever


In [None]:
query = "What is BM25?"
docs = hybrid_retrieve(query)

print("🔍 Query:", query)
print("📚 Retrieved Documents:")
for doc in docs:
    print("-", doc)

🔍 Query: What is BM25?
📚 Retrieved Documents:
- BM25 is a ranking function used by search engines.
- PyTorch and TensorFlow are popular deep learning frameworks.
- Deep learning models require large datasets for training.


---

## 🤖 Generator: T5 for Answer Generation


In [None]:
generator = pipeline("text2text-generation", model="google/flan-t5-base")

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cpu


In [None]:
def generate_answer(query, retrieved_docs):
    context = " ".join(retrieved_docs)
    prompt = f"question: {query} context: {context}"
    output = generator(prompt, max_length=64, do_sample=False)
    return output[0]['generated_text']

---

## 🧪 End-to-End Hybrid RAG Demo


In [None]:
query = "When was the Eiffel Tower completed?"
retrieved_docs = hybrid_retrieve(query)
generated_answer = generate_answer(query, retrieved_docs)

print("📥 Query:", query)
print("📚 Context:")
for doc in retrieved_docs:
    print("-", doc)
print("🧠 Generated Answer:", generated_answer)

📥 Query: When was the Eiffel Tower completed?
📚 Context:
- The Eiffel Tower was completed in 1889.
- BM25 is a ranking function used by search engines.
- FAISS allows efficient similarity search over vector databases.
🧠 Generated Answer: 1889


---

## 📏 How to Evaluate RAG Quality


You care about:
- Retrieval Recall: Did you fetch good context?
- Generation Accuracy: Was the answer correct?

---

## 📊 Evaluate Generated Answer with ROUGE and BLEU


### ✏️ Define References and Compute Scores

In [None]:
# Ground truth reference
reference = "The Eiffel Tower was completed in 1889."

In [None]:
# Predicted/generated text
prediction = generated_answer

In [None]:
# ROUGE scorer
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
scores = scorer.score(reference, prediction)

In [None]:
print("🔎 ROUGE Scores:")
for metric, score in scores.items():
    print(f"{metric}: Precision={score.precision:.2f}, Recall={score.recall:.2f}, F1={score.fmeasure:.2f}")

🔎 ROUGE Scores:
rouge1: Precision=1.00, Recall=0.14, F1=0.25
rouge2: Precision=0.00, Recall=0.00, F1=0.00
rougeL: Precision=1.00, Recall=0.14, F1=0.25


## 🧠 Notes on Evaluation

---

### 📊 ROUGE Score Summary

| Metric   | Precision | Recall | F1 Score |
|----------|-----------|--------|----------|
| ROUGE-1  | 1.00      | 0.14   | 0.25     |
| ROUGE-2  | 0.00      | 0.00   | 0.00     |
| ROUGE-L  | 1.00      | 0.14   | 0.25     |

---

### 🔍 Key Observations

- **ROUGE-1**:
  - Very high precision (1.00) but low recall (0.14).
  - This suggests the model generated some correct keywords but missed many expected ones.
  - Happens often when answers are correct but shorter or phrased differently.

- **ROUGE-2**:
  - Zero overlap in bigrams (2-word phrases).
  - Indicates that exact phrasing between the generated text and the reference answer did not match.
  - Common for abstractive models or rephrasings.

- **ROUGE-L**:
  - Similar to ROUGE-1 because of short answers.
  - Captures matching sequences but still limited by recall.

---

### ✅ Interpretation for Short Factual Answers

- ROUGE can **underestimate quality** if the generated text is correct but worded differently.
- **Manual inspection** is necessary for small, factual answers.
- High precision but low recall often means "correct but incomplete" or "correct but differently phrased."

---

### 🚀 Tips for Evaluation

- Always combine **automatic metrics** with **manual review**.
- For short answers, consider adding:
  - **Exact Match (EM)**: 100% if the answer matches exactly.
  - **BLEU**: Measures n-gram overlaps, used in translation.
  - **BERTScore**: Embedding-based semantic similarity (better for free-text generations).

---

📚 **Summary**:  
Use ROUGE carefully. It works best for longer generations like summaries.  
For RAG-style short factual answers, rely on a **combination of ROUGE, BLEU, EM, and manual checks**.

---


## 📝 Exercises

1. Try varying `alpha` in hybrid retrieval from 0.2 to 0.8.  
   → Does generation quality change?
   
2. Test using larger corpus (Wikipedia sections) and see if retrieval needs scaling.

3. Compare RAG performance:
   - Dense only
   - Sparse only
   - Hybrid
   
4. Fine-tune a SentenceTransformer model on custom domain (advanced).

---

➡️ Coming up next: **Week 20 – RAG Capstone🚀**