# Module 4: Advanced Retrieval Techniques

*Part of the RCD Workshops series: Retrieval-Augmented Generation (RAG) for Advanced Research Applications*

---

Good retrieval is essential for RAG! In this module, you'll learn about enhancements far beyond simple vector search.


## Why go beyond the basics?
If your pipeline can't fetch the right info, the LLM can't answer correctly. Let's look at state-of-the-art research & practical strategies to improve retrieval in RAG systems.

### 4.1 Hybrid Search (Semantic + Lexical)

Pure vector retrieval (using embeddings) sometimes misses exact matches for names, numbers, or domain jargon—this is the **lexical gap**. Hybrid search combines **keyword-based retrieval** (like BM25 or even word overlap) with **vector-based retrieval**. Keyword search is precise for rare terms, vector search is good for synonyms/concepts. Together, they cover each other's weaknesses.

> **Diagram placeholder:** Schematic of hybrid search: keyword engine + vector engine whose results are merged before sending to LLM.


**Demo: Let's simulate a hybrid search for a specific keyword (e.g. '3°C').**
In a real system, you could use libraries like Whoosh (BM25), or ElasticSearch, or filter vector results by keyword presence. Here we combine vector retrieval and a keyword filter in pure Python.

In [None]:
# Hybrid search demonstration
query = "What is the GDP loss at 3°C warming?"
query_vec = encoder.encode([query])
query_vec = query_vec / np.linalg.norm(query_vec)

# Vector search
D, I = index.search(query_vec, k=2)
candidates = list(I[0])

# Keyword filter: boost any document containing '3°C'
keyword = "3°C"
for idx, text in enumerate(docs):
    if keyword in text:
        if idx not in candidates:
            candidates.append(idx)

print("Candidates by hybrid criteria:", candidates)
# You can choose to rerank these based on similarity or other heuristics


### 4.2 Reranking with Cross-Encoders

Fast embedding models give a rough score. A **cross-encoder reranker** (like a small BERT or the LLM itself) looks at a (query, passage) pair together to judge relevance more accurately.
In practice: use vector search for top-k, then rerank those with a cross-encoder. 🏅

> *Relevant models:* `cross-encoder/ms-marco-MiniLM-L-6-v2` and relatives in Hugging Face.

### 4.3 Query Expansion and Reformulation

Short or vague user queries? Expand or rephrase them (by LLM prompt or classic IR) to increase your chances of hitting relevant docs.
- E.g.: "health effects of PM2.5" → add synonyms and related phrases for a richer search.
- This could be done by classic feedback, or by prompting an LLM to "rewrite the query" before retrieval.


### 4.4 Domain-Specific Embeddings

General embedding models may miss field-specific meanings. For research, consider field-tuned embedders (e.g., SciBERT, BioBERT, SPECTER).

*If your work is biomedical, legal, or patent-focused, use a specialized encoder!*

### 4.5 Efficient Indexing and Filters
- Large corpora: use metadata (year, topic, author) to filter/scope docs before search.
- Maintain separate indexes by category/topic if relevant.
*This reduces irrelevant retrieval—improving both speed and accuracy!*

### 4.6 Evaluating Your Retrieval
- Hand-inspect retrieval for key queries
- If labeled data is available, use recall/precision, MRR, etc.
- Often: if LLM hallucinated, see if it retrieved the right info! If not, work on retrieval first.

> **PyTorch connection:** If you want to train an embedder retriever for your own corpus (e.g. a Dense Passage Retriever), that's a neural ranking task handled by twin-encoder architectures. Out of scope for today, but know it connects the dots with deep learning.

### Try it yourself!
What is one technique you would use to improve retrieval if the initial results are bad? (Hybrid, reranking, query expansion, etc. are all fair answers!)

In [None]:
from utils import create_answer_box
create_answer_box('📝 **Your Answer:** One technique is ...', question_id='mod4_improve_retrieval')