# Module 2: Document Retrieval and Embeddings

*Part of the RCD Workshops series: RAG for Advanced Research Applications*
---

In this module, we'll dive into how to fetch relevant documents for RAG, covering both "classic" (keyword) and modern (embedding) approaches, with hands-on practice for each step.


## 2.1: From Keywords to Vectors: Why Classic Search Isn’t Enough

Traditional document search relies on **keyword matching** — for example, using TF-IDF or BM25 — but this method misses synonyms and rephrasings. RAG leverages **embeddings** instead: both documents and queries are mapped to dense vectors that reflect semantic *meaning*, enabling discovery even if no words overlap.

In [None]:
# EXERCISE: Classic Keyword Search
corpus = [
    'Impacts of climate change on global economies are substantial.',
    'Recent studies discuss economic loss due to global warming.',
    'Embedding models let us search by meaning, not just words.'
]
query = 'climate economics'
def keyword_search(query, docs):
    return [d for d in docs if any(word.lower() in d.lower() for word in query.split())]
keyword_search(query, corpus)

> **Diagram placeholder:** Show a Venn diagram with "Keyword Search" and "Semantic Search", highlighting the overlap (recalled by keywords) versus the full "meaning space" captured by embeddings.

Above: Only exact (or near-exact) keyword matches will be found. Synonyms/non-obvious rephrasings are missed.
---

## 2.2: What Are Embeddings?

Embeddings are vector representations of text such that meaningfully similar texts have vectors close together in space.

Let's see a toy example using word relatives (analogy):

In [None]:
# Embedding arithmetic example (pseudocode/description)
print("If vector('king') - vector('man') + vector('woman') produces something like vector('queen'), the model is capturing analogy meaning!")

> **Diagram placeholder:** 3D space showing vectors for king, man, woman, queen with arrows illustrating the analogy.

To make this hands-on, let's compute *real sentence embeddings* for a few sentences and examine their similarity.

In [None]:
# Import necessary tool
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
sentences = [
    'Large language models can learn from research papers.',
    'AI systems use documents to answer questions.',
    'Bananas are yellow and tasty.'
]
embs = model.encode(sentences)
# Show cosine similarities
def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
for i, s1 in enumerate(sentences):
    for j, s2 in enumerate(sentences):
        if i < j:
            print(f"Similarity('{s1}', '{s2}') = {cosine(embs[i], embs[j]):.2f}")

You should see higher similarity between topically related text, much lower for unrelated (e.g. the banana one).

---

### Quick Check: In your own words
Why do we use embeddings instead of plain keyword search when building a RAG system?

In [None]:
from utils import create_answer_box
create_answer_box("📝 **Your Answer:** I think we use embeddings instead of only keywords because...", question_id="mod2_why_embeddings")

## 2.3: Preparing Documents for Retrieval: Chunking and Embedding

Documents are often too long for models to process at once. We break them into chunks (by token/paragraph) before embedding.

**Why chunk?**
- Keeps each unit the right size for LLM input
- Lets retrieval focus on topical sections — precision

Let's practice chunking and embedding a custom document.

In [None]:
# Example: Manual chunking
doc = """
Retrieval-Augmented Generation (RAG) augments LLMs by allowing retrieval from external sources. \
Chunking splits text into manageable parts; for example, splitting by paragraph.

Embeddings allow searches to find relevant sections even if different words are used. Cosine similarity quantifies text closeness.

Document retrieval pipelines (using tools like FAISS) depend on these steps working well together.
"""
chunks = [c.strip() for c in doc.split('\n') if c.strip()]
for i, chunk in enumerate(chunks):
    print(f'Chunk {i+1}: {chunk}')

In [None]:
# Embed your chunks
chunk_embs = model.encode(chunks)
print(f'Each chunk embedding shape: {chunk_embs[0].shape}')

### Try it yourself!
Type a short document (2-4 sentences, each about a different subtopic).
We'll split and embed your own text.

In [None]:
from utils import create_answer_box
create_answer_box("✍️ **Write a mini-document (2-4 sentences, each different topic):**", question_id="mod2_mini_chunk")

## 2.4: Indexing a Small Corpus (with FAISS)
We'll go end-to-end: encode docs → index → retrieve.

In [None]:
import faiss
import numpy as np
docs = [
    'Estimates of GDP loss from climate change include effects of weather extremes.',
    'RAG systems combine LLMs with document retrievers for better answers.',
    'Bananas and apples are common fruits.'
]
doc_embs = model.encode(docs)
doc_embs = np.array([v/np.linalg.norm(v) for v in doc_embs])  # Normalize for cosine
index = faiss.IndexFlatIP(doc_embs.shape[1])
index.add(doc_embs)

query = 'How does RAG use LLMs?'
q_emb = model.encode([query])[0]
q_emb = q_emb/np.linalg.norm(q_emb)
D, I = index.search(np.array([q_emb]), k=2)
print('Top result:', docs[I[0][0]], '\nScore:', D[0][0])
print('Second result:', docs[I[0][1]], '\nScore:', D[0][1])

> **Diagram placeholder:** Schematic of a vector index: documents as points on a sphere, query vector arrow, nearest documents circled.

---

## Quick Knowledge Check
What would happen if you used a *very* long chunk, or failed to normalize your vectors before similarity search?
Write a brief hypothesis.

In [None]:
from utils import create_answer_box
create_answer_box("✍️ **Your Hypothesis:**\n- With a long chunk...\n- If you don't normalize vectors...", question_id="mod2_longchunk_norm")

# End of Module 2

You've now practiced the core steps of document retrieval for RAG: classic vs. semantic search, embedding, chunking, and vector indexing.

Next: We'll assemble these building blocks into a complete RAG pipeline!