## Learning Goals

- Explain embeddings and similarity at a high level.
- Chunk and embed a small corpus with a modern sentence encoder.
- Index vectors and run top-k similarity search (FAISS or exact NN).
- Assess retrieval quality and iterate on chunking/top_k.


# Module 2: Document Retrieval and Embeddings

*Part of the RCD Workshops series: RAG for Research Applications*
---

In this module, we'll dive into how to fetch relevant documents for RAG, covering both "classic" (keyword) and modern (embedding) approaches, with hands-on practice for each step.


## 2.1: From Keywords to Vectors: Why Classic Search Isn’t Enough

Traditional document search relies on **keyword matching** — for example, using TF-IDF or BM25 — but this method misses synonyms and rephrasings. RAG leverages **embeddings** instead: both documents and queries are mapped to dense vectors that reflect semantic *meaning*, enabling discovery even if no words overlap.

By "dense vectors," we mean that each document or query is represented as a point in a high-dimensional space, where similar meanings are closer together. This allows us to find relevant documents based on their semantic content rather than just exact word matches.


In [7]:
# EXERCISE: Classic Keyword Search
corpus = [
    'Impacts of climate change on global economies are substantial.',
    'Recent studies discuss economic loss due to global warming.',
    'Embedding models let us search by meaning, not just words.'
]
query = 'climate economics'
def keyword_search(query, docs):
    return [d for d in docs if any(word.lower() in d.lower() for word in query.split())]
keyword_search(query, corpus)

['Impacts of climate change on global economies are substantial.']

<img src="semantic_sim_venn.png" alt="Semantic Similarity Venn Diagram" width="600"/>

Above: Only exact (or near-exact) keyword matches will be found. Synonyms/non-obvious rephrasings are missed.

## 2.2: What Are Embeddings?

Embeddings are vector representations of text such that meaningfully similar texts have vectors close together in space.

Let's see a toy example:

![Semantic similarity concept](semantic_sim_venn.png)


In [8]:
# Import necessary tool
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
sentences = [
    'Large language models can learn from research papers.',
    'AI systems use documents to answer questions.',
    'Bananas are yellow and tasty.'
]
embs = model.encode(sentences)

# Now we have embeddings for each sentence. Let's take a look at the first chunk of each embedding.
for i, s in enumerate(sentences):
    print(f"Sentence {i}: {s}")
    print(f"Embedding: {embs[i][:15]}...\n")  # Display first 15 elements of each embedding

Sentence 0: Large language models can learn from research papers.
Embedding: [-0.02631376 -0.05902476  0.03586608  0.03195616  0.02241782  0.07368159
 -0.06222048  0.03942321  0.06609748 -0.00331412 -0.0080848   0.04686617
  0.01815479  0.04575907  0.05886647]...

Sentence 1: AI systems use documents to answer questions.
Embedding: [-0.03400004  0.03123258 -0.03424968  0.0382176   0.05766154  0.00407708
  0.02707845  0.08380622  0.07094587  0.014809   -0.07037963  0.01803808
  0.04322198 -0.00348967 -0.00045261]...

Sentence 2: Bananas are yellow and tasty.
Embedding: [-0.01373625 -0.02557316  0.03501061  0.02210407  0.01295308  0.06529091
  0.06829672 -0.05202602 -0.00095944  0.08391878  0.00758327 -0.1111851
 -0.00839011 -0.05700076  0.06420341]...



In [9]:
# Now let's look at the cosine similarities among our documents.
def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
for i, s1 in enumerate(sentences):
    for j, s2 in enumerate(sentences):
        if i < j:
            print(f"Similarity('{s1}', '{s2}') = {cosine(embs[i], embs[j]):.2f}")

Similarity('Large language models can learn from research papers.', 'AI systems use documents to answer questions.') = 0.39
Similarity('Large language models can learn from research papers.', 'Bananas are yellow and tasty.') = 0.00
Similarity('AI systems use documents to answer questions.', 'Bananas are yellow and tasty.') = 0.02


You should see higher similarity between topically related text, much lower for unrelated (e.g. the banana one).

---

### Quick Check: In your own words
Why do we use embeddings instead of plain keyword search when building a RAG system?

In [10]:
from utils import create_answer_box
create_answer_box("**Your Answer:** We use embeddings instead of only keywords because...", question_id="mod2_why_embeddings")

**Your Answer:** We use embeddings instead of only keywords because...

Text(value='', layout=Layout(width='500px'), placeholder='Type your answer here')

Button(button_style='success', description='Submit', style=ButtonStyle())

Output()

## 2.3: Preparing Documents for Retrieval: Chunking and Embedding

Documents are often too long for models to process at once. We break them into chunks (by token/paragraph) before embedding.

**Why chunk?**
- Keeps each unit the right size for LLM input
- Lets retrieval focus on topical sections — precision

Let's practice chunking and embedding a custom document.

In [11]:
# Example: Manual chunking
doc = """
Retrieval-Augmented Generation (RAG) augments LLMs by allowing retrieval from external sources. \
Chunking splits text into manageable parts; for example, splitting by paragraph.

Embeddings allow searches to find relevant sections even if different words are used. Cosine similarity quantifies text closeness.

Document retrieval pipelines (using tools like FAISS) depend on these steps working well together.
"""
chunks = [c.strip() for c in doc.split('\n') if c.strip()]
for i, chunk in enumerate(chunks):
    print(f'Chunk {i+1}: {chunk}')

Chunk 1: Retrieval-Augmented Generation (RAG) augments LLMs by allowing retrieval from external sources. Chunking splits text into manageable parts; for example, splitting by paragraph.
Chunk 2: Embeddings allow searches to find relevant sections even if different words are used. Cosine similarity quantifies text closeness.
Chunk 3: Document retrieval pipelines (using tools like FAISS) depend on these steps working well together.


In [12]:
# Embed your chunks
chunk_embs = model.encode(chunks)

# Print the first 15 elements of each chunk embedding
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}")
    print(f"Embedding: {chunk_embs[i][:15]}...\n")  # Display first 15 elements of each embedding

Chunk 1: Retrieval-Augmented Generation (RAG) augments LLMs by allowing retrieval from external sources. Chunking splits text into manageable parts; for example, splitting by paragraph.
Embedding: [-0.07808192  0.01271216  0.0162458   0.01324641 -0.02419303  0.0192296
 -0.02702066 -0.03190193  0.0106208  -0.02104903  0.0347135   0.04540067
  0.0567274  -0.082671    0.02537981]...

Chunk 2: Embeddings allow searches to find relevant sections even if different words are used. Cosine similarity quantifies text closeness.
Embedding: [-0.00037822 -0.03876137 -0.02181436 -0.03538439  0.05156246  0.06276481
 -0.03221408  0.013954    0.07484718 -0.07835811  0.05879162  0.05892429
  0.06992489  0.05355405 -0.01230072]...

Chunk 3: Document retrieval pipelines (using tools like FAISS) depend on these steps working well together.
Embedding: [-0.0423752  -0.00392205 -0.0310769  -0.00017406  0.03266594 -0.02668107
 -0.09702621  0.02626733 -0.01050339 -0.03040713  0.01340665  0.0550771
  0.04512915 

### Dataset: Demo Corpus

We will use a tiny mixed-domain corpus (AI, Climate, Biomedical, Materials paper abstracts) stored in `data/demo_corpus.jsonl`.


In [13]:
from pathlib import Path
import pandas as pd

DATA_PATH = 'data/demo_corpus.jsonl'
df = pd.read_json(DATA_PATH, lines=True)
docs = df.to_dict('records')
print(f'Loaded {len(docs)} docs from {DATA_PATH}')
display(df[['id','title','year','topics']].head())


Loaded 18 docs from data/demo_corpus.jsonl


Unnamed: 0,id,title,year,topics
0,2508.05366,Can Language Models Critique Themselves? Inves...,2025,"[NLP, Retrieval, Language Model, Biomedical]"
1,2508.07326,Nonparametric Reaction Coordinate Optimization...,2025,"[ML, Climate]"
2,2508.07654,MLego: Interactive and Scalable Topic Explorat...,2025,"[Databases, IR]"
3,2508.07798,Generative Inversion for Property-Targeted Mat...,2025,"[Materials, ML]"
4,2508.0814,Data-Efficient Biomedical In-Context Learning:...,2025,"[NLP, Retrieval, Language Model, Biomedical]"


## 2.4: Indexing Scientific Abstracts (with FAISS)
We'll go end-to-end using the demo corpus of scientific abstracts: chunk abstracts → encode chunks → index → retrieve.

In [20]:
# Build a tiny passage index from scientific abstracts with simple chunking
import faiss
import numpy as np

def chunk_text(text, max_chars=400):
    text = (text or '').strip()
    if not text:
        return []
    return [text[i:i+max_chars].strip() for i in range(0, len(text), max_chars)]

# Prepare chunk records from the loaded demo corpus (expects df/docs from above)
chunk_texts = []
chunk_meta = []
for d in docs:
    abs_text = d.get('abstract', '')
    pieces = chunk_text(abs_text, max_chars=400)
    for j, t in enumerate(pieces):
        if not t:
            continue
        chunk_texts.append(t)
        chunk_meta.append({'doc_id': d.get('id'), 'title': d.get('title'), 'chunk_id': j})

# Encode and normalize for cosine similarity via inner product
embs = model.encode(chunk_texts)
embs = np.array([v/np.linalg.norm(v) for v in embs], dtype='float32')
index = faiss.IndexFlatIP(embs.shape[1])
index.add(embs)

# Simple demo query over abstracts
query = 'How do RAG systems combine LLMs with retrieval?'
q = model.encode([query])[0]
q = (q/np.linalg.norm(q)).astype('float32')
D, I = index.search(np.array([q]), k=3)
for rank, (idx, score) in enumerate(zip(I[0], D[0]), start=1):
    m = chunk_meta[idx]
    snippet = chunk_texts[idx][:160].replace('\n',' ')
    print(f'#{rank} score={score:.3f} | id={m["doc_id"]} | {m["title"][:60]}...')
    print(f'   {snippet}...\n')


#1 score=0.429 | id=2508.13107 | All for law and law for all: Adaptive RAG Pipeline for Legal...
   contributions demonstrate the potential of task-aware, component-level tuning to deliver legally grounded, reproducible, and cost-effective RAG systems for lega...

#2 score=0.349 | id=2508.12863 | Word Meanings in Transformer Language Models...
   serves to rule out certain "meaning eliminativist" hypotheses about how transformer LLMs process semantic information....

#3 score=0.340 | id=2508.05366 | Can Language Models Critique Themselves? Investigating Self-...
   Agentic Retrieval Augmented Generation (RAG) and 'deep research' systems aim to enable autonomous search processes where Large Language Models (LLMs) iterativel...



> **Diagram placeholder:** Schematic of a vector index: documents as points on a sphere, query vector arrow, nearest documents circled.

---

## Quick Knowledge Check
What would happen if you used a *very* long chunk, or failed to normalize your vectors before similarity search?
Write a brief hypothesis.

In [21]:
from utils import create_answer_box
create_answer_box("**Your Hypothesis:**\n- With a long chunk...\n- If you don't normalize vectors...", question_id="mod2_longchunk_norm")

**Your Hypothesis:**
- With a long chunk...
- If you don't normalize vectors...

Text(value='', layout=Layout(width='500px'), placeholder='Type your answer here')

Button(button_style='success', description='Submit', style=ButtonStyle())

Output()

# End of Module 2

You've now practiced the core steps of document retrieval for RAG: classic vs. semantic search, embedding, chunking, and vector indexing.

Next: We'll assemble these building blocks into a complete RAG pipeline!