# Hybrid Retrieval (BM25 + Dense) with RRF — Concept Overview

When a user asks a question, we need to **search through documents** (e.g., uv.es university content).

There are two popular ways to search:

- **BM25 (sparse retrieval)** — based on **exact word matching**.  
  It works great when the query and the document share the same words  
  *(e.g., “matrícula”, “plazos”, “UV”)*.

- **Dense retrieval (embeddings)** — based on **semantic meaning**.  
  Even if the words differ, it can match similar meanings  
  *(e.g., “inscripción” ≈ “matrícula”)*.

Each method fails where the other succeeds.  
The usual solution is to **combine both**, increasing recall without sacrificing precision.

# What is a Hybrid Retriever?

A **hybrid retriever** runs **two searches in parallel**:

1. One using **BM25** — ranking documents by **textual relevance**.
2. Another using **embeddings** — ranking by **semantic similarity**.

Finally, it **merges both ranked lists** into a single, more reliable ranking.

# RRF (Reciprocal Rank Fusion)

**Reciprocal Rank Fusion (RRF)** is a simple yet powerful rule for combining ranked lists.

### Inputs:
- **List A** (e.g., BM25) returns documents with ranks: `rank_A(d) = 1, 2, 3, ...` (1 = best)
- **List B** (e.g., Dense) returns ranks: `rank_B(d) = 1, 2, 3, ...`

### Formula (per document `d`):
RRF\_score(d) = 1/(k + rank_A(d)) + 1/(k + rank_B(d))


- **k** is a constant (typically `60`) — it ensures top ranks matter more but prevents one list from dominating.
- If a document doesn’t appear in a list, we treat it as having a **very large rank** (or simply no contribution).

### Intuition:
- A document ranked very high in **either** list gets a **strong boost** (`1/(k+1)`).
- A document that ranks moderately in **both** lists also gains a solid combined score.
- **RRF is robust** — no need to normalize scores or tune probabilities, just use ranks.


# Simple Example of RRF

Assume we take the **top-3** documents from each retriever:

**BM25 (List A)** → `d1`, `d2`, `d3`  
**Dense (List B)** → `d2`, `d3`, `d4`  

We set `k = 60` (a common choice).

---

### Step-by-step computation

| Document | rank_A | rank_B | RRF Score Calculation | Total |
|-----------|--------|--------|------------------------|--------|
| **d1** | 1 | — | 1/(60+1) + 0 | **1/61** |
| **d2** | 2 | 1 | 1/(60+2) + 1/(60+1) | **1/62 + 1/61** |
| **d3** | 3 | 2 | 1/(60+3) + 1/(60+2) | **1/63 + 1/62** |
| **d4** | — | 3 | 0 + 1/(60+3) | **1/63** |

---

### Final order (highest to lowest RRF score):
1. **d2** → highest combined (top in both lists)  
2. **d3** → solid mid-rank in both lists  
3. **d1** → strong in BM25 only  
4. **d4** → found only in Dense search

**d2 wins** because it performs well in *both* search types.  
**d1** was best in BM25 but missing in Dense, so it drops behind.


### 1 — Environment & Imports

In [12]:
# Environment & Imports
import os

from openai import OpenAI
import chromadb
# BM25
from rank_bm25 import BM25Okapi

from typing import List, Dict, Any, Tuple
from dataclasses import dataclass

import re

import glob
from datetime import datetime
from uuid import uuid4

from pypdf import PdfReader
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    SentenceTransformersTokenTextSplitter,
)
from dotenv import load_dotenv
load_dotenv()

# configuration expected from .env.
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
CHROMADB_TOKEN = os.environ.get("CHROMADB_TOKEN")
CHROMA_TENANT = os.environ.get("CHROMA_TENANT")       
CHROMA_DB = os.environ.get("CHROMA_DB")             
CHROMA_HOST = os.environ.get("CHROMA_HOST")            
CHROMA_COLLECTION = os.environ.get("CHROMA_COLLECTION") 

# Early fail-fast to surface configuration errors deterministically.
assert OPENAI_API_KEY, "Set OPENAI_API_KEY"
assert CHROMADB_TOKEN and CHROMA_TENANT and CHROMA_DB and CHROMA_HOST and CHROMA_COLLECTION, "Set Chroma Cloud env vars"

In [13]:
# OpenAI client
llm = OpenAI(api_key=OPENAI_API_KEY)

# Chroma client (Cloud)
client = chromadb.CloudClient(
    api_key=CHROMADB_TOKEN,
    tenant=CHROMA_TENANT,
    database=CHROMA_DB,
)
collection = client.get_or_create_collection(name=CHROMA_COLLECTION)

# Minimal tokenizer dedicated to BM25: lowercasing + simple word/symbol segmentation.
# this is sufficient for BM25’s bag-of-words scoring.
def simple_tokenize(text: str) -> List[str]:
    return re.findall(r"\w+|\S", (text or "").lower())

### 2 - Embedding helpers (OpenAI)

In [25]:
EMBEDDING_MODEL = "text-embedding-3-small"  

def embed_texts(texts: List[str]) -> List[List[float]]:
    """Vectorize a batch of texts using OpenAI embeddings. 
    Returns one embedding per input text, preserving order."""
    if not texts:
        return []
    # Send the texts to the OpenAI embedding model to generate embeddings.
    # 'resp' is an API response object containing metadata + one embedding per text.
    resp = llm.embeddings.create(model=EMBEDDING_MODEL, input=texts)
    # Extract only the embedding vectors from the response (ignore metadata).
    # resp.data is a list of objects; each has an attribute 'embedding' which is a list of floats.
    return [d.embedding for d in resp.data]

def embed_query(query: str) -> List[float]:
    """Vectorize a single query (thin wrapper over embed_texts)."""
    return embed_texts([query])[0]


### 3 — PDF Ingestion, Two-Pass Chunking, Embedding, and Upsert to Chroma

#### Scan `./data` and extract text from each PDF

For each PDF:
- read pages, extract text
- join per-document text with double newlines
- build a minimal metadata record
- keep both full document text and per-page text for traceability


In [15]:
DATA_DIR = "./data"
assert os.path.isdir(DATA_DIR), f"Data folder not found: {DATA_DIR}"

pdf_paths = sorted(glob.glob(os.path.join(DATA_DIR, "**/*.pdf"), recursive=True))
print(f"Found {len(pdf_paths)} PDFs under {DATA_DIR}")

raw_docs: List[Dict[str, Any]] = []  # items: {"doc_id","filename","path","text","pages_text","last_modified"}

for pdf_path in pdf_paths:
    try:
        reader = PdfReader(pdf_path)
    except Exception as e:
        print(f"[WARN] Skipping {pdf_path}: {e}")
        continue

    # Extract per-page text (strip leading/trailing whitespace)
    pages_text: List[str] = []
    for p in reader.pages:
        t = p.extract_text() or ""
        t = t.strip()
        if t:
            pages_text.append(t)

    # Join all page texts into a single document text
    full_text = "\n\n".join(pages_text)

    # Skip empty documents (e.g., scanned PDFs without OCR)
    if not full_text.strip():
        print(f"[WARN] Empty text: {pdf_path}")
        continue

    # Metadata
    stat = os.stat(pdf_path)
    last_modified = datetime.fromtimestamp(stat.st_mtime).isoformat()
    doc_id = str(uuid4())  # alternatively, derive a stable id from filepath+mtime if desired

    raw_docs.append({
        "doc_id": doc_id,
        "filename": os.path.basename(pdf_path),
        "path": os.path.abspath(pdf_path),
        "text": full_text,
        "pages_text": pages_text,  # retain per-page traceability
        "last_modified": last_modified,
    })

print(f"Ingested {len(raw_docs)} PDF documents with text.")

Found 25 PDFs under ./data
Ingested 25 PDF documents with text.


#### Two-pass splitting
First pass at character level preserves coarse structure and avoids awkward breaks.

Second pass at token level introduces overlap to improve recall near boundaries.

character chunks ≈ 1000 (no overlap)

token chunks ≈ 384 tokens with 64-token overlap

In [18]:
char_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0,
)

token_splitter = SentenceTransformersTokenTextSplitter(
    tokens_per_chunk=384,
    chunk_overlap=64,
)


#### Split documents into page-aware chunks with rich metadata

For each document:

iterate page by page

apply character-level splitting, then token-level splitting

construct chunks with metadata capturing the source PDF and page

In [20]:
def split_into_chunks_by_page(doc: Dict[str, Any]) -> List[Dict[str, Any]]:
    """
    Produce page-aware chunks from a document.
    Pipeline: page -> character-level segments -> token-level chunks (with overlap).
    Each chunk carries metadata sufficient for page-level citation and source tracing.
    """
    chunks: List[Dict[str, Any]] = []
    pages: List[str] = doc.get("pages_text") or [doc["text"]]
    total_pages = len(pages)

    global_chunk_idx = 0  # running index of chunks across the whole document

    for page_num, page_text in enumerate(pages, start=1):
        if not page_text or not page_text.strip():
            continue

        # First pass: character-level segmentation
        char_segments = char_splitter.split_text(page_text)

        # Second pass: token-level segmentation with overlap
        page_chunk_idx = 0
        for seg in char_segments:
            token_chunks = token_splitter.split_text(seg)
            for chunk_text in token_chunks:
                # Skip degenerate chunks that are purely whitespace
                if not chunk_text.strip():
                    continue

                chunk_id = f"{doc['doc_id']}::p{page_num}::ch{global_chunk_idx}"

                chunks.append({
                    "id": chunk_id,
                    "text": chunk_text,
                    "metadata": {
                        "doc_id": doc["doc_id"],
                        "filename": doc["filename"],
                        "path": doc["path"],
                        "page": page_num,                    # 1-based page number for citation
                        "page_chunk_index": page_chunk_idx,  # chunk index within the page
                        "chunk_index": global_chunk_idx,     # global chunk index within the doc
                        "total_pages": total_pages,
                        "last_modified": doc["last_modified"],
                        # Useful anchor for viewers that support page fragments:
                        "source": f"{doc['path']}#page={page_num}",
                        # Additional domain-specific fields can be added here (e.g., "section", "language").
                    }
                })

                page_chunk_idx += 1
                global_chunk_idx += 1

    print(f"[{doc['filename']}] produced {len(chunks)} chunks with page metadata.")
    return chunks

#### Build the complete chunk list across the corpus

Aggregate page-aware chunks from all documents into a single list for embedding and upsert.

In [21]:
all_chunks: List[Dict[str, Any]] = []
for d in raw_docs:
    all_chunks.extend(split_into_chunks_by_page(d))

print(f"Total chunks prepared: {len(all_chunks)}")


[2024.findings-acl.456.pdf] produced 89 chunks with page metadata.
[2025.acl-long.366.pdf] produced 111 chunks with page metadata.
[2401.10020v3.pdf] produced 80 chunks with page metadata.
[2401.10774v3.pdf] produced 107 chunks with page metadata.
[2401.15884v3.pdf] produced 69 chunks with page metadata.
[2401.18059v1.pdf] produced 92 chunks with page metadata.
[2402.01306v4.pdf] produced 95 chunks with page metadata.
[2402.13547v2.pdf] produced 88 chunks with page metadata.
[2402.13753v1.pdf] produced 78 chunks with page metadata.
[2403.03206v1.pdf] produced 111 chunks with page metadata.
[2403.19887v2.pdf] produced 57 chunks with page metadata.
[2404.16130v2.pdf] produced 105 chunks with page metadata.
[2405.04437v3.pdf] produced 109 chunks with page metadata.
[2405.14734v3.pdf] produced 131 chunks with page metadata.
[2405.21060v1.pdf] produced 216 chunks with page metadata.
[2406.04692v1.pdf] produced 61 chunks with page metadata.
[2407.08608v2.pdf] produced 81 chunks with page met

#### Embed and upsert chunks into Chroma (batched)

Procedure:

batch the chunk list to respect rate limits

compute embeddings for each batch

upsert (ids, documents, embeddings, metadatas) into the target Chroma collection

In [22]:
BATCH = 256  # tune based on provider limits and latency characteristics

def batched(iterable, n: int):
    """Yield successive n-sized lists from iterable."""
    for i in range(0, len(iterable), n):
        yield iterable[i : i + n]

upserted = 0
for batch in batched(all_chunks, BATCH):
    texts = [c["text"] for c in batch]
    ids   = [c["id"] for c in batch]
    metas = [c["metadata"] for c in batch]

    # Embeddings (OpenAI-compatible)
    embeds = embed_texts(texts)

    # Upsert into Chroma
    collection.upsert(
        ids=ids,
        documents=texts,
        embeddings=embeds,
        metadatas=metas,
    )
    upserted += len(batch)

print(f"Upserted {upserted} chunks into Chroma")

Upserted 2922 chunks into Chroma


In [24]:
query_text = "What did Qwen2.5-1M models do to enhance training efficiency and reduce costs?"

q_embed = embed_query(query_text)  
results = collection.query(
    query_embeddings=[q_embed],
    n_results=5,
    include=["documents", "metadatas", "distances"],
)

docs      = results.get("documents", [[]])[0]
metadatas = results.get("metadatas", [[]])[0]
scores    = results.get("distances", [[]])[0]

for doc_text, meta, score in zip(docs, metadatas, scores):
    filename = meta.get("filename", "unknown.pdf")
    page     = meta.get("page", "?")
    source   = meta.get("source", "")
    print(f"Source: {filename} — page {page}  (distance={score:.4f})")
    if source:
        print(f"Link: {source}")
    print(doc_text[:400].strip(), "...\n")


Source: 2501.15383v1.pdf — page 15  (distance=0.5904)
Link: c:\Users\Zakaria\Downloads\bm25-dense-etrieval\data\2501.15383v1.pdf#page=15
its processing time from 4. 9 minutes to only 68 seconds. these improvements significantly reduce user waiting times for long - sequence tasks. compared to the open - source qwen2. 5 - 1m models, qwen2. 5 - turbo excels in short tasks and achieves competitive results on long - context tasks, while delivering shorter processing times and lower costs. consequently, it offers an excellent balance of ...

Source: 2501.15383v1.pdf — page 2  (distance=0.6577)
Link: c:\Users\Zakaria\Downloads\bm25-dense-etrieval\data\2501.15383v1.pdf#page=2
qwen2. 5 - 14b - instruct - 1m. compared to the 128k versions, these models exhibit significantly enhanced long - context capabilities. additionally, we provide an api - accessible model based on mixture of experts ( moe ), called qwen2. 5 - turbo, which offers performance comparable to gpt - 4o - mini but with longer con