# Sparse Retrieval - BM25 on HotpotQA

## Installation

In [1]:
!pip install rank_bm25 nltk datasets faiss-cpu transformers evaluate

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━

## Imports

In [6]:
import re, nltk, math, json, random, evaluate, tqdm
from itertools import chain
from datasets import load_dataset
from rank_bm25 import BM25Okapi
nltk.download("punkt")
nltk.download('punkt_tab')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

## Load Data

**Load 10-k HotpotQA paragraphs**

In [7]:
hp = load_dataset("hotpot_qa", "distractor", split="train[:10000]")
hp

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/9.19k [00:00<?, ?B/s]

hotpot_qa.py:   0%|          | 0.00/6.42k [00:00<?, ?B/s]

The repository for hotpot_qa contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/hotpot_qa.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/566M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/46.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/90447 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/7405 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'question', 'answer', 'type', 'level', 'supporting_facts', 'context'],
    num_rows: 10000
})

In [8]:
corpus_raw = hp["context"]

corpus = []
for context in corpus_raw:
    corpus.extend(list(chain.from_iterable(context['sentences'])))

corpus[:10]

["Radio City is India's first private FM radio station and was started on 3 July 2001.",
 ' It broadcasts on 91.1 (earlier 91.0 in most cities) megahertz from Mumbai (where it was started in 2004), Bengaluru (started first in 2001), Lucknow and New Delhi (since 2003).',
 ' It plays Hindi, English and regional songs.',
 ' It was launched in Hyderabad in March 2006, in Chennai on 7 July 2006 and in Visakhapatnam October 2007.',
 ' Radio City recently forayed into New Media in May 2008 with the launch of a music portal - PlanetRadiocity.com that offers music related news, videos, songs, and other music-related features.',
 ' The Radio station currently plays a mix of Hindi and Regional music.',
 ' Abraham Thomas is the CEO of the company.',
 'Football in Albania existed before the Albanian Football Federation (FSHF) was created.',
 " This was evidenced by the team's registration at the Balkan Cup tournament during 1929-1931, which started in 1929 (although Albania eventually had pressure 

## Tokenization

In [9]:
#chunk id helper
doc_ids = list(range(len(corpus)))
len(doc_ids)

408264

In [10]:
## Tokenise ~ (whitespace + lowercase + punctuation strip)

def tokenize(text):

    tokens = nltk.word_tokenize(text.lower())
    final_tokens = [re.sub(r"\W+", "", t) for t in tokens if re.sub(r"\W+", "", t)]
    return final_tokens

tokenized_corpus = [tokenize(doc) for doc in tqdm.tqdm(corpus)]

tokenized_corpus

100%|██████████| 408264/408264 [01:28<00:00, 4618.04it/s]


[['radio',
  'city',
  'is',
  'india',
  's',
  'first',
  'private',
  'fm',
  'radio',
  'station',
  'and',
  'was',
  'started',
  'on',
  '3',
  'july',
  '2001'],
 ['it',
  'broadcasts',
  'on',
  '911',
  'earlier',
  '910',
  'in',
  'most',
  'cities',
  'megahertz',
  'from',
  'mumbai',
  'where',
  'it',
  'was',
  'started',
  'in',
  '2004',
  'bengaluru',
  'started',
  'first',
  'in',
  '2001',
  'lucknow',
  'and',
  'new',
  'delhi',
  'since',
  '2003'],
 ['it', 'plays', 'hindi', 'english', 'and', 'regional', 'songs'],
 ['it',
  'was',
  'launched',
  'in',
  'hyderabad',
  'in',
  'march',
  '2006',
  'in',
  'chennai',
  'on',
  '7',
  'july',
  '2006',
  'and',
  'in',
  'visakhapatnam',
  'october',
  '2007'],
 ['radio',
  'city',
  'recently',
  'forayed',
  'into',
  'new',
  'media',
  'in',
  'may',
  '2008',
  'with',
  'the',
  'launch',
  'of',
  'a',
  'music',
  'portal',
  'planetradiocitycom',
  'that',
  'offers',
  'music',
  'related',
  'news',
 

## Build BM25 Index

In [11]:
bm25 = BM25Okapi(tokenized_corpus)

## Evaluate Retriever Recall@k

In [12]:
questions = hp["question"][:100]
gold_answers = hp["answer"][:100]

list(zip(questions[:2], gold_answers[:2]))

[("Which magazine was started first Arthur's Magazine or First for Women?",
  "Arthur's Magazine"),
 ('The Oberoi family is part of a hotel company that has a head office in what city?',
  'Delhi')]

In [13]:
def recall_at_k(k=5):

    hit = 0
    for q, gold in zip(questions, gold_answers):
        q_tokenized = tokenize(q)
        doc_scores = bm25.get_scores(q_tokenized)
        topk_idx = sorted(range(len(doc_scores)), key=lambda i: doc_scores[i], reverse=True)[:k]
        retrived_docs = [corpus[i] for i in topk_idx]
        if any(gold.lower() in p for p in retrived_docs):
            hit += 1
    return hit * 100 / len(questions)

for k in [1, 3, 5]:
    print(f"Recall@{k} : {recall_at_k(k):.1f}%")

Recall@1 : 5.0%
Recall@3 : 11.0%
Recall@5 : 11.0%


## Plug into Generation

In [14]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
from huggingface_hub import login
login(token='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')


In [15]:
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

model =  AutoModelForCausalLM.from_pretrained(model_name,
                     device_map="auto", torch_dtype=torch.float16)

generator =  pipeline("text-generation", model=model, tokenizer=tokenizer,
               temperature=0.1, max_new_tokens=128)


tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

Device set to use cpu


In [1]:
def generate_answer_with_bm25(q, k=3):

    doc_scores = bm25.get_scores(tokenize(q))
    context_ids = sorted(range(len(doc_scores)), key=lambda i: doc_scores[i], reverse=True)[:k]

    context = "\n".join([corpus[i] for i in context_ids])
    prompt = ("You are a question-answering system. Use the context. \n\n"
        "Context: {context}\n\n"
        "Question: {q}\n\n"
        "Answer briefly:"
    )

    return generator(prompt)[0]["generated_text"].split("Answer briefly:")[-1].strip()

generate_answer_with_bm25(questions[0], k=3)

NameError: name 'questions' is not defined

## Evaluation

In [None]:
predictions = [generate_answer_with_bm25(q, k=3) for q in tqdm.tqdm(questions[:100])]

predictions_formatted = []
references_formatted = []

for i, (pred, ref) in enumerate(zip(predictions, gold_answers[:100])):
    predictions_formatted.append({"id": str(i), "prediction_text": pred})
    references_formatted.append({"id": str(i), "answers": {"text": [ref], "answer_start": [0]}})
squad = evaluate.load("squad")
results = squad.compute(predictions=predictions_formatted, references=references_formatted)
print(json.dumps(results, indent=2))

## Knobs and Experiements

We can tune following hyper-parameters to boost the retrieval performance:

- Vary $k$ (retrieved chunks) to higher number  might uplift the recall but LLM context window may overflow - diminishing returns
- Tune hyper-parameters
    - $k_1$ ~ [0.9,1.2,1.8]
        - Smaller = less TF influence
    - $b$ ~ [0.4, 1.0]
        - Lower b may help short passages
    

- Stemming vs no stemming (Snowball): May raise recall by ~1-2%
- Stop-word dropping: Open hurts QA; keep them

## Strengths & Points Observed

**What BM25 did well**

- Simple "Who is the husband of ..." questions where the answer string exits verbatim
- Token overlap heavy queries (name, dates)

**Where it failed**

- Synonymy ("movie" vs "film")
- Morphology ("run" vs "running")
- Long multi-hop queries - may retrieve only one of the required passages
- Rate entities unseen in corpus will be never be matched

**These limitations point directly at dense semantic embeddings.**