# BM25 vs Vector Search: Lexical and Semantic Retrieval

This notebook compares two fundamental search approaches: BM25 (lexical/keyword-based) and vector search (semantic/meaning-based).

## Learning Objectives

By the end of this notebook, you will be able to:
- Implement BM25 retrieval for keyword-based search
- Implement vector search using embeddings
- Compare performance between lexical and semantic approaches
- Understand when to use each method
- Recognize the benefits of hybrid retrieval systems

## Search Paradigms

### BM25 (Best Matching 25)
- **Lexical search** - Matches based on exact keywords and term frequency
- **Fast and efficient** - No embeddings or neural networks required
- **Exact matching** - Finds documents with specific terminology
- **Limitations** - Cannot understand synonyms or semantic relationships

### Vector Search
- **Semantic search** - Matches based on meaning, not just keywords
- **Understanding** - Captures context and related concepts
- **Neural-based** - Uses embeddings from pre-trained models
- **Limitations** - May miss exact keyword matches

## Setup: Install Required Libraries

In [None]:
import os
os.environ['UV_LINK_MODE'] = 'copy'

!uv pip install llama-index-embeddings-huggingface

print("âœ“ Required libraries installed successfully!")

## Sample Documents
We'll create a collection of sample documents related to machine learning concepts to test our search methods.

In [2]:
# Sample docs
from llama_index.core import Document

# Creating a diverse set of AI/ML related documents for our retrieval experiments
documents = [
    Document(text="Machine learning is a branch of artificial intelligence focused on building systems that learn from data.",
             metadata={"title": "Machine Learning Basics"}),
    Document(text="Transformers are neural network models that use self-attention mechanisms to process sequential data.",
             metadata={"title": "Transformer Architecture"}),
    Document(text="Python code for neural networks typically uses libraries like TensorFlow or PyTorch.",
             metadata={"title": "Neural Network Code"}),
    Document(text="The backpropagation algorithm calculates gradients by applying the chain rule backwards through the network.",
             metadata={"title": "Backpropagation Algorithm"}),
    Document(text="BM25 is a ranking function used in information retrieval systems based on term frequency.",
             metadata={"title": "BM25 Algorithm"}),
    Document(text="Artificial intelligence concepts include reasoning, learning, and adaptation in complex environments.",
             metadata={"title": "AI Concepts"}),
    Document(text="Deep learning is a subset of machine learning that uses multi-layered neural networks to extract complex patterns.",
             metadata={"title": "Deep Learning Introduction"}),
    Document(text="Convolutional Neural Networks (CNNs) are specialized neural architectures designed for image processing and computer vision tasks.",
             metadata={"title": "CNN Architecture"}),
    Document(text="Natural Language Processing (NLP) uses computational techniques to analyze and understand human language text and speech.",
             metadata={"title": "NLP Fundamentals"}),
    Document(text="Reinforcement learning is a training method based on rewarding desired behaviors and punishing undesired ones.",
             metadata={"title": "Reinforcement Learning"}),
    Document(text="Vector databases store high-dimensional vectors for efficient similarity search and retrieval.",
             metadata={"title": "Vector Database Systems"}),
    Document(text="The BERT language model uses bidirectional training to understand context from both directions in text.",
             metadata={"title": "BERT Model"}),
    Document(text="Hybrid retrieval systems combine multiple search techniques like BM25 and vector search for improved results.",
             metadata={"title": "Hybrid Retrieval"})
]

## Test Helper Function
This function will help us test our retrieval methods with different queries and display the results.

In [3]:
from llama_index.core.schema import QueryBundle

# Helper function to display retrieval results in a readable format
def test_bm25_retrieval(retriever, queries):
    """Test BM25 retriever with a list of queries."""
    for query in queries:
        print(f"\n{'='*80}\nQuery: {query}\n{'='*80}")
        query_bundle = QueryBundle(query_str=query)
        results = retriever.retrieve(query_bundle)

        print(f"Found {len(results)} relevant documents\n")
        for i, result in enumerate(results):
            print(f"Result {i+1} (Score: {result.score:.8f}):")
            print(f"  {result.node.get_content()[:200]}...\n")

## BM25 Retrieval

### What is BM25?
BM25 (Best Matching 25) is a ranking function used in information retrieval. It's a lexical search method that scores documents based on the query terms appearing in each document, using term frequency and document length. 

Key characteristics:
- Purely lexical (keyword-based)
- No embeddings or neural networks required
- Works well for exact term matching
- Cannot understand semantic relationships between words

In [4]:
# Import necessary modules
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.node_parser import SentenceSplitter

print(f"Loaded {len(documents)} documents")

# Split documents into nodes (chunks)
# The SentenceSplitter breaks documents into smaller chunks for processing
splitter = SentenceSplitter(chunk_size=200)
nodes = splitter.get_nodes_from_documents(documents)

# Create BM25 Retriever - Note: no embeddings model needed for BM25!
# BM25 works purely on lexical matching (word frequencies)
bm25_retriever = BM25Retriever.from_defaults(
    nodes=nodes,
    similarity_top_k=3  # Return the top 3 most relevant documents
)

# Run test queries - see how BM25 performs on basic questions
test_queries = [
    "What is machine learning?",
    "How do transformers work?",
    "Explain the limitations of BM25",
    "Python code examples for neural networks"
]
test_bm25_retrieval(bm25_retriever, test_queries)

# Compare with specific technical terms - BM25 should do well with exact terminology
technical_queries = [
    "dropout regularization technique",
    "backpropagation algorithm",
    "cross-entropy loss function",
    "BERT pre-training objective"
]
print("\n\nTesting with technical queries:")
test_bm25_retrieval(bm25_retriever, technical_queries)


Loaded 13 documents

Query: What is machine learning?
Found 3 relevant documents

Result 1 (Score: 1.72518969):
  Machine learning is a branch of artificial intelligence focused on building systems that learn from data....

Result 2 (Score: 1.36624670):
  Deep learning is a subset of machine learning that uses multi-layered neural networks to extract complex patterns....

Result 3 (Score: 0.65637398):
  Reinforcement learning is a training method based on rewarding desired behaviors and punishing undesired ones....


Query: How do transformers work?
Found 3 relevant documents

Result 1 (Score: 1.29171598):
  Transformers are neural network models that use self-attention mechanisms to process sequential data....

Result 2 (Score: 0.00000000):
  The BERT language model uses bidirectional training to understand context from both directions in text....

Result 3 (Score: 0.00000000):
  Vector databases store high-dimensional vectors for efficient similarity search and retrieval....


Query:

## Part 2: Comparing BM25 vs. Vector Search

### Vector Search Overview
Vector search uses embeddings to represent documents and queries in a high-dimensional space, then finds documents that are "close" to the query in this space.

Key characteristics:
- Based on semantic similarity (meaning) rather than exact words
- Uses neural networks to create embeddings
- Can understand synonyms and related concepts
- May miss exact keyword matches that BM25 would catch

Let's compare both approaches on the same queries:

In [5]:
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.schema import QueryBundle
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

print(f"Created {len(documents)} sample documents")

# Create both retriever types for comparison
def create_retrievers(documents):
    # Parse into nodes - Using a large chunk size to keep each document as one node
    parser = SentenceSplitter(chunk_size=2000, chunk_overlap=0)
    nodes = parser.get_nodes_from_documents(documents)
    
    # 1. BM25 Retriever - lexical search based on term frequencies
    bm25_retriever = BM25Retriever.from_defaults(
        nodes=nodes, similarity_top_k=5)

    # 2. Vector Retriever - semantic search based on embeddings
    # Load a pre-trained embedding model
    embed_model = HuggingFaceEmbedding(
        model_name="sentence-transformers/all-MiniLM-L6-v2")

    # Create vector index and retriever
    vector_index = VectorStoreIndex(
        nodes,         
        embed_model=embed_model
        )
    vector_retriever = vector_index.as_retriever(similarity_top_k=5)

    return {"BM25": bm25_retriever, "Vector": vector_retriever}

# Compare the two retrieval methods
def compare_retrievers(retrievers, queries):
    results = {}

    for name, retriever in retrievers.items():
        method_results = []

        for query in queries:
            print(f"Running {name} retriever on: {query}")
            query_bundle = QueryBundle(query_str=query)
            retrieved = retriever.retrieve(query_bundle)

            # Store results with titles for better readability
            result = {
                "query": query,
                "titles": [node.node.metadata.get("title") for node in retrieved]
            }
            method_results.append(result)

        results[name] = method_results

    return results

# Display the results in a human-readable format
def display_results(results):
    for query_idx, query in enumerate([r["query"] for r in results["BM25"]]):
        print(f"\n\nQuery: {query}")
        print("-" * 50)

        # Show what each method found
        for method in results:
            titles = results[method][query_idx]["titles"]
            print(f"{method} found: {', '.join(titles)}")

        # Calculate and display overlap and differences
        bm25_titles = set(results["BM25"][query_idx]["titles"])
        vector_titles = set(results["Vector"][query_idx]["titles"])
        overlap = bm25_titles.intersection(vector_titles)

        print(
            f"Overlap: {len(overlap)} documents ({', '.join(overlap) if overlap else 'None'})")
        print(
            f"Unique to BM25: {', '.join(bm25_titles - vector_titles) if bm25_titles - vector_titles else 'None'}")
        print(
            f"Unique to Vector: {', '.join(vector_titles - bm25_titles) if vector_titles - bm25_titles else 'None'}")


# Run comparison tests
queries = [
    "What is machine learning?",
    "transformer architecture",
    "python neural network code",
    "backpropagation algorithm",
    "information retrieval"
]

retrievers = create_retrievers(documents)
results = compare_retrievers(retrievers, queries)
display_results(results)

Created 13 sample documents


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Running BM25 retriever on: What is machine learning?
Running BM25 retriever on: transformer architecture
Running BM25 retriever on: python neural network code
Running BM25 retriever on: backpropagation algorithm
Running BM25 retriever on: information retrieval
Running Vector retriever on: What is machine learning?
Running Vector retriever on: transformer architecture
Running Vector retriever on: python neural network code
Running Vector retriever on: backpropagation algorithm
Running Vector retriever on: information retrieval


Query: What is machine learning?
--------------------------------------------------
BM25 found: Machine Learning Basics, Deep Learning Introduction, Reinforcement Learning, AI Concepts, Hybrid Retrieval
Vector found: Machine Learning Basics, Deep Learning Introduction, Reinforcement Learning, AI Concepts, NLP Fundamentals
Overlap: 4 documents (AI Concepts, Machine Learning Basics, Reinforcement Learning, Deep Learning Introduction)
Unique to BM25: Hybrid Retriev

## Summary

We've compared BM25 lexical search and vector-based semantic search.

### Key Takeaways

1. **BM25 excels at exact matching** - Best for technical terms, specific keywords, precise queries
2. **Vector search excels at semantics** - Best for conceptual queries, synonyms, related topics
3. **Different results** - Methods often return different documents, showing complementary strengths
4. **Hybrid is better** - Combining both methods leverages advantages of each approach

### When to Use Each Method

**Use BM25 when:**
- Exact terminology matters (legal, medical, technical docs)
- Specific keywords are critical
- Fast performance needed (no embedding computation)
- Dealing with rare or specialized terms

**Use Vector Search when:**
- Semantic meaning matters more than exact words
- Handling synonyms and paraphrases
- Cross-lingual search
- Conceptual similarity important

**Use Hybrid (both) when:**
- Production search systems
- Maximum recall needed
- Diverse query types
- Best overall performance required

### Next Steps

In the following notebooks, we'll explore:
- Implementing hybrid retrieval that combines both methods
- Ranking and reranking strategies
- Production deployment patterns

## Conclusion

In this notebook, we've explored two fundamental retrieval approaches:
- BM25 (lexical search) which works with keywords and term frequencies
- Vector search which uses embeddings to capture semantic meaning

Each method has its strengths and weaknesses. BM25 excels at finding exact matches and specific terminology, while vector search is better at understanding the meaning behind queries. 

For production systems, consider using a hybrid approach that combines the best of both worlds.