# Building a Complete Retrieval Pipeline

This notebook demonstrates building a production-ready retrieval pipeline that combines BM25, vector search, hybrid fusion, and cross-encoder reranking.

## Learning Objectives

By the end of this notebook, you will be able to:
- Build modular retrieval components (BM25, vector, hybrid)
- Implement configurable pipelines for different use cases
- Compare performance across pipeline configurations
- Understand latency-quality trade-offs
- Deploy production-ready retrieval systems

## Pipeline Architecture

**Components:**
1. **BM25 Retriever** - Fast lexical search
2. **Vector Retriever** - Semantic similarity search
3. **Weighted Fusion** - Combines BM25 + Vector
4. **Cross-Encoder Reranker** - Refines top results

**Complete Pipeline:**
```
Query → [BM25 + Vector] → Fusion → Reranking → Final Results
```

This multi-stage approach balances speed and accuracy for production systems.

## Setup and Imports

Let's start by importing the necessary libraries:

In [1]:
import time
from typing import Dict, List

# LlamaIndex imports
from llama_index.core.schema import Document, NodeWithScore, QueryBundle, TextNode
from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import BaseRetriever
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Hugging Face imports
from sentence_transformers import SentenceTransformer, CrossEncoder, util

## Creating Sample Documents

Let's create a set of sample documents to work with throughout this notebook:

In [2]:
texts = [
    "Python is a high-level programming language known for its readability.",
    "Machine learning is a subset of AI that enables systems to learn from data.",
    "Neural networks are computing systems inspired by biological neural networks.",
    "Deep learning uses neural networks with many layers to extract features from data.",
    "Natural language processing helps computers understand human language.",
    "Python libraries like PyTorch and TensorFlow are used for deep learning.",
    "BM25 is a bag-of-words retrieval function used in information retrieval.",
    "Vector search finds documents by measuring similarity in embedding space.",
    "Reranking refines initial search results with a more complex model.",
    "Retrieval pipelines combine multiple techniques for better search results."
]

# Convert to Document objects
documents = [Document(text=text, id_=f"doc_{i}") for i, text in enumerate(texts)]

# Convert to Nodes for retrieval
nodes = [TextNode(text=doc.text, id_=doc.id_) for doc in documents]

print(f"Created {len(documents)} sample documents")

Created 10 sample documents


## Creating a Testing Function

Let's create a helper function to test our retrievers consistently:

In [3]:
def test_retriever(retriever, query_text, name="Retriever"):
    """Test a retriever with a query and print results."""
    print(f"\n=== Testing {name} ===")
    start_time = time.time()
    results = retriever.retrieve(QueryBundle(query_text))
    end_time = time.time()

    print(f"Query: '{query_text}'")
    print(f"Retrieved {len(results)} documents in {(end_time - start_time):.4f} seconds")

    for i, node in enumerate(results[:3], 1):  # Show top 3 for brevity
        print(f"{i}. Score: {node.score:.4f} - {node.node.get_content()}")

    return results

## Basic Retrieval: BM25

Let's start with a traditional lexical search method: BM25. This algorithm ranks documents based on term frequency and inverse document frequency, essentially looking for keyword matches.

BM25 is great for finding documents containing specific terms in the query, but it doesn't understand synonyms or semantic meaning.

In [4]:
# Create a BM25 retriever
bm25_retriever = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=5)

# Test the BM25 retriever
query = "How is Python used in machine learning?"
bm25_results = test_retriever(bm25_retriever, query, "BM25 Retriever")


=== Testing BM25 Retriever ===
Query: 'How is Python used in machine learning?'
Retrieved 5 documents in 0.0012 seconds
1. Score: 1.5418 - Python libraries like PyTorch and TensorFlow are used for deep learning.
2. Score: 1.4118 - Machine learning is a subset of AI that enables systems to learn from data.
3. Score: 0.8041 - Deep learning uses neural networks with many layers to extract features from data.


## Basic Retrieval: Vector Search

Now let's try semantic search using vector embeddings. This method converts both the query and documents into vector representations and finds the most similar documents based on vector similarity.

Vector search is better at understanding semantic meaning, even when exact keywords aren't present.

In [5]:
# Create the embedding model
model_name = "sentence-transformers/all-MiniLM-L6-v2"
embed_model = HuggingFaceEmbedding(model_name=model_name)

# Create the bi-encoder to generate embeddings
bi_encoder = SentenceTransformer(model_name, device="cpu")

# Generate embeddings for all nodes
for node in nodes:
    node.embedding = bi_encoder.encode(node.get_content())

# Create vector index and retriever
vector_index = VectorStoreIndex(nodes=nodes, embed_model=embed_model)
vector_retriever = vector_index.as_retriever(similarity_top_k=5)

# Test the vector retriever
vector_results = test_retriever(vector_retriever, query, "Vector Retriever")


=== Testing Vector Retriever ===
Query: 'How is Python used in machine learning?'
Retrieved 5 documents in 0.0143 seconds
1. Score: 0.6273 - Python is a high-level programming language known for its readability.
2. Score: 0.6004 - Python libraries like PyTorch and TensorFlow are used for deep learning.
3. Score: 0.5428 - Machine learning is a subset of AI that enables systems to learn from data.


## Comparing the Results So Far

Notice the differences between BM25 and vector search results:

- **BM25** ranks "Machine learning" document highest because it directly contains the keyword, followed by the Python libraries document.
- **Vector Search** ranks the general "Python" document highest due to semantic similarity, then the libraries document.

Each method has strengths and weaknesses:
- BM25 is precise with keywords but misses semantic relationships
- Vector search understands semantics but might miss exact term matches

What if we could combine their strengths?

## Hybrid Retrieval: Combining BM25 and Vector Search

Now let's implement a hybrid approach that combines both methods. This weighted fusion retriever will:
1. Get results from both BM25 and vector retrievers
2. Assign weights to each method (e.g., 70% vector, 30% BM25)
3. Combine and rerank the results based on the weighted scores

In [12]:
class WeightedFusionRetriever(BaseRetriever):
    """Combines results from multiple retrievers with weights."""

    def __init__(self, retrievers: Dict[str, BaseRetriever], weights: Dict[str, float]):
        self.retrievers = retrievers
        self.weights = weights
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        all_results = {}

        # Get results from each retriever
        for name, retriever in self.retrievers.items():
            results = retriever.retrieve(query_bundle)
            weight = self.weights.get(name, 1.0)

            # Combine results with weighting
            for node in results:
                node_id = node.node.node_id
                weighted_score = node.score * weight

                if node_id not in all_results:
                    all_results[node_id] = {"node": node.node, "scores": {}}
                all_results[node_id]["scores"][name] = weighted_score

        # Create final results with combined scores
        final_results = [
            NodeWithScore(node=data["node"], score=sum(data["scores"].values()))
            for node_id, data in all_results.items()
        ]

        return sorted(final_results, key=lambda x: x.score, reverse=True)


# Create hybrid retriever
hybrid_retriever = WeightedFusionRetriever(
    retrievers={"vector": vector_retriever, "bm25": bm25_retriever},
    weights={"vector": 0.7, "bm25": 0.3}
)

# Test the hybrid retriever
hybrid_results = test_retriever(hybrid_retriever, query, "Hybrid Retriever")


=== Testing Hybrid Retriever ===
Query: 'How is Python used in machine learning?'
Retrieved 6 documents in 0.0154 seconds
1. Score: 0.8828 - Python libraries like PyTorch and TensorFlow are used for deep learning.
2. Score: 0.8035 - Machine learning is a subset of AI that enables systems to learn from data.
3. Score: 0.6208 - Python is a high-level programming language known for its readability.


## Enhancing Results with Cross-Encoder Reranking

Our hybrid retriever improves results by combining methods, but it still relies on the initial retrieval scores.

Let's take it to the next level with cross-encoder reranking:

**Bi-Encoders vs Cross-Encoders:**
- **Bi-Encoders** (like our vector retriever) encode queries and documents separately
- **Cross-Encoders** process query-document pairs together, capturing complex interactions

While cross-encoders are more accurate, they're too computationally expensive to use on an entire collection. The solution? Use them only to rerank a smaller set of candidates from our initial retrieval.

In [7]:
class RerankedRetriever(BaseRetriever):
    """Two-stage retriever: initial retrieval + cross-encoder reranking."""

    def __init__(self, base_retriever, model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
                 fetch_k=10, top_k=5):
        self.base_retriever = base_retriever
        self.reranker = CrossEncoder(model_name, device="cpu")
        self.fetch_k = fetch_k
        self.top_k = top_k
        super().__init__()
        print(f"Loaded CrossEncoder model: {model_name}")

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        # Stage 1: Get initial candidates from base retriever
        base_nodes = self.base_retriever.retrieve(query_bundle)[:self.fetch_k]

        # Early return if no results
        if not base_nodes:
            return []

        # Stage 2: Rerank candidates with cross-encoder
        query = query_bundle.query_str
        node_texts = [node.node.get_content() for node in base_nodes]
        rerank_scores = self.reranker.predict(
            [(query, text) for text in node_texts])

        # Create reranked nodes
        reranked_nodes = [
            NodeWithScore(node=node.node, score=float(score))
            for node, score in zip(base_nodes, rerank_scores)
        ]

        # Sort and filter
        reranked_nodes.sort(key=lambda x: x.score, reverse=True)
        return reranked_nodes[:self.top_k] if self.top_k else reranked_nodes


# Create reranked retriever
reranked_retriever = RerankedRetriever(
    base_retriever=hybrid_retriever,
    fetch_k=10,
    top_k=5
)

# Test the reranked retriever
reranked_results = test_retriever(reranked_retriever, query, "Reranked Retriever")

Loaded CrossEncoder model: cross-encoder/ms-marco-MiniLM-L-6-v2

=== Testing Reranked Retriever ===
Query: 'How is Python used in machine learning?'
Retrieved 5 documents in 0.0512 seconds
1. Score: 5.1547 - Python libraries like PyTorch and TensorFlow are used for deep learning.
2. Score: 0.4848 - Python is a high-level programming language known for its readability.
3. Score: -2.4892 - Machine learning is a subset of AI that enables systems to learn from data.


## 9. Observations about Reranking

Notice how the cross-encoder completely changed the ranking:

1. The document about "Python libraries for deep learning" moved to the top position with a much higher score (4.97)
2. The document about "Machine learning" moved from first to third
3. The scores now range from positive to negative, showing the cross-encoder's more nuanced relevance assessment

These changes make intuitive sense. For the query "How is Python used in machine learning?", the most relevant document is indeed the one that directly mentions Python libraries used for deep learning.

However, reranking did add some latency vs hybrid retrieval. This is the efficiency-effectiveness tradeoff that makes the two-stage approach valuable.

## 10. Building a Complete Configurable Pipeline

Now that we've explored the individual components, let's build a complete, configurable retrieval pipeline that can adapt to different needs:

In [13]:
class RetrievalPipeline:
    """Configurable retrieval pipeline combining multiple techniques."""

    def __init__(self, use_bm25=True, use_vector=True, use_hybrid=True, use_reranking=True,
                 vector_weight=0.7, bm25_weight=0.3, top_k=5, rerank_top_k=10):
        self.config = {
            "use_bm25": use_bm25, "use_vector": use_vector,
            "use_hybrid": use_hybrid, "use_reranking": use_reranking,
            "vector_weight": vector_weight, "bm25_weight": bm25_weight,
            "top_k": top_k, "rerank_top_k": rerank_top_k
        }
        self.pipeline = None

    def build(self, nodes):
        """Build the pipeline based on configuration."""
        enabled = [k for k, v in self.config.items()
                   if v and k.startswith('use_')]
        print(f"\nBuilding pipeline: {', '.join(enabled)}")

        # Set up retrievers
        retrievers = {}
        if self.config["use_bm25"]:
            retrievers["bm25"] = BM25Retriever.from_defaults(
                nodes=nodes, similarity_top_k=self.config["top_k"]
            )

        if self.config["use_vector"]:
            embed_model = HuggingFaceEmbedding(
                model_name="sentence-transformers/all-MiniLM-L6-v2"
            )
            vector_index = VectorStoreIndex(
                nodes=nodes, embed_model=embed_model)
            retrievers["vector"] = vector_index.as_retriever(
                similarity_top_k=self.config["top_k"]
            )

        # Select base retriever
        if self.config["use_hybrid"] and len(retrievers) > 1:
            weights = {
                "vector": self.config["vector_weight"],
                "bm25": self.config["bm25_weight"]
            }
            base_retriever = WeightedFusionRetriever(
                retrievers=retrievers, weights=weights)
        else:
            retriever_name = next(iter(retrievers.keys()))
            base_retriever = retrievers[retriever_name]

        # Add reranking if enabled
        if self.config["use_reranking"]:
            self.pipeline = RerankedRetriever(
                base_retriever=base_retriever,
                fetch_k=self.config["rerank_top_k"],
                top_k=self.config["top_k"]
            )
        else:
            self.pipeline = base_retriever
        return self

    def retrieve(self, query, verbose=True):
        """Execute the retrieval pipeline on a query."""
        if self.pipeline is None:
            raise ValueError("Pipeline not built. Call build() first.")

        if isinstance(query, str):
            query = QueryBundle(query)

        start_time = time.time()
        results = self.pipeline.retrieve(query)
        elapsed = time.time() - start_time

        if verbose:
            print(f"\n=== Results ({elapsed:.4f}s) ===")
            for i, node in enumerate(results[:3], 1):
                print(f"{i}. {node.score:.4f} - {node.node.get_content()}")

        return results, elapsed


# Try out the pipeline with the full configuration
full_pipeline = RetrievalPipeline().build(nodes)
results, elapsed = full_pipeline.retrieve(query)


Building pipeline: use_bm25, use_vector, use_hybrid, use_reranking
Loaded CrossEncoder model: cross-encoder/ms-marco-MiniLM-L-6-v2

=== Results (0.0483s) ===
1. 5.1547 - Python libraries like PyTorch and TensorFlow are used for deep learning.
2. 0.4848 - Python is a high-level programming language known for its readability.
3. -2.4892 - Machine learning is a subset of AI that enables systems to learn from data.


## 11. Comparing Different Pipeline Configurations

Now let's compare different configurations of our pipeline to see how each component affects performance and results. This will help us understand the trade-offs between efficiency and effectiveness.

In [14]:
def test_configurations(nodes, query):
    """Compare different pipeline configurations on the same query."""
    configurations = {
        "BM25 Only": {"use_bm25": True, "use_vector": False, "use_hybrid": False, "use_reranking": False},
        "Vector Only": {"use_bm25": False, "use_vector": True, "use_hybrid": False, "use_reranking": False},
        "Hybrid": {"use_bm25": True, "use_vector": True, "use_hybrid": True, "use_reranking": False},
        "Full Pipeline": {"use_bm25": True, "use_vector": True, "use_hybrid": True, "use_reranking": True}
    }

    print(f"\n=== Pipeline Configuration Comparison ===")
    print(f"Query: '{query}'")
    
    results = {}

    for name, config in configurations.items():
        pipeline = RetrievalPipeline(**config).build(nodes)
        retrieval_results, elapsed = pipeline.retrieve(query, verbose=False)
        results[name] = {"time": elapsed, "results": retrieval_results}
        
        print(f"\n{name} ({elapsed:.4f}s):")
        for i, node in enumerate(retrieval_results[:2], 1):
            print(f"{i}. Score: {node.score:.4f} - {node.node.get_content()}")
    
    # Print performance summary
    print("\n=== Performance Summary ===")
    baseline_time = results["BM25 Only"]["time"]
    for name, result in results.items():
        relative = result["time"] / baseline_time
        print(f"{name}: {result['time']:.4f}s{' (' + f'{relative:.1f}x slower than BM25)' if name != 'BM25 Only' else ''}")
    
    return results

# Run the comparison
comparison_results = test_configurations(nodes, query)


=== Pipeline Configuration Comparison ===
Query: 'How is Python used in machine learning?'

Building pipeline: use_bm25

BM25 Only (0.0009s):
1. Score: 1.5418 - Python libraries like PyTorch and TensorFlow are used for deep learning.
2. Score: 1.4118 - Machine learning is a subset of AI that enables systems to learn from data.

Building pipeline: use_vector

Vector Only (0.0132s):
1. Score: 0.6273 - Python is a high-level programming language known for its readability.
2. Score: 0.6004 - Python libraries like PyTorch and TensorFlow are used for deep learning.

Building pipeline: use_bm25, use_vector, use_hybrid

Hybrid (0.0188s):
1. Score: 0.8828 - Python libraries like PyTorch and TensorFlow are used for deep learning.
2. Score: 0.8035 - Machine learning is a subset of AI that enables systems to learn from data.

Building pipeline: use_bm25, use_vector, use_hybrid, use_reranking
Loaded CrossEncoder model: cross-encoder/ms-marco-MiniLM-L-6-v2

Full Pipeline (0.0717s):
1. Score: 5.1547

## 12. Trying a Different Query

Let's see how our pipeline performs on a different query. This will help us understand how the retrieval methods adapt to different information needs.

In [15]:
new_query = "What is the difference between deep learning and neural networks?"
new_comparison = test_configurations(nodes, new_query)

# Analyze the results
print("\n=== Comparison of Top Results ===")
print("For this query, all methods retrieved the same top 2 documents, but with different rankings and scores.")
print("The reranker gave significantly higher scores to both documents, showing its confidence in their relevance.")


=== Pipeline Configuration Comparison ===
Query: 'What is the difference between deep learning and neural networks?'

Building pipeline: use_bm25

BM25 Only (0.0009s):
1. Score: 1.9626 - Deep learning uses neural networks with many layers to extract features from data.
2. Score: 1.7196 - Neural networks are computing systems inspired by biological neural networks.

Building pipeline: use_vector

Vector Only (0.0148s):
1. Score: 0.6050 - Deep learning uses neural networks with many layers to extract features from data.
2. Score: 0.5264 - Python libraries like PyTorch and TensorFlow are used for deep learning.

Building pipeline: use_bm25, use_vector, use_hybrid

Hybrid (0.0156s):
1. Score: 1.0123 - Deep learning uses neural networks with many layers to extract features from data.
2. Score: 0.8801 - Neural networks are computing systems inspired by biological neural networks.

Building pipeline: use_bm25, use_vector, use_hybrid, use_reranking
Loaded CrossEncoder model: cross-encoder/ms-

## 13. Understanding the Latency-Quality Tradeoff

Let's analyze how our pipeline would scale with larger document collections. This simulation will help us understand the latency-quality tradeoff in a real-world scenario.

In [16]:
def estimate_scaling(base_times, doc_counts):
    """Estimate latency scaling with document collection size."""
    print("\n=== Latency Scaling Analysis ===")
    print("Estimating retrieval times for different collection sizes:\n")
    
    # Print header
    methods = list(base_times.keys())
    header = "Document Count | " + " | ".join(methods)
    print(header)
    print("-" * len(header) * 2)
    
    current_count = 10  # Our current document count
    
    # For each document count
    for i, count in enumerate(doc_counts):
        row = []
        # Format the document count
        if i == 0:
            row.append(f"{count} (current)")
        else:
            row.append(f"{count:,}")
            
        # For each method
        for method in methods:
            # For the full pipeline, only the initial retrieval scales with document count
            # The reranking time is constant based on fetch_k
            if method == "Full Pipeline" and i > 0:
                # Estimate the base retrieval time (without reranking)
                base_retrieval_time = base_times["Hybrid"] * (count / current_count)
                # Add the constant reranking time
                reranking_time = base_times["Full Pipeline"] - base_times["Hybrid"]
                time_estimate = base_retrieval_time + reranking_time
            else:
                # For other methods, simply scale linearly
                time_estimate = base_times[method] * (count / current_count)
                
            row.append(f"{time_estimate:.4f}s")
            
        print(" | ".join(row).ljust(16))
    
    print("\nNote: This is a simplified linear estimate. In practice, retrieval systems use")
    print("optimization techniques like indexing and approximate nearest neighbor search")
    print("to achieve sub-linear scaling.")

# Get the base times from our previous comparison
base_times = {
    "BM25": comparison_results["BM25 Only"]["time"],
    "Vector": comparison_results["Vector Only"]["time"],
    "Hybrid": comparison_results["Hybrid"]["time"],
    "Full Pipeline": comparison_results["Full Pipeline"]["time"]
}

# Document counts to estimate for
doc_counts = [10, 100, 1000, 10000, 100000]

# Run the scaling analysis
estimate_scaling(base_times, doc_counts)


=== Latency Scaling Analysis ===
Estimating retrieval times for different collection sizes:

Document Count | BM25 | Vector | Hybrid | Full Pipeline
--------------------------------------------------------------------------------------------------------------
10 (current) | 0.0009s | 0.0132s | 0.0188s | 0.0717s
100 | 0.0087s | 0.1322s | 0.1876s | 0.2405s
1,000 | 0.0872s | 1.3222s | 1.8763s | 1.9292s
10,000 | 0.8717s | 13.2222s | 18.7628s | 18.8157s
100,000 | 8.7166s | 132.2222s | 187.6283s | 187.6812s

Note: This is a simplified linear estimate. In practice, retrieval systems use
optimization techniques like indexing and approximate nearest neighbor search
to achieve sub-linear scaling.


## 14. Practical Considerations for Production

When implementing a retrieval pipeline in production, consider these factors:

### Performance Optimization

1. **Pre-compute and cache embeddings**: Generate embeddings at indexing time, not query time
2. **Use approximate nearest neighbor (ANN) search**: For vector search at scale (e.g., FAISS, Annoy, ScaNN)
3. **Optimize fetch_k and top_k**: More candidates improve quality but increase latency
4. **Batch processing**: Process multiple queries in parallel when possible

### Model Selection

1. **Domain-specific embeddings**: Choose models trained on data similar to your domain
2. **Model size tradeoffs**: Larger models are more accurate but slower
3. **Distilled models**: Consider knowledge-distilled models for better speed/quality tradeoff
4. **Cross-encoder selection**: Test different cross-encoder models for your specific task

### Resource Allocation

1. **BM25 is CPU-bound**: Allocate sufficient memory for index, but CPU is the primary constraint
2. **Vector search benefits from GPU**: Especially for larger models and collections
3. **Cross-encoders are resource-intensive**: Use judiciously on a carefully filtered candidate set

### Adaptive Configuration

1. **Query-dependent weights**: Adjust fusion weights based on query type
2. **Different pipelines for different use cases**: Simple/fast pipeline for suggestions, complex/accurate for main search
3. **A/B testing**: Continuously test configuration changes with real users

## Summary

We've built a complete, configurable retrieval pipeline combining multiple techniques.

### Key Takeaways

1. **Multi-stage pipelines work best** - Combining methods beats any single approach
2. **Configuration matters** - Different use cases need different pipeline settings
3. **Trade-offs are real** - Speed vs accuracy, choose based on requirements
4. **Modularity enables flexibility** - Build components that can be mixed and matched

### Pipeline Configurations

**BM25 Only:**
- Speed: Fastest (~1ms)
- Use when: Keyword search, large collections, speed critical
- Trade-off: Misses semantic relationships

**Vector Only:**
- Speed: Medium (~15ms for 10 docs)
- Use when: Semantic understanding needed
- Trade-off: Misses exact keyword matches

**Hybrid:**
- Speed: Medium (~20ms for 10 docs)
- Use when: General purpose, balanced needs
- Trade-off: Slight overhead, best overall

**Full Pipeline (Hybrid + Reranking):**
- Speed: Slower (~70ms for 10 docs)
- Use when: Quality critical, can tolerate latency
- Trade-off: 3-4x slower but most accurate

### Production Best Practices

**1. Pre-computation:**
- Generate embeddings offline during indexing
- Cache frequently accessed results
- Use batch processing where possible

**2. Optimization:**
- Use ANN (FAISS, HNSW) for vector search at scale
- Tune fetch_k and top_k parameters
- Consider GPU acceleration for embeddings

**3. Adaptive Configuration:**
- Simple queries → BM25 only
- Complex queries → Full pipeline
- A/B test configuration changes

**4. Resource Allocation:**
- BM25: CPU + memory for index
- Vector search: GPU beneficial for large collections
- Cross-encoder: Most resource-intensive, use carefully

### Scaling Considerations

| Collection Size | Recommended Approach |
|----------------|---------------------|
| <1K documents | Any method works well |
| 1K-10K | Hybrid without reranking |
| 10K-100K | ANN + selective reranking |
| 100K-1M+ | Distributed search, ANN, caching |

### Configuration Decision Tree

```
Is speed critical?
├─ YES → Use BM25 only
└─ NO → Need semantic understanding?
    ├─ YES → Collection size?
    │   ├─ <10K → Full pipeline
    │   └─ >10K → Hybrid + selective reranking
    └─ NO → Use BM25 + keyword optimization
```

### Next Steps for Production

1. **Monitoring** - Track latency, quality metrics
2. **Evaluation** - Test on real queries with ground truth
3. **Optimization** - Profile bottlenecks, optimize hot paths
4. **Scaling** - Implement distributed search, sharding
5. **Personalization** - Add user context, query understanding

### Real-World Applications

- **Search engines** - Full pipeline for main results, BM25 for suggestions
- **RAG systems** - Hybrid for document retrieval + LLM generation
- **Recommendation** - Vector search with metadata filtering
- **Question answering** - Reranking critical for accuracy
- **E-commerce** - Hybrid with product-specific weights

This configurable pipeline approach provides the foundation for building production search systems that balance quality, speed, and resource constraints!