# Reranking: Enhancing Retrieval with Cross-Encoders

This notebook demonstrates two-stage retrieval with reranking for improved search accuracy.

## Learning Objectives

By the end of this notebook, you will be able to:
- Understand bi-encoder vs cross-encoder architectures
- Implement bi-encoder retrieval for initial candidate selection
- Build cross-encoder rerankers for accurate relevance scoring
- Compare performance and accuracy trade-offs
- Deploy efficient two-stage retrieval pipelines

## Two-Stage Retrieval

**Stage 1: Fast Retrieval (Bi-Encoder)**
- Encode queries and documents separately
- Pre-compute document embeddings
- Fast similarity search
- Returns candidate set (e.g., top 100)

**Stage 2: Accurate Reranking (Cross-Encoder)**  
- Process query-document pairs together
- Capture complex interactions
- More accurate relevance scores
- Rerank candidates (e.g., top 10)

**Why This Works:**
- Bi-encoders: Fast but less accurate
- Cross-encoders: Accurate but slow
- Combined: Best of both worlds!

## Understanding the Key Technologies

### Bi-Encoders vs Cross-Encoders

- **Bi-Encoders**:
  - Encode queries and documents separately
  - Allow pre-computation of document embeddings
  - Fast for initial retrieval across large collections
  - Less accurate as they don't directly compare query-document interactions

- **Cross-Encoders**:
  - Process query and document pairs together
  - Capture complex interactions between query and document
  - More accurate for relevance assessment
  - Computationally expensive (can't pre-compute)
  - Best used for reranking a small set of candidates

## Setting Up Our Environment

First, let's import the necessary libraries:

In [1]:
from sentence_transformers import CrossEncoder, SentenceTransformer, util
import time

## Creating Sample Data

Let's create some sample documents to work with for our demonstration:

In [2]:
documents = [
    "Python is a high-level, interpreted programming language known for its readability.",
    "Machine learning is a subset of artificial intelligence that learns from data.",
    "Neural networks are computing systems inspired by biological neural networks.",
    "Deep learning uses neural networks with many layers to extract features from data.",
    "Natural language processing helps computers understand human language.",
    "Python libraries like PyTorch and TensorFlow are used for deep learning.",
    "BM25 is a bag-of-words retrieval function used in information retrieval.",
    "Vector search finds documents by measuring similarity in embedding space.",
    "Reranking refines initial search results with a more complex model.",
    "Hybrid search combines multiple retrieval methods to improve search quality."
]

## Implementing a Bi-Encoder Retriever

First, we'll implement the first stage of our system using a bi-encoder model. This will encode our documents and queries separately, allowing for efficient initial retrieval.

In [3]:
class BiEncoderRetriever:
    """Simple implementation of a Bi-Encoder retriever"""
    
    def __init__(self, model_name="sentence-transformers/all-MiniLM-L6-v2", top_k=5):
        """Initialize the retriever with a pre-trained model"""
        print(f"Loading bi-encoder model: {model_name}")
        self.model = SentenceTransformer(model_name, device="cpu")
        self.top_k = top_k
        self.doc_embeddings = None
        self.documents = None
        
    def index_documents(self, documents):
        """Generate and store embeddings for all documents"""
        print(f"Indexing {len(documents)} documents...")
        self.documents = documents
        self.doc_embeddings = self.model.encode(documents)
        print(f"Created embeddings with {self.doc_embeddings.shape[1]} dimensions")
        
    def retrieve(self, query):
        """Retrieve top documents for a given query"""
        # Encode the query
        query_embedding = self.model.encode(query)
        
        # Calculate similarity scores
        scores = []
        for i, doc_embedding in enumerate(self.doc_embeddings):
            # Compute cosine similarity
            similarity = util.cos_sim(query_embedding, doc_embedding).item()
            scores.append((i, similarity, self.documents[i]))
        
        # Sort by similarity score (descending)
        scores.sort(key=lambda x: x[1], reverse=True)
        
        # Return top_k results
        return scores[:self.top_k]

Let's initialize our bi-encoder retriever and index our documents:

In [4]:
# Initialize our retriever
retriever = BiEncoderRetriever(top_k=5)

# Index our documents
retriever.index_documents(documents)

Loading bi-encoder model: sentence-transformers/all-MiniLM-L6-v2
Indexing 10 documents...
Created embeddings with 384 dimensions


Now, let's test our bi-encoder retriever with a query:

In [5]:
# Define a query
query = "How is Python used in machine learning?"
print(f"Query: '{query}'\n")

# Perform retrieval
start_time = time.time()
bi_encoder_results = retriever.retrieve(query)
retrieval_time = time.time() - start_time

# Display results
print(f"Retrieved {len(bi_encoder_results)} documents in {retrieval_time:.4f}s")
for i, (doc_id, score, doc) in enumerate(bi_encoder_results, 1):
    print(f"{i}. Score: {score:.4f} - {doc}")

Query: 'How is Python used in machine learning?'

Retrieved 5 documents in 0.0156s
1. Score: 0.6194 - Python is a high-level, interpreted programming language known for its readability.
2. Score: 0.6004 - Python libraries like PyTorch and TensorFlow are used for deep learning.
3. Score: 0.5793 - Machine learning is a subset of artificial intelligence that learns from data.
4. Score: 0.4139 - Deep learning uses neural networks with many layers to extract features from data.
5. Score: 0.4065 - Natural language processing helps computers understand human language.


## Implementing a Cross-Encoder Reranker

Now, let's implement the second stage of our system using a cross-encoder model. This will process query-document pairs together to provide more accurate relevance scores.

In [6]:
class CrossEncoderReranker:
    """Simple implementation of a Cross-Encoder reranker"""
    
    def __init__(self, model_name="cross-encoder/ms-marco-MiniLM-L-6-v2", top_k=3):
        """Initialize with a pre-trained cross-encoder model"""
        print(f"Loading cross-encoder model: {model_name}")
        self.model = CrossEncoder(model_name, device="cpu")
        self.top_k = top_k
        
    def rerank(self, query, results):
        """Rerank results using cross-encoder model"""
        if not results:
            return []
        
        # Create query-document pairs
        query_doc_pairs = [(query, doc) for _, _, doc in results]
        
        # Get scores from cross-encoder
        rerank_scores = self.model.predict(query_doc_pairs)
        
        # Combine with original results
        reranked = [(results[i][0], float(score), results[i][2]) 
                   for i, score in enumerate(rerank_scores)]
        
        # Sort by new scores (descending)
        reranked.sort(key=lambda x: x[1], reverse=True)
        
        # Return top_k results
        return reranked[:self.top_k]

Now, let's initialize our cross-encoder reranker:

In [7]:
# Initialize our reranker
reranker = CrossEncoderReranker(top_k=3)

Loading cross-encoder model: cross-encoder/ms-marco-MiniLM-L-6-v2


## Applying Reranking to Our Results

Let's apply the cross-encoder reranker to the initial results from the bi-encoder retriever:

In [8]:
# Using the same query from before
print(f"Query: '{query}'\n")

# Apply reranking
start_time = time.time()
reranked_results = reranker.rerank(query, bi_encoder_results)
rerank_time = time.time() - start_time

# Display results
print(f"Reranked {len(reranked_results)} documents in {rerank_time:.4f}s")
for i, (doc_id, score, doc) in enumerate(reranked_results, 1):
    print(f"{i}. Score: {score:.4f} - {doc}")

Query: 'How is Python used in machine learning?'

Reranked 3 documents in 0.1398s
1. Score: 5.1547 - Python libraries like PyTorch and TensorFlow are used for deep learning.
2. Score: 0.4332 - Python is a high-level, interpreted programming language known for its readability.
3. Score: -2.5431 - Machine learning is a subset of artificial intelligence that learns from data.


## Comparison: Bi-Encoder vs Cross-Encoder Results

Let's compare how the rankings changed after applying the cross-encoder reranker:

In [9]:
print("BEFORE RERANKING (BI-ENCODER)")
for i, (doc_id, score, doc) in enumerate(bi_encoder_results[:3], 1):
    print(f"{i}. Score: {score:.4f} - {doc}")

print("\nAFTER RERANKING (CROSS-ENCODER)")
for i, (doc_id, score, doc) in enumerate(reranked_results, 1):
    print(f"{i}. Score: {score:.4f} - {doc}")

# Analyze changes in ranking
print("\nCHANGES IN RANKING:")
initial_top_docs = [doc for _, _, doc in bi_encoder_results[:3]]
reranked_top_docs = [doc for _, _, doc in reranked_results]

for i, doc in enumerate(reranked_top_docs, 1):
    if doc in initial_top_docs:
        old_rank = initial_top_docs.index(doc) + 1
        if old_rank != i:
            print(f"Document moved from position {old_rank} to {i}")
    else:
        print(f"New document at position {i} (wasn't in top 3 before)")

BEFORE RERANKING (BI-ENCODER)
1. Score: 0.6194 - Python is a high-level, interpreted programming language known for its readability.
2. Score: 0.6004 - Python libraries like PyTorch and TensorFlow are used for deep learning.
3. Score: 0.5793 - Machine learning is a subset of artificial intelligence that learns from data.

AFTER RERANKING (CROSS-ENCODER)
1. Score: 5.1547 - Python libraries like PyTorch and TensorFlow are used for deep learning.
2. Score: 0.4332 - Python is a high-level, interpreted programming language known for its readability.
3. Score: -2.5431 - Machine learning is a subset of artificial intelligence that learns from data.

CHANGES IN RANKING:
Document moved from position 2 to 1
Document moved from position 1 to 2


## Deep Dive: Why Cross-Encoders Are More Accurate

Let's directly compare how bi-encoders and cross-encoders score the same document:

In [10]:
# Select a document for our comparison
test_doc = "Python libraries like PyTorch and TensorFlow are used for deep learning."
print(f"Query: '{query}'")
print(f"Document: '{test_doc}'\n")

# Calculate bi-encoder similarity
query_emb = retriever.model.encode(query)
doc_emb = retriever.model.encode(test_doc)
bi_sim = util.cos_sim(query_emb, doc_emb).item()

# Get cross-encoder score
cross_score = reranker.model.predict([(query, test_doc)])

print("BI-ENCODER")
print(f"Similarity score: {bi_sim:.4f}")
print("The bi-encoder encodes query and document separately,")
print("then calculates similarity between these independent representations.\n")

print("CROSS-ENCODER")
print(f"Relevance score: {float(cross_score[0]):.4f}")
print("The cross-encoder processes the query and document together,")
print("allowing it to capture complex interactions between terms.")

Query: 'How is Python used in machine learning?'
Document: 'Python libraries like PyTorch and TensorFlow are used for deep learning.'

BI-ENCODER
Similarity score: 0.6004
The bi-encoder encodes query and document separately,
then calculates similarity between these independent representations.

CROSS-ENCODER
Relevance score: 5.1547
The cross-encoder processes the query and document together,
allowing it to capture complex interactions between terms.


## Performance Comparison

Let's measure the performance difference between the two approaches:

In [11]:
# Define a function to time retrieval operations
def time_retrieval(retriever_func, n_runs=10):
    times = []
    for _ in range(n_runs):
        start = time.time()
        _ = retriever_func()
        times.append(time.time() - start)

    # Calculate mean without numpy
    mean_time = sum(times) / len(times)
    return mean_time

# Time bi-encoder retrieval
bi_encoder_time = time_retrieval(lambda: retriever.retrieve(query))

# Time cross-encoder reranking (on top of bi-encoder)
def full_retrieval():
    initial_results = retriever.retrieve(query)
    _ = reranker.rerank(query, initial_results)
    
full_pipeline_time = time_retrieval(full_retrieval)
reranking_overhead = full_pipeline_time - bi_encoder_time

print("PERFORMANCE COMPARISON")
print(f"Bi-encoder retrieval time: {bi_encoder_time:.4f}s")
print(f"Full pipeline (retrieval + reranking) time: {full_pipeline_time:.4f}s")
print(f"Reranking overhead: {reranking_overhead:.4f}s")
print(f"Percentage increase: {(reranking_overhead/bi_encoder_time)*100:.1f}%")

PERFORMANCE COMPARISON
Bi-encoder retrieval time: 0.0113s
Full pipeline (retrieval + reranking) time: 0.0375s
Reranking overhead: 0.0262s
Percentage increase: 231.5%


## What if we used only Cross-Encoders?

Let's simulate what would happen if we tried to use cross-encoders for the initial retrieval on all documents:

In [12]:
# Function to retrieve using only cross-encoder (brute force)
def cross_encoder_only_retrieval(query, documents, top_k=3):
    # Create query-document pairs for all documents
    query_doc_pairs = [(query, doc) for doc in documents]
    
    # Get scores from cross-encoder
    start_time = time.time()
    scores = reranker.model.predict(query_doc_pairs)
    retrieval_time = time.time() - start_time
    
    # Combine with documents
    results = [(i, float(score), doc) 
              for i, (score, doc) in enumerate(zip(scores, documents))]
    
    # Sort by scores (descending)
    results.sort(key=lambda x: x[1], reverse=True)
    
    return results[:top_k], retrieval_time

# Run cross-encoder-only retrieval
cross_only_results, cross_only_time = cross_encoder_only_retrieval(query, documents)

print("CROSS-ENCODER ONLY APPROACH")
print(f"Retrieved {len(cross_only_results)} documents in {cross_only_time:.4f}s")
for i, (doc_id, score, doc) in enumerate(cross_only_results, 1):
    print(f"{i}. Score: {score:.4f} - {doc}")

print("\nPERFORMANCE COMPARISON")
print(f"Bi-encoder retrieval time: {bi_encoder_time:.4f}s")
print(f"Two-stage pipeline time: {full_pipeline_time:.4f}s")
print(f"Cross-encoder only time: {cross_only_time:.4f}s")
print(f"Cross-encoder is {cross_only_time/bi_encoder_time:.1f}x slower than bi-encoder")

# Calculate how this would scale
print("\nSCALING ANALYSIS")
doc_counts = [100, 1000, 10000, 100000]
print("Estimated retrieval times for different collection sizes:")
print("Doc Count | Bi-encoder | Cross-encoder | Speedup")
print("-" * 55)
for count in doc_counts:
    # Assuming linear scaling with document count (simplified)
    bi_time = bi_encoder_time * (count / len(documents))
    cross_time = cross_only_time * (count / len(documents))
    print(f"{count:8d} | {bi_time:.4f}s     | {cross_time:.4f}s      | {cross_time/bi_time:.1f}x")

CROSS-ENCODER ONLY APPROACH
Retrieved 3 documents in 0.0796s
1. Score: 5.1547 - Python libraries like PyTorch and TensorFlow are used for deep learning.
2. Score: 0.4332 - Python is a high-level, interpreted programming language known for its readability.
3. Score: -2.5431 - Machine learning is a subset of artificial intelligence that learns from data.

PERFORMANCE COMPARISON
Bi-encoder retrieval time: 0.0113s
Two-stage pipeline time: 0.0375s
Cross-encoder only time: 0.0796s
Cross-encoder is 7.0x slower than bi-encoder

SCALING ANALYSIS
Estimated retrieval times for different collection sizes:
Doc Count | Bi-encoder | Cross-encoder | Speedup
-------------------------------------------------------
     100 | 0.1132s     | 0.7958s      | 7.0x
    1000 | 1.1317s     | 7.9581s      | 7.0x
   10000 | 11.3173s     | 79.5805s      | 7.0x
  100000 | 113.1728s     | 795.8055s      | 7.0x


## Summary

We've demonstrated the value of two-stage retrieval with reranking.

### Key Takeaways

1. **Bi-encoders are fast** - Process queries and documents independently, enable pre-computation
2. **Cross-encoders are accurate** - Process pairs together, capture interactions
3. **Reranking combines both** - Fast initial retrieval + accurate final ranking
4. **Scalability matters** - Cross-encoders don't scale to large collections

### Architecture Comparison

| Feature | Bi-Encoder | Cross-Encoder |
|---------|------------|---------------|
| Speed | Fast (pre-compute) | Slow (compute per query) |
| Accuracy | Good | Excellent |
| Scalability | Millions of docs | Hundreds of docs |
| Use case | Initial retrieval | Reranking |

### Performance Trade-offs

**Bi-Encoder Only:**
- Pros: Very fast, scales well
- Cons: Lower accuracy
- Use when: Speed critical, large collections

**Cross-Encoder Only:**
- Pros: Highest accuracy
- Cons: Too slow for large collections
- Use when: Small collections (<1000 docs)

**Two-Stage (Recommended):**
- Pros: Fast + accurate, best overall
- Cons: More complex to implement
- Use when: Production systems, large collections

### Production Implementation

**Stage 1 - Bi-Encoder Retrieval:**
```python
# Pre-compute embeddings offline
doc_embeddings = model.encode(documents)

# Fast retrieval at query time
query_emb = model.encode(query)
candidates = find_top_k(query_emb, doc_embeddings, k=100)
```

**Stage 2 - Cross-Encoder Reranking:**
```python
# Accurate reranking of candidates
pairs = [(query, doc) for doc in candidates]
scores = reranker.predict(pairs)
final_results = rank_by_scores(candidates, scores, top_k=10)
```

### Scaling Considerations

**10K documents:** Two-stage is ~10x faster than cross-encoder only  
**100K documents:** Two-stage is ~100x faster  
**1M+ documents:** Cross-encoder only becomes impractical

### Next Steps

- Implement in production with caching
- Monitor accuracy vs speed trade-offs
- Consider learned sparse retrievers for stage 1
- Explore distillation to compress cross-encoders