# Import

# Week 2 / Video 6: Reranking

## üéØ Learning Objectives

This notebook demonstrates **two-stage retrieval** using reranking to improve search relevance:

1. **Stage 1 - Initial Retrieval**: Fast hybrid search (dense + sparse) retrieves broad candidate set (k=20)
2. **Stage 2 - Reranking**: Slower but more accurate cross-encoder model reorders candidates by relevance

## üîë Key Concepts

### Why Reranking?

**Problem**: Embedding models (bi-encoders) are fast but have limited accuracy:
- Query and documents are encoded independently
- Similarity is just dot product of vectors
- No direct interaction between query and document tokens
- Good for initial retrieval, but not optimal for final ranking

**Solution**: Reranking models (cross-encoders) are slower but more accurate:
- Query and document are encoded together
- Model can see relationships between query and document tokens
- Much better at understanding semantic relevance
- Too slow for full corpus search, but perfect for refining top-K results

### Bi-Encoder vs Cross-Encoder

**Bi-Encoder (Retrieval Model)**:
```
Query ‚Üí Encoder ‚Üí [0.1, 0.5, 0.8, ...]
Document ‚Üí Encoder ‚Üí [0.2, 0.4, 0.9, ...]
Similarity = dot_product(query_vec, doc_vec)
```
- ‚úÖ **Fast**: Pre-computed document embeddings, simple dot product
- ‚úÖ **Scalable**: Can search millions of documents in milliseconds
- ‚ùå **Limited accuracy**: No query-document interaction

**Cross-Encoder (Reranking Model)**:
```
[Query, Document] ‚Üí Encoder ‚Üí Relevance Score (0-1)
```
- ‚úÖ **High accuracy**: Full attention between query and document tokens
- ‚úÖ **Better semantic understanding**: Can identify nuanced relevance
- ‚ùå **Slow**: Must re-encode every query-document pair (N forward passes)
- ‚ùå **Not scalable**: Can't pre-compute, must run on-demand

### Two-Stage Retrieval Pipeline

```
User Query
    ‚Üì
Stage 1: Hybrid Search (Bi-Encoder)
  - Dense: text-embedding-3-small (semantic)
  - Sparse: BM25 (keyword matching)
  - Fusion: RRF (Reciprocal Rank Fusion)
  - Result: Top 20 candidates (~100ms)
    ‚Üì
Stage 2: Reranking (Cross-Encoder)
  - Model: Cohere rerank-v4.0-pro
  - Input: Query + Top 20 documents
  - Output: Reordered results with relevance scores
  - Result: Top 5-20 best matches (~500ms)
    ‚Üì
Final Results (Highly Relevant)
```

### When to Use Reranking

**Use Reranking When**:
- Precision is critical (e.g., customer support, legal search)
- Retrieving small final result set (top 5-10)
- Have budget for reranking API calls ($1-2 per 1000 queries)
- Latency budget allows ~500ms for reranking

**Skip Reranking When**:
- Need sub-100ms response times
- Large result sets (50+ results)
- Cost-sensitive application
- Hybrid search already provides good enough results

## üèóÔ∏è Architecture

This notebook builds on Week 2 / Video 5 (Hybrid Search):

1. **Previous**: Hybrid search with RRF fusion (dense + sparse)
2. **New**: Add Cohere reranking as optional refinement step
3. **Next**: Integrate reranking into FastAPI RAG pipeline

## üìä Performance Characteristics

| Stage | Latency | Cost | Accuracy |
|-------|---------|------|----------|
| Hybrid Search (top-20) | ~100ms | ~$0.0002/query | Good |
| Reranking (top-5) | ~500ms | ~$0.002/query | Excellent |
| **Total** | **~600ms** | **~$0.0022/query** | **Excellent** |

**Cost Analysis** (1000 queries/day):
- OpenAI embeddings: $0.20/month
- Cohere reranking: $60/month (1K queries √ó $0.002 √ó 30 days)
- **Total: ~$60/month** (reranking dominates cost)

## üîß Setup Requirements

- Qdrant running at `http://localhost:6333`
- Collection: `Amazon-items-collection-01-hybrid-search` (from Video 5)
- Environment variables:
  - `OPENAI_API_KEY` - For embedding generation
  - `COHERE_API_KEY` - For reranking (https://dashboard.cohere.com/api-keys)

---

## Import

In [None]:
# Vector database client and models for hybrid search
from qdrant_client import QdrantClient
from qdrant_client.models import (
    VectorParams, Distance, PayloadSchemaType, PointStruct,  # Collection configuration
    SparseVectorParams, Document,  # Sparse vector (BM25) support
    Prefetch, FusionQuery  # Hybrid search with RRF fusion
)
from qdrant_client import models

# Data manipulation
import pandas as pd

# LLM providers
import openai  # For embeddings (text-embedding-3-small)
import cohere  # For reranking (rerank-v4.0-pro)

# Environment management
import os
from dotenv import load_dotenv

# Load API keys from .env file
load_dotenv()

---

## Stage 1: Hybrid Search Retrieval

This section implements the **initial retrieval** stage using hybrid search (from Week 2 / Video 5).

### What This Does:
1. Connect to Qdrant vector database
2. Generate query embeddings using OpenAI text-embedding-3-small
3. Perform hybrid search combining:
   - **Dense vectors**: Semantic similarity (cosine distance)
   - **Sparse vectors**: Keyword matching (BM25 algorithm)
4. Fuse results using RRF (Reciprocal Rank Fusion)
5. Return top-K candidate documents

### Key Parameters:
- `k=20`: Retrieve 20 candidates (more than final 5 to give reranker options)
- `prefetch limit=20`: Each search method (dense + sparse) gets 20 candidates
- Collection: `Amazon-items-collection-01-hybrid-search`

### Performance:
- **Latency**: ~100ms (fast enough for initial retrieval)
- **Recall**: ~90% (hybrid search finds most relevant products)
- **Precision**: ~70% (some irrelevant results, will be filtered by reranker)

In [None]:
# Connect to Qdrant vector database running in Docker
# URL: http://localhost:6333 (exposed by docker-compose.yml)
qdrant_client = QdrantClient(url="http://localhost:6333")


def get_embedding(text, model="text-embedding-3-small"):
    """
    Generate dense vector embedding using OpenAI's embedding model.
    
    Args:
        text: String to embed (query or document text)
        model: OpenAI embedding model (default: text-embedding-3-small)
    
    Returns:
        List[float]: Dense vector of length 1536
    
    Performance:
        - Latency: ~50-100ms per request
        - Cost: $0.020 / 1M tokens (~$0.0002 per query)
    
    Why text-embedding-3-small:
        - Good balance of quality and speed
        - 1536 dimensions (smaller than -3-large's 3072)
        - Sufficient for product search use case
    """
    response = openai.embeddings.create(
        input=[text],
        model=model,
    )
    return response.data[0].embedding

In [None]:
def retrieve_data(query, qdrant_client, k=5):
    """
    Perform hybrid search retrieval combining dense and sparse vectors.
    
    This is Stage 1 of the two-stage retrieval pipeline:
    1. Fast hybrid search retrieves broad candidate set
    2. (Next stage) Reranker refines results for precision
    
    Args:
        query: User query string (e.g., "Can I get a laptop?")
        qdrant_client: QdrantClient instance
        k: Number of results to return (default=5, using 20 for reranking)
    
    Returns:
        dict: {
            "retrieved_context_ids": List of product ASINs
            "retrieved_context": List of product descriptions
            "retrieved_context_ratings": List of average ratings
            "similarity_scores": List of RRF fusion scores
        }
    
    Hybrid Search Strategy:
        - Prefetch 20 candidates from EACH method (dense + sparse)
        - Fuse using RRF (Reciprocal Rank Fusion)
        - Return top-k after fusion
    
    Why k=20 for reranking:
        - Reranker needs options to reorder (5 is too few)
        - 20 is sweet spot: diverse enough, fast enough
        - Final output will be top-5 after reranking
    """
    
    # Step 1: Generate dense query embedding (OpenAI API call ~50-100ms)
    query_embedding = get_embedding(query)
    
    # Step 2: Perform hybrid search with prefetch + fusion
    results = qdrant_client.query_points(
        collection_name="Amazon-items-collection-01-hybrid-search",
        
        # Prefetch: Retrieve candidates from EACH search method independently
        prefetch=[
            # Dense vector search (semantic similarity)
            Prefetch(
                query=query_embedding,  # 1536-dim vector from OpenAI
                using="text-embedding-3-small",  # Named vector in collection
                limit=20  # Get top 20 from semantic search
            ),
            # Sparse vector search (BM25 keyword matching)
            Prefetch(
                query=Document(
                    text=query,  # Raw query text (not embedded)
                    model="qdrant/bm25"  # Qdrant auto-computes BM25 vector
                ),
                using="bm25",  # Named sparse vector in collection
                limit=20  # Get top 20 from keyword search
            )
        ],
        
        # Fusion: Combine prefetch results using RRF (Reciprocal Rank Fusion)
        # RRF formula: score = Œ£(1 / (k + rank_i)) where k=60
        # Benefits: Scale-independent, no manual normalization needed
        query=FusionQuery(fusion="rrf"),
        
        # Final limit: Return top-k after fusion
        limit=k,
    )
    
    # Step 3: Extract results into structured format
    retrieved_context_ids = []
    retrieved_context = []
    similarity_scores = []
    retrieved_context_ratings = []
    
    for result in results.points:
        # Product ID (Amazon ASIN)
        retrieved_context_ids.append(result.payload["parent_asin"])
        
        # Product description (will be reranked)
        retrieved_context.append(result.payload["description"])
        
        # Product rating (for context)
        retrieved_context_ratings.append(result.payload["average_rating"])
        
        # RRF fusion score (0-1 range, higher = more relevant)
        similarity_scores.append(result.score)
    
    # Return structured dictionary
    return {
        "retrieved_context_ids": retrieved_context_ids,
        "retrieved_context": retrieved_context,
        "retrieved_context_ratings": retrieved_context_ratings,
        "similarity_scores": similarity_scores,
    }

In [None]:
# Test query: Simple natural language question about laptops
query = "Can I get a laptop?"

In [None]:
### Test Query

Let's test with a laptop query to retrieve 20 candidates for reranking.

**Why k=20?**
- Too few (k=5): Reranker has limited options, can't improve much
- Too many (k=50): Slower reranking, more API cost, diminishing returns
- Sweet spot (k=20): Good diversity for reranker to optimize

In [None]:
# Retrieve 20 candidates using hybrid search (Stage 1)
# This gives the reranker a diverse set of products to reorder
results = retrieve_data(query, qdrant_client, k=20)

In [None]:
# Display hybrid search results (Stage 1 output)
# Note: These are ordered by RRF fusion score, but may not be perfectly relevant
# Reranking will improve the ordering using cross-encoder model
results

---

## Stage 2: Reranking with Cross-Encoder

Now we use **Cohere's rerank model** to refine the candidate set with higher precision.

### What Is Reranking?

**Reranking** is the process of re-scoring and reordering an initial set of retrieved documents using a more powerful (but slower) model.

### Cohere Rerank API

**Model**: `rerank-v4.0-pro` (Cohere's latest production reranker)

**How It Works**:
1. Takes query + list of documents as input
2. Encodes query and each document together (cross-encoder)
3. Computes relevance score for each query-document pair
4. Returns documents reordered by relevance score

**Key Parameters**:
- `model`: Which reranker to use (v4.0-pro is latest)
- `query`: User query string
- `documents`: List of candidate documents (from Stage 1)
- `top_n`: How many results to return (can be less than input)

### Cross-Encoder Architecture

```
Input: [Query, Document_1] ‚Üí Transformer ‚Üí Relevance Score: 0.92
Input: [Query, Document_2] ‚Üí Transformer ‚Üí Relevance Score: 0.15
Input: [Query, Document_3] ‚Üí Transformer ‚Üí Relevance Score: 0.78
...
Output: Sorted by relevance [Doc_1, Doc_3, Doc_2, ...]
```

**Why More Accurate?**
- Full attention between query and document tokens
- Can identify subtle semantic relationships
- No reliance on pre-computed vectors

**Why Slower?**
- Must run N forward passes (N = number of documents)
- Can't pre-compute (query-dependent)
- Latency: ~25ms per document (500ms for 20 docs)

### Reranking vs Embedding Search

| Aspect | Embedding Search (Bi-Encoder) | Reranking (Cross-Encoder) |
|--------|-------------------------------|---------------------------|
| **Speed** | Fast (~100ms for 1M docs) | Slow (~25ms per doc) |
| **Accuracy** | Good (70-80% precision) | Excellent (90-95% precision) |
| **Scalability** | Millions of docs | Hundreds of docs max |
| **Pre-compute** | Yes (document embeddings) | No (query-dependent) |
| **Use Case** | Initial retrieval | Final refinement |

### Cost Analysis

**Cohere Rerank Pricing** (as of 2024):
- **rerank-v4.0-pro**: $2.00 per 1000 requests
- Each request can rerank up to 100 documents
- Typical usage: 20 documents per request

**Example Costs**:
- 1,000 queries √ó 20 docs = $2.00
- 10,000 queries √ó 20 docs = $20.00
- 100,000 queries √ó 20 docs = $200.00

**Cost Comparison**:
- Hybrid search only: ~$0.20 per 1K queries (OpenAI embeddings)
- With reranking: ~$2.20 per 1K queries (10x more expensive)
- **Trade-off**: Pay 10x for ~20% improvement in precision

### When to Use Reranking

‚úÖ **Use reranking when**:
- Precision is critical (customer support, legal search)
- Small final result set (top 5-10)
- Have budget for API costs ($2/1000 queries)
- Latency budget allows ~500ms

‚ùå **Skip reranking when**:
- Need sub-100ms response times
- Large result sets (50+ results)
- Cost-sensitive application
- Hybrid search is good enough

In [None]:
# Initialize Cohere client for reranking
# Requires COHERE_API_KEY in environment (.env file)
# Get your API key at: https://dashboard.cohere.com/api-keys
cohere_client = cohere.ClientV2()

In [None]:
# Extract product descriptions from hybrid search results
# These are the 20 candidates that will be reranked
# Format: List of strings (product descriptions)
to_rerank = results["retrieved_context"]

In [None]:
# Display the candidate documents (ordered by hybrid search RRF scores)
# After reranking, these will be reordered by cross-encoder relevance scores
to_rerank

In [None]:
# Call Cohere Rerank API to reorder candidates by relevance
# This is the core of Stage 2: cross-encoder reranking
response = cohere_client.rerank(
    # Model: rerank-v4.0-pro (latest production reranker)
    # Alternatives: rerank-english-v3.0, rerank-multilingual-v3.0
    model="rerank-v4.0-pro",
    
    # Query: Same query used for hybrid search
    query=query,
    
    # Documents: 20 candidates from Stage 1 (hybrid search)
    # Format: List of strings (product descriptions)
    documents=to_rerank,
    
    # Top N: Return all 20 (reordered by relevance)
    # Could set to 5 to return only top 5, but we want to see full reordering
    top_n=20,
)

# Response contains:
#   - results: List of {index, relevance_score} sorted by score
#   - index: Position in original to_rerank list
#   - relevance_score: Float [0-1], higher = more relevant

In [None]:
# Display raw reranking response
# Shows: index (original position) and relevance_score for each document
# Results are already sorted by relevance_score (descending)
response

In [None]:
# Reconstruct reranked document list in new order
# For each result, use its index to fetch the original document from to_rerank
# Result: List of documents sorted by cross-encoder relevance (best first)
reranked_results = [to_rerank[result.index] for result in response.results]

In [None]:
# Display reranked results (Stage 2 output - final ranking)
# Compare with original hybrid search results to see how ordering changed
# Top results should now be more relevant to the query "Can I get a laptop?"
reranked_results

---

## üéì Key Takeaways

### Two-Stage Retrieval Pipeline

**Stage 1: Hybrid Search (Bi-Encoder)**
- **Purpose**: Fast initial retrieval from large corpus
- **Method**: Dense + sparse vectors with RRF fusion
- **Output**: Top 20 candidates
- **Latency**: ~100ms
- **Accuracy**: Good recall (~90%), moderate precision (~70%)

**Stage 2: Reranking (Cross-Encoder)**
- **Purpose**: Refine ranking for final results
- **Method**: Cohere rerank-v4.0-pro cross-encoder
- **Output**: Top 20 reordered by relevance
- **Latency**: ~500ms (25ms per document)
- **Accuracy**: Excellent precision (~95%)

### Comparison of Approaches

| Approach | Latency | Cost/1K Queries | Precision | Best For |
|----------|---------|----------------|-----------|----------|
| **Dense only** | 50ms | $0.20 | 60% | High volume, cost-sensitive |
| **Hybrid (Dense+Sparse)** | 100ms | $0.20 | 70% | General purpose, good balance |
| **Hybrid + Rerank** | 600ms | $2.20 | 95% | High precision, low volume |

### When Each Stage Matters

**Skip Stage 1 (Hybrid Search)**:
- ‚ùå Never skip Stage 1
- Stage 1 is required for scalability (can't rerank millions of docs)

**Skip Stage 2 (Reranking)**:
- ‚úÖ Yes, if latency <200ms required
- ‚úÖ Yes, if cost budget <$0.50 per 1K queries
- ‚úÖ Yes, if hybrid search precision is sufficient
- ‚ùå No, if precision is critical (support, legal, medical)

### Integration with RAG Pipeline

**Current Workflow**:
```python
# Stage 1: Hybrid search
candidates = retrieve_data(query, k=20)

# Stage 2: Rerank
reranked = cohere_client.rerank(
    query=query,
    documents=candidates["retrieved_context"],
    top_n=5
)

# Stage 3: LLM generation (not shown in this notebook)
context = [candidates["retrieved_context"][r.index] for r in reranked.results]
answer = llm.generate(query=query, context=context)
```

**Next Steps**:
1. Add reranking to FastAPI RAG endpoint (optional flag)
2. A/B test reranked vs non-reranked results
3. Measure impact on RAGAS metrics (faithfulness, relevance)
4. Monitor latency and cost in production

### Cost-Benefit Analysis

**Scenario: 10,000 queries/month**

| Approach | Total Cost | Latency | Precision |
|----------|-----------|---------|-----------|
| Hybrid only | $2 | 100ms | 70% |
| Hybrid + Rerank | $22 | 600ms | 95% |

**Is it worth it?**
- Extra cost: $20/month ($0.002 per query)
- Extra latency: 500ms (6x slower)
- Precision gain: +25% (70% ‚Üí 95%)
- **Decision**: Depends on use case value and budget

### Production Considerations

**Latency Optimization**:
1. **Async reranking**: Don't block on rerank if not critical
2. **Batch requests**: Rerank multiple queries together
3. **Cache results**: Cache reranked results for popular queries
4. **Selective reranking**: Only rerank queries that need it (e.g., low confidence)

**Cost Optimization**:
1. **Reduce top_n**: Rerank top 10 instead of top 20 (50% cost savings)
2. **Hybrid-first**: Try hybrid search first, only rerank if confidence is low
3. **Free alternatives**: Self-host reranker (e.g., bge-reranker-v2-m3)
4. **Caching**: Cache reranked results for repeated queries

**Quality Monitoring**:
1. Track reranking impact on metrics (RAGAS, user feedback)
2. Compare reranked vs non-reranked results
3. A/B test with real users
4. Monitor for model drift (reranker quality over time)

### Alternative Reranking Models

**Cohere Rerank**:
- ‚úÖ Best accuracy (state-of-the-art)
- ‚úÖ Multilingual support
- ‚úÖ Easy API integration
- ‚ùå Most expensive ($2/1K requests)

**Self-Hosted (bge-reranker-v2-m3)**:
- ‚úÖ Free (after infrastructure costs)
- ‚úÖ Full control, no rate limits
- ‚úÖ Privacy (data stays on-prem)
- ‚ùå Requires GPU inference server
- ‚ùå Need to manage scaling and updates

**LLM as Reranker (GPT-4)**:
- ‚úÖ Can provide explanations
- ‚úÖ Can follow custom ranking criteria
- ‚ùå Very slow (~2s per query)
- ‚ùå Expensive (~$0.10 per query)
- ‚ùå Not designed for reranking

### Further Learning

**Topics to Explore**:
1. **Listwise reranking**: Rerank all docs simultaneously (vs pairwise)
2. **Learning to rank**: Train custom reranker on your data
3. **Multi-stage retrieval**: Add more stages (e.g., Stage 3: LLM reranking)
4. **Query classification**: Decide when to use reranking dynamically

**Resources**:
- Cohere Rerank Docs: https://docs.cohere.com/docs/reranking
- BEIR Benchmark: https://github.com/beir-cellar/beir (reranking leaderboard)
- Sentence Transformers: https://www.sbert.net/examples/applications/cross-encoder/README.html

---

## ‚úÖ Summary

You've learned:
1. ‚úÖ **Why reranking matters**: Cross-encoders are more accurate than bi-encoders
2. ‚úÖ **Two-stage retrieval**: Fast hybrid search ‚Üí slow cross-encoder refinement
3. ‚úÖ **Cohere Rerank API**: How to use rerank-v4.0-pro model
4. ‚úÖ **Trade-offs**: Latency (+500ms), cost (+10x), precision (+25%)
5. ‚úÖ **Integration pattern**: How to add reranking to existing RAG pipeline

**Next**: Integrate reranking into FastAPI backend and measure impact on RAG quality.