# Day 23: Advanced Retrieval Implementation

In this notebook, we'll build upon our basic RAG system from Day 22 by implementing advanced retrieval techniques to improve the quality of the context provided to the LLM. 

## Overview

We will cover:
1.  **Setup**: Prepare our data and baseline retrieval system.
2.  **Evaluating Retrieval**: Implement and calculate metrics like Recall@k and Mean Reciprocal Rank (MRR).
3.  **Reranking**: Use a more powerful Cross-Encoder model to rerank initial search results for higher precision.
4.  **Hybrid Search**: Combine traditional keyword search (BM25) with semantic search for more robust retrieval.

## 1. Setup

First, let's install the necessary libraries and set up our initial data and retrieval system from Day 22.

In [None]:
!pip install sentence-transformers faiss-cpu rank_bm25 numpy pandas matplotlib

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer, CrossEncoder
import faiss
from rank_bm25 import BM25Okapi

# --- Data and Baseline System from Day 22 ---
document = (
    "The planet Zoltar is a marvel of the Andromeda galaxy, known for its twin suns, Helios Prime and Helios Beta, which create a perpetual twilight. "
    "The surface of Zoltar is covered in crystalline forests that hum with a low-frequency energy. This energy is harnessed by the native Zoltarians, a species of sentient, silicon-based lifeforms. "
    "Zoltarians communicate through a complex series of light patterns, a language known as 'Luminar'. Their society is structured around the 'Great Crystal', a massive geological formation at the planet's north pole that is believed to be the source of all life. "
    "The Zoltarian diet consists of absorbing geothermal energy from volcanic vents scattered across the planet. They reproduce asexually, budding off smaller versions of themselves once every 'Great Cycle', which corresponds to 50 Earth years. "
    "Zoltar's atmosphere is composed mainly of nitrogen and argon, and is unbreathable for carbon-based lifeforms. The planet has a strong magnetic field, which protects it from the intense solar winds of its twin suns. The average temperature on Zoltar is a cool -30 degrees Celsius."
)

def chunk_text(text, chunk_size=30, overlap=5):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

text_chunks = chunk_text(document)
bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
chunk_embeddings = bi_encoder.encode(text_chunks, convert_to_tensor=False)

d = chunk_embeddings.shape[1]
index = faiss.IndexFlatL2(d)
index.add(chunk_embeddings.astype('float32'))

print(f"Created {len(text_chunks)} chunks and indexed them in FAISS.")

## 2. Evaluating Retrieval

To measure our improvements, we first need an evaluation set and functions for our metrics.

In [None]:
# Create a small evaluation dataset (query, index_of_correct_chunk)
eval_dataset = [
    {"query": "How do Zoltarians talk to each other?", "relevant_chunk_idx": 2},
    {"query": "What is the Zoltarian diet?", "relevant_chunk_idx": 3},
    {"query": "What is the weather like on Zoltar?", "relevant_chunk_idx": 4},
    {"query": "What are the suns of Zoltar called?", "relevant_chunk_idx": 0}
]

def evaluate_retrieval(retriever_fn, eval_dataset, k=5):
    reciprocal_ranks = []
    recall_at_k = 0
    
    for item in eval_dataset:
        query = item["query"]
        true_idx = item["relevant_chunk_idx"]
        
        retrieved_indices = retriever_fn(query, k=k)
        
        # Check for recall
        if true_idx in retrieved_indices:
            recall_at_k += 1
            # Find rank for MRR
            rank = retrieved_indices.index(true_idx) + 1
            reciprocal_ranks.append(1 / rank)
        else:
            reciprocal_ranks.append(0)
            
    mrr = np.mean(reciprocal_ranks)
    recall = recall_at_k / len(eval_dataset)
    return {"mrr": mrr, f"recall@{k}": recall}

# --- Baseline Retriever Function ---
def baseline_retriever(query, k=5):
    query_embedding = bi_encoder.encode([query])
    _, indices = index.search(query_embedding.astype('float32'), k)
    return indices[0].tolist()

# Evaluate the baseline system
baseline_metrics = evaluate_retrieval(baseline_retriever, eval_dataset)
print(f"Baseline Metrics: {baseline_metrics}")

## 3. Reranking with a Cross-Encoder

Now, let's add a second stage to our retrieval. We'll use a fast bi-encoder to get initial candidates and a powerful cross-encoder to rerank them.

In [None]:
# Load a cross-encoder model
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def reranked_retriever(query, k=5, candidates_k=10):
    # 1. Get initial candidates from the fast retriever
    candidate_indices = baseline_retriever(query, k=candidates_k)
    
    # 2. Create pairs of (query, chunk) for the cross-encoder
    pairs = [(query, text_chunks[i]) for i in candidate_indices]
    
    # 3. Score the pairs with the cross-encoder
    scores = cross_encoder.predict(pairs)
    
    # 4. Combine indices and scores, then sort
    scored_indices = list(zip(candidate_indices, scores))
    scored_indices.sort(key=lambda x: x[1], reverse=True)
    
    # 5. Return the top k reranked indices
    reranked_indices = [idx for idx, score in scored_indices[:k]]
    return reranked_indices

# Evaluate the reranked system
reranked_metrics = evaluate_retrieval(reranked_retriever, eval_dataset)
print(f"Reranked Metrics: {reranked_metrics}")

## 4. Hybrid Search (BM25 + Dense)

Let's implement hybrid search by combining our dense vector search with a traditional keyword search algorithm, BM25.

In [None]:
# 1. Set up the BM25 index
tokenized_chunks = [chunk.split(" ") for chunk in text_chunks]
bm25 = BM25Okapi(tokenized_chunks)

def hybrid_retriever(query, k=5):
    # Get BM25 (keyword) results
    tokenized_query = query.split(" ")
    bm25_scores = bm25.get_scores(tokenized_query)
    bm25_top_indices = np.argsort(bm25_scores)[::-1][:k]
    
    # Get dense (semantic) results
    dense_top_indices = baseline_retriever(query, k=k)
    
    # 2. Fuse the results using Reciprocal Rank Fusion (RRF)
    # RRF is simple and effective: the score of a document is the sum of its reciprocal ranks.
    rrf_scores = {i: 0 for i in range(len(text_chunks))}
    
    for rank, idx in enumerate(bm25_top_indices):
        rrf_scores[idx] += 1 / (rank + 1) # +1 to avoid division by zero
    for rank, idx in enumerate(dense_top_indices):
        rrf_scores[idx] += 1 / (rank + 1)
        
    # 3. Sort documents by their fused RRF score
    sorted_docs = sorted(rrf_scores.items(), key=lambda item: item[1], reverse=True)
    
    hybrid_indices = [doc[0] for doc in sorted_docs[:k]]
    return hybrid_indices

# Evaluate the hybrid system
hybrid_metrics = evaluate_retrieval(hybrid_retriever, eval_dataset)
print(f"Hybrid Search Metrics: {hybrid_metrics}")

## 5. Comparison and Conclusion

Let's compare the performance of our three retrieval strategies.

In [None]:
# Create a DataFrame for comparison
metrics_df = pd.DataFrame([baseline_metrics, reranked_metrics, hybrid_metrics], 
                          index=['Baseline', 'Reranked', 'Hybrid'])

print("Comparison of Retrieval Metrics:")
display(metrics_df)

# Plot the results
metrics_df.plot(kind='bar', figsize=(10, 6), rot=0)
plt.title('Retrieval Performance Comparison')
plt.ylabel('Score')
plt.ylim(0, 1.1)
plt.grid(axis='y', linestyle='--')
plt.show()

print("\nAs we can see, both reranking and hybrid search can significantly improve retrieval performance over a simple baseline. "
      "Reranking excels at precision (improving MRR), while hybrid search improves robustness (improving recall). "
      "In a production system, you might even combine them: use hybrid search for initial retrieval and then a reranker on top.")