# SREnity - Enterprise SRE Agent Prototype

This notebook contains the development and testing of the SREnity agentic RAG system for production incident resolution.


In [4]:
# Install required packages
%pip install openai langchain langchain-community qdrant-client python-dotenv pandas numpy requests beautifulsoup4 ragas rank-bm25 tavily-python cohere langsmith markdownify rapidfuzz



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [48]:
# Imports and Setup
import os
import sys
import logging
from pathlib import Path

# Add project root to Python path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# Set up minimal logging
logging.basicConfig(level=logging.WARNING)

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

# Configuration
from src.utils.config import get_config
config = get_config()


## Data Loading - GitLab Runbooks

**Data Source:** Production runbooks from https://runbooks.gitlab.com/ - comprehensive SRE documentation covering infrastructure, databases, CI/CD pipelines, monitoring, and incident response procedures. These are real-world operational guides used by GitLab's SRE team.

**Multi-Service Foundation:** 696 enterprise runbooks covering Redis, PostgreSQL, Elasticsearch, CI/CD, monitoring, and more. This notebook focuses on **Redis service** (145 docs) to demonstrate the RAG pipeline, but the architecture supports filtering by any service combination for real-world multi-system incidents.

**Smart loading:** Checks for existing `data/runbooks/gitlab_runbooks.json` file and loads/downloads accordingly.


In [49]:
# Smart GitLab Runbook Loading with Service Filtering
from src.utils.document_loader import download_gitlab_runbooks, save_documents, load_saved_documents
from pathlib import Path

def filter_by_service(documents, services=['redis']):
    """Filter documents by service type"""
    filtered = []
    for doc in documents:
        source = doc.metadata.get('source', '').lower()
        if any(service in source for service in services):
            filtered.append(doc)
    return filtered

# Check if runbooks file exists
runbooks_file = Path("../data/runbooks/gitlab_runbooks.json")

if runbooks_file.exists():
    print("Loading saved runbooks...")
    documents = load_saved_documents()
    print(f"Loaded {len(documents)} total documents")
else:
    print("Downloading fresh runbooks...")
    documents = download_gitlab_runbooks()
    print(f"Downloaded {len(documents)} documents")
    
    print("Saving documents...")
    filepath = save_documents(documents)
    print(f"Saved to {filepath}")

# Filter to Redis services only
documents = filter_by_service(documents, ['redis'])
print(f"Filtered to {len(documents)} Redis documents")


Loading saved runbooks...
Loaded 696 total documents
Filtered to 33 Redis documents


## RAG Pipeline - Document Processing

This section implements document preprocessing, chunking, and vector storage:
1. **HTML to Markdown** - Convert documents for better processing. (HTML tags are redundant)
2. **Token-based Chunking** - Split documents using tiktoken
3. **Vector Database** - Generate embeddings and store in Qdrant

In [50]:
# Document Preprocessing and Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
import tiktoken
from src.utils.document_loader import preprocess_html_documents

def chunk_documents_with_tiktoken(documents, chunk_size=1000, chunk_overlap=200):
    """Split documents using tiktoken for accurate token counting"""
    
    # Get tiktoken encoding for the configured model
    encoding = tiktoken.encoding_for_model(config.openai_model)
    
    # Create text splitter with tiktoken length function
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=lambda text: len(encoding.encode(text)),
        separators=["\n\n", "\n", " ", ""]
    )
    
    # Split documents
    chunks = text_splitter.split_documents(documents)
    
    # Calculate statistics
    total_tokens = sum(len(encoding.encode(chunk.page_content)) for chunk in chunks)
    avg_tokens = total_tokens / len(chunks) if chunks else 0
    
    print(f"Created {len(chunks)} chunks ({total_tokens:,} tokens, avg {avg_tokens:.0f} tokens/chunk)")
    
    return chunks

# Preprocess HTML documents to markdown
print("Preprocessing HTML documents to markdown...")
processed_documents = preprocess_html_documents(documents)

# Chunk the preprocessed documents
print("Chunking preprocessed documents...")
chunks = chunk_documents_with_tiktoken(processed_documents, chunk_size=1000, chunk_overlap=200)


Preprocessing HTML documents to markdown...
HTML to Markdown conversion results:
  Original: 290,437 - 575,312 chars
  Markdown: 52,226 - 96,814 chars
  Reduction: 81.5%
Chunking preprocessed documents...
Created 685 chunks (631,830 tokens, avg 922 tokens/chunk)


In [51]:
# Qdrant Vector Database Setup
from src.utils.config import get_model_factory
from langchain_community.vectorstores import Qdrant
from pathlib import Path

def create_embeddings_and_store(chunks):
    """Create embeddings and store in Qdrant"""
    
    # Get model factory and create embeddings
    model_factory = get_model_factory()
    embeddings = model_factory.get_embeddings()

    # Log the Qdrant URL configuration
    print(f"Creating vector store at: {config.qdrant_url}")
    print(f"Using collection name: {config.qdrant_collection_name}")
    
    # Create vector store with local file storage and proper collection name
    vector_store = Qdrant.from_documents(
        documents=chunks,
        embedding=embeddings,
        path=config.qdrant_url,
        collection_name=config.qdrant_collection_name
    )

    print(f"Stored {len(chunks)} chunks in Qdrant at {config.qdrant_url}")
    return vector_store

def load_existing_vector_store():
    """Load existing Qdrant vector store"""
    model_factory = get_model_factory()
    embeddings = model_factory.get_embeddings()
    
    # Load existing vector store from file path
    vector_store = Qdrant.from_existing_collection(
        embedding=embeddings,
        path=config.qdrant_url,
        collection_name=config.qdrant_collection_name
    )
    
    print(f"Loaded existing vector store from {config.qdrant_url}")
    return vector_store

# Check if vector database exists, otherwise create it
qdrant_path = Path(config.qdrant_url)

if qdrant_path.exists():
    print("Vector database exists. Loading...")
    vector_store = load_existing_vector_store()
else:
    print("Vector database not found. Creating new one...")
    vector_store = create_embeddings_and_store(chunks)


Vector database exists. Loading...
Loaded existing vector store from ../qdrant_db


## Synthetic Data Generation (SDG)

This section creates test data for runbook helper evaluation:
1. **Question Generation** - Create how-to questions from Redis runbook chunks
2. **Answer Generation** - Generate expected answers from runbook content
3. **Test Dataset** - Create evaluation dataset with ground truth


In [None]:
# Synthetic Data Generation using RAGAS
from ragas.testset.synthesizers.generate import TestsetGenerator
from src.utils.config import get_model_factory
from langchain_core.documents import Document
import pandas as pd


def generate_test_dataset(documents, num_questions=10):
    """Generate synthetic test data using RAGAS"""
    
    print(f"Generating {num_questions} test questions from {len(documents)} documents...")
    
    # Use preprocessed documents (already converted to markdown)
    print(f"Using preprocessed markdown documents for SDG...")
    
    # Get model factory for LLM and embeddings
    model_factory = get_model_factory()
    
    # Generate test dataset using TestsetGenerator
    generator = TestsetGenerator.from_langchain(
        llm=model_factory.get_llm(),
        embedding_model=model_factory.get_embeddings()
    )
    
    test_data = generator.generate_with_langchain_docs(
        documents=documents,  # Use preprocessed documents
        testset_size=num_questions
    )
    
    print(f"Generated {len(test_data)} test samples")
    return test_data

# Generate test dataset (start with smaller size to test)
print("Creating synthetic test data using RAGAS...")
test_dataset = generate_test_dataset(processed_documents, num_questions=8)

In [None]:

# Display cached test questions
df = pd.DataFrame(test_dataset)
print(f"Loaded {len(df)} test questions")
display(df)


### Caching the SDG questions for reuse

- Questions cached in JSON file ```/data/sdg/redis_sdg_questions.json```
- Delete the JSON file if you want to clear cache. 

In [31]:
# Cache SDG Questions using utils function
from src.utils.sdg_utils import save_test_dataset

# Save the generated test dataset with correct field mapping
if 'test_dataset' in locals():
    cache_file = save_test_dataset(test_dataset)
    print(f"Test dataset cached with correct field mapping")
else:
    print("No test_dataset found to cache")

No test_dataset found to cache


## RAG Pipeline - Retrieval and Testing

This section implements the core RAG functionality:
1. **Naive Retrieval** - Basic semantic search using vector similarity
2. **Incident Testing** - Test the system with sample Redis incidents
3. **Response Generation** - Generate runbook recommendations


In [52]:
# Redis Runbook Assistant - Runnable Retrieval System
from src.utils.config import get_model_factory
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from operator import itemgetter

# Create the naive retriever as a runnable
from langchain_core.runnables import RunnableLambda

def create_naive_retriever(vector_store, k=3):
    """Create a runnable retriever from vector store"""
    def retrieve_docs(question):
        docs = vector_store.similarity_search(question, k=k)
        return docs
    return RunnableLambda(retrieve_docs)

# Create the RAG prompt template
rag_prompt = ChatPromptTemplate.from_template("""
You are a Redis expert helping SREs find the right procedures. Based on the question and relevant runbook documentation, provide a clear, step-by-step answer.

**Question:**
{question}

**Relevant Runbook Documentation:**
{context}

**Please provide:**
1. **Direct Answer** - Clear response to the question
2. **Step-by-Step Instructions** - Detailed procedure from the runbooks
3. **Key Commands** - Specific commands or configurations needed
4. **Important Notes** - Warnings, prerequisites, or additional context

Format your response clearly with headers and numbered steps.
""")

# Create the naive retrieval chain using Runnable approach
def create_naive_retrieval_chain(vector_store, model_factory, k=3):
    """Create a Runnable chain for naive retrieval"""
    
    # Create retriever and model
    naive_retriever = create_naive_retriever(vector_store, k)
    chat_model = model_factory.get_llm()
    
    # Create the chain - optimized approach
    naive_retrieval_chain = (
        # Input: {"question": "user question"}
        # Output: {"docs": [Document], "question": "user question"}
        {"docs": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
        # Generate response and extract contexts in one pass
        | RunnablePassthrough.assign(
            response=lambda x: chat_model.invoke(
                rag_prompt.format(
                    question=x["question"],
                    context="\n\n".join([doc.page_content for doc in x["docs"]])
                )
            ).content,
            contexts=lambda x: [doc.page_content for doc in x["docs"]]  # Extract contexts as strings
        )
    )
    
    return naive_retrieval_chain

# Create the chain
naive_chain = create_naive_retrieval_chain(vector_store, get_model_factory(), k=3)

# Test the Runnable chain
test_query = "How to monitor Redis memory usage?"
print(f"Query: {test_query}")
print("=" * 50)

# Invoke the chain
result = naive_chain.invoke({"question": test_query})
print("Response:")
print(result["response"])
print(f"\nSources Used: {len(result['contexts'])} document chunks")


Query: How to monitor Redis memory usage?
Response:
### 1. Direct Answer
To monitor Redis memory usage effectively, you should regularly check Redis memory metrics, especially the `used_memory`, `maxmemory`, and `memory_stats`. Additionally, monitor the `redis_evicted_keys_total` metric to detect memory saturation issues. Use Redis commands like `INFO MEMORY` and `MEMORY STATS`, and set up Prometheus/ Grafana dashboards for continuous monitoring.

---

### 2. Step-by-Step Instructions

#### Step 1: Access Redis CLI
- Connect to your Redis instance using `redis-cli`:
  ```bash
  redis-cli -h <redis-host> -p <port>
  ```

#### Step 2: Check Basic Memory Usage
- Run the `INFO MEMORY` command:
  ```bash
  INFO MEMORY
  ```
- Review the output for:
  - `used_memory`: current memory used by Redis (bytes)
  - `maxmemory`: configured maximum memory limit (bytes)
  - `maxmemory_policy`: eviction policy in use

#### Step 3: Review Memory Statistics
- Run the `MEMORY STATS` command:
  ```bash
  M

## RAGAS Evaluation Framework

This section implements comprehensive evaluation of our RAG pipeline using RAGAS metrics:

1. **Evaluation Dataset Creation** - Run RAG pipeline on test questions
2. **RAGAS Metrics** - Measure faithfulness, relevancy, precision, recall
3. **Performance Analysis** - Compare naive vs advanced retrieval
4. **Results Summary** - Certification-ready metrics table


In [53]:
from src.utils.sdg_utils import load_cached_test_questions

# Test the updated functions
print("Testing updated functions with proper RAGAS field names...")
cached_questions = load_cached_test_questions()

if cached_questions:
    print(f"Found {len(cached_questions)} test questions")
    print("\nSample test questions:")
    for i, q in enumerate(cached_questions[:3], 1):
        print(f"{i}. {q['user_input']}")
    print(f"\nSample structure: {cached_questions[0].keys()}")
else:
    print("No cached questions found. Please run SDG first.")


Testing updated functions with proper RAGAS field names...
Loaded 8 cached test questions from ../data/sdg/redis_sdg_questions.json
Found 8 test questions

Sample test questions:
1. What is Hashicorp Vault used for in managing Redis cluster passwords?
2. Is GitLab used for managing redis configurations in the infrastructure setup?
3. What is an SSL Certficate?

Sample structure: dict_keys(['user_input', 'reference', 'reference_contexts', 'retrieved_contexts', 'response'])


In [54]:
# Create Evaluation Dataset
from src.utils.sdg_utils import create_evaluation_dataset

print("Creating evaluation dataset...")
evaluation_data = create_evaluation_dataset(cached_questions, naive_chain)

print(f"✅ Created evaluation dataset with {len(evaluation_data)} samples")

Creating evaluation dataset...
Creating evaluation dataset from 8 test questions...
Processing question 1/8: What is Hashicorp Vault used for in managing Redis...
Processing question 2/8: Is GitLab used for managing redis configurations i...
Processing question 3/8: What is an SSL Certficate?...
Processing question 4/8: gitlab.com what is it...
Processing question 5/8: How does connecting to various services via Telepo...
Processing question 6/8: How do migration procedures for Redis sharding, in...
Processing question 7/8: how to scale up redis cluster and manage nodes fro...
Processing question 8/8: How do failover and recovery procedures help in tr...
✅ Created evaluation dataset with 8 samples


In [55]:
# Run RAGAS Evaluation
from src.utils.ragas_utils import run_ragas_evaluation

print("Running RAGAS evaluation...")
naive_results = run_ragas_evaluation(evaluation_data, "Naive Retriever")

if naive_results:
    print("✅ RAGAS evaluation completed successfully!")
else:
    print("❌ RAGAS evaluation failed")


Running RAGAS evaluation...

RAGAS EVALUATION RESULTS - NAIVE RETRIEVER
Converting to RAGAS evaluation format...
Evaluation dataset created with 8 samples

Running RAGAS evaluation metrics...
This may take a few minutes...


Evaluating: 100%|██████████| 48/48 [02:26<00:00,  3.04s/it]


✅ RAGAS evaluation completed successfully!
✅ RAGAS evaluation completed successfully!


In [56]:
# Process RAGAS Results
from src.utils.ragas_utils import process_ragas_results

print("Processing RAGAS results...")
processed_naive_results = process_ragas_results(naive_results)

if processed_naive_results:
    print("✅ Results processing completed successfully!")
else:
    print("❌ Results processing failed")


Processing RAGAS results...

📊 PROCESSING RESULTS FOR NAIVE RETRIEVER

📊 SUMMARY METRICS TABLE:
Metric               Mean Score   Std Dev    Min Score  Max Score 
----------------------------------------------------------------------
faithfulness         0.524        0.325      0.000      1.000     
answer_relevancy     0.838        0.106      0.695      0.976     
context_precision    0.750        0.463      0.000      1.000     
context_recall       0.708        0.375      0.000      1.000     
answer_correctness   0.537        0.268      0.145      0.953     
context_entity_recall 0.025        0.071      0.000      0.200     

📈 PERFORMANCE INTERPRETATION:
--------------------------------------------------
Faithfulness: 0.524 - 🟠 Fair
Answer Relevancy: 0.838 - 🟢 Excellent
Context Precision: 0.750 - 🟡 Good
Context Recall: 0.708 - 🟡 Good
Answer Correctness: 0.537 - 🟠 Fair (Critical for production)
Context Entity Recall: 0.025 - 🔴 Needs Improvement (Command/entity coverage)

📋 DETAILED

## Advanced Retrieval - BM25 + Reranker

This section implements advanced retrieval using BM25 + Cohere Reranker for improved performance:

1. **BM25 Retrieval** - Keyword-based retrieval for exact term matching
2. **Cohere Reranker** - Cross-attention reranking for relevance scoring  
3. **Performance Comparison** - Compare against naive vector retrieval
4. **RAGAS Evaluation** - Measure improvement in retrieval metrics


In [57]:
# BM25 + Reranker Implementation
from src.rag.advanced_retrieval import create_bm25_reranker_chain
from src.utils.config import get_model_factory

print("Setting up advanced retrieval chain...")
bm25_reranker_chain = create_bm25_reranker_chain(
    chunked_docs=chunks, 
    model_factory=get_model_factory(),
    bm25_k=12,
    rerank_k=5
)

print("Advanced retrieval chain ready!")
print(f"BM25 + Reranker: {'✅' if bm25_reranker_chain else '❌'}")

Setting up advanced retrieval chain...
Creating BM25 + Reranker chain...
Creating BM25 retriever from 685 documents...
BM25 retriever created (k=12)
BM25 + Reranker chain created (BM25 k=12, Rerank k=5)
Advanced retrieval chain ready!
BM25 + Reranker: ✅


### Test BM25 + Rerank

In [58]:
# Test BM25 + Reranker chain
test_query = "How to monitor Redis memory usage?"
print(f"Testing Query: '{test_query}'")
print("=" * 60)

if bm25_reranker_chain:
    result = bm25_reranker_chain.invoke({"question": test_query})  # Use dict format
    print(f"Response: {result['response'][:200]}...")
    print(f"Contexts: {len(result['contexts'])} chunks")
else:
    print("BM25 + Reranker chain not available")

Testing Query: 'How to monitor Redis memory usage?'
Response: ### 1. Direct Answer
To monitor Redis memory usage effectively, you should use `redis-cli` commands such as `INFO memory` and `MEMORY STATS`, and leverage Prometheus metrics if available. Additionally...
Contexts: 5 chunks


## Evaluate BM25 + Reranker retriever

In [60]:
# RAGAS Evaluation for BM25 + Reranker
from src.utils.sdg_utils import create_evaluation_dataset
from src.utils.ragas_utils import run_ragas_evaluation, process_ragas_results

print("Creating evaluation dataset for BM25 + Reranker...")
reranker_evaluation_data = create_evaluation_dataset(cached_questions, bm25_reranker_chain)

print(f"Created BM25 + Reranker evaluation dataset with {len(reranker_evaluation_data)} samples")

print("\nRunning RAGAS evaluation for BM25 + Reranker...")
reranker_results = run_ragas_evaluation(reranker_evaluation_data, "BM25 + Reranker")

if reranker_results:
    print("BM25 + Reranker RAGAS evaluation completed!")
    processed_reranker_results = process_ragas_results(reranker_results)
else:
    print("BM25 + Reranker RAGAS evaluation failed")

Creating evaluation dataset for BM25 + Reranker...
Creating evaluation dataset from 8 test questions...
Processing question 1/8: What is Hashicorp Vault used for in managing Redis...
Processing question 2/8: Is GitLab used for managing redis configurations i...
Processing question 3/8: What is an SSL Certficate?...
Processing question 4/8: gitlab.com what is it...
Processing question 5/8: How does connecting to various services via Telepo...
Processing question 6/8: How do migration procedures for Redis sharding, in...
Processing question 7/8: how to scale up redis cluster and manage nodes fro...
Processing question 8/8: How do failover and recovery procedures help in tr...
Created BM25 + Reranker evaluation dataset with 8 samples

Running RAGAS evaluation for BM25 + Reranker...

RAGAS EVALUATION RESULTS - BM25 + RERANKER
Converting to RAGAS evaluation format...
Evaluation dataset created with 8 samples

Running RAGAS evaluation metrics...
This may take a few minutes...


Evaluating: 100%|██████████| 48/48 [02:09<00:00,  2.71s/it]


✅ RAGAS evaluation completed successfully!
BM25 + Reranker RAGAS evaluation completed!

📊 PROCESSING RESULTS FOR BM25 + RERANKER

📊 SUMMARY METRICS TABLE:
Metric               Mean Score   Std Dev    Min Score  Max Score 
----------------------------------------------------------------------
faithfulness         0.735        0.311      0.020      1.000     
answer_relevancy     0.817        0.153      0.593      0.971     
context_precision    0.500        0.535      0.000      1.000     
context_recall       0.667        0.436      0.000      1.000     
answer_correctness   0.443        0.350      0.103      0.947     
context_entity_recall 0.218        0.332      0.000      1.000     

📈 PERFORMANCE INTERPRETATION:
--------------------------------------------------
Faithfulness: 0.735 - 🟡 Good
Answer Relevancy: 0.817 - 🟢 Excellent
Context Precision: 0.500 - 🟠 Fair
Context Recall: 0.667 - 🟡 Good
Answer Correctness: 0.443 - 🟠 Fair (Critical for production)
Context Entity Recall: 0.218 

# Comparison

In [63]:
# Check the structure of your processed results
print("Naive results structure:")
print(type(processed_naive_results))
if hasattr(processed_naive_results, 'keys'):
    print("Keys:", list(processed_naive_results.keys()))
else:
    print("Not a dict")

print("\nReranker results structure:")
print(type(processed_reranker_results))
if hasattr(processed_reranker_results, 'keys'):
    print("Keys:", list(processed_reranker_results.keys()))
else:
    print("Not a dict")

Naive results structure:
<class 'dict'>
Keys: ['result', 'summary_stats', 'chain_name', 'dataframe']

Reranker results structure:
<class 'dict'>
Keys: ['result', 'summary_stats', 'chain_name', 'dataframe']


In [65]:
# Performance Comparison
import pandas as pd

# Create comparison table
comparison_data = [
    {
        'Retriever': 'Naive Vector',
        'Faithfulness': f"{processed_naive_results['summary_stats']['faithfulness']['mean']:.3f}",
        'Answer Relevancy': f"{processed_naive_results['summary_stats']['answer_relevancy']['mean']:.3f}",
        'Context Precision': f"{processed_naive_results['summary_stats']['context_precision']['mean']:.3f}",
        'Context Recall': f"{processed_naive_results['summary_stats']['context_recall']['mean']:.3f}",
        'Answer Correctness': f"{processed_naive_results['summary_stats']['answer_correctness']['mean']:.3f}",
        'Context Entity Recall': f"{processed_naive_results['summary_stats']['context_entity_recall']['mean']:.3f}"
    },
    {
        'Retriever': 'BM25 + Reranker',
        'Faithfulness': f"{processed_reranker_results['summary_stats']['faithfulness']['mean']:.3f}",
        'Answer Relevancy': f"{processed_reranker_results['summary_stats']['answer_relevancy']['mean']:.3f}",
        'Context Precision': f"{processed_reranker_results['summary_stats']['context_precision']['mean']:.3f}",
        'Context Recall': f"{processed_reranker_results['summary_stats']['context_recall']['mean']:.3f}",
        'Answer Correctness': f"{processed_reranker_results['summary_stats']['answer_correctness']['mean']:.3f}",
        'Context Entity Recall': f"{processed_reranker_results['summary_stats']['context_entity_recall']['mean']:.3f}"
    }
]

df = pd.DataFrame(comparison_data)
print("📊 RETRIEVAL PERFORMANCE COMPARISON")
print("=" * 60)
print(df.to_string(index=False))

# Show improvements in percentage
print("\n🎯 IMPROVEMENTS (BM25 + Reranker vs Naive Vector):")
print("-" * 50)
for metric in ['Faithfulness', 'Answer Relevancy', 'Context Precision', 'Context Recall', 'Answer Correctness', 'Context Entity Recall']:
    naive_val = float(comparison_data[0][metric])
    advanced_val = float(comparison_data[1][metric])
    improvement_pct = ((advanced_val - naive_val) / naive_val) * 100
    if improvement_pct > 0:
        print(f"✅ {metric}: +{improvement_pct:.1f}%")
    elif improvement_pct < 0:
        print(f"❌ {metric}: {improvement_pct:.1f}%")
    else:
        print(f"➖ {metric}: No change")

📊 RETRIEVAL PERFORMANCE COMPARISON
      Retriever Faithfulness Answer Relevancy Context Precision Context Recall Answer Correctness Context Entity Recall
   Naive Vector        0.524            0.838             0.750          0.708              0.537                 0.025
BM25 + Reranker        0.735            0.817             0.500          0.667              0.443                 0.218

🎯 IMPROVEMENTS (BM25 + Reranker vs Naive Vector):
--------------------------------------------------
✅ Faithfulness: +40.3%
❌ Answer Relevancy: -2.5%
❌ Context Precision: -33.3%
❌ Context Recall: -5.8%
❌ Answer Correctness: -17.5%
✅ Context Entity Recall: +772.0%
