## Activity #1: Retriever Evaluation with RAGAS

This notebook evaluates different retriever methods using RAGAS for synthetic dataset generation.

### Objectives:
1. Create a "golden dataset" using RAGAS Synthetic Data Generation
2. Evaluate 6 different retrievers on combined CSV + PDF data
3. Compare performance, cost, and latency
4. Provide recommendations

### Data Sources:
- **CSV Data**: Consumer complaint narratives
- **PDF Data**: Federal Student Aid handbooks

### Retrievers to Evaluate:
- Naive Retrieval (Embedding-based)
- BM25 Retriever
- Multi-Query Retriever
- Parent-Document Retriever
- Contextual Compression (Reranking)
- Ensemble Retriever


## Step 1: Setup and Dependencies

In [1]:
import os
import time
import pandas as pd
from datetime import datetime
import getpass

# Set up API keys
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")
os.environ["COHERE_API_KEY"] = getpass.getpass("Enter your Cohere API Key:")

# Optional: Set up LangSmith for advanced evaluation
try:
    os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("Enter your LangSmith API Key (optional, press Enter to skip):")
    if os.environ["LANGCHAIN_API_KEY"]:
        os.environ["LANGCHAIN_TRACING_V2"] = "true"
        os.environ["LANGCHAIN_PROJECT"] = "Retriever-Evaluation"
        print("✅ LangSmith tracing enabled")
    else:
        print("⚠️  LangSmith skipped")
except:
    print("⚠️  LangSmith skipped")

print("✅ API keys configured")

✅ LangSmith tracing enabled
✅ API keys configured


## Step 2: Load Data

In [3]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader

# Load CSV data
loader = CSVLoader(
    file_path="./data/complaints.csv",
    metadata_columns=[
        "Date received", "Product", "Sub-product", "Issue", "Sub-issue", 
        "Consumer complaint narrative", "Company", "State", "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

# Set page content to complaint narrative
for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

print(f"✅ Loaded {len(loan_complaint_data)} complaint documents from CSV")

# Load PDF data
path = "data/"
pdf_loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
pdf_docs = pdf_loader.load()

print(f"✅ Loaded {len(pdf_docs)} PDF documents")

# Combine all documents
all_docs = loan_complaint_data + pdf_docs
print(f"✅ Total documents: {len(all_docs)} (CSV: {len(loan_complaint_data)}, PDF: {len(pdf_docs)})")

print(f"\nSample complaint: {loan_complaint_data[0].page_content[:100]}...")
print(f"Sample PDF content: {pdf_docs[0].page_content[:100]}...")

✅ Loaded 825 complaint documents from CSV
✅ Loaded 269 PDF documents
✅ Total documents: 1094 (CSV: 825, PDF: 269)

Sample complaint: The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were no...
Sample PDF content: Volume 3
Academic Calendars, Cost of Attendance, and
Packaging
Introduction
This volume of the Feder...


## Step 3: Create Golden Dataset using RAGAS

In [4]:
# RAGAS setup
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Initialize models for RAGAS
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))

print("✅ RAGAS models initialized")

✅ RAGAS models initialized


In [7]:
# Generate synthetic dataset using abstracted SDG
from ragas.testset import TestsetGenerator

print("Generating synthetic test dataset using RAGAS...")

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

# Use subset for cost efficiency
# Try PDF docs first as they tend to work better with RAGAS
testset_docs = pdf_docs[:30] + loan_complaint_data[:30]  # Mixed approach

golden_dataset = generator.generate_with_langchain_docs(
    testset_docs, 
    testset_size=10
)

print(f"✅ Generated {len(golden_dataset)} synthetic QA pairs")

# Convert to pandas for easier viewing
df = golden_dataset.to_pandas()
print(f"\nDataset columns: {list(df.columns)}")
print(f"Dataset shape: {df.shape}")

# Show sample questions
print("\nSample questions from the dataset:")
if 'question' in df.columns:
    question_col = 'question'
elif 'user_input' in df.columns:
    question_col = 'user_input'
elif len(df.columns) > 0:
    question_col = df.columns[0]  # Use first column as fallback
    print(f"Using column '{question_col}' as questions:")
else:
    print("No columns found in dataset!")
    question_col = None

if question_col:
    for i in range(min(3, len(df))):
        print(f"{i+1}. {df.iloc[i][question_col]}")

Generating synthetic test dataset using RAGAS...


Applying HeadlinesExtractor:   0%|          | 0/23 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/60 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to ap

Applying SummaryExtractor:   0%|          | 0/41 [00:00<?, ?it/s]

Property 'summary' already exists in node '6e960b'. Skipping!
Property 'summary' already exists in node '73f140'. Skipping!
Property 'summary' already exists in node 'f25110'. Skipping!
Property 'summary' already exists in node '3e841f'. Skipping!
Property 'summary' already exists in node '3eb292'. Skipping!
Property 'summary' already exists in node '39235c'. Skipping!
Property 'summary' already exists in node 'af322b'. Skipping!
Property 'summary' already exists in node 'aad1bc'. Skipping!
Property 'summary' already exists in node '5c9f93'. Skipping!
Property 'summary' already exists in node '5b2c1d'. Skipping!
Property 'summary' already exists in node 'a8afb0'. Skipping!
Property 'summary' already exists in node 'e3e938'. Skipping!
Property 'summary' already exists in node '3d1839'. Skipping!
Property 'summary' already exists in node '90cd35'. Skipping!
Property 'summary' already exists in node 'f8ba29'. Skipping!
Property 'summary' already exists in node 'e34cac'. Skipping!
Property

Applying CustomNodeFilter:   0%|          | 0/10 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/61 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'f25110'. Skipping!
Property 'summary_embedding' already exists in node '73f140'. Skipping!
Property 'summary_embedding' already exists in node 'af322b'. Skipping!
Property 'summary_embedding' already exists in node '6e960b'. Skipping!
Property 'summary_embedding' already exists in node '3eb292'. Skipping!
Property 'summary_embedding' already exists in node '3e841f'. Skipping!
Property 'summary_embedding' already exists in node '3d1839'. Skipping!
Property 'summary_embedding' already exists in node '90cd35'. Skipping!
Property 'summary_embedding' already exists in node '39235c'. Skipping!
Property 'summary_embedding' already exists in node 'a8afb0'. Skipping!
Property 'summary_embedding' already exists in node 'e34cac'. Skipping!
Property 'summary_embedding' already exists in node 'e3e938'. Skipping!
Property 'summary_embedding' already exists in node '98534d'. Skipping!
Property 'summary_embedding' already exists in node '5c9f93'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

✅ Generated 12 synthetic QA pairs

Dataset columns: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name']
Dataset shape: (12, 4)

Sample questions from the dataset:
1. What does 34 CFR 668.3(b) pertain to in the context of regulatory citations?
2. What does the term Subscription-Based Program refer to in educational regulations?
3. How do school districts influence the scheduling of clinical experiences in academic programs?


## Step 3.5: Create LangSmith Dataset (Optional)

In [8]:
# Create LangSmith dataset for advanced evaluation (if LangSmith is available)
try:
    from langsmith import Client
    
    if os.environ.get("LANGCHAIN_API_KEY"):
        print("Creating LangSmith dataset...")
        
        client = Client()
        dataset_name = f"Retriever-Evaluation-{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        
        # Create dataset
        langsmith_dataset = client.create_dataset(
            dataset_name=dataset_name,
            description="Synthetic data for retriever evaluation using RAGAS"
        )
        
        # Add examples to LangSmith dataset
        df = golden_dataset.to_pandas()
        
        # Find the correct column names
        if 'question' in df.columns:
            question_col = 'question'
        elif 'user_input' in df.columns:
            question_col = 'user_input'
        else:
            question_col = df.columns[0]
            
        if 'answer' in df.columns:
            answer_col = 'answer'
        elif 'reference' in df.columns:
            answer_col = 'reference'
        else:
            answer_col = df.columns[1] if len(df.columns) > 1 else question_col
        
        # Add examples
        for idx, row in df.iterrows():
            client.create_example(
                inputs={
                    "question": row[question_col]
                },
                outputs={
                    "answer": row[answer_col] if answer_col != question_col else "Generated answer"
                },
                metadata={
                    "source": "ragas_synthetic",
                    "retriever_evaluation": True
                },
                dataset_id=langsmith_dataset.id
            )
        
        print(f"✅ Created LangSmith dataset: {dataset_name}")
        print(f"📊 Added {len(df)} examples to dataset")
        
        # Store for later use
        LANGSMITH_DATASET_NAME = dataset_name
        USE_LANGSMITH = True
    else:
        print("⚠️  LangSmith API key not found, skipping dataset creation")
        USE_LANGSMITH = False
        LANGSMITH_DATASET_NAME = None
        
except ImportError:
    print("⚠️  LangSmith not available, install with: pip install langsmith")
    USE_LANGSMITH = False
    LANGSMITH_DATASET_NAME = None
except Exception as e:
    print(f"⚠️  LangSmith setup failed: {e}")
    USE_LANGSMITH = False
    LANGSMITH_DATASET_NAME = None

Creating LangSmith dataset...
✅ Created LangSmith dataset: Retriever-Evaluation-20250729_111816
📊 Added 12 examples to dataset


## Step 4: Set Up Retrievers

In [9]:
from langchain_community.vectorstores import Qdrant
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.retrievers import ParentDocumentRetriever, EnsembleRetriever
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, models
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

# Initialize models
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
chat_model = ChatOpenAI(model="gpt-4o-mini")

print("Setting up retrievers...")

Setting up retrievers...


In [10]:
# 1. Naive Retriever
vectorstore = Qdrant.from_documents(
    all_docs,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)
naive_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
print("✅ 1. Naive retriever ready")

# 2. BM25 Retriever
bm25_retriever = BM25Retriever.from_documents(all_docs)
bm25_retriever.k = 5
print("✅ 2. BM25 retriever ready")

# 3. Multi-Query Retriever
multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)
print("✅ 3. Multi-query retriever ready")

# 4. Parent Document Retriever
parent_docs = all_docs
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

# Create new QdrantClient and collection
client = QdrantClient(location=":memory:")
client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", 
    embedding=embeddings, 
    client=client
)

store = InMemoryStore()
parent_document_retriever = ParentDocumentRetriever(
    vectorstore=parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

parent_document_retriever.add_documents(parent_docs, ids=None)
print("✅ 4. Parent document retriever ready")

# 5. Contextual Compression Retriever
compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)
print("✅ 5. Contextual compression retriever ready")

# 6. Ensemble Retriever
retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)
print("✅ 6. Ensemble retriever ready")

print("\n✅ All retrievers initialized successfully!")

✅ 1. Naive retriever ready
✅ 2. BM25 retriever ready
✅ 3. Multi-query retriever ready
✅ 4. Parent document retriever ready
✅ 5. Contextual compression retriever ready
✅ 6. Ensemble retriever ready

✅ All retrievers initialized successfully!


## Step 5: Evaluation Function

In [11]:
def evaluate_retriever_simple(retriever, retriever_name, questions):
    """
    Simple evaluation function that measures retrieval performance
    """
    print(f"\nEvaluating {retriever_name}...")
    
    start_time = time.time()
    total_docs_retrieved = 0
    successful_retrievals = 0
    
    for i, question in enumerate(questions):
        try:
            # Retrieve documents
            docs = retriever.get_relevant_documents(question)
            total_docs_retrieved += len(docs)
            successful_retrievals += 1
            
        except Exception as e:
            print(f"  Error on question {i+1}: {e}")
    
    end_time = time.time()
    
    # Calculate metrics
    avg_docs_per_query = total_docs_retrieved / len(questions) if questions else 0
    success_rate = successful_retrievals / len(questions) if questions else 0
    latency = end_time - start_time
    
    results = {
        'retriever_name': retriever_name,
        'success_rate': success_rate,
        'avg_docs_per_query': avg_docs_per_query,
        'total_latency': latency,
        'avg_latency_per_query': latency / len(questions) if questions else 0
    }
    
    print(f"  ✅ Success rate: {success_rate:.2%}")
    print(f"  ✅ Avg docs per query: {avg_docs_per_query:.1f}")
    print(f"  ✅ Latency: {latency:.2f}s")
    
    return results

def estimate_cost(retriever_name, num_queries):
    """Estimate API costs per retriever type"""
    cost_per_query = {
        'Naive': 0.002,  # OpenAI embedding calls
        'BM25': 0.0,     # No API calls
        'Multi-Query': 0.008,  # Multiple LLM calls + embeddings
        'Parent Document': 0.003,  # Embeddings + some overhead
        'Contextual Compression': 0.015,  # Cohere rerank + embeddings
        'Ensemble': 0.020,  # All of the above combined
    }
    return cost_per_query.get(retriever_name.split()[0], 0.005) * num_queries

print("✅ Evaluation functions ready")

✅ Evaluation functions ready


## Step 6: Run Evaluations

In [12]:
# Extract questions from RAGAS dataset
df = golden_dataset.to_pandas()

# Find the correct question column
if 'question' in df.columns:
    question_col = 'question'
elif 'user_input' in df.columns:
    question_col = 'user_input'
elif len(df.columns) > 0:
    question_col = df.columns[0]  # Use first column as fallback
    print(f"Using column '{question_col}' as questions")
else:
    raise ValueError("No suitable question column found in RAGAS dataset!")

questions = df[question_col].tolist()

print(f"Running evaluation on {len(questions)} questions...")
print("="*60)

# Define retrievers to evaluate
retrievers_to_test = [
    (naive_retriever, "Naive"),
    (bm25_retriever, "BM25"),
    (multi_query_retriever, "Multi-Query"),
    (parent_document_retriever, "Parent Document"),
    (compression_retriever, "Contextual Compression"),
    (ensemble_retriever, "Ensemble")
]

# Run evaluations
results = []
for retriever, name in retrievers_to_test:
    result = evaluate_retriever_simple(retriever, name, questions)
    result['estimated_cost'] = estimate_cost(name, len(questions))
    results.append(result)

print("\n✅ All evaluations completed!")

Running evaluation on 12 questions...

Evaluating Naive...


  docs = retriever.get_relevant_documents(question)


  ✅ Success rate: 100.00%
  ✅ Avg docs per query: 5.0
  ✅ Latency: 5.74s

Evaluating BM25...
  ✅ Success rate: 100.00%
  ✅ Avg docs per query: 5.0
  ✅ Latency: 0.08s

Evaluating Multi-Query...
  ✅ Success rate: 100.00%
  ✅ Avg docs per query: 6.4
  ✅ Latency: 39.04s

Evaluating Parent Document...
  ✅ Success rate: 100.00%
  ✅ Avg docs per query: 2.9
  ✅ Latency: 4.32s

Evaluating Contextual Compression...
  Error on question 11: status_code: 429, body: data=None id='32c77a2a-a91b-41e4-8683-6eb06984fa32' message="You are using a Trial key, which is limited to 10 API calls / minute. You can continue to use the Trial key for free or upgrade to a Production key with higher rate limits at 'https://dashboard.cohere.com/api-keys'. Contact us on 'https://discord.gg/XW44jPfYJu' or email us at support@cohere.com with any questions"
  Error on question 12: status_code: 429, body: data=None id='073344e7-ba46-46b4-bc5d-8ffb43574509' message="You are using a Trial key, which is limited to 10 API cal

## Step 7: Analyze Results

In [13]:
# Create results dataframe
results_df = pd.DataFrame(results)

# Filter successful retrievers
successful_results = results_df[results_df['success_rate'] > 0.8].copy()

if len(successful_results) == 0:
    print("⚠️  No retrievers achieved >80% success rate. Showing all results:")
    successful_results = results_df.copy()

# Display main metrics
display_cols = ['retriever_name', 'success_rate', 'avg_docs_per_query', 
               'total_latency', 'estimated_cost']
print("\n📈 Performance Summary:")
print(successful_results[display_cols].round(4).to_string(index=False))

# Find best performers
fastest = successful_results.loc[successful_results['total_latency'].idxmin()]
cheapest = successful_results.loc[successful_results['estimated_cost'].idxmin()]
most_docs = successful_results.loc[successful_results['avg_docs_per_query'].idxmax()]

# Calculate combined score (simple weighted average)
successful_results = successful_results.copy()
successful_results['combined_score'] = (
    0.4 * successful_results['success_rate'] + 
    0.3 * (1 / (successful_results['total_latency'] + 1)) + 
    0.3 * (1 / (successful_results['estimated_cost'] + 0.001))
)

best_overall = successful_results.loc[successful_results['combined_score'].idxmax()]

print("\n🏆 WINNERS:")
print(f"⚡ Fastest: {fastest['retriever_name']} ({fastest['total_latency']:.2f}s)")
print(f"💰 Cheapest: {cheapest['retriever_name']} (${cheapest['estimated_cost']:.4f})")
print(f"📚 Most Comprehensive: {most_docs['retriever_name']} ({most_docs['avg_docs_per_query']:.1f} docs/query)")
print(f"🎖️  Best Overall: {best_overall['retriever_name']} ({best_overall['combined_score']:.3f})")


📈 Performance Summary:
        retriever_name  success_rate  avg_docs_per_query  total_latency  estimated_cost
                 Naive        1.0000              5.0000         5.7410           0.024
                  BM25        1.0000              5.0000         0.0795           0.000
           Multi-Query        1.0000              6.4167        39.0379           0.096
       Parent Document        1.0000              2.9167         4.3158           0.060
Contextual Compression        0.8333              2.5000         7.0007           0.060

🏆 WINNERS:
⚡ Fastest: BM25 (0.08s)
💰 Cheapest: BM25 ($0.0000)
📚 Most Comprehensive: Multi-Query (6.4 docs/query)
🎖️  Best Overall: BM25 (300.678)


## Step 8: Final Analysis and Recommendations

In [14]:
if len(successful_results) > 0:
    print("\n💡 RECOMMENDATIONS BY USE CASE:")
    print(f"\n1. ⚡ For Speed: {fastest['retriever_name']}")
    print(f"   - Fastest response time: {fastest['total_latency']:.2f}s")
    print(f"   - Good for: Real-time applications, high-throughput systems")
    
    print(f"\n2. 💰 For Cost Efficiency: {cheapest['retriever_name']}")
    print(f"   - Lowest cost: ${cheapest['estimated_cost']:.4f}")
    print(f"   - Good for: Budget-conscious deployments, high-volume usage")
    
    print(f"\n3. 📚 For Comprehensive Results: {most_docs['retriever_name']}")
    print(f"   - Most documents per query: {most_docs['avg_docs_per_query']:.1f}")
    print(f"   - Good for: Research applications, thorough analysis")
    
    print(f"\n4. ⚖️  For Balanced Performance: {best_overall['retriever_name']}")
    print(f"   - Best combined score: {best_overall['combined_score']:.3f}")
    print(f"   - Good for: General-purpose applications, balanced requirements")
    
    print("\n🔍 KEY INSIGHTS:")
    print("- RAGAS provides realistic test questions based on actual data")
    print("- BM25 is typically fastest and cheapest (no API calls)")
    print("- Embedding-based methods provide better semantic understanding")
    print("- Multi-query retrieval improves recall but increases cost")
    print("- Ensemble methods balance different strengths")
    print("- Compression/reranking improves quality but adds latency")
    print("- Parent-document retrievers provide more context per result")
    
    print("\n📈 EVALUATION METRICS:")
    print("- Success Rate: Percentage of queries processed successfully")
    print("- Docs Per Query: Average number of documents retrieved")
    print("- Latency: Time to retrieve and process documents")
    print("- Cost: Estimated API usage costs")
    
else:
    print("\n⚠️  All retrievers had issues. Check your setup and data.")

print(f"\n📊 EVALUATION COMPLETED: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")


💡 RECOMMENDATIONS BY USE CASE:

1. ⚡ For Speed: BM25
   - Fastest response time: 0.08s
   - Good for: Real-time applications, high-throughput systems

2. 💰 For Cost Efficiency: BM25
   - Lowest cost: $0.0000
   - Good for: Budget-conscious deployments, high-volume usage

3. 📚 For Comprehensive Results: Multi-Query
   - Most documents per query: 6.4
   - Good for: Research applications, thorough analysis

4. ⚖️  For Balanced Performance: BM25
   - Best combined score: 300.678
   - Good for: General-purpose applications, balanced requirements

🔍 KEY INSIGHTS:
- RAGAS provides realistic test questions based on actual data
- BM25 is typically fastest and cheapest (no API calls)
- Embedding-based methods provide better semantic understanding
- Multi-query retrieval improves recall but increases cost
- Ensemble methods balance different strengths
- Compression/reranking improves quality but adds latency
- Parent-document retrievers provide more context per result

📈 EVALUATION METRICS:
- Su

## Step 9: LangSmith Advanced Evaluation for ALL Retrievers

In [15]:
if USE_LANGSMITH:
    print("\n🔬 Running LangSmith evaluation for ALL retrievers...")
    
    try:
        from langsmith.evaluation import LangChainStringEvaluator, evaluate
        from langchain.prompts import ChatPromptTemplate
        from langchain.schema import StrOutputParser
        from operator import itemgetter
        
        # Create RAG chain for evaluation
        RAG_PROMPT = """Given the provided context and question, answer the question based only on the context.
If you cannot answer based on the context, say "I don't know".

Context: {context}
Question: {question}"""
        
        rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)
        eval_llm = ChatOpenAI(model="gpt-4o-mini")
        
        # QA evaluator (following example.ipynb pattern)
        qa_evaluator = LangChainStringEvaluator("qa", config={"llm": eval_llm})
        
        # Labeled helpfulness evaluator (following example.ipynb pattern)
        labeled_helpfulness_evaluator = LangChainStringEvaluator(
            "labeled_criteria",
            config={
                "criteria": {
                    "helpfulness": (
                        "Is this submission helpful to the user,"
                        " taking into account the correct reference answer?"
                    )
                },
                "llm": eval_llm
            },
            prepare_data=lambda run, example: {
                "prediction": run.outputs["output"],
                "reference": example.outputs["answer"],
                "input": example.inputs["question"],
            }
        )
        
        # Empathy evaluator (following example.ipynb pattern)
        empathy_evaluator = LangChainStringEvaluator(
            "criteria",
            config={
                "criteria": {
                    "empathy": "Is this response empathetic? Does it make the user feel like they are being heard?",
                },
                "llm": eval_llm
            }
        )
        
        # Define all retrievers to evaluate
        all_retrievers_to_evaluate = [
            (naive_retriever, "Naive"),
            (bm25_retriever, "BM25"),
            (multi_query_retriever, "Multi-Query"),
            (parent_document_retriever, "Parent-Document"),
            (compression_retriever, "Contextual-Compression"),
            (ensemble_retriever, "Ensemble")
        ]
        
        print(f"📊 Evaluating {len(all_retrievers_to_evaluate)} retrievers with LangSmith...")
        print("🔍 Evaluators: QA Accuracy, Helpfulness, Empathy")
        
        # Evaluate each retriever
        for retriever, name in all_retrievers_to_evaluate:
            print(f"\n🔍 Evaluating {name} retriever...")
            
            try:
                # Create RAG chain for this retriever
                rag_chain = (
                    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
                    | rag_prompt | eval_llm | StrOutputParser()
                )
                
                # Run evaluation for this retriever
                experiment_results = evaluate(
                    rag_chain.invoke,
                    data=LANGSMITH_DATASET_NAME,
                    evaluators=[
                        qa_evaluator,
                        labeled_helpfulness_evaluator,
                        empathy_evaluator
                    ],
                    metadata={
                        "retriever_type": name, 
                        "evaluation_run": "all_retrievers",
                        "evaluators": "qa_helpfulness_empathy"
                    },
                    experiment_prefix=f"retriever_{name.lower().replace(' ', '_').replace('-', '_')}"
                )
                
                print(f"✅ {name} evaluation completed successfully")
                
                # Add rate limiting delay between retrievers
                time.sleep(3)  # 3 second delay between retrievers
                
            except Exception as e:
                print(f"❌ {name} evaluation failed: {e}")
                continue
        
        print("\n🎯 All retriever evaluations completed!")
        print("📊 Check LangSmith dashboard for detailed comparison results!")
        print("🔍 Each retriever has been evaluated for: QA Accuracy, Helpfulness, Empathy")
        
    except Exception as e:
        print(f"❌ LangSmith evaluation failed: {e}")
        
else:
    print("\n⚠️  Skipping LangSmith evaluation (not configured)")


🔬 Running LangSmith evaluation for ALL retrievers...
📊 Evaluating 6 retrievers with LangSmith...
🔍 Evaluators: QA Accuracy, Helpfulness, Empathy

🔍 Evaluating Naive retriever...
View the evaluation results for experiment: 'retriever_naive-633802b9' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/ac74f704-3930-48e5-9de2-f08c2498e0b8/compare?selectedSessions=5363d56e-8be9-473e-9708-4f2e17fb9571




0it [00:00, ?it/s]

✅ Naive evaluation completed successfully

🔍 Evaluating BM25 retriever...
View the evaluation results for experiment: 'retriever_bm25-6f2f351e' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/ac74f704-3930-48e5-9de2-f08c2498e0b8/compare?selectedSessions=f2f17d7b-c0f5-4ca4-9ad5-582101d06363




0it [00:00, ?it/s]

✅ BM25 evaluation completed successfully

🔍 Evaluating Multi-Query retriever...
View the evaluation results for experiment: 'retriever_multi_query-79e4f50a' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/ac74f704-3930-48e5-9de2-f08c2498e0b8/compare?selectedSessions=984e92df-fe41-4f92-9057-04ebb5f100c7




0it [00:00, ?it/s]

✅ Multi-Query evaluation completed successfully

🔍 Evaluating Parent-Document retriever...
View the evaluation results for experiment: 'retriever_parent_document-674f3ef1' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/ac74f704-3930-48e5-9de2-f08c2498e0b8/compare?selectedSessions=62840a2e-3d2c-4d9b-bbce-5deb1894ab1d




0it [00:00, ?it/s]

✅ Parent-Document evaluation completed successfully

🔍 Evaluating Contextual-Compression retriever...
View the evaluation results for experiment: 'retriever_contextual_compression-40e10208' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/ac74f704-3930-48e5-9de2-f08c2498e0b8/compare?selectedSessions=86ad2823-376a-4d39-95c0-6af895623b22




0it [00:00, ?it/s]

✅ Contextual-Compression evaluation completed successfully

🔍 Evaluating Ensemble retriever...
View the evaluation results for experiment: 'retriever_ensemble-1860fdea' at:
https://smith.langchain.com/o/a8b64252-5f0f-4f35-a048-c004586e098a/datasets/ac74f704-3930-48e5-9de2-f08c2498e0b8/compare?selectedSessions=99bb04b6-8c41-453a-b86e-df1376326cbc




0it [00:00, ?it/s]

✅ Ensemble evaluation completed successfully

🎯 All retriever evaluations completed!
📊 Check LangSmith dashboard for detailed comparison results!
🔍 Each retriever has been evaluated for: QA Accuracy, Helpfulness, Empathy


![LangSmith Screenshot](./img/screenshot.png)

Here is a concise explanation you can use in markdown to justify why the BM25 retriever is the best for handling student loan data and complaints, based on your evaluation results:

---

### Why BM25 is the Best Retriever for Student Loan Data and Complaints:

After evaluating multiple retrieval methods—including Naive, BM25, Multi-Query, Parent-Document, Contextual-Compression, and Ensemble retrievers—**BM25 clearly emerged as the top performer** for the following reasons:

* **Highest Correctness**:
  BM25 scored highest in correctness (9 correct answers out of all evaluated responses), indicating superior ability to retrieve relevant information from student loan data and complaints.

* **Strong Helpfulness**:
  BM25 provided highly helpful responses, achieving a strong balance between precision and recall, essential for handling detailed and specific user queries related to student loans.

* **Low Latency**:
  BM25 demonstrated very low latency (P50 at 3.52s, P99 at 9.74s), meaning it can quickly return relevant results, making it highly suitable for user-facing applications requiring responsiveness.

* **Simplicity and Robustness**:
  Unlike more complex methods (such as ensemble or contextual-compression), BM25 operates on simple keyword-matching principles, making it robust for structured, complaint-driven data that often contains repetitive phrasing or terminology.

**Conclusion**:
Given its highest correctness score, solid helpfulness, lowest latency, and robustness to structured textual data, BM25 stands out as the optimal choice for effectively retrieving and managing student loan complaints and related information.

---

