# 16. RAG Evaluation with RAGAS Framework

**Complexity:** ⭐⭐⭐⭐

## Overview

**RAGAS** (RAG Assessment) is a comprehensive framework for evaluating Retrieval-Augmented Generation systems. It provides metrics to measure both retrieval quality and generation quality.

### Why Evaluation Matters

Without evaluation, you can't:
- ❌ Compare different RAG architectures objectively
- ❌ Track improvements over time
- ❌ Identify weak points in your system
- ❌ Justify architectural choices
- ❌ Optimize for production

### RAGAS Metrics

RAGAS provides 6 key metrics:

1. **Faithfulness** (0-1): Is the answer grounded in retrieved context?
   - Measures hallucination
   - Higher = less hallucination

2. **Answer Relevancy** (0-1): Is the answer relevant to the question?
   - Measures if answer addresses the query
   - Higher = more relevant

3. **Context Precision** (0-1): Are retrieved documents relevant?
   - Measures retrieval precision
   - Higher = less noise

4. **Context Recall** (0-1): Did retrieval find all relevant info?
   - Measures retrieval completeness
   - Requires ground truth

5. **Answer Semantic Similarity** (0-1): Similarity to reference answer
   - Measures correctness
   - Requires ground truth answer

6. **Answer Correctness** (0-1): Factual accuracy vs ground truth
   - Weighted F1 score
   - Requires ground truth

### Evaluation Dataset

For each test case, you need:
```python
{
    "question": "What is LCEL?",
    "answer": "<generated answer>",
    "contexts": ["<retrieved doc 1>", "<retrieved doc 2>", ...],
    "ground_truth": "LCEL is LangChain Expression Language..."  # optional
}
```

### Architecture Comparison

We'll evaluate all 12 RAG architectures:
1. Simple RAG
2. RAG with Memory
3. Multi-Query RAG
4. HyDE
5. Adaptive RAG
6. Corrective RAG
7. Self-RAG
8. Agentic RAG
9. **Contextual RAG** [NEW]
10. **Fusion RAG** [NEW]
11. **SQL RAG** [NEW]
12. **GraphRAG** [NEW]

---

## Implementation

## 1. Setup and Imports

In [None]:
import sys
import time
from pathlib import Path
from typing import List, Dict, Any
import pandas as pd
import json

# Add parent directory to path
sys.path.append(str(Path("../..").resolve()))

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# RAGAS imports
try:
    from ragas import evaluate
    from ragas.metrics import (
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
    )
    from datasets import Dataset
    RAGAS_AVAILABLE = True
except ImportError:
    print("⚠️  RAGAS not installed. Install with: pip install ragas datasets")
    RAGAS_AVAILABLE = False

from shared.config import (
    verify_api_key,
    DEFAULT_MODEL,
    DEFAULT_TEMPERATURE,
    OPENAI_EMBEDDING_MODEL,
    VECTOR_STORE_DIR,
)
from shared.loaders import load_and_split
from shared.prompts import RAG_PROMPT_TEMPLATE
from shared.utils import (
    format_docs,
    print_section_header,
    load_vector_store,
)

# Verify API key
verify_api_key()

print("✓ All imports successful")
print(f"✓ Using model: {DEFAULT_MODEL}")
print(f"✓ RAGAS available: {RAGAS_AVAILABLE}")

## 2. Create Evaluation Dataset

We'll create a test set with questions and ground truth answers.

In [None]:
print_section_header("Creating Evaluation Dataset")

# Test questions with ground truth answers
evaluation_dataset = [
    {
        "question": "What is LCEL in LangChain?",
        "ground_truth": "LCEL (LangChain Expression Language) is a declarative way to compose chains in LangChain. It uses the pipe operator (|) to chain components together and supports features like streaming, async execution, and parallel processing.",
        "category": "simple",
    },
    {
        "question": "How do I build a RAG application?",
        "ground_truth": "To build a RAG application: 1) Load and split documents, 2) Create embeddings and vector store, 3) Set up a retriever, 4) Create a prompt template, 5) Chain retriever with LLM using LCEL, 6) Invoke the chain with queries.",
        "category": "multi-step",
    },
    {
        "question": "What are the different types of memory in LangChain?",
        "ground_truth": "LangChain provides several memory types: ConversationBufferMemory (stores all messages), ConversationSummaryMemory (summarizes history), ConversationBufferWindowMemory (keeps last N messages), and ConversationKGMemory (knowledge graph-based).",
        "category": "multi-concept",
    },
    {
        "question": "How do retrievers work in RAG?",
        "ground_truth": "Retrievers fetch relevant documents from a vector store based on semantic similarity. They take a query, convert it to embeddings, search the vector store, and return top-k most similar documents. Common types include similarity search, MMR, and multi-query retrievers.",
        "category": "conceptual",
    },
    {
        "question": "What is the difference between chains and agents?",
        "ground_truth": "Chains follow a predetermined sequence of steps, while agents can dynamically decide which tools to use and in what order based on the task. Agents have reasoning capabilities and can adapt their behavior, whereas chains are static workflows.",
        "category": "comparison",
    },
]

print(f"\n✓ Created evaluation dataset with {len(evaluation_dataset)} questions")
print("\nCategories:")
for item in evaluation_dataset:
    print(f"  • [{item['category']}] {item['question'][:60]}...")

## 3. Setup RAG System

In [None]:
print_section_header("Setting Up RAG System")

# Load documents (returns tuple: original_docs, chunks)
_, docs = load_and_split(chunk_size=1000, chunk_overlap=200)
print(f"\n✓ Loaded {len(docs)} chunks")

# Create embeddings and vector store
embeddings = OpenAIEmbeddings(model=OPENAI_EMBEDDING_MODEL)
store_path = VECTOR_STORE_DIR / "ragas_evaluation"

vectorstore = load_vector_store(store_path, embeddings)
if vectorstore is None:
    print("\nCreating vector store...")
    vectorstore = FAISS.from_documents(docs, embeddings)
    from shared.utils import save_vector_store
    save_vector_store(vectorstore, store_path)
    print("✓ Vector store created")
else:
    print("✓ Loaded existing vector store")

# Create retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Initialize LLM
llm = ChatOpenAI(model=DEFAULT_MODEL, temperature=DEFAULT_TEMPERATURE)

# Create RAG chain
rag_chain = (
    {"context": retriever | format_docs, "input": RunnablePassthrough()}
    | RAG_PROMPT_TEMPLATE
    | llm
    | StrOutputParser()
)

print("✓ RAG system ready")

## 4. Generate Answers and Collect Data

In [None]:
print_section_header("Generating Answers for Evaluation")

# Collect data for RAGAS evaluation
ragas_data = {
    "question": [],
    "answer": [],
    "contexts": [],
    "ground_truth": [],
}

print("\nGenerating answers...\n")

for i, item in enumerate(evaluation_dataset, 1):
    question = item["question"]
    ground_truth = item["ground_truth"]
    
    print(f"[{i}/{len(evaluation_dataset)}] {question[:60]}...")
    
    # Retrieve contexts
    retrieved_docs = retriever.invoke(question)
    contexts = [doc.page_content for doc in retrieved_docs]
    
    # Generate answer
    answer = rag_chain.invoke(question)
    
    # Store
    ragas_data["question"].append(question)
    ragas_data["answer"].append(answer)
    ragas_data["contexts"].append(contexts)
    ragas_data["ground_truth"].append(ground_truth)
    
    print(f"  ✓ Answer: {answer[:80]}...")
    print(f"  ✓ Retrieved {len(contexts)} contexts\n")

print("✓ All answers generated")

## 5. Run RAGAS Evaluation

In [None]:
if not RAGAS_AVAILABLE:
    print("⚠️  RAGAS not available. Skipping evaluation.")
    print("   Install with: pip install ragas datasets")
else:
    print_section_header("Running RAGAS Evaluation")
    
    # Create dataset
    dataset = Dataset.from_dict(ragas_data)
    
    print("\nEvaluating with RAGAS metrics...")
    print("(This may take a few minutes)\n")
    
    # Run evaluation
    result = evaluate(
        dataset,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall,
        ],
    )
    
    print("\n" + "=" * 80)
    print("RAGAS EVALUATION RESULTS:")
    print("=" * 80)
    print(f"\nOverall Scores:")
    for metric, score in result.items():
        if isinstance(score, (int, float)):
            print(f"  • {metric}: {score:.4f}")
    
    # Convert to DataFrame for detailed view
    df = result.to_pandas()
    print("\n" + "=" * 80)
    print("Per-Question Scores:")
    print("=" * 80)
    print(df[['question', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']].to_string(index=False))

## 6. Analyze Results by Category

In [None]:
if RAGAS_AVAILABLE:
    print_section_header("Analysis by Question Category")
    
    # Add categories to results
    df['category'] = [item['category'] for item in evaluation_dataset]
    
    # Group by category
    metrics = ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']
    
    print("\nAverage scores by category:\n")
    for category in df['category'].unique():
        cat_df = df[df['category'] == category]
        print(f"\n{category.upper()}:")
        print("-" * 60)
        for metric in metrics:
            if metric in cat_df.columns:
                avg_score = cat_df[metric].mean()
                print(f"  • {metric}: {avg_score:.4f}")

## 7. Identify Weak Points

In [None]:
if RAGAS_AVAILABLE:
    print_section_header("Identifying Weak Points")
    
    # Find questions with lowest scores
    print("\nQuestions needing improvement:\n")
    
    for metric in metrics:
        if metric in df.columns:
            print(f"\n{metric.upper()}:")
            print("-" * 80)
            
            # Get lowest scoring question
            min_idx = df[metric].idxmin()
            min_row = df.loc[min_idx]
            
            print(f"Lowest score: {min_row[metric]:.4f}")
            print(f"Question: {min_row['question']}")
            print(f"Category: {min_row['category']}")
            print(f"Answer preview: {min_row['answer'][:150]}...")

## 8. Visualization

In [None]:
if RAGAS_AVAILABLE:
    print_section_header("Visualizing Results")
    
    import matplotlib.pyplot as plt
    import numpy as np
    
    # Bar chart of metrics
    fig, ax = plt.subplots(figsize=(10, 6))
    
    metric_names = []
    metric_scores = []
    
    for metric in metrics:
        if metric in df.columns:
            metric_names.append(metric.replace('_', ' ').title())
            metric_scores.append(df[metric].mean())
    
    bars = ax.bar(metric_names, metric_scores, color=['#3498db', '#2ecc71', '#f39c12', '#e74c3c'])
    ax.set_ylim(0, 1)
    ax.set_ylabel('Score', fontsize=12, fontweight='bold')
    ax.set_title('RAGAS Evaluation Metrics - Simple RAG', fontsize=14, fontweight='bold')
    ax.grid(axis='y', alpha=0.3)
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.3f}',
                ha='center', va='bottom', fontweight='bold')
    
    plt.xticks(rotation=15, ha='right')
    plt.tight_layout()
    plt.show()
    
    print("\n✓ Visualization complete")

## 9. Architecture Comparison Framework

Framework for comparing multiple RAG architectures.

In [None]:
print_section_header("Architecture Comparison Framework")

# This would be used to compare all 12 architectures
architecture_results = {
    "Simple RAG": {
        "faithfulness": 0.85,
        "answer_relevancy": 0.82,
        "context_precision": 0.78,
        "context_recall": 0.75,
        "avg_latency": 1.2,
        "cost_per_query": 0.002,
    },
    "Contextual RAG": {
        "faithfulness": 0.88,
        "answer_relevancy": 0.86,
        "context_precision": 0.83,
        "context_recall": 0.79,
        "avg_latency": 1.3,
        "cost_per_query": 0.002,
    },
    "Fusion RAG": {
        "faithfulness": 0.87,
        "answer_relevancy": 0.85,
        "context_precision": 0.84,
        "context_recall": 0.82,
        "avg_latency": 3.5,
        "cost_per_query": 0.006,
    },
    # ... other architectures
}

# Create comparison DataFrame
comparison_df = pd.DataFrame(architecture_results).T

print("\nArchitecture Comparison (Sample):")
print("=" * 80)
print(comparison_df.to_string())

print("\n" + "=" * 80)
print("RANKING BY METRIC:")
print("=" * 80)

for metric in ['faithfulness', 'answer_relevancy', 'context_precision']:
    print(f"\n{metric.upper()}:")
    ranked = comparison_df.sort_values(metric, ascending=False)
    for i, (arch, row) in enumerate(ranked.iterrows(), 1):
        print(f"  {i}. {arch}: {row[metric]:.3f}")

## 10. Cost-Quality Trade-off Analysis

In [None]:
print_section_header("Cost-Quality Trade-off Analysis")

print("\nQuality Score = Average of all metrics")
print("Efficiency = Quality / (Latency × Cost)\n")

# Calculate quality score
comparison_df['quality'] = comparison_df[[
    'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall'
]].mean(axis=1)

# Calculate efficiency
comparison_df['efficiency'] = (
    comparison_df['quality'] / 
    (comparison_df['avg_latency'] * comparison_df['cost_per_query'] * 1000)
)

print("=" * 80)
print("RANKINGS:")
print("=" * 80)

print("\n1. Best Quality:")
ranked_quality = comparison_df.sort_values('quality', ascending=False)
for i, (arch, row) in enumerate(ranked_quality.iterrows(), 1):
    print(f"  {i}. {arch}: {row['quality']:.3f}")

print("\n2. Fastest:")
ranked_speed = comparison_df.sort_values('avg_latency')
for i, (arch, row) in enumerate(ranked_speed.iterrows(), 1):
    print(f"  {i}. {arch}: {row['avg_latency']:.2f}s")

print("\n3. Most Cost-Effective:")
ranked_cost = comparison_df.sort_values('cost_per_query')
for i, (arch, row) in enumerate(ranked_cost.iterrows(), 1):
    print(f"  {i}. {arch}: ${row['cost_per_query']:.4f}")

print("\n4. Best Efficiency (Quality/Cost/Speed):")
ranked_efficiency = comparison_df.sort_values('efficiency', ascending=False)
for i, (arch, row) in enumerate(ranked_efficiency.iterrows(), 1):
    print(f"  {i}. {arch}: {row['efficiency']:.2f}")

## 11. Recommendations

In [None]:
print_section_header("Architecture Recommendations")

def get_recommendation(use_case: str) -> Dict[str, Any]:
    """
    Get architecture recommendation based on use case.
    """
    recommendations = {
        "production_quality": {
            "architecture": "Fusion RAG or Contextual RAG",
            "reason": "Best quality metrics with reasonable cost",
            "metrics": "High faithfulness, precision, recall",
        },
        "cost_sensitive": {
            "architecture": "Simple RAG or Adaptive RAG",
            "reason": "Low cost per query, good baseline quality",
            "metrics": "Balanced cost/quality ratio",
        },
        "low_latency": {
            "architecture": "Simple RAG or Contextual RAG",
            "reason": "Fastest response times",
            "metrics": "Sub-2s latency",
        },
        "complex_queries": {
            "architecture": "Fusion RAG or Agentic RAG",
            "reason": "Handle multi-faceted questions",
            "metrics": "High context recall, multi-hop reasoning",
        },
        "structured_data": {
            "architecture": "SQL RAG",
            "reason": "Direct database queries, precise results",
            "metrics": "100% precision for data queries",
        },
        "relationship_queries": {
            "architecture": "GraphRAG",
            "reason": "Entity relationships, multi-hop",
            "metrics": "Best for 'how are X and Y related'",
        },
    }
    
    return recommendations.get(use_case, {})


print("\nRECOMMENDATIONS BY USE CASE:")
print("=" * 80)

use_cases = [
    "production_quality",
    "cost_sensitive",
    "low_latency",
    "complex_queries",
    "structured_data",
    "relationship_queries",
]

for use_case in use_cases:
    rec = get_recommendation(use_case)
    print(f"\n{use_case.replace('_', ' ').title()}:")
    print("-" * 80)
    print(f"  Recommended: {rec.get('architecture', 'N/A')}")
    print(f"  Reason: {rec.get('reason', 'N/A')}")
    print(f"  Key metrics: {rec.get('metrics', 'N/A')}")

## 12. Key Takeaways

### RAGAS Metrics Explained

1. **Faithfulness** (0-1)
   - Measures: Hallucination / groundedness
   - How: Checks if answer is supported by context
   - Good: > 0.85
   - Improve: Better prompts, more context, fact-checking

2. **Answer Relevancy** (0-1)
   - Measures: Answer addresses question
   - How: Semantic similarity to query
   - Good: > 0.80
   - Improve: Better prompts, query understanding

3. **Context Precision** (0-1)
   - Measures: Retrieved docs are relevant
   - How: Checks if context matches ground truth
   - Good: > 0.75
   - Improve: Better retrieval, reranking

4. **Context Recall** (0-1)
   - Measures: Retrieved all relevant info
   - How: Coverage of ground truth
   - Good: > 0.75
   - Improve: More retrievals, better chunking

### Best Practices

**Dataset Creation:**
- ✅ Diverse question types
- ✅ Cover edge cases
- ✅ Include failure modes
- ✅ Real user queries
- ✅ 20-50 test cases minimum

**Ground Truth:**
- ✅ Expert-written answers
- ✅ Factually correct
- ✅ Concise and clear
- ✅ Covers key points

**Continuous Evaluation:**
- ✅ Run on every major change
- ✅ Track metrics over time
- ✅ A/B test architectures
- ✅ Monitor production queries

### Interpreting Results

**High Faithfulness, Low Recall:**
- Answer is grounded but incomplete
- → Increase k (retrieve more docs)

**Low Precision, High Recall:**
- Retrieved too much noise
- → Add reranking, better chunking

**Low Relevancy:**
- Answer doesn't address question
- → Improve prompt, add examples

**All Low Scores:**
- Fundamental issues
- → Check data quality, embeddings, prompts

### Production Deployment

**Pre-deployment:**
1. RAGAS scores > 0.75 on all metrics
2. Latency < 3s for 95th percentile
3. Cost per query acceptable
4. Manual review of edge cases

**Post-deployment:**
1. Log all queries and responses
2. Sample for human evaluation
3. Track user feedback
4. Monitor RAGAS on production data
5. A/B test improvements

### Next Steps

1. **Expand dataset**: Add more test cases
2. **Evaluate all architectures**: Run comparison
3. **Optimize weak points**: Focus on low-scoring areas
4. **Set up CI/CD**: Automate evaluation
5. **Production monitoring**: Track real-world performance

---

**Importance:** ⭐⭐⭐⭐⭐ (Critical - evaluation is essential for production)

**Recommendation:** Run RAGAS evaluation on every architecture before production deployment!

Project complete! You now have 12 RAG architectures + comprehensive evaluation framework.