# LLM-as-a-Judge Evaluation with OpenRouter

This notebook evaluates different LLM models (via OpenRouter) on a QA dataset using GPT-4o as a judge.

## How it works:
1. Load a QA dataset (questions generated by GPT-4o)
2. Test 3 different models via OpenRouter on the same questions:
   - Claude 3.5 Sonnet (Anthropic)
   - GPT-3.5-turbo (OpenAI)
   - Google Gemini Pro (Google)
3. Use GPT-4o as a judge to evaluate answer quality
4. Get scores on 3 dimensions:
   - **Retrieval Relevance**: How relevant are the retrieved documents to the query?
   - **Faithfulness (Groundedness)**: Is the answer faithful to the retrieved context?
   - **Answer Quality**: Overall quality of the answer

## Setup

### 1. Environment Variables
In your `.env` file, you need:
- `OPENAI_API_KEY`: Required for the judge model (GPT-4o)
- `OPENROUTER_API_KEY`: Required for testing different models via OpenRouter
- `FORCE_OPENROUTER=true`: Add this to force the system to use OpenRouter for all model calls

Example `.env`:
```
OPENAI_API_KEY=sk-...
OPENROUTER_API_KEY=sk-or-v1-...
FORCE_OPENROUTER=true
```

### 2. Server Setup
- FastAPI server must be running on `http://localhost:8000`
- PDFs must be uploaded to the system before running evaluation
- Restart the server after setting `FORCE_OPENROUTER=true` in `.env`


In [135]:
# Install required packages (if needed)
# !pip install openai requests python-dotenv


In [136]:
# Import libraries
import os
import json
import time
import requests
from typing import List, Dict, Optional
from pathlib import Path
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

# Configuration
API_BASE_URL = "http://localhost:8000"
EVALUATION_OUTPUT_DIR = Path("llm_judge_results")
EVALUATION_OUTPUT_DIR.mkdir(exist_ok=True)

print("✓ Libraries imported")
print(f"API Base URL: {API_BASE_URL}")
print(f"Output Directory: {EVALUATION_OUTPUT_DIR}")


✓ Libraries imported
API Base URL: http://localhost:8000
Output Directory: llm_judge_results


## LLM Judge Class

This class uses an LLM to evaluate answer quality on 5 dimensions.


In [137]:
class LLMJudge:
    """Use LLM as a judge to evaluate answer quality."""
    
    def __init__(self, judge_model: str = "gpt-4o", api_key: Optional[str] = None):
        """Initialize the LLM judge."""
        self.judge_model = judge_model
        
        # Check if judge model is from OpenRouter (has provider prefix like anthropic/, google/, etc.)
        use_openrouter = '/' in judge_model
        
        if use_openrouter:
            # Use OpenRouter for judge model
            openrouter_key = api_key or os.getenv("OPENROUTER_API_KEY")
            if not openrouter_key:
                raise ValueError("OpenRouter API key required for judge model. Set OPENROUTER_API_KEY environment variable.")
            self.client = OpenAI(
                api_key=openrouter_key,
                base_url="https://openrouter.ai/api/v1",
                default_headers={
                    "HTTP-Referer": "http://localhost:8000",
                    "X-Title": "PDF RAG Q&A System - Judge"
                }
            )
        else:
            # Use OpenAI API directly for judge model
            openai_key = api_key or os.getenv("OPENAI_API_KEY")
            if not openai_key:
                raise ValueError("OpenAI API key required for LLM judge. Set OPENAI_API_KEY environment variable.")
            self.client = OpenAI(api_key=openai_key)
    
    def evaluate_answer(
        self,
        query: str,
        retrieved_docs: List[Dict],
        model_answer: str
    ) -> Dict:
        """
        Evaluate an answer using LLM as judge.
        
        Args:
            query: The question/query
            retrieved_docs: List of retrieved documents with 'text', 'pdf_filename', 'page'
            model_answer: The answer generated by the model
            
        Returns:
            Dictionary with scores and explanations
        """
        # Format retrieved context
        retrieved_context = "\n\n".join([
            f"[Document {i+1} - {doc.get('pdf_filename', 'Unknown')}, Page {doc.get('page', 'Unknown')}]:\n{doc.get('text', '')}"
            for i, doc in enumerate(retrieved_docs)
        ])
        
        # System prompt
        system_prompt = """You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.
You must score the system's performance on the following dimensions:

1. **Retrieval Relevance**: How relevant are the retrieved documents to the query?
   - Score 1: Documents are completely irrelevant to the query
   - Score 2: Documents are mostly irrelevant, with minimal connection to the query
   - Score 3: Documents are somewhat relevant but miss key aspects of the query
   - Score 4: Documents are highly relevant and cover most aspects of the query
   - Score 5: Documents are perfectly relevant and comprehensively address the query

2. **Faithfulness (Groundedness)**: Is the answer faithful to the retrieved context?
   - Score 1: Answer contains significant information not in the retrieved documents or contradicts them
   - Score 2: Answer includes some unsupported claims or minor contradictions
   - Score 3: Answer is mostly faithful but includes some unsupported details
   - Score 4: Answer is highly faithful with only minor unsupported additions
   - Score 5: Answer is completely faithful and only uses information from retrieved documents

3. **Answer Quality**: Overall quality of the answer
   - Score 1: Answer is unclear, incomplete, or unhelpful
   - Score 2: Answer addresses the query but is poorly structured or lacks clarity
   - Score 3: Answer is adequate but could be improved in clarity or completeness
   - Score 4: Answer is clear, well-structured, and mostly complete
   - Score 5: Answer is excellent, clear, comprehensive, and well-structured

Scores range from 1 to 5.
You must provide a short explanation for each score."""

        # User prompt - emphasize JSON format
        user_prompt = f"""Evaluate the following:

Query:
{query}

Retrieved Context:
{retrieved_context}

Model Answer:
{model_answer}

IMPORTANT: You must respond ONLY with valid JSON. No additional text before or after.

Return output in this exact JSON format:
{{
    "relevance": {{"score": X, "explanation": "..."}},
    "faithfulness": {{"score": X, "explanation": "..."}},
    "answer_quality": {{"score": X, "explanation": "..."}}
}}

Where X is a number from 1 to 5."""

        try:
            # Try without response_format first (some models don't support it)
            try:
                response = self.client.chat.completions.create(
                    model=self.judge_model,
                    messages=[
                        {"role": "system", "content": system_prompt},
                        {"role": "user", "content": user_prompt}
                    ],
                    temperature=0.3
                )
            except Exception as format_error:
                # If that fails, try with response_format
                response = self.client.chat.completions.create(
                    model=self.judge_model,
                    messages=[
                        {"role": "system", "content": system_prompt},
                        {"role": "user", "content": user_prompt}
                    ],
                    temperature=0.3,
                    response_format={"type": "json_object"}
                )
            
            result_text = response.choices[0].message.content.strip() if response.choices[0].message.content else ""
            
            # Check if response is empty
            if not result_text:
                raise ValueError(f"Judge model returned empty response. Model: {self.judge_model}")
            
            # Try to parse JSON directly
            try:
                evaluation = json.loads(result_text)
            except json.JSONDecodeError:
                # If direct parsing fails, try to extract JSON from text
                import re
                json_match = re.search(r'\{.*\}', result_text, re.DOTALL)
                if json_match:
                    evaluation = json.loads(json_match.group(0))
                else:
                    raise ValueError(f"Could not parse JSON from response. Response text (first 500 chars): {result_text[:500]}")
            
            return {
                "success": True,
                "scores": {
                    "relevance": evaluation.get("relevance", {}).get("score", 0),
                    "faithfulness": evaluation.get("faithfulness", {}).get("score", 0),
                    "answer_quality": evaluation.get("answer_quality", {}).get("score", 0)
                },
                "explanations": {
                    "relevance": evaluation.get("relevance", {}).get("explanation", ""),
                    "faithfulness": evaluation.get("faithfulness", {}).get("explanation", ""),
                    "answer_quality": evaluation.get("answer_quality", {}).get("explanation", "")
                },
                "judge_model": self.judge_model
            }
        except Exception as e:
            return {
                "success": False,
                "error": str(e)
            }

print("✓ LLMJudge class defined")


✓ LLMJudge class defined


## Helper Functions


In [138]:
def load_qa_dataset(file_path: str) -> List[Dict]:
    """Load QA dataset from JSON file."""
    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    if isinstance(data, list):
        return data
    elif isinstance(data, dict) and "questions" in data:
        return data["questions"]
    else:
        raise ValueError("Invalid dataset format. Expected list or dict with 'questions' key.")


def format_model_name_for_openrouter(model_name: str) -> str:
    """
    Format model name for OpenRouter API.
    
    OpenRouter requires format: provider/model-name
    Examples:
    - gpt-3.5-turbo -> openai/gpt-3.5-turbo
    - gpt-4 -> openai/gpt-4
    - claude-3.5-sonnet -> anthropic/claude-3-5-sonnet (note: hyphen, not dot)
    - claude-3-sonnet -> anthropic/claude-3-sonnet
    - llama-2 -> meta/llama-2
    - gemini-pro -> google/gemini-pro
    """
    # If already has provider prefix, check if it needs conversion
    if '/' in model_name:
        # Handle special cases for OpenRouter model IDs
        # OpenRouter uses claude-3-5-sonnet (with hyphen) not claude-3.5-sonnet (with dot)
        if 'claude-3.5' in model_name:
            return model_name.replace('claude-3.5', 'claude-3-5')
        else:
            # Keep model name as-is if it already has provider prefix
            return model_name
    
    # Map common model names to providers
    if model_name.startswith('gpt'):
        return f"openai/{model_name}"
    elif model_name.startswith('claude'):
        # OpenRouter uses claude-3-5-sonnet (hyphen) not claude-3.5-sonnet (dot)
        formatted = model_name.replace('3.5', '3-5')
        return f"anthropic/{formatted}"
    elif model_name.startswith('llama'):
        return f"meta/{model_name}"
    elif model_name.startswith('gemini'):
        return f"google/{model_name}"
    else:
        # Default: assume OpenAI format
        return f"openai/{model_name}"


def ask_question_with_model(question: str, model_name: str = None, use_openrouter: bool = True) -> Dict:
    """
    Ask a question using the API.
    
    Args:
        question: The question to ask
        model_name: Model name (will be formatted for OpenRouter if needed)
        use_openrouter: Whether to format model name for OpenRouter
    """
    payload = {"question": question}
    
    if model_name:
        # Format model name for OpenRouter if needed
        if use_openrouter:
            formatted_model = format_model_name_for_openrouter(model_name)
        else:
            formatted_model = model_name
        payload["model_name"] = formatted_model
    
    start_time = time.time()
    
    try:
        response = requests.post(
            f"{API_BASE_URL}/api/ask",
            json=payload,
            timeout=120
        )
        
        response_time = time.time() - start_time
        
        if response.status_code == 200:
            data = response.json()
            return {
                "success": True,
                "answer": data.get("answer", ""),
                "citations": data.get("citations", []),
                "retrieved_docs": data.get("retrieved_docs", []),  # Include retrieved documents
                "response_time": response_time,
                "model_used": data.get("model_used", formatted_model if model_name else "default")
            }
        else:
            return {
                "success": False,
                "error": response.text,
                "response_time": response_time
            }
    except Exception as e:
        return {
            "success": False,
            "error": str(e),
            "response_time": time.time() - start_time
        }

print("✓ Helper functions defined")


✓ Helper functions defined


In [139]:
# Configuration
DATASET_PATH = "qa_dataset.json"  # Path to your QA dataset JSON file

# Models to test via OpenRouter (3 models from different families)
MODELS_TO_TEST = [
    "claude-3.5-sonnet",   # Anthropic Claude 3.5 Sonnet
    "gpt-3.5-turbo",       # OpenAI GPT-3.5 Turbo
    "google/gemini-2.5-flash",    # Google Gemini 2.5 Flash
]

# Judge model: Must be from a different family than tested models
# Since we're testing Claude, GPT, and Gemini, we need a model from another family
# Using Mistral (different from Anthropic, OpenAI, and Google)
# Note: llama-guard is a safety model, not suitable for evaluation tasks
JUDGE_MODEL = "mistralai/mistral-large"  # Model to use as judge (different family: Mistral)
OUTPUT_FILE = None  # None = auto-generate filename with timestamp
USE_OPENROUTER = True  # Whether to use OpenRouter for testing models

print("Configuration:")
print(f"  Dataset: {DATASET_PATH}")
print(f"  Models to test (via OpenRouter): {MODELS_TO_TEST}")
print(f"  Judge model: {JUDGE_MODEL}")
print(f"  Use OpenRouter: {USE_OPENROUTER}")
print("\n⚠️  IMPORTANT: Make sure FORCE_OPENROUTER=true is set in .env file!")
print("   This ensures all model calls go through OpenRouter.")


Configuration:
  Dataset: qa_dataset.json
  Models to test (via OpenRouter): ['claude-3.5-sonnet', 'gpt-3.5-turbo', 'google/gemini-2.5-flash']
  Judge model: mistralai/mistral-large
  Use OpenRouter: True

⚠️  IMPORTANT: Make sure FORCE_OPENROUTER=true is set in .env file!
   This ensures all model calls go through OpenRouter.


## Load QA Dataset

Load your questions and reference answers.


In [140]:
# Load dataset
print(f"Loading QA dataset from: {DATASET_PATH}")
qa_items = load_qa_dataset(DATASET_PATH)
print(f"✓ Loaded {len(qa_items)} questions")

# Show first question as example
if qa_items:
    print("\nExample question:")
    print(f"  Question: {qa_items[0].get('question', 'N/A')}")
    print(f"  Has reference answer: {'answer' in qa_items[0] or 'reference_answer' in qa_items[0]}")
    print(f"  Has context: {'context' in qa_items[0]}")


Loading QA dataset from: qa_dataset.json
✓ Loaded 50 questions

Example question:
  Question: What are the main topics covered in the Machine Learning Interview Cheat Sheet?
  Has reference answer: True
  Has context: True


## Initialize Judge

Initialize the LLM judge that will evaluate all answers.


In [141]:
# Initialize judge
print(f"Initializing LLM judge: {JUDGE_MODEL}")
judge = LLMJudge(judge_model=JUDGE_MODEL)
print("✓ Judge initialized")


Initializing LLM judge: mistralai/mistral-large
✓ Judge initialized


## Run Evaluation

This will test each model (via OpenRouter) on each question and get judge evaluations.


In [142]:
# Setup output
if not OUTPUT_FILE:
    timestamp = time.strftime("%Y%m%d_%H%M%S")
    OUTPUT_FILE = f"llm_judge_evaluation_{timestamp}.json"

output_path = EVALUATION_OUTPUT_DIR / OUTPUT_FILE

print(f"{'='*60}")
print(f"LLM-as-a-Judge Evaluation (via OpenRouter)")
print(f"{'='*60}")
print(f"Dataset: {DATASET_PATH}")
print(f"Questions: {len(qa_items)}")
print(f"Models to test: {MODELS_TO_TEST}")
print(f"Judge model: {JUDGE_MODEL}")
print(f"Output: {output_path}")
print(f"{'='*60}\n")

results = {
    "metadata": {
        "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
        "dataset_path": DATASET_PATH,
        "total_questions": len(qa_items),
        "models_tested": MODELS_TO_TEST,
        "judge_model": JUDGE_MODEL,
        "api_base_url": API_BASE_URL,
        "use_openrouter": USE_OPENROUTER
    },
    "evaluations": []
}

models_to_test = MODELS_TO_TEST if MODELS_TO_TEST else [None]
total_tests = len(qa_items) * len(models_to_test)
current_test = 0

for qa_item in qa_items:
    question = qa_item.get("question", "")
    reference_answer = qa_item.get("answer") or qa_item.get("reference_answer")
    context = qa_item.get("context", "")
    qa_id = qa_item.get("id", len(results["evaluations"]) + 1)
    
    if not question:
        print(f"⚠ Skipping item {qa_id}: No question found")
        continue
    
    print(f"\nQuestion {qa_id}/{len(qa_items)}: {question[:60]}...")
    
    for model in models_to_test:
        current_test += 1
        model_display = model or "default"
        print(f"  [{current_test}/{total_tests}] Testing {model_display}...", end=" ", flush=True)
        
        # Get answer from model (via OpenRouter)
        answer_result = ask_question_with_model(question, model, use_openrouter=USE_OPENROUTER)
        
        if not answer_result["success"]:
            print(f"✗ Error: {answer_result.get('error', 'Unknown')[:50]}")
            evaluation = {
                "qa_id": qa_id,
                "question": question,
                "model": model_display,
                "success": False,
                "error": answer_result.get("error", "Unknown error")
            }
            results["evaluations"].append(evaluation)
            continue
        
        answer = answer_result["answer"]
        retrieved_docs = answer_result.get("retrieved_docs", [])
        print(f"✓ Got answer ({answer_result.get('response_time', 0):.2f}s)", end=" ", flush=True)
        
        # Use context from QA dataset for Retrieval Relevance evaluation
        # Convert context string to retrieved_docs format if context exists
        context_for_evaluation = retrieved_docs
        if context and context.strip():
            # Convert context string to retrieved_docs format for evaluation
            # This represents the "ground truth" context that should have been retrieved
            context_for_evaluation = [{
                "text": context,
                "pdf_filename": "Reference Context",
                "page": "N/A"
            }]
        
        # Judge the answer
        print("Judging...", end=" ", flush=True)
        judge_result = judge.evaluate_answer(
            query=question,
            retrieved_docs=context_for_evaluation,
            model_answer=answer
        )
        
        if judge_result["success"]:
            # Calculate average score
            scores = judge_result["scores"]
            avg_score = (scores["relevance"] + scores["faithfulness"] + scores["answer_quality"]) / 3
            print(f"✓ Avg Score: {avg_score:.2f}/5")
        else:
            print(f"✗ Judge error: {judge_result.get('error', 'Unknown')[:50]}")
        
        # Build evaluation record
        evaluation = {
            "qa_id": qa_id,
            "question": question,
            "model": model_display,
            "model_used": answer_result.get("model_used", model_display),
            "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
            "answer": answer,
            "retrieved_docs": retrieved_docs,
            "citations": answer_result.get("citations", []),
            "response_time": answer_result.get("response_time", 0),
            "num_citations": len(answer_result.get("citations", [])),
            "num_retrieved_docs": len(retrieved_docs),
            "answer_length": len(answer),
            "judge_evaluation": judge_result if judge_result["success"] else None,
            "judge_error": judge_result.get("error") if not judge_result["success"] else None
        }
        
        results["evaluations"].append(evaluation)
        
        # Save progress after each question
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(results, f, indent=2, ensure_ascii=False)

print(f"\n{'='*60}")
print(f"Evaluation Complete!")
print(f"Results saved to: {output_path}")
print(f"{'='*60}\n")


LLM-as-a-Judge Evaluation (via OpenRouter)
Dataset: qa_dataset.json
Questions: 50
Models to test: ['claude-3.5-sonnet', 'gpt-3.5-turbo', 'google/gemini-2.5-flash']
Judge model: mistralai/mistral-large
Output: llm_judge_results/llm_judge_evaluation_20251205_143859.json


Question 1/50: What are the main topics covered in the Machine Learning Int...
  [1/150] Testing claude-3.5-sonnet... ✓ Got answer (6.10s) Judging... ✓ Avg Score: 3.00/5
  [2/150] Testing gpt-3.5-turbo... ✓ Got answer (1.78s) Judging... ✓ Avg Score: 2.33/5
  [3/150] Testing google/gemini-2.5-flash... ✓ Got answer (1.26s) Judging... ✓ Avg Score: 3.00/5

Question 2/50: What are some key concepts related to machine learning menti...
  [4/150] Testing claude-3.5-sonnet... ✓ Got answer (12.46s) Judging... ✓ Avg Score: 4.33/5
  [5/150] Testing gpt-3.5-turbo... ✓ Got answer (5.81s) Judging... ✓ Avg Score: 4.00/5
  [6/150] Testing google/gemini-2.5-flash... ✓ Got answer (5.67s) Judging... ✓ Avg Score: 4.67/5

Question 3/50: Wha

## Calculate Summary Statistics


In [143]:
# Calculate summary statistics
summary = {}

for eval_item in results["evaluations"]:
    if not eval_item.get("success", True) or not eval_item.get("judge_evaluation"):
        continue
    
    model = eval_item["model"]
    if model not in summary:
        summary[model] = {
            "total_questions": 0,
            "successful_evaluations": 0,
            "avg_scores": {
                "relevance": 0,
                "faithfulness": 0,
                "answer_quality": 0,
                "overall_score": 0
            },
            "total_response_time": 0,
            "avg_response_time": 0,
            "total_citations": 0,
            "avg_citations": 0,
            "total_retrieved_docs": 0,
            "avg_retrieved_docs": 0
        }
    
    stats = summary[model]
    stats["total_questions"] += 1
    
    judge_eval = eval_item.get("judge_evaluation", {})
    if judge_eval and judge_eval.get("success"):
        stats["successful_evaluations"] += 1
        scores = judge_eval.get("scores", {})
        
        # Calculate overall score as average of three dimensions
        overall = (scores.get("relevance", 0) + scores.get("faithfulness", 0) + scores.get("answer_quality", 0)) / 3
        
        stats["avg_scores"]["relevance"] += scores.get("relevance", 0)
        stats["avg_scores"]["faithfulness"] += scores.get("faithfulness", 0)
        stats["avg_scores"]["answer_quality"] += scores.get("answer_quality", 0)
        stats["avg_scores"]["overall_score"] += overall
        
        stats["total_response_time"] += eval_item.get("response_time", 0)
        stats["total_citations"] += eval_item.get("num_citations", 0)
        stats["total_retrieved_docs"] += eval_item.get("num_retrieved_docs", 0)

# Calculate averages
for model, stats in summary.items():
    if stats["successful_evaluations"] > 0:
        for score_type in stats["avg_scores"]:
            stats["avg_scores"][score_type] /= stats["successful_evaluations"]
        stats["avg_response_time"] = stats["total_response_time"] / stats["successful_evaluations"]
        stats["avg_citations"] = stats["total_citations"] / stats["successful_evaluations"]
        stats["avg_retrieved_docs"] = stats["total_retrieved_docs"] / stats["successful_evaluations"]

results["summary"] = summary

# Save final results
with open(output_path, 'w', encoding='utf-8') as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

print("✓ Summary calculated and saved")


✓ Summary calculated and saved


## View Summary Results


In [144]:
# Print summary
print("\n" + "="*60)
print("SUMMARY - Model Comparison")
print("="*60)

for model, stats in summary.items():
    print(f"\nModel: {model}")
    print(f"  Total Questions: {stats['total_questions']}")
    print(f"  Successful Evaluations: {stats['successful_evaluations']}")
    print(f"  Average Scores (1-5 scale):")
    print(f"    Retrieval Relevance: {stats['avg_scores']['relevance']:.2f}/5")
    print(f"    Faithfulness: {stats['avg_scores']['faithfulness']:.2f}/5")
    print(f"    Answer Quality: {stats['avg_scores']['answer_quality']:.2f}/5")
    print(f"    Overall Score: {stats['avg_scores']['overall_score']:.2f}/5")
    print(f"  Avg Response Time: {stats['avg_response_time']:.2f}s")
    print(f"  Avg Citations: {stats['avg_citations']:.2f}")
    print(f"  Avg Retrieved Docs: {stats['avg_retrieved_docs']:.2f}")

print(f"\n✓ Full results saved to: {output_path}")
print("\nYou can now analyze the results to compare different models' performance.")



SUMMARY - Model Comparison

Model: claude-3.5-sonnet
  Total Questions: 50
  Successful Evaluations: 50
  Average Scores (1-5 scale):
    Retrieval Relevance: 3.82/5
    Faithfulness: 3.96/5
    Answer Quality: 4.12/5
    Overall Score: 3.97/5
  Avg Response Time: 10.84s
  Avg Citations: 2.98
  Avg Retrieved Docs: 29.18

Model: gpt-3.5-turbo
  Total Questions: 50
  Successful Evaluations: 50
  Average Scores (1-5 scale):
    Retrieval Relevance: 3.68/5
    Faithfulness: 3.58/5
    Answer Quality: 3.70/5
    Overall Score: 3.65/5
  Avg Response Time: 3.64s
  Avg Citations: 3.56
  Avg Retrieved Docs: 29.18

Model: google/gemini-2.5-flash
  Total Questions: 50
  Successful Evaluations: 50
  Average Scores (1-5 scale):
    Retrieval Relevance: 3.94/5
    Faithfulness: 4.20/5
    Answer Quality: 4.32/5
    Overall Score: 4.15/5
  Avg Response Time: 2.17s
  Avg Citations: 3.50
  Avg Retrieved Docs: 29.18

✓ Full results saved to: llm_judge_results/llm_judge_evaluation_20251205_143859.json

