# Evaluating LLM Outputs with LLM-as-Judge

## Overview

Production systems must **validate LLM outputs** before sending to users. This notebook demonstrates the **LLM-as-judge pattern**: using an LLM to evaluate another LLM's responses.

## Why Evaluation Matters

### Quality Assurance
- **Accuracy**: Does response match facts?
- **Completeness**: Does it answer the question?
- **Relevance**: Is it on-topic?
- **Safety**: No harmful/inappropriate content?

### Use Cases
1. **Pre-deployment**: Test prompt changes
2. **Production**: Real-time quality gates
3. **Monitoring**: Track quality metrics over time
4. **A/B Testing**: Compare prompt variants

## LLM-as-Judge Pattern

**Concept**: Use a separate LLM call to evaluate the primary LLM's output

**Benefits**:
- Automated at scale (no human review)
- Consistent criteria
- Cost-effective (~$0.0001 per evaluation)
- Fast (<500ms)

**Limitations**:
- Not 100% accurate (LLMs can misjudge)
- Requires clear evaluation criteria
- Best for binary/simple judgments

## Environment Setup

In [None]:
import os
import openai
import tiktoken
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

---

## Layer 1: Content Moderation

Before evaluating quality, ensure the response passes content safety checks using OpenAI's Moderation API.

In [None]:
client = openai.OpenAI()

def get_completion_from_messages(
    messages,
    model="gpt-3.5-turbo",
    temperature=0,
    max_tokens=500,
):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return response.choices[0].message.content


**Expected Result**: `flagged=False` (no policy violations)

This is a benign customer service response, so it should pass moderation.

---

## Layer 2: LLM-as-Judge Evaluation

### Evaluation Criteria

1. **Factual Accuracy**: Does response match product info?
2. **Completeness**: Does it answer the user's question?

### System Message Design

Creates an evaluation LLM with two jobs:
- Check facts against source data
- Verify question was answered

**Output**: Binary Y/N (simple, fast, low-cost)

In [None]:
final_response_to_customer = f"""
The SmartX ProPhone has a 6.1-inch display, 128GB storage, \
12MP dual camera, and 5G. The FotoSnap DSLR Camera \
has a 24.2MP sensor, 1080p video, 3-inch LCD, and \
interchangeable lenses. We have a variety of TVs, including \
the CineView 4K TV with a 55-inch display, 4K resolution, \
HDR, and smart TV features. We also have the SoundMax \
Home Theater system with 5.1 channel, 1000W output, wireless \
subwoofer, and Bluetooth. Do you have any specific questions \
about these products or any other products we offer?
"""
response = client.moderations.create(
    input=final_response_to_customer
)
moderation_output = response.results[0]
print(moderation_output)

**Expected Output**: `Y` (response is accurate and complete)

The response correctly states:
- SmartX ProPhone: 6.1-inch, 128GB, 12MP, 5G ✓
- FotoSnap DSLR: 24.2MP, 1080p, 3-inch LCD ✓
- TVs: CineView 4K, SoundMax Home Theater ✓

---

## Test Case 2: Irrelevant Response

Testing with a nonsensical response to verify the judge catches bad outputs.

In [None]:
system_message = f"""
You are an assistant that evaluates whether \
customer service agent responses sufficiently \
answer customer questions, and also validates that \
all the facts the assistant cites from the product \
information are correct.
The product information and user and customer \
service agent messages will be delimited by \
3 backticks, i.e. ```.
Respond with a Y or N character, with no punctuation:
Y - if the output sufficiently answers the question \
AND the response correctly uses product information
N - otherwise

Output a single letter only.
"""
customer_message = f"""
tell me about the smartx pro phone and \
the fotosnap camera, the dslr one. \
Also tell me about your tvs"""
product_information = """{ "name": "SmartX ProPhone", "category": "Smartphones and Accessories", "brand": "SmartX", "model_number": "SX-PP10", "warranty": "1 year", "rating": 4.6, "features": [ "6.1-inch display", "128GB storage", "12MP dual camera", "5G" ], "description": "A powerful smartphone with advanced camera features.", "price": 899.99 } { "name": "FotoSnap DSLR Camera", "category": "Cameras and Camcorders", "brand": "FotoSnap", "model_number": "FS-DSLR200", "warranty": "1 year", "rating": 4.7, "features": [ "24.2MP sensor", "1080p video", "3-inch LCD", "Interchangeable lenses" ], "description": "Capture stunning photos and videos with this versatile DSLR camera.", "price": 599.99 } { "name": "CineView 4K TV", "category": "Televisions and Home Theater Systems", "brand": "CineView", "model_number": "CV-4K55", "warranty": "2 years", "rating": 4.8, "features": [ "55-inch display", "4K resolution", "HDR", "Smart TV" ], "description": "A stunning 4K TV with vibrant colors and smart features.", "price": 599.99 } { "name": "SoundMax Home Theater", "category": "Televisions and Home Theater Systems", "brand": "SoundMax", "model_number": "SM-HT100", "warranty": "1 year", "rating": 4.4, "features": [ "5.1 channel", "1000W output", "Wireless subwoofer", "Bluetooth" ], "description": "A powerful home theater system for an immersive audio experience.", "price": 399.99 } { "name": "CineView 8K TV", "category": "Televisions and Home Theater Systems", "brand": "CineView", "model_number": "CV-8K65", "warranty": "2 years", "rating": 4.9, "features": [ "65-inch display", "8K resolution", "HDR", "Smart TV" ], "description": "Experience the future of television with this stunning 8K TV.", "price": 2999.99 } { "name": "SoundMax Soundbar", "category": "Televisions and Home Theater Systems", "brand": "SoundMax", "model_number": "SM-SB50", "warranty": "1 year", "rating": 4.3, "features": [ "2.1 channel", "300W output", "Wireless subwoofer", "Bluetooth" ], "description": "Upgrade your TV's audio with this sleek and powerful soundbar.", "price": 199.99 } { "name": "CineView OLED TV", "category": "Televisions and Home Theater Systems", "brand": "CineView", "model_number": "CV-OLED55", "warranty": "2 years", "rating": 4.7, "features": [ "55-inch display", "4K resolution", "HDR", "Smart TV" ], "description": "Experience true blacks and vibrant colors with this OLED TV.", "price": 1499.99 }"""
q_a_pair = f"""
Customer message: ```{customer_message}```
Product information: ```{product_information}```
Agent response: ```{final_response_to_customer}```

Does the response use the retrieved information correctly?
Does the response sufficiently answer the question

Output Y or N
"""
messages = [
    {'role': 'system', 'content': system_message},
    {'role': 'user', 'content': q_a_pair}
]

response = get_completion_from_messages(messages, max_tokens=1)
print(response)

**Expected Output**: `N` (fails both criteria)

"Life is like a box of chocolates" doesn't:
- Use product information ✗
- Answer the question ✗

**Key Insight**: The evaluator correctly rejects irrelevant responses.

---

## Production Implementation

### Complete Evaluation Pipeline

```python
def evaluate_and_send_response(user_query, llm_response, product_data):
    \"\"\"
    Multi-layer evaluation before sending to user.
    Returns: (should_send: bool, reason: str)
    \"\"\"
    # Layer 1: Content safety
    moderation = client.moderations.create(input=llm_response)
    if moderation.results[0].flagged:
        return False, "Content policy violation"
    
    # Layer 2: Factual accuracy + completeness
    eval_prompt = f\"\"\"
    Evaluate this response:
    User: {user_query}
    Product Data: {product_data}
    Response: {llm_response}
    
    Check: Facts correct? Question answered?
    Output: Y or N
    \"\"\"
    
    judgment = get_completion(eval_prompt, max_tokens=1)
    
    if judgment != 'Y':
        log_failed_response(user_query, llm_response, "Failed evaluation")
        return False, "Quality check failed"
    
    # Passed all checks
    return True, None

# Usage
should_send, reason = evaluate_and_send_response(query, response, data)
if should_send:
    send_to_user(response)
else:
    # Fallback or regenerate
    send_fallback_response(reason)
```

### Evaluation Dimensions

Expand beyond binary Y/N:

```python
system_message = \"\"\"
Evaluate on 1-5 scale:
1. Factual Accuracy (uses data correctly)
2. Completeness (answers all parts)
3. Relevance (on-topic)
4. Tone (professional, friendly)
5. Clarity (easy to understand)

Output JSON: {
    "accuracy": 5,
    "completeness": 4,
    "relevance": 5,
    "tone": 5,
    "clarity": 4,
    "overall_pass": true
}
\"\"\"
```

### Cost Analysis

**Per Evaluation**:
- Input tokens: ~500 (query + product data + response)
- Output tokens: ~50 (judgment + reasoning)
- Cost: ~$0.0003 (gpt-3.5-turbo)

**At Scale** (10K evaluations/day):
- Daily cost: $3
- Monthly cost: $90

**Worth It If**:
- Prevents one bad customer interaction (support ticket = $20-50)
- ROI positive after preventing ~2-3 escalations/month

### Automated Quality Metrics

```python
class EvaluationMetrics:
    def __init__(self):
        self.total = 0
        self.passed = 0
        self.failed = 0
        self.fail_reasons = defaultdict(int)
    
    def record(self, passed, reason=None):
        self.total += 1
        if passed:
            self.passed += 1
        else:
            self.failed += 1
            self.fail_reasons[reason] += 1
    
    def report(self):
        return {
            "pass_rate": self.passed / self.total,
            "total_evaluated": self.total,
            "common_failures": dict(self.fail_reasons.most_common(5))
        }

# Track over time
metrics = EvaluationMetrics()
# ...after each evaluation...
metrics.record(judgment == 'Y')

# Weekly report
print(metrics.report())
# Output: {"pass_rate": 0.94, "total_evaluated": 1000, ...}
```

---

## Advanced Evaluation Patterns

### 1. Multi-Model Consensus

Use multiple models as judges, take majority vote:

```python
def consensus_evaluation(response, criteria):
    judges = ["gpt-3.5-turbo", "gpt-4o-mini", "gpt-4o"]
    votes = []
    
    for model in judges:
        judgment = evaluate_with_model(response, criteria, model)
        votes.append(judgment)
    
    # Majority vote
    return Counter(votes).most_common(1)[0][0]
```

### 2. Reference-Based Evaluation

Compare against known good responses:

```python
system_message = \"\"\"
Compare the agent's response to this reference answer:

Reference: {reference_answer}
Agent Response: {agent_response}

Rate similarity (1-5) considering:
- Same facts mentioned
- Similar level of detail
- Comparable tone

Output: Score and reasoning
\"\"\"
```

### 3. Chain-of-Thought Evaluation

Ask evaluator to explain reasoning:

```python
system_message = \"\"\"
Evaluate this response step-by-step:

Step 1: List facts from product data
Step 2: Check if response matches those facts
Step 3: Identify any hallucinations
Step 4: Assess if question is fully answered
Step 5: Final judgment (Y/N)

Format:
Step 1: ...
Step 2: ...
...
Judgment: Y
\"\"\"
```

---

## Monitoring and Dashboards

### Real-Time Quality Dashboard

```python
{
    "last_hour": {
        "responses_evaluated": 847,
        "pass_rate": 0.92,
        "avg_latency_ms": 450,
        "moderation_failures": 3,
        "accuracy_failures": 24,
        "completeness_failures": 45
    },
    "top_failure_patterns": [
        "Missing pricing information",
        "Mentioned unavailable products",
        "Incorrect warranty details"
    ]
}
```

### Alerting Thresholds

```python
if pass_rate < 0.85:
    alert_team("Quality degradation detected")
    # Auto-switch to fallback system
    enable_human_review_mode()

if moderation_failures > 10/hour:
    alert_security("Unusual content policy violations")
```

---

## Summary: Evaluation Best Practices

### Implementation
1. **Multi-Layer**: Moderation → Factual check → Completeness
2. **Binary First**: Y/N for speed, expand to scores later
3. **Log Everything**: Track patterns in failures
4. **Auto-Fallback**: Have plan B if evaluation fails

### When to Use LLM-as-Judge
✅ High-volume systems (can't human-review all)  
✅ Consistent criteria needed  
✅ Cost of bad output > cost of evaluation  
✅ Clear right/wrong answers

### When NOT to Use
❌ Subjective quality (creativity, style)  
❌ Low volume (human review feasible)  
❌ Cost-sensitive (evaluation adds overhead)

### Key Metrics to Track
- Pass rate (target: >90%)
- Evaluation latency (<500ms)
- False positive rate (judge rejects good responses)
- False negative rate (judge accepts bad responses)

### Next Steps
- **G.CustomerServiceBot** - Full system with evaluation
- **H.EvaluatingLLMPerformance** - Performance benchmarking
- **I.EvaluationUsingRubric** - Multi-dimensional scoring

In [None]:
another_response = "life is like a box of chocolates"
q_a_pair = f"""
Customer message: ```{customer_message}```
Product information: ```{product_information}```
Agent response: ```{another_response}```

Does the response use the retrieved information correctly?
Does the response sufficiently answer the question?

Output Y or N
"""
messages = [
    {'role': 'system', 'content': system_message},
    {'role': 'user', 'content': q_a_pair}
]

response = get_completion_from_messages(messages)
print(response)