# Session 2: Testing Strategies for AI Applications

**Salesforce AI Workshop Series**

---

## Learning Objectives

By the end of this session, you will be able to:

1. **Understand why** traditional testing approaches fail for LLM applications
2. **Build automated evaluation harnesses** that run 100 tests in 3 minutes
3. **Implement G-Eval scoring** with custom criteria and multi-dimensional quality assessment
4. **Generate synthetic test datasets** with paraphrases and adversarial inputs
5. **Catch regressions** before they reach production through systematic testing

## Prerequisites

- Basic Python knowledge (functions, classes, decorators)
- Session 1 completed (understanding of DevHub architecture and observability)
- No prior AI testing experience required

## Session Format

- **~2.5 hours hands-on**
- Instructor demos followed by your labs
- All code runs in this notebook
- We'll test the DevHub application from Session 1

## The Problem: "The New Prompt Broke Everything"

Picture this scenario...

You manually test 20 prompt variations over **2 days**. Everything looks good. You deploy the new prompt to production.

**2 hours later**, your Slack explodes:

> "The refund responses are completely wrong now!" - @support-lead
> "Order status queries are returning gibberish" - @customer-success
> "Did someone change something??" - @engineering-manager

You check your manual test results. All 20 tests passed! But those tests were all about **refunds**. Nobody tested order status. Nobody tested shipping queries. Nobody tested edge cases.

**The new prompt fixed refunds but broke everything else.**

You have NO IDEA:
- How many features are actually broken
- Which prompt change caused the regression
- How to prevent this from happening again

**This session teaches you how to NEVER be in this situation again.**

## What We'll Build Today

1. **Understand** why LLM outputs can't be tested like traditional software
2. **Learn** DeepEval - the pytest-like framework for LLM testing
3. **Master** G-Eval - the state-of-the-art LLM-as-judge approach
4. **Build** an automated evaluation harness for DevHub (Lab 1)
5. **Generate** synthetic test data and run regression tests (Lab 2)

---

## Google Colab Setup

If you're running this in Google Colab:

1. **Runtime → Change runtime type → Python 3**
2. No GPU needed for this session
3. All data is loaded from this notebook

Let's start by installing the required packages...

In [None]:
# =============================================================================
# INSTALL REQUIRED PACKAGES
# =============================================================================
# Run this cell first! It installs all dependencies needed for this session.
# This may take 1-2 minutes on first run.

!pip install -q \
    deepeval>=1.0.0 \
    openai>=1.0.0 \
    chromadb>=0.4.0 \
    pandas>=2.0.0 \
    rich>=13.0.0 \
    opentelemetry-api>=1.20.0 \
    opentelemetry-sdk>=1.20.0 \
    langchain-core \
    langchain-community \
    langchain-text-splitters

print("All packages installed successfully!")

In [None]:
# =============================================================================
# CONFIGURATION
# =============================================================================
# These credentials are needed for LLM-as-judge evaluation.

import os

# -----------------------------------------------------------------------------
# OpenAI Configuration (for G-Eval metrics)
# -----------------------------------------------------------------------------
# Your instructor will provide this key
OPENAI_API_KEY = "sk-..."  # INSTRUCTOR: Fill this before class
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# -----------------------------------------------------------------------------
# Student Identity
# -----------------------------------------------------------------------------
# Change this to your name (lowercase, no spaces)
# This helps identify your test results
STUDENT_NAME = "your-name-here"  # Example: "john-smith"

# Validate student name
if STUDENT_NAME == "your-name-here" or " " in STUDENT_NAME:
    print("ERROR: Please set STUDENT_NAME to your name (lowercase, no spaces)")
    print("   Example: STUDENT_NAME = 'john-smith'")
else:
    print(f"Student identity set: {STUDENT_NAME}")

In [None]:
# =============================================================================
# TEST OPENAI CONNECTION
# =============================================================================
# G-Eval metrics use OpenAI for LLM-as-judge evaluation.

from openai import OpenAI

try:
    client = OpenAI()

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Say 'Testing ready!' in exactly 2 words"}],
        max_tokens=10
    )

    result = response.choices[0].message.content
    print(f"OpenAI connection successful!")
    print(f"   Response: {result}")

except Exception as e:
    print(f"OpenAI connection failed: {e}")
    print("   Check that OPENAI_API_KEY is set correctly")

In [None]:
# =============================================================================
# TEST DEEPEVAL INSTALLATION
# =============================================================================
# Verify DeepEval is installed and working.

try:
    from deepeval import evaluate
    from deepeval.metrics import GEval
    from deepeval.test_case import LLMTestCase

    print("DeepEval imported successfully!")
    print("   Available: evaluate, GEval, LLMTestCase")

except ImportError as e:
    print(f"DeepEval import failed: {e}")
    print("   Try running the install cell again")

---

## Setup Complete!

If you see success messages above for:
- Packages installed
- Student identity set
- OpenAI connection
- DeepEval imported

**You're ready to begin!**

If any step failed, raise your hand or message in the workshop chat.

---

**Next:** We'll understand why traditional testing doesn't work for LLMs.

---

# Topic 1: Why Traditional Testing Fails for LLMs

Before we can fix the problem, we need to understand it.

Traditional software testing works because outputs are **deterministic**. Given the same input, you always get the same output. LLMs break this assumption completely.

## The Three Fundamental Challenges

### Challenge 1: Non-Deterministic Outputs

The same prompt can produce different responses:

```python
# Traditional software
assert add(2, 3) == 5  # Always true

# LLM output
response = llm("What is 2+3?")
# Could be: "5", "The answer is 5", "2+3=5", "Five", etc.
assert response == "5"  # Often fails!
```

### Challenge 2: Multi-Dimensional Quality

LLM outputs have many quality dimensions:
- **Correctness**: Is the answer factually right?
- **Relevance**: Does it answer what was asked?
- **Coherence**: Is it logically structured?
- **Fluency**: Is it grammatically correct?
- **Tone**: Is it appropriately professional?

A response can be correct but incoherent, or fluent but wrong.

### Challenge 3: Scale

Manual evaluation doesn't scale:
- **20 test cases**: 2 days of manual review
- **100 test cases**: 2 weeks of manual review
- **1000 test cases**: Impossible

**Only 5% of teams** track LLM quality metrics in production. **81%** don't test prompts before deployment.

## Traditional Testing vs LLM Testing

![Traditional vs LLM Testing](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_02/charts/01_traditional_vs_llm_testing.svg)

**Traditional Testing:**
- Input → Deterministic Function → Expected Output
- Binary: Pass or Fail
- Fast: Milliseconds per test

**LLM Testing:**
- Input → Non-Deterministic LLM → Many Valid Outputs
- Spectrum: Quality Score (0.0 - 1.0)
- Slow: Needs LLM judge (seconds per test)

**Key Insight:** We need to test *quality*, not *equality*.

In [None]:
# =============================================================================
# DEMO: Why assert == Doesn't Work for LLMs
# =============================================================================
# Let's see the problem firsthand.

from openai import OpenAI

client = OpenAI()

def ask_llm(question: str) -> str:
    """Simple LLM call."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
        temperature=0.7,  # Some randomness
        max_tokens=50
    )
    return response.choices[0].message.content

# Ask the same question 3 times
question = "What is the capital of France?"
expected = "Paris"

print(f"Question: {question}")
print(f"Expected: '{expected}'")
print("-" * 50)

for i in range(3):
    response = ask_llm(question)
    matches = response.strip() == expected
    print(f"Response {i+1}: '{response}'")
    print(f"   Exact match: {matches}")
    print()

print("Notice how the responses vary, even though they're all CORRECT!")
print("Traditional assert == would fail most of these.")

## What We Actually Need

Instead of checking **equality**, we need to check **quality**.

### The LLM-as-Judge Approach

Use another LLM to evaluate the output:

```python
# Instead of:
assert response == "Paris"

# We need:
score = evaluate_with_llm(
    question="What is the capital of France?",
    response="The capital of France is Paris.",
    criteria="Is the answer correct?"
)
assert score >= 0.8  # 80% or better
```

### Why This Works

- LLMs understand **semantics**, not just strings
- "Paris" and "The capital is Paris" mean the same thing
- We can define **custom criteria** for what "good" means
- Research shows LLMs align with humans **81% of the time** - better than humans align with each other!

In [None]:
# =============================================================================
# DEMO: Quality-Based Testing
# =============================================================================
# Let's build a simple quality checker.

def check_correctness(question: str, response: str, expected_answer: str) -> float:
    """Use LLM to check if response is semantically correct."""

    prompt = f"""Rate how well the response answers the question correctly.

Question: {question}
Expected Answer: {expected_answer}
Actual Response: {response}

Score from 0.0 (completely wrong) to 1.0 (perfectly correct).
Consider semantic equivalence, not exact wording.

Respond with ONLY a number between 0.0 and 1.0."""

    result = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=10
    )

    try:
        score = float(result.choices[0].message.content.strip())
        return min(max(score, 0.0), 1.0)  # Clamp to 0-1
    except:
        return 0.0

# Test it
test_cases = [
    ("What is the capital of France?", "Paris", "Paris"),
    ("What is the capital of France?", "The capital of France is Paris.", "Paris"),
    ("What is the capital of France?", "London", "Paris"),
    ("What is the capital of France?", "Paris is the beautiful capital city of France.", "Paris"),
]

print("Quality-Based Testing Results:")
print("-" * 60)

for question, response, expected in test_cases:
    score = check_correctness(question, response, expected)
    status = "PASS" if score >= 0.8 else "FAIL"
    print(f"Response: '{response[:40]}...'")
    print(f"   Score: {score:.2f} [{status}]")
    print()

## The Cost Problem (And Why DeepEval Exists)

Building quality checkers from scratch has problems:

### 1. Prompt Engineering is Hard
- Getting reliable scores requires careful prompt design
- Different criteria need different prompts
- Edge cases break simple prompts

### 2. Scoring is Inconsistent
- Raw LLM outputs vary
- Need normalization techniques
- Token probabilities help but add complexity

### 3. No Standardization
- Every team builds their own approach
- Hard to compare results across projects
- No best practices

**DeepEval solves all of this** with:
- **50+ research-backed metrics** (G-Eval, Answer Relevancy, Faithfulness, etc.)
- **Pytest-like interface** (familiar to developers)
- **Normalization built-in** (consistent scores)
- **Battle-tested** by thousands of teams

## The State of LLM Testing (Research Statistics)

From Confident AI and industry surveys:

| Statistic | Value | Implication |
|-----------|-------|-------------|
| Teams tracking LLM quality in production | **5%** | Most teams are flying blind |
| Teams that test prompts before deployment | **19%** | 81% deploy without testing |
| AI initiatives that fail to meet outcomes | **70-85%** | Lack of quality assurance |
| G-Eval correlation with human judgment | **0.87** | Better than human-human agreement |
| Automated vs manual testing speed | **100 tests in 3 min** vs **20 tests in 2 days** | 100x faster |

**The opportunity:** Implementing proper testing gives you a massive competitive advantage.

## Real-World Example: Gorgias

**Gorgias** is a customer service platform handling millions of support tickets.

### Before Systematic Testing:
- Product managers manually tested 20 prompt variations
- Took 2 days per prompt change
- Still had production incidents from missed edge cases

### After Implementing DeepEval:
- **100+ test cases** run on every prompt change
- **3 minutes** per evaluation cycle
- **Autonomous prompt engineering team** - LLMs improving LLMs
- Catches regressions **before deployment**

> "We use DeepEval daily to version control prompts, run evaluations on regression datasets, and review logs."

Source: [PromptLayer Case Study](https://www.promptlayer.com)

---

## Key Insight: Test Quality, Not Equality

**Traditional Testing:**
```python
assert response == expected_output  # Fails for LLMs
```

**LLM Testing:**
```python
score = evaluate_quality(response, criteria)
assert score >= threshold  # Works!
```

**The shift:**
- From **exact matching** to **quality scoring**
- From **binary pass/fail** to **continuous scores**
- From **manual review** to **automated LLM judges**

---

**Next:** We'll learn DeepEval - the framework that makes this easy.

---

# Topic 2: DeepEval - The Pytest for LLMs

DeepEval is an open-source framework that makes LLM testing as easy as unit testing.

Think of it as **pytest, but for AI outputs**.

## Why DeepEval?

### The Problem with Building Your Own
- Reinventing the wheel for every metric
- Inconsistent scoring across projects
- No community best practices
- Hard to maintain

### What DeepEval Provides

| Feature | Benefit |
|---------|---------|
| **50+ metrics** | Research-backed, ready to use |
| **Pytest integration** | Familiar developer experience |
| **LLM-as-judge** | Semantic evaluation, not string matching |
| **Synthetic data** | Generate test cases automatically |
| **CI/CD ready** | Integrate with GitHub Actions, etc. |
| **Vendor-neutral** | Works with OpenAI, Anthropic, local models |

### Core Philosophy

> "You can't improve what you don't measure, and you can't trust what you don't test."

## DeepEval Architecture

![DeepEval Architecture](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_02/charts/02_deepeval_architecture.svg)

**Components:**

1. **Test Cases** (`LLMTestCase`)
   - Input, actual output, expected output, context
   - The "unit" of LLM testing

2. **Metrics** (G-Eval, AnswerRelevancy, Faithfulness, etc.)
   - Define what "good" means
   - Each metric produces a score (0-1)

3. **LLM Judge** (GPT-4, Claude, etc.)
   - Evaluates test cases against metrics
   - Uses chain-of-thought reasoning

4. **Evaluation Results**
   - Scores, reasons, pass/fail status
   - Aggregated reports

## Anatomy of an LLM Test Case

A test case in DeepEval represents a single LLM interaction:

```python
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    # REQUIRED
    input="What is the capital of France?",           # The prompt
    actual_output="Paris is the capital of France.",  # LLM response

    # OPTIONAL (depending on metric)
    expected_output="Paris",                          # Ground truth
    context=["France is a country in Europe..."],    # Retrieved docs
    retrieval_context=["Doc 1", "Doc 2"],            # RAG context
)
```

### When to Use Each Field

| Field | Use When |
|-------|----------|
| `input` | Always (what you asked) |
| `actual_output` | Always (what LLM said) |
| `expected_output` | Testing correctness |
| `context` | Testing against knowledge base |
| `retrieval_context` | Testing RAG systems |

In [None]:
# =============================================================================
# DEMO: Create Your First LLM Test Case
# =============================================================================
# A test case captures one LLM interaction for evaluation.

from deepeval.test_case import LLMTestCase

# Create a simple test case
test_case = LLMTestCase(
    input="How do I authenticate with the Payments API?",
    actual_output="""To authenticate with the Payments API, use OAuth 2.0:
    1. Get your client_id and client_secret from the Developer Portal
    2. Make a POST request to /oauth/token
    3. Include the access_token in the Authorization header as 'Bearer {token}'""",
    expected_output="Use OAuth 2.0 with client credentials flow. Get credentials from Developer Portal."
)

print("Test Case Created!")
print(f"   Input: {test_case.input[:50]}...")
print(f"   Output: {test_case.actual_output[:50]}...")
print(f"   Expected: {test_case.expected_output[:50]}...")

## DeepEval Metrics: Measuring Quality

DeepEval provides **50+ metrics** organized by use case:

### Correctness & Accuracy
| Metric | What It Measures |
|--------|-----------------|
| **G-Eval** | Custom criteria (most flexible) |
| **AnswerRelevancy** | Does output answer the input? |
| **Correctness** | Factual accuracy vs expected output |

### RAG-Specific
| Metric | What It Measures |
|--------|-----------------|
| **Faithfulness** | Is output grounded in context? |
| **ContextualRelevancy** | Is retrieved context relevant? |
| **ContextualPrecision** | How precise is the retrieval? |

### Safety & Quality
| Metric | What It Measures |
|--------|-----------------|
| **Hallucination** | Did LLM make things up? |
| **Toxicity** | Harmful content detection |
| **Bias** | Unfair or biased responses |

**For this workshop, we'll focus on G-Eval** - the most flexible and powerful metric.

## Test Case Structure

![Test Case Structure](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_02/charts/04_test_case_structure.svg)

**Key Points:**
- `input` and `actual_output` are always required
- Other fields depend on which metric you're using
- G-Eval can use any combination of fields
- The metric determines which fields matter

In [None]:
# =============================================================================
# DEMO: Run Your First Evaluation
# =============================================================================
# Let's evaluate a test case with a simple metric.

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

# Create test case
test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris, a beautiful city known for the Eiffel Tower."
)

# Create metric
relevancy_metric = AnswerRelevancyMetric(
    threshold=0.7,  # Pass if score >= 0.7
    model="gpt-4o-mini"
)

# Run evaluation (this runs the batch evaluation)
print("Running evaluation...")
result = evaluate([test_case], [relevancy_metric])

# Access individual score by measuring directly
relevancy_metric.measure(test_case)

# Show results
print("\nResults:")
print(f"   Score: {relevancy_metric.score:.2f}")
print(f"   Passed: {relevancy_metric.score >= 0.7}")
print(f"   Reason: {relevancy_metric.reason}")

In [None]:
# =============================================================================
# DEMO: Evaluate Multiple Test Cases
# =============================================================================
# In practice, you'll run many test cases at once.

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

# Create multiple test cases
test_cases = [
    LLMTestCase(
        input="What is the capital of France?",
        actual_output="Paris is the capital of France."
    ),
    LLMTestCase(
        input="What is the capital of France?",
        actual_output="France is a country in Europe with great food."  # Off-topic!
    ),
    LLMTestCase(
        input="Who wrote Romeo and Juliet?",
        actual_output="William Shakespeare wrote Romeo and Juliet in the late 16th century."
    ),
]

# Create metric
relevancy_metric = AnswerRelevancyMetric(
    threshold=0.7,
    model="gpt-4o-mini"
)

# Run evaluation on all test cases
print(f"Evaluating {len(test_cases)} test cases...")
results = evaluate(test_cases, [relevancy_metric])

# Show summary
print("\nResults Summary:")
for i, tc in enumerate(test_cases):
    # Re-run metric on each case for individual scores
    relevancy_metric.measure(tc)
    status = "PASS" if relevancy_metric.score >= 0.7 else "FAIL"
    print(f"\nTest {i+1}: [{status}]")
    print(f"   Input: {tc.input[:40]}...")
    print(f"   Output: {tc.actual_output[:40]}...")
    print(f"   Score: {relevancy_metric.score:.2f}")

## Pytest Integration

DeepEval integrates seamlessly with pytest for CI/CD:

```python
# test_llm_outputs.py

import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

def test_capital_question():
    test_case = LLMTestCase(
        input="What is the capital of France?",
        actual_output="Paris is the capital of France."
    )

    metric = AnswerRelevancyMetric(threshold=0.7)

    # This raises an exception if the test fails
    assert_test(test_case, [metric])
```

**Run with:**
```bash
deepeval test run test_llm_outputs.py
```

**CI/CD Integration:**
```yaml
# .github/workflows/llm-tests.yml
- name: Run LLM Tests
  run: deepeval test run tests/
```

## Key Concept: Thresholds

Every metric has a **threshold** that determines pass/fail:

```python
metric = AnswerRelevancyMetric(
    threshold=0.7  # Score must be >= 0.7 to pass
)
```

### Choosing Thresholds

| Threshold | Use Case |
|-----------|----------|
| **0.9+** | Critical accuracy (medical, legal) |
| **0.7-0.9** | Standard quality (most applications) |
| **0.5-0.7** | Lenient (creative tasks, drafts) |

### Best Practice: Start Low, Increase Over Time

1. Start with `threshold=0.5` to establish baseline
2. Measure current performance
3. Gradually increase as you improve prompts
4. Different metrics may need different thresholds

In [None]:
# =============================================================================
# DEMO: Impact of Different Thresholds
# =============================================================================
# Same test case, different thresholds = different results.

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

# A borderline response
test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="Paris, known for the Eiffel Tower, is a major European city."
)

# Test with different thresholds
thresholds = [0.5, 0.7, 0.9]

print("Same response, different thresholds:")
print("-" * 50)

for threshold in thresholds:
    metric = AnswerRelevancyMetric(
        threshold=threshold,
        model="gpt-4o-mini"
    )
    metric.measure(test_case)

    status = "PASS" if metric.score >= threshold else "FAIL"
    print(f"\nThreshold: {threshold}")
    print(f"   Score: {metric.score:.2f}")
    print(f"   Result: [{status}]")

---

## Coming Up: Lab 1

In Lab 1, you'll build an evaluation harness for DevHub that:

1. Creates test cases for different query types:
   - Documentation search
   - Owner lookup
   - Status checks

2. Uses G-Eval with custom criteria:
   - Correctness
   - Helpfulness
   - Handling edge cases

3. Runs automated evaluation:
   - 10+ test cases
   - Multiple metrics
   - Clear pass/fail results

But first, we need to understand G-Eval - the most powerful metric in DeepEval.

---

## Key Takeaways: DeepEval Basics

1. **Test Cases** capture LLM interactions (input, output, optional fields)

2. **Metrics** define what "good" means (50+ available)

3. **Thresholds** determine pass/fail (start low, increase over time)

4. **Pytest integration** enables CI/CD pipelines

5. **Batch evaluation** tests many cases quickly (100 in 3 minutes)

---

**Next:** Deep dive into G-Eval - the most flexible and powerful metric.

---

# Topic 3: G-Eval - LLM-as-Judge with Chain-of-Thought

G-Eval is the **most flexible metric** in DeepEval. It lets you define **any evaluation criteria** in plain English, then uses an LLM to score against that criteria.

Think of it as having an expert reviewer who follows your exact rubric.

## How G-Eval Works

### The G-Eval Algorithm (from the research paper)

1. **Criteria Definition**: You define what "good" looks like in plain English

2. **CoT Generation**: G-Eval generates evaluation steps (chain of thought)

3. **Form-Filling**: Creates a structured evaluation prompt

4. **Token Probabilities**: Uses LLM output probabilities for scoring

5. **Weighted Summation**: Normalizes scores (1-5 scale internally, 0-1 output)

### Why This Matters

- **0.87 correlation** with human judgment (better than human-human!)
- **Consistent** across evaluations (unlike raw LLM scoring)
- **Customizable** for any use case
- **Explainable** - provides reasoning for each score

## G-Eval Evaluation Pipeline

![G-Eval Pipeline](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_02/charts/03_geval_pipeline.svg)

**Step-by-step:**

1. **Your Criteria** → "Is the response helpful and actionable?"

2. **CoT Generation** → LLM generates evaluation steps:
   - Check if response addresses the question
   - Check if it provides specific actions
   - Check if information is accurate

3. **Form-Filling** → Structured prompt with test case data

4. **LLM Scoring** → Model outputs score with probabilities

5. **Normalization** → Token probabilities weighted for final score

## G-Eval Configuration

### Required Parameters

```python
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

metric = GEval(
    name="Helpfulness",                    # Metric name (for reports)
    criteria="Is the response helpful?",   # What to evaluate
    evaluation_params=[                    # Which test case fields to use
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT
    ],
)
```

### Optional Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `threshold` | 0.5 | Pass/fail cutoff |
| `model` | gpt-4o | LLM for judging |
| `evaluation_steps` | Auto-generated | Custom evaluation steps |
| `strict_mode` | False | Only pass if score == 1.0 |

In [None]:
# =============================================================================
# DEMO: Basic G-Eval Usage
# =============================================================================
# G-Eval with a simple criteria.

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Define the metric
helpfulness_metric = GEval(
    name="Helpfulness",
    criteria="Determine if the response provides helpful, actionable information that directly addresses the user's question.",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT
    ],
    threshold=0.7,
    model="gpt-4o-mini"
)

# Create test case
test_case = LLMTestCase(
    input="How do I reset my password?",
    actual_output="""To reset your password:
    1. Go to the login page
    2. Click 'Forgot Password'
    3. Enter your email
    4. Check your inbox for the reset link
    5. Create a new password (8+ characters, include numbers)"""
)

# Evaluate
helpfulness_metric.measure(test_case)

print("G-Eval Helpfulness Results:")
print(f"   Score: {helpfulness_metric.score:.2f}")
print(f"   Passed: {helpfulness_metric.score >= 0.7}")
print(f"   Reason: {helpfulness_metric.reason}")

In [None]:
# =============================================================================
# DEMO: Multiple G-Eval Metrics
# =============================================================================
# Evaluate the same response against multiple criteria.

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Define multiple metrics
metrics = [
    GEval(
        name="Correctness",
        criteria="Is the response factually correct and accurate?",
        evaluation_params=[
            LLMTestCaseParams.INPUT,
            LLMTestCaseParams.ACTUAL_OUTPUT,
            LLMTestCaseParams.EXPECTED_OUTPUT
        ],
        threshold=0.8,
        model="gpt-4o-mini"
    ),
    GEval(
        name="Coherence",
        criteria="Is the response logically structured and easy to follow?",
        evaluation_params=[
            LLMTestCaseParams.ACTUAL_OUTPUT
        ],
        threshold=0.7,
        model="gpt-4o-mini"
    ),
    GEval(
        name="Conciseness",
        criteria="Is the response appropriately concise without unnecessary information?",
        evaluation_params=[
            LLMTestCaseParams.INPUT,
            LLMTestCaseParams.ACTUAL_OUTPUT
        ],
        threshold=0.6,
        model="gpt-4o-mini"
    ),
]

# Create test case
test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="""The capital of France is Paris. Paris is located in northern France
    along the Seine River. It's known as the City of Light and is famous for the Eiffel Tower,
    the Louvre Museum, and Notre-Dame Cathedral. The city has a population of about 2 million
    in the city proper and over 12 million in the metropolitan area.""",
    expected_output="Paris"
)

# Evaluate with all metrics
print("Multi-Metric Evaluation:")
print("=" * 60)

for metric in metrics:
    metric.measure(test_case)
    status = "PASS" if metric.score >= metric.threshold else "FAIL"
    print(f"\n{metric.name}:")
    print(f"   Score: {metric.score:.2f} (threshold: {metric.threshold})")
    print(f"   Result: [{status}]")
    print(f"   Reason: {metric.reason[:100]}...")

## Choosing Evaluation Parameters

The `evaluation_params` tell G-Eval which test case fields to consider:

### Available Parameters

```python
from deepeval.test_case import LLMTestCaseParams

LLMTestCaseParams.INPUT           # The user's question/prompt
LLMTestCaseParams.ACTUAL_OUTPUT   # The LLM's response
LLMTestCaseParams.EXPECTED_OUTPUT # Ground truth (if available)
LLMTestCaseParams.CONTEXT         # Knowledge base content
LLMTestCaseParams.RETRIEVAL_CONTEXT  # RAG retrieved documents
```

### Common Combinations

| Use Case | Parameters |
|----------|------------|
| **Answer relevancy** | INPUT + ACTUAL_OUTPUT |
| **Correctness** | INPUT + ACTUAL_OUTPUT + EXPECTED_OUTPUT |
| **Faithfulness (RAG)** | ACTUAL_OUTPUT + CONTEXT |
| **Groundedness** | ACTUAL_OUTPUT + RETRIEVAL_CONTEXT |
| **Coherence** | ACTUAL_OUTPUT only |

In [None]:
# =============================================================================
# DEMO: G-Eval with Custom Evaluation Steps
# =============================================================================
# You can provide your own evaluation steps instead of auto-generating.

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Define metric with explicit evaluation steps
api_quality_metric = GEval(
    name="API Documentation Quality",
    criteria="Does the response provide high-quality API documentation guidance?",
    evaluation_steps=[
        "Check if the response includes the correct API endpoint or method",
        "Verify that authentication requirements are mentioned",
        "Confirm that example code or usage patterns are provided",
        "Assess if error handling guidance is included",
        "Evaluate if the response is actionable for a developer"
    ],
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT
    ],
    threshold=0.7,
    model="gpt-4o-mini"
)

# Test case
test_case = LLMTestCase(
    input="How do I authenticate with the Payments API?",
    actual_output="""To authenticate with the Payments API:

1. **Get Credentials**: Obtain your client_id and client_secret from the Developer Portal.

2. **Request Token**:
```
POST /oauth/token
Content-Type: application/x-www-form-urlencoded

grant_type=client_credentials&client_id=YOUR_ID&client_secret=YOUR_SECRET
```

3. **Use Token**: Include in all requests:
```
Authorization: Bearer YOUR_ACCESS_TOKEN
```

**Note**: Tokens expire after 1 hour. Handle 401 errors by refreshing the token."""
)

# Evaluate
api_quality_metric.measure(test_case)

print("API Documentation Quality Evaluation:")
print(f"   Score: {api_quality_metric.score:.2f}")
print(f"   Reason: {api_quality_metric.reason}")

## Designing Metrics for DevHub

For our DevHub application (from Session 1), we need metrics that capture its specific quality needs:

### Metric 1: Documentation Helpfulness
**For:** `search_docs` tool responses
**Criteria:** Does the response provide helpful documentation that addresses the query?

### Metric 2: Owner Information Accuracy
**For:** `find_owner` tool responses
**Criteria:** Is the owner information accurate and up-to-date?

### Metric 3: Status Clarity
**For:** `check_status` tool responses
**Criteria:** Is the status clearly communicated with actionable information?

### Metric 4: Stale Data Handling
**For:** Any response with inactive owners
**Criteria:** Does the response appropriately flag stale or inactive information?

### Metric 5: Error Acknowledgment
**For:** Responses when tools fail
**Criteria:** Does the response acknowledge errors honestly without making things up?

In [None]:
# =============================================================================
# DEMO: DevHub-Specific G-Eval Metrics
# =============================================================================
# Metrics tailored for our DevHub use cases.

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

# Metric 1: Documentation Helpfulness
docs_helpfulness = GEval(
    name="Documentation Helpfulness",
    criteria="""Evaluate if the response provides helpful documentation guidance that:
    1. Directly addresses the user's question
    2. Includes specific, actionable steps
    3. References relevant code examples or endpoints when appropriate
    4. Is technically accurate""",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT
    ],
    threshold=0.7,
    model="gpt-4o-mini"
)

# Metric 2: Stale Data Handling
stale_data_handling = GEval(
    name="Stale Data Handling",
    criteria="""Evaluate if the response appropriately handles potentially stale data:
    1. If an owner is marked as inactive, this should be mentioned
    2. Alternative contacts (team channel) should be suggested
    3. The response should not present stale info as current
    4. Any uncertainty about data freshness should be acknowledged""",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT
    ],
    threshold=0.8,  # Higher threshold for data quality
    model="gpt-4o-mini"
)

print("DevHub Metrics Defined!")
print(f"   1. {docs_helpfulness.name} (threshold: {docs_helpfulness.threshold})")
print(f"   2. {stale_data_handling.name} (threshold: {stale_data_handling.threshold})")

## G-Eval Best Practices

### 1. Write Clear Criteria
```python
# BAD - vague
criteria="Is the response good?"

# GOOD - specific
criteria="Does the response provide accurate API authentication steps including endpoint, required parameters, and example code?"
```

### 2. Choose Appropriate Thresholds
- Start with **0.5-0.6** for new metrics
- Measure baseline performance
- Increase threshold as prompts improve
- Critical metrics: **0.8-0.9**

### 3. Use Custom Evaluation Steps for Complex Criteria
- Auto-generated steps work for simple criteria
- Complex domains benefit from explicit steps
- Steps should be checkable (yes/no or scale)

### 4. Test Your Metrics
- Run on known good/bad examples
- Verify scores match expectations
- Adjust criteria if needed

### 5. Name Metrics Descriptively
- Good: `"API_Auth_Completeness"`
- Bad: `"Metric1"`

In [None]:
# =============================================================================
# DEMO: Testing G-Eval Metrics
# =============================================================================
# Always test your metrics on known good/bad examples.

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Create a metric
clarity_metric = GEval(
    name="Response Clarity",
    criteria="Is the response clear, well-structured, and easy to understand?",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
    model="gpt-4o-mini"
)

# Test on known examples
test_examples = [
    {
        "name": "Good response (should score high)",
        "output": """Here are the steps to reset your password:
        1. Go to login page
        2. Click "Forgot Password"
        3. Enter your email
        4. Check inbox for reset link
        Done!"""
    },
    {
        "name": "Bad response (should score low)",
        "output": "password reset you need to do stuff with email thing and then click and login idk it's confusing just ask someone"
    }
]

print("Testing Clarity Metric:")
print("=" * 60)

for example in test_examples:
    test_case = LLMTestCase(
        input="How do I reset my password?",
        actual_output=example["output"]
    )
    clarity_metric.measure(test_case)

    print(f"\n{example['name']}")
    print(f"   Score: {clarity_metric.score:.2f}")
    expected = "High (>0.7)" if "Good" in example["name"] else "Low (<0.7)"
    print(f"   Expected: {expected}")
    matches = (clarity_metric.score > 0.7) == ("Good" in example["name"])
    print(f"   Matches: {'Yes' if matches else 'NO - Adjust metric!'}")

## When to Use G-Eval vs Built-in Metrics

### Use G-Eval When:
- You need **custom criteria** specific to your domain
- Built-in metrics don't capture what you need
- You want **explainable scoring** with reasons
- Evaluating **tone, style, or domain-specific requirements**

### Use Built-in Metrics When:
- Standard evaluations (relevancy, faithfulness)
- **RAG systems** (ContextualPrecision, Faithfulness)
- **Safety checks** (Toxicity, Bias)
- You want **faster evaluation** (some built-ins are optimized)

### Common Combinations

```python
# For a RAG chatbot:
metrics = [
    AnswerRelevancyMetric(),           # Is it relevant?
    FaithfulnessMetric(),              # Is it grounded in context?
    GEval(criteria="Is it helpful?")   # Custom quality check
]
```

---

## Key Takeaways: G-Eval

1. **G-Eval is the most flexible metric** - define any criteria in plain English

2. **Chain-of-thought reasoning** provides explainable scores

3. **0.87 correlation** with human judgment - better than human-human!

4. **Choose parameters carefully** - they determine what's evaluated

5. **Test your metrics** on known good/bad examples before deploying

---

**Next:** Lab 1 - Build an evaluation harness for DevHub using G-Eval.

---

# Lab 1: Build an Evaluation Harness for DevHub

Time to get your hands dirty! In this lab, you'll build a complete testing system for the instrumented DevHub from Session 1.

**Duration:** ~30 minutes

**What you'll do:**
1. Load DevHub (instrumented version from Session 1)
2. Create test cases for different query types
3. Define G-Eval metrics for quality dimensions
4. Run automated evaluation
5. Analyze results

**Scaffolding decreases** as you progress:
- Task 1: Full guidance
- Task 2: Medium guidance
- Task 3: Minimal guidance

## What You'll Build

By the end of this lab, you'll have:

```
DevHub Evaluation Harness
├── Test Cases (10+)
│   ├── Documentation queries
│   ├── Owner lookup queries
│   └── Status check queries
│
├── G-Eval Metrics (3+)
│   ├── Response Correctness
│   ├── Response Helpfulness
│   └── Edge Case Handling
│
└── Evaluation Results
    ├── Per-test scores
    ├── Aggregate metrics
    └── Pass/fail summary
```

This harness can be run automatically to catch regressions!

In [None]:
# =============================================================================
# LOAD DEVHUB (INSTRUMENTED VERSION)
# =============================================================================
# We'll use the instrumented DevHub from Session 1.
# This version has OpenTelemetry tracing built in!

import json
from openai import OpenAI

# Simplified DevHub for testing (mirrors the instrumented version structure)
class DevHubSimple:
    """Simplified DevHub for evaluation testing.

    This mirrors the structure of the instrumented DevHub from Session 1,
    with the same data sources and query patterns.
    """

    def __init__(self):
        self.client = OpenAI()

        # Same data as instrumented DevHub
        self.docs = {
            "payments-auth": "Use OAuth 2.0 with client_id and client_secret from Developer Portal.",
            "error-handling": "Return standard error format with code, message, and correlation_id.",
            "rate-limiting": "100 requests/minute authenticated, 10 requests/minute unauthenticated.",
            "search-api": "POST /api/v1/search with query in body. Returns ranked results.",
        }
        self.owners = {
            "billing": {"name": "Sarah Chen", "slack": "@sarah.chen", "team": "payments-team", "is_active": True},
            "vector-search": {"name": "David Kim", "slack": "@david.kim", "team": "search-team", "is_active": False},  # STALE!
            "auth": {"name": "Michael Brown", "slack": "@mbrown", "team": "security-team", "is_active": True},
            "api-gateway": {"name": "Lisa Wang", "slack": "@lwang", "team": "platform-team", "is_active": True},
        }
        self.status = {
            "payments-api": {"status": "healthy", "uptime": 99.95},
            "staging": {"status": "degraded", "incident": "Database connection pool exhaustion"},
            "search-api": {"status": "healthy", "uptime": 99.99},
            "auth-service": {"status": "healthy", "uptime": 99.90},
        }

    def query(self, user_query: str) -> str:
        """Process a query and return response."""
        context = f"""You have access to:
Docs: {json.dumps(self.docs, indent=2)}
Owners: {json.dumps(self.owners, indent=2)}
Status: {json.dumps(self.status, indent=2)}"""

        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": f"You are DevHub, an internal developer assistant. Use this data to answer queries accurately. If an owner is_active=False, mention this and suggest using the team channel instead.\n\n{context}"},
                {"role": "user", "content": user_query}
            ],
            temperature=0.3,
            max_tokens=500
        )
        return response.choices[0].message.content

# Initialize
devhub = DevHubSimple()

print("DevHub loaded successfully!")
print("   Ready for evaluation testing")
print(f"   Docs: {len(devhub.docs)} entries")
print(f"   Owners: {len(devhub.owners)} entries")
print(f"   Services: {len(devhub.status)} entries")

## Task 1: Create Test Cases

**Goal:** Create test cases that cover DevHub's main functionality.

**What to do:**

1. Create 3+ test cases for **documentation queries**
2. Create 3+ test cases for **owner lookup queries**
3. Create 3+ test cases for **status check queries**

For each test case, include:
- `input`: The user's question
- `actual_output`: DevHub's response (from `devhub.query()`)
- `expected_output`: What a good answer should contain

**Example:**
```python
test_case = LLMTestCase(
    input="How do I authenticate with the Payments API?",
    actual_output=devhub.query("How do I authenticate with the Payments API?"),
    expected_output="Use OAuth 2.0 with client credentials"
)
```

**Time:** ~10 minutes

In [None]:
# =============================================================================
# TASK 1: Create Test Cases (SOLUTION)
# =============================================================================
# Create test cases for DevHub evaluation.
# =============================================================================

from deepeval.test_case import LLMTestCase

# -----------------------------------------------------------------------------
# Documentation Test Cases
# -----------------------------------------------------------------------------
doc_test_cases = [
    LLMTestCase(
        input="How do I authenticate with the Payments API?",
        actual_output=devhub.query("How do I authenticate with the Payments API?"),
        expected_output="Use OAuth 2.0 with client_id and client_secret from Developer Portal"
    ),
    LLMTestCase(
        input="What is the error handling format?",
        actual_output=devhub.query("What is the error handling format?"),
        expected_output="Standard error format with code, message, and correlation_id"
    ),
    LLMTestCase(
        input="What are the rate limits?",
        actual_output=devhub.query("What are the rate limits?"),
        expected_output="100 requests/minute authenticated, 10/minute unauthenticated"
    ),
]

# -----------------------------------------------------------------------------
# Owner Lookup Test Cases
# -----------------------------------------------------------------------------
owner_test_cases = [
    LLMTestCase(
        input="Who owns billing?",
        actual_output=devhub.query("Who owns billing?"),
        expected_output="Sarah Chen (@sarah.chen) from payments-team - active owner"
    ),
    LLMTestCase(
        input="Who owns vector-search?",
        actual_output=devhub.query("Who owns vector-search?"),
        expected_output="David Kim is listed but is INACTIVE - contact #search-team channel instead"
    ),
    LLMTestCase(
        input="Who owns auth?",
        actual_output=devhub.query("Who owns auth?"),
        expected_output="Michael Brown (@mbrown) from security-team - active owner"
    ),
]

# -----------------------------------------------------------------------------
# Status Check Test Cases
# -----------------------------------------------------------------------------
status_test_cases = [
    LLMTestCase(
        input="Is the payments API working?",
        actual_output=devhub.query("Is the payments API working?"),
        expected_output="Payments API is healthy with 99.95% uptime"
    ),
    LLMTestCase(
        input="Is staging working?",
        actual_output=devhub.query("Is staging working?"),
        expected_output="Staging is DEGRADED due to database connection pool exhaustion"
    ),
    LLMTestCase(
        input="What is the status of all services?",
        actual_output=devhub.query("What is the status of all services?"),
        expected_output="List all services: payments-api healthy, staging degraded, search-api healthy, auth-service healthy"
    ),
]

# -----------------------------------------------------------------------------
# Combine all test cases
# -----------------------------------------------------------------------------
all_test_cases = doc_test_cases + owner_test_cases + status_test_cases

print(f"Total test cases created: {len(all_test_cases)}")
print(f"   Documentation: {len(doc_test_cases)}")
print(f"   Owner lookup: {len(owner_test_cases)}")
print(f"   Status check: {len(status_test_cases)}")

In [None]:
# =============================================================================
# SOLUTION: Task 1 - Create Test Cases
# =============================================================================
# Expand if you get stuck.

from deepeval.test_case import LLMTestCase

# Documentation Test Cases
doc_test_cases = [
    LLMTestCase(
        input="How do I authenticate with the Payments API?",
        actual_output=devhub.query("How do I authenticate with the Payments API?"),
        expected_output="Use OAuth 2.0 with client_id and client_secret from Developer Portal"
    ),
    LLMTestCase(
        input="What is the error handling format?",
        actual_output=devhub.query("What is the error handling format?"),
        expected_output="Standard error format with code, message, and correlation_id"
    ),
    LLMTestCase(
        input="What are the rate limits?",
        actual_output=devhub.query("What are the rate limits?"),
        expected_output="100 requests/minute authenticated, 10/minute unauthenticated"
    ),
    LLMTestCase(
        input="How do I use the search API?",
        actual_output=devhub.query("How do I use the search API?"),
        expected_output="POST /api/v1/search with query in body, returns ranked results"
    ),
]

# Owner Lookup Test Cases
owner_test_cases = [
    LLMTestCase(
        input="Who owns billing?",
        actual_output=devhub.query("Who owns billing?"),
        expected_output="Sarah Chen (@sarah.chen) from payments-team - active owner"
    ),
    LLMTestCase(
        input="Who owns vector-search?",
        actual_output=devhub.query("Who owns vector-search?"),
        expected_output="David Kim is listed but is INACTIVE - contact #search-team channel instead"
    ),
    LLMTestCase(
        input="Who owns auth?",
        actual_output=devhub.query("Who owns auth?"),
        expected_output="Michael Brown (@mbrown) from security-team - active owner"
    ),
]

# Status Check Test Cases
status_test_cases = [
    LLMTestCase(
        input="Is the payments API working?",
        actual_output=devhub.query("Is the payments API working?"),
        expected_output="Payments API is healthy with 99.95% uptime"
    ),
    LLMTestCase(
        input="Is staging working?",
        actual_output=devhub.query("Is staging working?"),
        expected_output="Staging is DEGRADED due to database connection pool exhaustion"
    ),
    LLMTestCase(
        input="What is the status of all services?",
        actual_output=devhub.query("What is the status of all services?"),
        expected_output="List all services: payments-api healthy, staging degraded, search-api healthy, auth-service healthy"
    ),
]

all_test_cases = doc_test_cases + owner_test_cases + status_test_cases
print(f"Solution: {len(all_test_cases)} test cases created")

## Task 2: Define G-Eval Metrics

**Goal:** Create G-Eval metrics to evaluate DevHub responses.

**What to do:**

Create at least 3 metrics:

1. **Correctness**: Is the information accurate?
2. **Helpfulness**: Is the response actionable and useful?
3. **Edge Case Handling**: Does it handle stale data/degraded services properly?

**Hints:**
- Use appropriate `evaluation_params` for each metric
- Set thresholds based on importance (critical = 0.8+)
- Write clear, specific criteria

**Time:** ~8 minutes

In [None]:
# =============================================================================
# TASK 2: Define G-Eval Metrics (SOLUTION)
# =============================================================================
# Create metrics to evaluate DevHub response quality.
# =============================================================================

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

# -----------------------------------------------------------------------------
# Metric 1: Correctness
# -----------------------------------------------------------------------------
correctness_metric = GEval(
    name="Correctness",
    criteria="""Evaluate if the response provides factually correct information that aligns with the expected output.
    Consider:
    1. Are the key facts accurate?
    2. Does it match the expected answer?
    3. Are there any factual errors?""",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT
    ],
    threshold=0.8,
    model="gpt-4o-mini"
)

# -----------------------------------------------------------------------------
# Metric 2: Helpfulness
# -----------------------------------------------------------------------------
helpfulness_metric = GEval(
    name="Helpfulness",
    criteria="""Evaluate if the response is helpful and actionable:
    1. Does it directly address the user's question?
    2. Does it provide specific, actionable information?
    3. Would a developer be able to proceed based on this response?""",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT
    ],
    threshold=0.7,
    model="gpt-4o-mini"
)

# -----------------------------------------------------------------------------
# Metric 3: Edge Case Handling
# -----------------------------------------------------------------------------
edge_case_metric = GEval(
    name="Edge Case Handling",
    criteria="""Evaluate if the response appropriately handles edge cases:
    1. If an owner is inactive, is this clearly communicated?
    2. If a service is degraded, is the issue explained?
    3. Does it suggest alternatives when appropriate?
    4. Is uncertainty acknowledged rather than hidden?""",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT
    ],
    threshold=0.75,
    model="gpt-4o-mini"
)

# -----------------------------------------------------------------------------
# Collect metrics
# -----------------------------------------------------------------------------
all_metrics = [correctness_metric, helpfulness_metric, edge_case_metric]

print(f"Metrics defined: {len(all_metrics)}")
for m in all_metrics:
    print(f"   - {m.name} (threshold: {m.threshold})")

In [None]:
# =============================================================================
# SOLUTION: Task 2 - Define G-Eval Metrics
# =============================================================================

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

# Metric 1: Correctness
correctness_metric = GEval(
    name="Correctness",
    criteria="""Evaluate if the response provides factually correct information that aligns with the expected output.
    Consider:
    1. Are the key facts accurate?
    2. Does it match the expected answer?
    3. Are there any factual errors?""",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT
    ],
    threshold=0.8,
    model="gpt-4o-mini"
)

# Metric 2: Helpfulness
helpfulness_metric = GEval(
    name="Helpfulness",
    criteria="""Evaluate if the response is helpful and actionable:
    1. Does it directly address the user's question?
    2. Does it provide specific, actionable information?
    3. Would a developer be able to proceed based on this response?""",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT
    ],
    threshold=0.7,
    model="gpt-4o-mini"
)

# Metric 3: Edge Case Handling
edge_case_metric = GEval(
    name="Edge Case Handling",
    criteria="""Evaluate if the response appropriately handles edge cases:
    1. If an owner is inactive, is this clearly communicated?
    2. If a service is degraded, is the issue explained?
    3. Does it suggest alternatives when appropriate?
    4. Is uncertainty acknowledged rather than hidden?""",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT
    ],
    threshold=0.75,
    model="gpt-4o-mini"
)

all_metrics = [correctness_metric, helpfulness_metric, edge_case_metric]
print(f"Solution: {len(all_metrics)} metrics defined")

## Task 3: Run Evaluation

**Goal:** Run all test cases against all metrics and analyze results.

**What to do:**

1. Use `deepeval.evaluate()` to run all tests
2. Print per-test results with pass/fail
3. Calculate aggregate statistics

**No starter code** - apply what you've learned!

**Time:** ~10 minutes

In [None]:
# =============================================================================
# TASK 3: Run Evaluation (SOLUTION)
# =============================================================================
# Run all test cases against all metrics.
# =============================================================================

from deepeval import evaluate
import time

print("Running DevHub Evaluation...")
print("=" * 60)
start_time = time.time()

# 1. Run evaluation
results = evaluate(all_test_cases, all_metrics)

# 2. Print per-test results
print("\nPer-Test Results:")
print("-" * 60)

passed = 0
failed = 0

for i, test_case in enumerate(all_test_cases):
    test_passed = True
    print(f"\nTest {i+1}: {test_case.input[:40]}...")

    for metric in all_metrics:
        metric.measure(test_case)
        status = "PASS" if metric.score >= metric.threshold else "FAIL"
        if metric.score < metric.threshold:
            test_passed = False
        print(f"   {metric.name}: {metric.score:.2f} [{status}]")

    if test_passed:
        passed += 1
    else:
        failed += 1

# 3. Calculate and print summary
elapsed = time.time() - start_time
print("\n" + "=" * 60)
print("SUMMARY")
print("=" * 60)
print(f"Total Tests: {len(all_test_cases)}")
print(f"Passed: {passed}")
print(f"Failed: {failed}")
print(f"Pass Rate: {passed/len(all_test_cases)*100:.1f}%")
print(f"Time: {elapsed:.1f} seconds")

In [None]:
# =============================================================================
# SOLUTION: Task 3 - Run Evaluation
# =============================================================================

from deepeval import evaluate
import time

print("Running DevHub Evaluation...")
print("=" * 60)
start_time = time.time()

# Run evaluation
results = evaluate(all_test_cases, all_metrics)

# Per-test results
print("\nPer-Test Results:")
print("-" * 60)

passed = 0
failed = 0

for i, test_case in enumerate(all_test_cases):
    test_passed = True
    print(f"\nTest {i+1}: {test_case.input[:40]}...")

    for metric in all_metrics:
        metric.measure(test_case)
        status = "PASS" if metric.score >= metric.threshold else "FAIL"
        if metric.score < metric.threshold:
            test_passed = False
        print(f"   {metric.name}: {metric.score:.2f} [{status}]")

    if test_passed:
        passed += 1
    else:
        failed += 1

# Summary
elapsed = time.time() - start_time
print("\n" + "=" * 60)
print("SUMMARY")
print("=" * 60)
print(f"Total Tests: {len(all_test_cases)}")
print(f"Passed: {passed}")
print(f"Failed: {failed}")
print(f"Pass Rate: {passed/len(all_test_cases)*100:.1f}%")
print(f"Time: {elapsed:.1f} seconds")

---

## Lab 1 Verification Checklist

Verify your evaluation harness:

### 1. Test Cases
- [ ] At least 10 test cases created
- [ ] Mix of documentation, owner, and status queries
- [ ] Each has input, actual_output, expected_output

### 2. Metrics
- [ ] At least 3 G-Eval metrics defined
- [ ] Each has clear criteria
- [ ] Thresholds set appropriately

### 3. Results
- [ ] All tests ran without errors
- [ ] Got scores for each test/metric combination
- [ ] Summary shows pass/fail counts

### 4. Analysis
- [ ] Can identify which tests failed
- [ ] Can see which metrics need improvement
- [ ] Understand why failures occurred

**If all checked, you've built an evaluation harness!**

---

## Lab 1 Key Takeaways

1. **Test cases capture LLM interactions** for systematic evaluation

2. **G-Eval metrics define quality** in plain English

3. **Batch evaluation** runs many tests quickly (~3 min for 100 tests)

4. **Results show exactly where** quality issues occur

5. **This harness can run on every prompt change** to catch regressions!

---

**Next:** Lab 2 - Generate synthetic test data and run regression tests.

---

# Lab 2: Synthetic Data Generation & Regression Testing

Now that you can evaluate DevHub, let's scale up your test suite automatically!

**Duration:** ~25 minutes

**What you'll do:**
1. Generate synthetic test cases from seed examples
2. Create paraphrased variations of queries
3. Build a regression testing pipeline
4. Compare prompt versions

**Why this matters:**
- 10 seed examples → 100 test cases automatically
- Catch edge cases you wouldn't think of
- Ensure prompt changes don't break existing functionality

## Synthetic Data Generation for LLM Testing

### The Problem
Writing test cases manually is:
- **Time-consuming**: 10 minutes per good test case
- **Limited**: You only test what you think of
- **Biased**: Your examples reflect your assumptions

### The Solution: LLM-Powered Data Generation

Use LLMs to generate test variations:
1. **Paraphrasing**: Same intent, different wording
2. **Complexity scaling**: Simple → complex versions
3. **Adversarial cases**: Edge cases and tricky inputs
4. **Style variations**: Formal, casual, technical

### DeepEval's Synthesizer

```python
from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
synthetic_data = synthesizer.generate_goldens_from_docs(
    documents=["your documentation..."],
    max_goldens_per_document=10
)
```

## How Synthetic Data Evolution Works

![Synthetic Data Evolution](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_02/charts/05_synthetic_data_evolution.svg)

**Process:**

1. **Seed Examples** (10 hand-crafted)
   - High-quality, verified test cases
   - Cover main functionality

2. **Filtration**
   - Remove low-quality seeds
   - Ensure diversity

3. **Evolution**
   - Generate paraphrases
   - Increase complexity
   - Add adversarial variations

4. **Styling**
   - Apply tone variations
   - Formal/casual/technical

5. **Golden Dataset** (100+ test cases)
   - Stratified by type and difficulty
   - Ready for regression testing

In [None]:
# =============================================================================
# SYNTHETIC DATA: DeepEval Synthesizer Setup
# =============================================================================
# Using DeepEval's built-in Synthesizer with document files.
# This provides research-backed generation with quality filtering.

from deepeval.synthesizer import Synthesizer
import tempfile
import os

# Initialize synthesizer
synthesizer = Synthesizer(model="gpt-4o-mini")

# Create temporary document files from DevHub's data
# (In production, you'd use your actual documentation files)
temp_dir = tempfile.mkdtemp()
document_paths = []

# Write DevHub data to temp files
docs_content = {
    "payments_auth.txt": f"Payments API Authentication: {devhub.docs['payments-auth']}",
    "error_handling.txt": f"Error Handling Standards: {devhub.docs['error-handling']}",
    "rate_limiting.txt": f"Rate Limiting Policy: {devhub.docs['rate-limiting']}",
    "ownership.txt": """Service Ownership Information:
- Billing: Sarah Chen (@sarah.chen) from payments-team, currently active.
- Vector Search: David Kim (@david.kim) from search-team, currently INACTIVE - use team channel instead.
- Auth: Michael Brown (@mbrown) from security-team, currently active.""",
    "status.txt": """Service Status Information:
- Payments API: Currently healthy with 99.95% uptime.
- Staging Environment: Currently DEGRADED due to database connection pool exhaustion.
- Search API: Currently healthy with 99.99% uptime.""",
}

for filename, content in docs_content.items():
    path = os.path.join(temp_dir, filename)
    with open(path, 'w') as f:
        f.write(content)
    document_paths.append(path)

print("Synthesizer initialized!")
print(f"   Model: gpt-4o-mini")
print(f"   Document files created: {len(document_paths)}")
for p in document_paths:
    print(f"      - {os.path.basename(p)}")

In [None]:
# =============================================================================
# TASK: Generate Synthetic Test Cases with DeepEval Synthesizer
# =============================================================================
# Generate test cases automatically from documentation files.

from deepeval.test_case import LLMTestCase

print("Generating synthetic test cases from documents...")
print("=" * 60)

# Generate goldens from document files
goldens = synthesizer.generate_goldens_from_docs(
    document_paths=document_paths,
    include_expected_output=True,
    max_goldens_per_context=2,
)

print(f"\nGenerated {len(goldens)} golden test cases")
print("-" * 60)

# Convert goldens to LLMTestCase with actual DevHub responses
synthetic_test_cases = []

for i, golden in enumerate(goldens):
    print(f"\n[{i+1}/{len(goldens)}] Processing: {golden.input[:50]}...")
    
    # Get actual response from DevHub
    actual_response = devhub.query(golden.input)
    
    test_case = LLMTestCase(
        input=golden.input,
        actual_output=actual_response,
        expected_output=golden.expected_output
    )
    synthetic_test_cases.append(test_case)

print(f"\n{'=' * 60}")
print(f"Total synthetic test cases: {len(synthetic_test_cases)}")

In [None]:
# =============================================================================
# Evaluate Synthetic Test Cases
# =============================================================================
# Run our metrics on the synthetic test cases.

from deepeval import evaluate
import time

print("Evaluating synthetic test cases...")
print("=" * 60)
start_time = time.time()

# Use the metrics from Lab 1
results = evaluate(synthetic_test_cases, all_metrics)

# Summarize
passed = 0
failed = 0

for test_case in synthetic_test_cases:
    test_passed = True
    for metric in all_metrics:
        metric.measure(test_case)
        if metric.score < metric.threshold:
            test_passed = False
    if test_passed:
        passed += 1
    else:
        failed += 1

elapsed = time.time() - start_time

print(f"\nSynthetic Test Results:")
print(f"   Total: {len(synthetic_test_cases)}")
print(f"   Passed: {passed}")
print(f"   Failed: {failed}")
print(f"   Pass Rate: {passed/len(synthetic_test_cases)*100:.1f}%")
print(f"   Time: {elapsed:.1f} seconds")

## Regression Testing for Prompts

### What is Regression Testing?

Ensuring that **changes don't break existing functionality**.

For LLMs, this means:
- New prompt version should pass all existing tests
- Quality scores should not decrease significantly
- Edge cases should still be handled correctly

### The Regression Testing Pipeline

1. **Establish Baseline**: Run tests on current prompt version
2. **Make Changes**: Modify the prompt
3. **Run Tests Again**: Same tests, new prompt
4. **Compare Results**: Did scores improve or regress?
5. **Decision**: Accept or reject the change

In [None]:
# =============================================================================
# REGRESSION TESTING: Capture Baseline
# =============================================================================
# Establish baseline scores before making changes.

def capture_baseline(test_cases: list, metrics: list) -> dict:
    """Capture baseline scores for regression testing."""
    baseline = {
        "total_tests": len(test_cases),
        "scores_by_metric": {},
        "test_results": []
    }

    for test_case in test_cases:
        test_result = {"input": test_case.input, "scores": {}}

        for metric in metrics:
            metric.measure(test_case)

            if metric.name not in baseline["scores_by_metric"]:
                baseline["scores_by_metric"][metric.name] = []

            baseline["scores_by_metric"][metric.name].append(metric.score)
            test_result["scores"][metric.name] = metric.score

        baseline["test_results"].append(test_result)

    # Calculate averages
    baseline["averages"] = {}
    for metric_name, scores in baseline["scores_by_metric"].items():
        baseline["averages"][metric_name] = sum(scores) / len(scores)

    return baseline

# Capture baseline
print("Capturing baseline scores...")
baseline = capture_baseline(all_test_cases[:5], all_metrics)  # Use subset for demo

print("\nBaseline Established:")
for metric_name, avg in baseline["averages"].items():
    print(f"   {metric_name}: {avg:.2f}")

## Regression Testing Flow

![Regression Testing Flow](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_02/charts/06_regression_testing_flow.svg)

**Key Decision Points:**

1. **Score Comparison**
   - If new scores >= baseline: Consider accepting
   - If new scores < baseline: Investigate regression

2. **Threshold for Regression**
   - Typical: 5-10% decrease is a warning
   - Critical metrics: Any decrease is a failure

3. **Handling Trade-offs**
   - One metric improves, another regresses?
   - Document and decide based on priorities

In [None]:
# =============================================================================
# REGRESSION TESTING: Compare Results
# =============================================================================
# Compare new results against baseline.

def compare_with_baseline(new_results: dict, baseline: dict, regression_threshold: float = 0.05) -> dict:
    """Compare new results with baseline and detect regressions."""
    comparison = {
        "regressions": [],
        "improvements": [],
        "unchanged": [],
        "summary": {}
    }

    for metric_name in baseline["averages"]:
        baseline_avg = baseline["averages"][metric_name]
        new_avg = new_results["averages"][metric_name]
        diff = new_avg - baseline_avg
        pct_change = (diff / baseline_avg) * 100 if baseline_avg > 0 else 0

        result = {
            "metric": metric_name,
            "baseline": baseline_avg,
            "new": new_avg,
            "diff": diff,
            "pct_change": pct_change
        }

        if diff < -regression_threshold:
            comparison["regressions"].append(result)
        elif diff > regression_threshold:
            comparison["improvements"].append(result)
        else:
            comparison["unchanged"].append(result)

    comparison["summary"] = {
        "total_metrics": len(baseline["averages"]),
        "regressions": len(comparison["regressions"]),
        "improvements": len(comparison["improvements"]),
        "unchanged": len(comparison["unchanged"]),
        "overall": "PASS" if len(comparison["regressions"]) == 0 else "FAIL"
    }

    return comparison

# Simulate a "new version" (in practice, you'd change the prompt)
print("Simulating prompt change and re-running tests...")
new_results = capture_baseline(all_test_cases[:5], all_metrics)

# Compare
comparison = compare_with_baseline(new_results, baseline)

print("\nRegression Test Results:")
print("=" * 60)
print(f"Overall: {comparison['summary']['overall']}")
print(f"   Regressions: {comparison['summary']['regressions']}")
print(f"   Improvements: {comparison['summary']['improvements']}")
print(f"   Unchanged: {comparison['summary']['unchanged']}")

if comparison["regressions"]:
    print("\nRegressions Detected:")
    for r in comparison["regressions"]:
        print(f"   {r['metric']}: {r['baseline']:.2f} → {r['new']:.2f} ({r['pct_change']:.1f}%)")

In [None]:
# =============================================================================
# COMPLETE REGRESSION TESTING PIPELINE
# =============================================================================
# Put it all together into a reusable function.

def run_regression_test(
    test_cases: list,
    metrics: list,
    baseline: dict = None,
    regression_threshold: float = 0.05
) -> dict:
    """Run a complete regression test.

    Args:
        test_cases: List of LLMTestCase to evaluate
        metrics: List of metrics to use
        baseline: Previous baseline (if None, just capture)
        regression_threshold: Threshold for detecting regression

    Returns:
        Dict with results and comparison
    """
    # Capture current results
    current = capture_baseline(test_cases, metrics)

    if baseline is None:
        return {
            "mode": "baseline_capture",
            "results": current,
            "message": "Baseline captured. Save this for future comparisons."
        }

    # Compare with baseline
    comparison = compare_with_baseline(current, baseline, regression_threshold)

    return {
        "mode": "comparison",
        "results": current,
        "comparison": comparison,
        "passed": comparison["summary"]["overall"] == "PASS"
    }

# Demo the pipeline
print("Running regression test pipeline...")
result = run_regression_test(all_test_cases[:5], all_metrics, baseline=baseline)

print(f"\nPipeline Result: {'PASSED' if result['passed'] else 'FAILED'}")

---

## Lab 2 Verification Checklist

### 1. Synthetic Data Generation
- [ ] Generated paraphrases for seed queries
- [ ] Created 9+ synthetic test cases
- [ ] Verified synthetic cases have expected_output

### 2. Synthetic Evaluation
- [ ] Ran metrics on synthetic test cases
- [ ] Calculated pass rate
- [ ] Identified any failing cases

### 3. Regression Testing
- [ ] Captured baseline scores
- [ ] Ran comparison against baseline
- [ ] Understood regression vs improvement detection

### 4. Pipeline
- [ ] Regression pipeline returns pass/fail
- [ ] Can identify which metrics regressed
- [ ] Ready to integrate into CI/CD

**If all checked, you've built a regression testing system!**

## Taking It Further: CI/CD Integration

```yaml
# .github/workflows/llm-regression.yml
name: LLM Regression Tests
on: [pull_request]

jobs:
  regression-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - name: Install dependencies
        run: pip install deepeval openai

      - name: Load baseline
        run: |
          # Load saved baseline from artifact or file
          python scripts/load_baseline.py

      - name: Run regression tests
        run: |
          python -c "
          from regression import run_regression_test
          result = run_regression_test(test_cases, metrics, baseline)
          if not result['passed']:
              raise Exception('Regression detected!')
          "

      - name: Update baseline on merge
        if: github.event_name == 'push' && github.ref == 'refs/heads/main'
        run: python scripts/save_baseline.py
```

**This ensures:**
- Every PR is tested for regressions
- Baseline is updated after merges
- No surprises in production

---

## Lab 2 Key Takeaways

1. **Synthetic data expands test coverage** automatically (10 → 100 cases)

2. **Paraphrasing catches wording variations** you wouldn't think of

3. **Regression testing compares prompt versions** systematically

4. **Baselines establish quality benchmarks** for comparison

5. **CI/CD integration prevents regressions** from reaching production

---

**Next:** Wrap-up and take-home exercise.

---

# Session 2: Wrap-Up

## What You Learned Today

### 1. Why Traditional Testing Fails
- Non-deterministic outputs
- Multi-dimensional quality
- Scale challenges

### 2. DeepEval Framework
- Test cases as units of evaluation
- Pytest-like interface
- 50+ research-backed metrics

### 3. G-Eval Mastery
- Custom criteria in plain English
- Chain-of-thought evaluation
- 0.87 correlation with humans

### 4. Practical Skills
- Built evaluation harness for DevHub
- Generated synthetic test cases
- Ran automated regression tests
- Caught quality issues systematically

## Before vs After Testing

### Before
```
PM: "I tested 20 prompts manually, looks good!"
*Deploys to production*
2 hours later...
Support: "Everything is broken!"
*Emergency rollback at 2am*
```

### After
```
Engineer: "Running automated evaluation..."
*100 tests in 3 minutes*
"New prompt passes 98/100 tests but fails 2 refund edge cases."
*Fixes before deployment*
*Zero production incidents*
```

**From 2 days of manual testing to 3 minutes of automated evaluation.**

![Before/After Testing](https://raw.githubusercontent.com/axel-sirota/salesforce-ai-workshops/main/exercises/session_02/charts/07_before_after_testing.svg)

## 5 Key Takeaways

### 1. Test Quality, Not Equality
LLM outputs are non-deterministic. Use quality scoring, not exact matching.

### 2. G-Eval is Your Swiss Army Knife
Define any criteria in plain English. It handles the complexity.

### 3. Automation is Essential
Manual testing doesn't scale. 100 automated tests in 3 minutes vs 2 days manual.

### 4. Synthetic Data Expands Coverage
10 seed examples → 100 test cases automatically.

### 5. Catch Regressions Before Deployment
Run tests on every prompt change. No surprises in production.

## Take-Home Exercise: CI/CD Integration

**Integrate testing into your deployment pipeline.**

### Option 1: GitHub Actions
```yaml
# .github/workflows/llm-tests.yml
name: LLM Quality Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - run: pip install deepeval openai
      - run: deepeval test run tests/
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
```

### What to Deliver
1. Set up CI workflow in your project
2. Run DeepEval tests on every PR
3. Fail PRs that don't pass threshold
4. Test by making a breaking change

Bring your results to Session 3!

---

## Coming Up: Session 3 - Debugging AI Applications

In Session 3, we'll learn how to **debug production AI systems** when things go wrong.

### Topics:
- Distributed tracing for AI workflows
- Real-time dashboards
- Root cause analysis
- Common failure patterns

### You'll Build:
- Complete observability stack
- Debugging workflows
- Alerting system

---

## Congratulations!

You've completed Session 2: Testing Strategies for AI Applications!

**Skills gained:**
- Understand why LLM testing is different
- Build automated evaluation harnesses
- Use G-Eval for custom quality metrics
- Generate synthetic test data
- Run 100 tests in 3 minutes
- Catch regressions before deployment

See you in Session 3!