# Evaluating voice AI agents with Evalion and Braintrust

Evaluating voice AI agents presents unique challenges compared to traditional text-based systems. Voice conversations require assessing real-time performance (response latency under 500ms, interruption handling within 200ms), multi-turn dialogue across dozens of exchanges, and subjective qualities like customer satisfaction and naturalness.

Consider a customer service agent handling a flight cancellation. Success depends not just on providing correct information, but on maintaining context across the conversation, showing empathy under pressure, handling mid-sentence interruptions, and adapting to the customer's emotional state—all while processing background noise and accents in real-time.

This cookbook demonstrates how to build a systematic evaluation pipeline for voice AI agents by combining **Braintrust** (an evaluation and observability platform) with **Evalion** (a voice AI testing platform that simulates realistic customer interactions). Together, they enable you to evaluate complex voice conversations at scale.

## What you'll learn

- How to set up end-to-end voice agent evaluation pipelines
- How to use Braintrust datasets to define test scenarios
- How to orchestrate automated voice simulations with Evalion
- How to extract and track metrics across multiple evaluation runs
- How to build your own integration between evaluation platforms

## Prerequisites

1. **Braintrust Account**: Sign up at [braintrust.dev](https://www.braintrust.dev)
2. **Evalion Access**: Running Evalion backend with API access
3. **API Credentials**: Both Braintrust and Evalion API tokens


# Getting started

## Braintrust platform setup

Before running the evaluation programmatically, let's see how to set up the key components in the Braintrust platform UI.

### 1. Creating a dataset in Braintrust

In the Braintrust platform, you can create and manage datasets through the web interface. Here's what the dataset creation looks like:

![braintrust-dataset.png](./assets/braintrust-dataset.png)

*The dataset view shows your test scenarios with input/expected pairs that will be used for evaluation.*

### 2. Setting up a playground

The Playground allows you to test and iterate on your prompts before running full evaluations:

![braintrust-playground.png](./assets/braintrust-playground.png)

*The Playground interface lets you experiment with different agent configurations and see immediate results.*

### 3. Running experiments

Once your dataset and prompts are configured, you can run experiments to evaluate performance:

![braintrust-experiment.png](./assets/braintrust-experiment.png)

*The experiment view displays comprehensive metrics, scores, and comparisons across multiple runs.*

---

## Setup & Installation

First, install the required dependencies:

In [None]:
# Install required packages
!pip install braintrust httpx pydantic asyncio

In [None]:
import os
import asyncio
import json
from typing import Any, Dict, List

# Set your API keys as environment variables
# os.environ["BRAINTRUST_API_KEY"] = "<YOUR_BRAINTRUST_API_KEY>"
# os.environ["EVALION_API_TOKEN"] = "<YOUR_EVALION_API_TOKEN>"
# os.environ["EVALION_PROJECT_ID"] = "<YOUR_EVALION_PROJECT_ID>"
# os.environ["EVALION_PERSONA_ID"] = "<YOUR_EVALION_PERSONA_ID>"

# For demo purposes, we'll use placeholders
BRAINTRUST_API_KEY = os.getenv("BRAINTRUST_API_KEY", "")
EVALION_API_TOKEN = os.getenv("EVALION_API_TOKEN", "")
EVALION_PROJECT_ID = os.getenv("EVALION_PROJECT_ID", "")
EVALION_PERSONA_ID = os.getenv("EVALION_PERSONA_ID", "")

## Understanding the workflow

Braintrust manages the evaluation pipeline—organizing datasets, launching evaluations, and tracking metrics over time. Evalion provides voice-specific capabilities: simulating realistic customer interactions and measuring voice metrics.

The workflow uses Braintrust's evaluation primitives:
1. **Datasets**: Define test scenarios for your voice agent
2. **Tasks**: Configure Evalion to simulate realistic customer conversations with your voice agent
3. **Scorers**: Measure voice-specific metrics (latency, customer satisfaction, goal completion)
4. **Experiments**: Run evaluations across multiple scenarios and track results over time
5. **Analysis**: Review results in Braintrust dashboard to identify improvements and regressions

Let's build each component step by step.

## 1: Creating test scenarios in Braintrust

Braintrust datasets define test scenarios for your voice agent. Each scenario specifies three components: the customer's situation (input), their behavioral characteristics (persona), and what constitutes successful handling (expected outcome).

For our airline customer service agent, scenarios range from straightforward bookings to high-stress cancellation handling.

In [None]:
from braintrust import init_dataset, init

# Initialize Braintrust
project_name = "Voice Agent Evaluation Demo"
dataset_name = "Customer Service Scenarios"

# Create test scenarios
test_scenarios = [
    {
        "input": "Customer calling to book a flight from New York to Los Angeles for next Tuesday. They want a morning flight and have a budget of $400.",
        "expected": [
            "Agent introduces themselves professionally",
            "Agent confirms the departure city (New York) and destination (Los Angeles)",
            "Agent confirms the date (next Tuesday)",
            "Agent asks about preferred time of day (morning)",
            "Agent presents available flight options within budget",
            "Agent confirms the booking details before finalizing",
        ],
    },
    {
        "input": "Frustrated customer calling because their flight was cancelled. They need to get to Chicago for an important business meeting tomorrow morning.",
        "expected": [
            "Agent shows empathy for the situation",
            "Agent apologizes for the inconvenience",
            "Agent asks for booking reference number",
            "Agent proactively searches for alternative flights",
            "Agent offers multiple rebooking options",
            "Agent provides compensation information if applicable",
        ],
    },
    {
        "input": "Customer wants to change their existing reservation to add extra baggage and select a window seat.",
        "expected": [
            "Agent asks for booking confirmation number",
            "Agent retrieves existing reservation details",
            "Agent explains baggage fees and options",
            "Agent checks seat availability",
            "Agent confirms changes and new total cost",
            "Agent sends confirmation of modifications",
        ],
    },
]

# Create or update dataset in Braintrust
dataset = init_dataset(project_name, dataset_name)

# Insert test scenarios
for scenario in test_scenarios:
    dataset.insert(**scenario)

print(f"Created dataset '{dataset_name}' with {len(test_scenarios)} scenarios")

These scenarios capture the spectrum of real customer interactions—from first-time nervous travelers to experienced flyers with complex itineraries. The persona details (interrupts when confused, gets frustrated if interrupted mid-sentence) ensure simulations behave like actual customers, not scripted test cases.


## 2: Creating Scorers

Voice scorers must evaluate what traditional scorers can't. Beyond "Did the agent book the flight?" you need to assess:

- **Latency**: Did the agent find alternatives within 800ms?
- **Customer satisfaction (CSAT)**: Did the agent acknowledge the customer's frustration?
- **Goal completion**: Was the cancelled flight successfully rebooked?

Evalion provides both objective metrics (latency, duration) and subjective assessments (CSAT, clarity). All scores normalize to 0-1 for Braintrust tracking:

In [None]:
from typing import Optional
from braintrust import Score


def normalize_score(
    score_value: Optional[float], has_succeeded: Optional[bool] = None
) -> Optional[float]:
    """Normalize scores to 0-1 range for Braintrust."""
    if has_succeeded is not None:
        return 1.0 if has_succeeded else 0.0

    if score_value is None:
        return None

    # Normalize 1-10 scale to 0-1
    return max(0.0, min(1.0, score_value / 10.0))


def extract_custom_metrics(output: Dict[str, Any]) -> List[Score]:
    """Extract custom metric scores from simulation results."""
    scores = []

    simulations = output.get("simulations", [])
    if not simulations:
        return scores

    simulation = simulations[0]
    evaluations = simulation.get("evaluations", [])

    for evaluation in evaluations:
        metric = evaluation.get("metric", {})
        metric_name = metric.get("name", "unknown")
        measurement_type = metric.get("measurement_type")

        if not evaluation.get("is_applicable", True):
            continue

        if measurement_type == "boolean":
            score_value = normalize_score(None, evaluation.get("has_succeeded"))
        else:
            score_value = normalize_score(evaluation.get("score"))

        if score_value is not None:
            scores.append(
                Score(
                    name=metric_name,
                    score=score_value,
                    metadata={
                        "reasoning": evaluation.get("reasoning"),
                        "improvement_suggestions": evaluation.get(
                            "improvement_suggestions"
                        ),
                    },
                )
            )

    return scores


def extract_builtin_metrics(output: Dict[str, Any]) -> List[Score]:
    """Extract builtin metric scores from simulation results."""
    scores = []

    simulations = output.get("simulations", [])
    if not simulations:
        return scores

    simulation = simulations[0]
    builtin_evaluations = simulation.get("builtin_evaluations", [])

    for evaluation in builtin_evaluations:
        builtin_metric = evaluation.get("builtin_metric", {})
        metric_name = builtin_metric.get("name", "unknown")
        measurement_type = builtin_metric.get("measurement_type")

        if not evaluation.get("is_applicable", True):
            continue

        # Skip avg_latency (handled separately)
        print(f"metric_name: {metric_name}")
        if metric_name == "avg_latency":
            latency_ms = evaluation.get("score")

            print(f"latency_ms: {latency_ms} ms")
            if latency_ms is None:
                continue

            # Score based on distance from 1500ms target
            target_latency = 1500
            if latency_ms <= target_latency:
                normalized_score = 1.0
            else:
                normalized_score = max(
                    0.0, 1.0 - (latency_ms - target_latency) / target_latency
                )

            print(f"Latency: {latency_ms} ms, Score: {normalized_score}")

            scores.append(
                Score(
                    name="avg_latency_ms",
                    score=normalized_score,
                    metadata={
                        "actual_latency_ms": latency_ms,
                        "target_latency_ms": target_latency,
                        "is_within_target": latency_ms <= target_latency,
                    },
                )
            )
            continue

        if measurement_type == "boolean":
            score_value = normalize_score(None, evaluation.get("has_succeeded"))
        else:
            score_value = normalize_score(evaluation.get("score"))

        if score_value is not None:
            scores.append(
                Score(
                    name=metric_name,
                    score=score_value,
                    metadata={
                        "reasoning": evaluation.get("reasoning"),
                    },
                )
            )

    return scores


print("✅ Scorer functions created")

## 3: Defining your tasks through Evalion

For voice agent evaluation, the "task" is the actual conversation between a simulated customer and your agent. Unlike traditional functions with simple inputs/outputs, Evalion creates autonomous testing agents that call your voice agent's phone number and conduct realistic conversations—interrupting mid-sentence, changing their mind, expressing frustration just like real customers.

Now let's create a service class to interact with Evalion's API. This handles all the HTTP requests for creating agents, test setups, and running simulations:

In [None]:
import httpx
import time
import uuid
from typing import Any, Dict, List, Optional


class EvalionAPIService:
    """Service class for interacting with the Evalion API."""

    def __init__(
        self, base_url: str = "https://api.evalion.ai/api/v1", api_token: str = ""
    ):
        self.base_url = base_url
        self.headers = {"Authorization": f"Bearer {api_token}"}

    async def create_hosted_agent(
        self, prompt: str, name: Optional[str] = None
    ) -> Dict[str, Any]:
        """Create a subject agent to be testedwith the given prompt. This agent will handle incoming calls from test scenarios."""
        if not name:
            name = f"Voice Agent - {uuid.uuid4()}"
        else:
            name = f"{name} - {uuid.uuid4()}"

        payload = {
            "name": name,
            "description": "Agent created for evaluation",
            "agent_type": "outbound",
            "prompt": prompt,
            "is_active": True,
            "speaks_first": False,
            "llm_provider": "openai",
            "llm_model": "gpt-4o-mini",
            "llm_temperature": 0.7,
            "tts_provider": "elevenlabs",
            "tts_model": "eleven_turbo_v2_5",
            "tts_voice": "5IDdqnXnlsZ1FCxoOFYg",
            "stt_provider": "openai",
            "stt_model": "gpt-4o-mini-transcribe",
            "language": "en",
            "max_conversation_time_in_minutes": 5,
            "llm_max_tokens": 800,
        }

        async with httpx.AsyncClient(timeout=300.0) as client:
            response = await client.post(
                f"{self.base_url}/hosted-agents",
                headers=self.headers,
                json=payload,
            )
            response.raise_for_status()
            return response.json()

    async def delete_hosted_agent(self, hosted_agent_id: str) -> None:
        """Delete a hosted agent."""
        async with httpx.AsyncClient(timeout=300.0) as client:
            response = await client.delete(
                f"{self.base_url}/hosted-agents/{hosted_agent_id}",
                headers=self.headers,
            )
            response.raise_for_status()
            return None

    async def create_agent(
        self,
        project_id: str,
        hosted_agent_id: str,
        prompt: str,
        name: Optional[str] = None,
    ) -> Dict[str, Any]:
        """Create an agent that references a hosted agent."""
        if not name:
            name = f"Test Agent {int(time.time())}"

        payload = {
            "name": name,
            "description": "Agent for evaluation testing",
            "agent_type": "inbound",
            "interaction_mode": "voice",
            "integration_type": "phone",
            "language": "en",
            "speaks_first": False,
            "prompt": prompt,
            "is_active": True,
            "hosted_agent_id": hosted_agent_id,
            "project_id": project_id,
        }

        async with httpx.AsyncClient(timeout=300.0) as client:
            response = await client.post(
                f"{self.base_url}/projects/{project_id}/agents",
                headers=self.headers,
                json=payload,
            )
            response.raise_for_status()
            return response.json()

    async def delete_agent(self, project_id: str, agent_id: str) -> None:
        """Delete an agent."""
        async with httpx.AsyncClient(timeout=300.0) as client:
            response = await client.delete(
                f"{self.base_url}/projects/{project_id}/agents/{agent_id}",
                headers=self.headers,
            )
            response.raise_for_status()
            return None

    async def create_test_set(
        self, project_id: str, name: Optional[str] = None
    ) -> Dict[str, Any]:
        """Create a test set."""
        if not name:
            name = f"Test Set {int(time.time())}"

        payload = {
            "name": name,
            "description": "Test set for evaluation",
            "project_id": project_id,
        }

        async with httpx.AsyncClient(timeout=300.0) as client:
            response = await client.post(
                f"{self.base_url}/projects/{project_id}/test-sets",
                headers=self.headers,
                json=payload,
            )
            response.raise_for_status()
            return response.json()

    async def delete_test_set(self, project_id: str, test_set_id: str) -> None:
        """Delete a test set."""
        async with httpx.AsyncClient(timeout=300.0) as client:
            response = await client.delete(
                f"{self.base_url}/projects/{project_id}/test-sets/{test_set_id}",
                headers=self.headers,
            )
            response.raise_for_status()
            return None

    async def create_test_case(
        self, project_id: str, test_set_id: str, scenario: str, expected_outcome: str
    ) -> Dict[str, Any]:
        """Create a test case."""
        payload = {
            "name": f"Test Case {int(time.time())}",
            "description": "Test case for evaluation",
            "scenario": scenario,
            "expected_outcome": expected_outcome,
            "test_set_id": test_set_id,
        }

        async with httpx.AsyncClient(timeout=300.0) as client:
            response = await client.post(
                f"{self.base_url}/projects/{project_id}/test-cases",
                headers=self.headers,
                json=payload,
            )
            response.raise_for_status()
            return response.json()

    async def delete_test_case(self, project_id: str, test_case_id: str) -> None:
        """Delete a test case."""
        async with httpx.AsyncClient(timeout=300.0) as client:
            response = await client.delete(
                f"{self.base_url}/projects/{project_id}/test-cases/{test_case_id}",
                headers=self.headers,
            )
            response.raise_for_status()
            return None

    async def create_test_setup(
        self,
        project_id: str,
        agent_id: str,
        persona_id: str,
        test_set_id: str,
        metrics: Optional[List[str]] = None,
    ) -> Dict[str, Any]:
        """Create a test setup."""
        payload = {
            "name": f"Test Setup {int(time.time())}",
            "description": "Test setup for evaluation",
            "project_id": project_id,
            "agents": [agent_id],
            "personas": [persona_id],
            "test_sets": [test_set_id],
            "metrics": metrics or [],
            "testing_mode": "voice",
        }

        async with httpx.AsyncClient(timeout=300.0) as client:
            response = await client.post(
                f"{self.base_url}/test-setups",
                headers=self.headers,
                json=payload,
            )
            response.raise_for_status()
            return response.json()

    async def delete_test_setup(self, project_id: str, test_setup_id: str) -> None:
        """Delete a test setup."""
        async with httpx.AsyncClient(timeout=300.0) as client:
            response = await client.delete(
                f"{self.base_url}/test-setups/{test_setup_id}?project_id={project_id}",
                headers=self.headers,
            )
            response.raise_for_status()
            return None

    async def run_test_setup(self, project_id: str, test_setup_id: str) -> str:
        """Prepare and run a test setup."""
        # First prepare
        async with httpx.AsyncClient(timeout=300.0) as client:
            response = await client.post(
                f"{self.base_url}/test-setup-runs/prepare",
                headers=self.headers,
                json={"project_id": project_id, "test_setup_id": test_setup_id},
            )
            response.raise_for_status()
            test_setup_run_id = response.json()["test_setup_run_id"]

        # Then run
        async with httpx.AsyncClient(timeout=300.0) as client:
            response = await client.post(
                f"{self.base_url}/test-setup-runs/{test_setup_run_id}/run",
                headers=self.headers,
                json={"project_id": project_id},
            )
            response.raise_for_status()

        return test_setup_run_id

    async def poll_for_completion(
        self, project_id: str, test_setup_run_id: str, max_wait: int = 600
    ) -> Optional[Dict[str, Any]]:
        """Poll for simulation completion."""
        start_time = time.time()

        while time.time() - start_time < max_wait:
            async with httpx.AsyncClient(timeout=300.0) as client:
                response = await client.get(
                    f"{self.base_url}/test-setup-runs/{test_setup_run_id}/simulations",
                    headers=self.headers,
                    params={"project_id": project_id},
                )

                if response.status_code == 200:
                    data = response.json()
                    simulations = data.get("data", [])

                    if simulations:
                        sim = simulations[0]
                        status = sim.get("run_status")

                        if status in ["COMPLETED", "FAILED"]:
                            return sim

            await asyncio.sleep(5)

        return None


print("EvalionAPIService class created")

This service class provides methods for the complete evaluation lifecycle:
- Creating voice agents with your prompt
- Setting up scenarios sets and test suites
- Running simulations (agents call each other)
- Polling for completion and retrieving results

## 4: Running your experiment

Now we orchestrate the complete workflow: create agents, define scenarios, run simulations, and measure performance. The power is **reproducibility**—run the same evaluation after each prompt change to track improvements systematically.

Here's the main evaluation function that ties everything together:


In [None]:
from braintrust import EvalAsync

# Define the agent prompt to evaluate
AGENT_PROMPT = """
You are a professional travel agent assistant. Your role is to help customers with:
- Booking flights
- Modifying existing reservations
- Handling cancellations and rebooking
- Answering questions about flights and policies

Guidelines:
- Always introduce yourself at the beginning of the call
- Be empathetic, especially with frustrated customers
- Confirm all details before making changes
- Provide clear pricing information
- Thank the customer at the end of the call
"""


async def run_evaluation_task(input: Dict[str, Any] | str) -> Dict[str, Any]:
    """Main task function that orchestrates the evaluation workflow."""

    # Extract scenario and expected outcome from input
    # Handle both old format (input is a string) and new format (input is a dict)
    if isinstance(input, dict):
        # New format: input is {"scenario": "...", "expected": [...]}
        scenario = input.get("scenario", "")
        expected_list = input.get("expected", [])
        # Convert expected list to string format for Evalion API
        expected_outcome = (
            "\n".join(expected_list)
            if isinstance(expected_list, list)
            else str(expected_list)
        )
    elif isinstance(input, str):
        # Old format: input is just the scenario string
        scenario = input
        expected_outcome = ""

    # Initialize Evalion API service
    api_service = EvalionAPIService(
        base_url="https://api.evalion.ai/api/v1", api_token=EVALION_API_TOKEN
    )

    # Store resource IDs for cleanup
    hosted_agent_id = None
    agent_id = None
    test_set_id = None
    test_setup_id = None

    try:
        # Step 1: Create hosted agent
        print("Creating hosted agent...")
        hosted_agent = await api_service.create_hosted_agent(
            prompt=AGENT_PROMPT, name="Travel Agent Eval"
        )
        hosted_agent_id = hosted_agent["id"]

        # Step 2: Create agent
        print("Creating agent...")
        agent = await api_service.create_agent(
            project_id=EVALION_PROJECT_ID,
            hosted_agent_id=hosted_agent_id,
            prompt=AGENT_PROMPT,
        )
        agent_id = agent["id"]

        # Step 3: Create test set
        print("Creating test set...")
        test_set = await api_service.create_test_set(project_id=EVALION_PROJECT_ID)
        test_set_id = test_set["id"]

        # Step 4: Create test case
        print("Creating test case...")
        await api_service.create_test_case(
            project_id=EVALION_PROJECT_ID,
            test_set_id=test_set_id,
            scenario=scenario,
            expected_outcome=expected_outcome,
        )

        # Step 5: Create test setup
        print("Creating test setup...")
        test_setup = await api_service.create_test_setup(
            project_id=EVALION_PROJECT_ID,
            agent_id=agent_id,
            persona_id=EVALION_PERSONA_ID,
            test_set_id=test_set_id,
            metrics=None,  # Example metric ID
        )
        test_setup_id = test_setup["id"]

        # Step 6: Run test setup
        print("Running test setup...")
        test_setup_run_id = await api_service.run_test_setup(
            project_id=EVALION_PROJECT_ID, test_setup_id=test_setup_id
        )

        # Step 7: Poll for completion
        print("Waiting for simulation to complete...")
        simulation = await api_service.poll_for_completion(
            project_id=EVALION_PROJECT_ID, test_setup_run_id=test_setup_run_id
        )

        # Step 8: Clean up Evalion resources
        print("Deleting test setup...")
        await api_service.delete_test_setup(EVALION_PROJECT_ID, test_setup_id)
        print("Deleting agent...")
        await api_service.delete_agent(EVALION_PROJECT_ID, agent_id)
        print("Deleting test set...")
        await api_service.delete_test_set(EVALION_PROJECT_ID, test_set_id)
        print("Deleting hosted agent...")
        await api_service.delete_hosted_agent(hosted_agent_id)

        if not simulation:
            return {"success": False, "error": "Simulation timed out", "transcript": ""}

        # Return results
        return {
            "success": True,
            "transcript": simulation.get("transcript", ""),
            "simulations": [simulation],
        }

    except Exception as e:
        return {"success": False, "error": str(e), "transcript": ""}


print("Evaluation task function created")

In [None]:
# Run the evaluation (this would typically be run via Braintrust CLI)
# For demonstration, we show how it would be structured

# Note: In production, you would run this with:
# braintrust eval path/to/eval_script.py --braintrust_project "Your Project" --braintrust_dataset "Your Dataset"
await EvalAsync(
    "Voice Agent Evaluation Demo",
    data=dataset,
    task=run_evaluation_task,
    scores=[
        extract_custom_metrics,
        extract_builtin_metrics,
    ],
    parameters={
        "main": {
            "type": "prompt",
            "description": "Prompt to be tested by Evalion simulations",
            "default": {
                "prompt": {
                    "type": "chat",
                    "messages": [
                        {
                            "role": "system",
                            "content": AGENT_PROMPT,
                        }
                    ],
                },
                "options": {"model": "gpt-4o"},
            },
        },
    },
)

## 5: Analyzing results

After running evaluations, you can analyze results in the Braintrust dashboard. Here's what you'll see:

After running the evaluation, navigate to Evaluations > Experiments in the Braintrust UI to see your results.

Here you will see metrics like average latency, CSAT scores, and goal completion rates across all test scenarios. You can drill down into individual scenarios to identify specific strengths and weaknesses of your voice agent.

![braintrust-results.png](./assets/braintrust-results.png)

In [None]:
# Example of what the results look like
example_results = {
    "scenario": "Customer calling to book a flight from New York to Los Angeles",
    "scores": {
        "Expected Outcome": 0.9,  # Custom metric: Did agent meet expectations?
        "conversation_flow": 0.85,  # Builtin: Was conversation natural?
        "empathy": 0.92,  # Builtin: Did agent show empathy?
        "clarity": 0.88,  # Builtin: Was agent clear?
        "avg_latency_ms": 0.95,  # Builtin: Response time (1450ms actual, target 1500ms)
    },
    "metadata": {
        "transcript_length": 450,
        "duration_seconds": 180,
        "reasoning": "Agent performed well overall, successfully gathered all required information...",
        "improvement_suggestions": "Could be more proactive in offering seat selection options",
    },
}

print("Example Results Structure:")
print(json.dumps(example_results, indent=2))

The combination of scores reveals the full picture:

- **High empathy (0.92) + Lower clarity (0.88)**: Agent is warm but needs clearer fee explanations
- **Great latency (0.95) + Lower flow (0.85)**: Fast searches but awkward booking transitions

**Compare iterations**: Run v1 → Adjust prompt → Run v2 → See what improved (and what regressed).

## Summary

Evaluating voice agents requires three shifts from traditional AI evaluation:

**Full conversations**: Assess entire interactions (goal achievement, engagement, flow), not isolated responses.

**Dynamic simulations**: Test with realistic behaviors (interruptions, frustration, topic changes), not fixed prompts.

**Interaction quality**: Measure technical performance (latency, interruption handling) and subjective experience (customer satisfaction, clarity).

This integration enables systematic evaluation at scale—reproducible testing across iterations with multidimensional metrics combining technical and qualitative assessment.

---

**Congratulations!** You've learned how to:

- Set up an end-to-end voice agent evaluation pipeline  
- Integrate Evalion's simulation testing with Braintrust's evaluation platform  
- Create custom scorers and metrics  
- Orchestrate automated testing workflows  
- Extract and analyze evaluation results  

### Next steps
1. **Start small**: Create 5-10 core scenarios (bookings, cancellations, common questions)
2. **Establish baselines**: Run first evaluation to benchmark current performance
3. **Iterate systematically**: Adjust prompt → Run eval → Compare → Deploy
4. **Automate**: Integrate with CI/CD to test every prompt change
5. **Track trends**: Monitor improvement over time via Braintrust dashboard

### Resources

- [Braintrust Documentation](https://www.braintrust.dev/docs)
- [Evalion API Documentation](http://docs.evalion.ai)
- [Contact Evalion](mailto:support@evalion.ai)
