# Effectively evaluating a prompt chain

When building applications that chain prompts—where one LLM call’s output feeds into the next—it’s best to measure more than a single metric. In this cookbook, we’ll demonstrate how to:

1. Trace and evaluate a complete end-to-end agent in Braintrust.
2. Isolate and evaluate a particular step in the chain to identify and measure issues.

We’ll walk through a travel-planning agent that decides what actions to take (e.g., calling a weather or flight API) and uses a judge function to decide if each step is valid. Finally, it produces an itinerary. We’ll do an end-to-end evaluation, then zoom in on the judge step to see how effectively it flags unnecessary actions.


## Getting started

Before getting started, make sure you have a [Braintrust account](https://www.braintrust.dev/signup) and an API key for [OpenAI](https://platform.openai.com/). Make sure to plug the OpenAI key into your Braintrust account's [AI providers](https://www.braintrust.dev/app/settings?subroute=secrets) configuration and acquire a [BRAINTRUST_API_KEY](https://www.braintrust.dev/app/settings?subroute=api-keys). You can also add an API key for any other AI provider you'd like but be sure to change the code to use that model. Lastly, add your `BRAINTRUST_API_KEY` to your Python environment.

```bash
export BRAINTRUST_API_KEY="YOUR_BRAINTRUST_API_KEY"
```

<Callout type="info">
Best practice is to export your API key as an environment variable. However, to make it easier to follow along with this cookbook, you can also hardcode it into the code below.
</Callout>

Install the required Python dependencies:

In [None]:
pip install braintrust openai autoevals

Next, we'll import all of the modules we need and initialize our OpenAI client.

In [None]:
import os
import json
import random
import openai
import jsonschema
from datetime import datetime, timedelta
from typing import Dict, Any, List, Optional

import braintrust
import autoevals


BRAINTRUST_API_KEY = os.environ.get(
    "BRAINTRUST_API_KEY", "sk-2omSuFIj5arE85lA37l1AVwLzx8bgWoVUU0ugTqoZzmxN9JI"
)
os.environ["BRAINTRUST_API_KEY"] = BRAINTRUST_API_KEY

client = braintrust.wrap_openai(
    openai.OpenAI(
        api_key=BRAINTRUST_API_KEY,
        base_url="https://api.braintrust.dev/v1/proxy",
    )
)

## Mock APIs

Here we define placeholder “mock” APIs for weather and flight searches. In real applications, you’d call external services or databases. However, for testing and illustration purposes, we simulate dynamic outputs (e.g., randomly chosen weather, airfare prices, seat availability) to confirm the agent logic works without external dependencies.


In [3]:
def get_future_date() -> str:
    base = datetime(2025, 1, 23)
    if random.random() < 0.7:
        days_ahead = random.randint(1, 10)
    else:
        days_ahead = random.randint(11, 365)
    return (base + timedelta(days=days_ahead)).strftime("%Y-%m-%d")


def mock_weather_api(city: str, date: str) -> Dict[str, Any]:
    return {
        "condition": random.choice(["sunny", "rainy", "cloudy"]),
        "temperature": random.randint(40, 95),
        "date": date,
    }


def mock_flight_api(origin: str, destination: str) -> Dict[str, Any]:
    return {
        "economy_price": random.randint(200, 800),
        "business_price": random.randint(800, 2000),
        "seats_left": random.randint(0, 100),
    }

## JSON schema & validation helpersfor agent output

We use a JSON schema to keep the agent’s output consistent. The agent can only return one of four actions: `GET_WEATHER`, `GET_FLIGHTS`, `GENERATE_ITINERARY`, or `DONE`. This constraint ensures we can reliably parse the agent’s response and handle it safely.

In [4]:
ACTION_SCHEMA = {
    "type": "object",
    "properties": {
        "action": {
            "type": "string",
            "enum": ["GET_WEATHER", "GET_FLIGHTS", "GENERATE_ITINERARY", "DONE"],
        },
        "parameters": {"type": "object"},
    },
    "required": ["action"],
    "additionalProperties": False,
}

We also validate the agent’s JSON output and attempt to fix it if it’s invalid or violates the schema. This prevents downstream parsing errors caused by unexpected or malformed JSON.

In [5]:
def prompt_llm_fix_json(original_content: str, schema: Dict[str, Any]) -> Optional[str]:
    """
    If the LLM returns invalid or schema-violating JSON, we ask it to fix it.
    Added a brief example of a valid fix to strengthen instructions.
    """
    fix_prompt = f"""The following JSON is invalid or does not match the required schema:
Original JSON:
{original_content}

Schema (must match exactly):
{json.dumps(schema, indent=2)}

Here are examples of valid JSON matching this schema:

Example 1 (GET_WEATHER):
{{
  "action": "GET_WEATHER",
  "parameters": {{
    "city": "Boston",
    "date": "2025-01-30"
  }}
}}

Example 2 (GET_FLIGHTS):
{{
  "action": "GET_FLIGHTS",
  "parameters": {{
    "origin": "NYC",
    "destination": "Miami"
  }}
}}

Now please correct the JSON so it is valid and follows the schema exactly.
Return ONLY valid JSON, nothing else.
"""

    try:
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {
                    "role": "system",
                    "content": "You are a JSON validator. Return valid JSON only.",
                },
                {"role": "user", "content": fix_prompt},
            ],
            temperature=0,
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        braintrust.current_span().log(error=f"Error during LLM fix: {e}")
        return None


def validate_json(
    content: str, schema: Dict[str, Any], retries: int = 3
) -> Optional[Dict[str, Any]]:
    span = braintrust.current_span()
    for attempt in range(retries):
        try:
            data = json.loads(content)
            jsonschema.validate(instance=data, schema=schema)
            return data
        except (json.JSONDecodeError, jsonschema.ValidationError) as e:
            span.log(error=f"JSON validation error (attempt {attempt+1}): {e}")
            fixed_content = prompt_llm_fix_json(content, schema)
            if not fixed_content:
                break
            content = fixed_content
    span.log(error="Failed to validate JSON after retries.")
    return None

## Step validation and correction

The agent may propose actions that are logically unnecessary (e.g., fetching weather it already has) or that contradict existing data. To solve this, we define a judge function to validate each proposed step. For example, if the agent attempts to `GET_WEATHER` a second time for data it already fetched, the judge flags it, then we prompt the LLM to fix it.

In [6]:
def judge_step_with_cot(
    step_description: str, context_data: Dict[str, Any] = None
) -> (bool, str):
    """
    Returns (is_ok, chain_of_thought).
    If the final decision is 'N', is_ok=False; if 'Y', is_ok=True.

    NOTE: We include some of the context (origin, destination, etc.) so the judge
    doesn't assume we have zero info if 'parameters' are empty.
    """
    with braintrust.start_span(name="judge_step") as jspan:
        # We gather minimal context details to show the judge that we do have relevant info
        context_snippet = ""
        if context_data:
            origin = context_data["input_data"].get("origin", "")
            destination = context_data["input_data"].get("destination", "")
            budget = context_data["input_data"].get("budget", "")
            preferences = context_data["input_data"].get("preferences", {})
            # Summarize the existing weather/flight data too
            wdata = context_data["weather_data"]
            fdata = context_data["flight_data"]

            context_snippet = (
                f"Context:\n"
                f" - Origin: {origin}\n"
                f" - Destination: {destination}\n"
                f" - Budget: {budget}\n"
                f" - Preferences: {preferences}\n"
                f" - Known Weather: {json.dumps(wdata, indent=2)}\n"
                f" - Known Flight: {json.dumps(fdata, indent=2)}\n"
            )

        prompt_msg = f"""You are a strict judge of correctness in a travel-planning chain. Your task is to determine whether or not the next task is a valid step to take. Typically a valid step is if the context/parameter does not yet have information. If the context/parameter already has information, the step is not valid. If all of the context is filled out, then generating the itnerary is a valid itinerary.

{context_snippet}

Step description:
\"\"\"
{step_description}
\"\"\"

Provide a short chain-of-thought. 
Then end with: "Final Decision: Y" or "Final Decision: N"
"""

        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=[
                    {
                        "role": "system",
                        "content": "You are a meticulous correctness judge.",
                    },
                    {"role": "user", "content": prompt_msg},
                ],
                temperature=0,
            )
            content = response.choices[0].message.content.strip()
            jspan.log(metadata={"raw_judge_response": content})

            lines = content.splitlines()
            final_decision = "N"
            rationale_lines = []
            for line in lines:
                if line.strip().startswith("Final Decision:"):
                    if "Y" in line.upper():
                        final_decision = "Y"
                    else:
                        final_decision = "N"
                else:
                    rationale_lines.append(line)

            rationale_text = "\n".join(rationale_lines).strip()
            is_ok = final_decision.upper() == "Y"
            return is_ok, rationale_text

        except Exception as e:
            jspan.log(error=f"Judge LLM error: {e}")
            return False, "Error in judge LLM"


def fix_step_with_cot(current_action_json: str) -> str:
    """
    Try to fix an incorrect or incomplete travel planning step by providing examples
    of a valid step. Strengthened instructions to guide the LLM better.
    """
    fix_prompt = f"""We have an incorrect or incomplete travel planning step:

\"\"\"{current_action_json}\"\"\"

Below are examples of valid steps:

Example 1:
{{
  "action": "GET_WEATHER",
  "parameters": {{
    "city": "London",
    "date": "2025-05-10"
  }}
}}

Example 2:
{{
  "action": "GET_FLIGHTS",
  "parameters": {{
    "origin": "SFO",
    "destination": "LAS"
  }}
}}

Example 3 (generate final itinerary if we already have enough data):
{{
  "action": "GENERATE_ITINERARY",
  "parameters": {{}}
}}

Please produce a corrected JSON object that fully addresses the problem, 
following this schema:
{json.dumps(ACTION_SCHEMA, indent=2)}

Return ONLY the corrected JSON, no additional commentary.
"""
    try:
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {
                    "role": "system",
                    "content": "You fix incomplete travel planning steps into valid JSON.",
                },
                {"role": "user", "content": fix_prompt},
            ],
            temperature=0,
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        braintrust.current_span().log(error=f"Error fixing step with LLM: {e}")
        return ""

## Final itinerary generation

Once the agent gathers enough information (e.g., weather and flight details), we expect a final itinerary to be generated. Below is a function that takes all the gathered data such as user preferences, API responses, budget details—and constructs a coherent multi-day travel plan. The result is a textual description of the trip, including recommended accommodations, daily activities, or tips.

In [7]:
def generate_final_itinerary(context: Dict[str, Any]) -> Optional[str]:
    with braintrust.start_span(name="generate_itinerary"):
        input_data = context["input_data"]
        weather_data = context["weather_data"]
        flight_data = context["flight_data"]

        prompt = (
            f"Based on the data, generate a travel itinerary.\n\n"
            f"Origin: {input_data['origin']}\n"
            f"Destination: {input_data['destination']}\n"
            f"Start Date: {input_data['start_date']}\n"
            f"Budget: {input_data['budget']}\n"
            f"Preferences: {json.dumps(input_data['preferences'])}\n\n"
            f"Weather Data: {json.dumps(weather_data, indent=2)}\n"
            f"Flight Data: {json.dumps(flight_data, indent=2)}\n\n"
            "Create a day-by-day plan, mention booking recs, accommodations, etc. "
            "Use a helpful, concise style."
        )
        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=[
                    {"role": "system", "content": "You are a thorough travel planner."},
                    {"role": "user", "content": prompt},
                ],
                temperature=0.3,
            )
            return response.choices[0].message.content.strip()
        except Exception as e:
            braintrust.current_span().log(error=f"Error generating itinerary: {e}")
            return None

## Defining the agent prompt for the "next action"

We create a system prompt that summarizes known data (e.g., weather, flights) and reiterates the JSON schema requirements. This ensures the agent doesn’t redundantly fetch data and responds in valid JSON.

In [None]:
def generate_agent_prompt(context: Dict[str, Any]) -> str:
    input_data = context["input_data"]
    weather_data = context["weather_data"]
    flight_data = context["flight_data"]

    # System instructions encouraging valid JSON and minimal iteration
    system_instructions = (
        "You are an autonomous travel planning assistant. "
        "You MUST produce strictly valid JSON that matches the schema below. "
        "If you need to GET_WEATHER, ensure you have both 'city' and 'date'. "
        "If you need to GET_FLIGHTS, ensure you have 'origin' and 'destination'. "
        "If the data is sufficient, proceed to GENERATE_ITINERARY. Avoid repeating an action if it's not needed. "
        "Actions can be: GET_WEATHER, GET_FLIGHTS, GENERATE_ITINERARY, or DONE."
    )

    # Summarize current context
    user_prompt = (
        "Current Travel Context:\n"
        f" - Origin: {input_data['origin']}\n"
        f" - Destination: {input_data['destination']}\n"
        f" - Start Date: {input_data['start_date']}\n"
        f" - Budget: {input_data['budget']}\n"
        f" - Preferences: {json.dumps(input_data['preferences'])}\n\n"
    )
    if weather_data:
        user_prompt += f"Weather Data: {json.dumps(weather_data, indent=2)}\n\n"
    if flight_data:
        user_prompt += f"Flight Data: {json.dumps(flight_data, indent=2)}\n\n"

    # State final instructions for returning JSON
    user_prompt += (
        "Next action? Respond only with valid JSON like:\n"
        "{\n"
        '  "action": "GET_WEATHER" | "GET_FLIGHTS" | "GENERATE_ITINERARY" | "DONE",\n'
        '  "parameters": { ... }\n'
        "}\n\n"
        "Here is the schema to follow:\n"
        f"{json.dumps(ACTION_SCHEMA, indent=2)}"
    )

    return system_instructions + "\n\n" + user_prompt

## The main agent loop

Here we build the core loop that powers our entire travel planning agent. It runs for a maximum number of iterations, doing the following each time:

- **Prompt** the LLM for the next action.
- **Validate** the JSON response against our schema.
- **Judge** if the step is logical in context. If it fails, attempt to fix it.
- **Execute** the step if valid (e.g., calling the mock weather/flight APIs).
- If the agent indicates `GENERATE_ITINERARY`, produce the final itinerary and exit.

By iterating until a final plan is reached (or until we exhaust retries), we create a semi-autonomous workflow that can correct missteps along the way.

In [9]:
@braintrust.traced
def agent_loop(client: openai.OpenAI, input_data: Dict[str, Any]) -> str:
    """
    Up to 10 iterations. If step is invalid or judge says 'N', we fix the step.
    Returns a JSON string with final itinerary + iteration logs.
    """
    context: Dict[str, Any] = {
        "input_data": input_data,
        "weather_data": {},
        "flight_data": {},
        "decisions": {},
        "itinerary": None,
        "iteration_logs": [],
    }

    max_iterations = 10
    iteration = 0
    current_action_json = ""
    fix_attempts = 0

    while iteration < max_iterations:
        iteration += 1
        with braintrust.start_span(
            name=f"travel_planning_iteration_{iteration}"
        ) as iter_span:
            # 1) Acquire or fix the next action JSON
            if not current_action_json:
                llm_prompt = generate_agent_prompt(context)
                try:
                    resp = client.chat.completions.create(
                        model="gpt-4",
                        messages=[
                            {"role": "system", "content": llm_prompt},
                        ],
                        temperature=0,
                    )
                    current_action_json = resp.choices[0].message.content.strip()
                except Exception as e:
                    iter_span.log(error=f"Error calling LLM for next action: {e}")
                    context["itinerary"] = "Failed to call LLM."
                    break
            else:
                # We have an action JSON that was flagged, fix it
                fixed = fix_step_with_cot(current_action_json)
                if fixed:
                    current_action_json = fixed
                else:
                    iter_span.log(error="Failed to fix step with LLM.")
                    context["itinerary"] = "Could not fix step."
                    break

            # 2) Validate
            action_data = validate_json(current_action_json, ACTION_SCHEMA)
            if not action_data:
                fix_attempts += 1
                if fix_attempts > 3:
                    iter_span.log(
                        error="Could not produce valid action after multiple fixes."
                    )
                    context["itinerary"] = "Invalid action, gave up."
                    break
                # re-try
                continue

            fix_attempts = 0
            action = action_data["action"]
            parameters = action_data.get("parameters", {})

            # 3) Judge step - pass in some context so it doesn't automatically reject
            step_desc = f"Action: {action}, Params: {parameters}"
            is_ok, rationale = judge_step_with_cot(step_desc, context)

            iteration_log = {
                "iteration": iteration,
                "raw_llm_json": current_action_json,
                "action": action,
                "parameters": parameters,
                "judge_decision": "Y" if is_ok else "N",
                "judge_rationale": rationale,
            }
            context["iteration_logs"].append(iteration_log)

            current_action_json = ""
            if not is_ok:
                iter_span.log(error="Judge flagged an error => fix next iteration.")
                current_action_json = iteration_log["raw_llm_json"]
                fix_attempts += 1
                if fix_attempts > 3:
                    context["itinerary"] = "Judge flagged error repeatedly. Gave up."
                    break
                continue

            # 4) Execute action
            if action == "GET_WEATHER":
                city = parameters.get("city")
                date = parameters.get("date")
                if not city or not date:
                    iter_span.log(
                        error="Missing GET_WEATHER params => fix next iteration."
                    )
                    current_action_json = iteration_log["raw_llm_json"]
                    fix_attempts += 1
                    continue
                wdata = mock_weather_api(city, date)
                context["weather_data"][date] = wdata
                iter_span.log(metadata={"fetched_weather": wdata})

            elif action == "GET_FLIGHTS":
                origin = parameters.get("origin")
                dest = parameters.get("destination")
                if not origin or not dest:
                    iter_span.log(
                        error="Missing GET_FLIGHTS params => fix next iteration."
                    )
                    current_action_json = iteration_log["raw_llm_json"]
                    fix_attempts += 1
                    continue
                fdata = mock_flight_api(origin, dest)
                context["flight_data"] = fdata
                iter_span.log(metadata={"fetched_flight": fdata})

            elif action == "GENERATE_ITINERARY":
                itinerary = generate_final_itinerary(context)
                context["itinerary"] = itinerary or "Failed to generate itinerary."
                break

            elif action == "DONE":
                iter_span.log(metadata={"status": "LLM indicated DONE"})
                break

            else:
                iter_span.log(error=f"Unknown action '{action}' => fix next iteration.")
                current_action_json = iteration_log["raw_llm_json"]
                fix_attempts += 1
                if fix_attempts > 3:
                    context["itinerary"] = f"Unknown action '{action}', gave up."
                    break

    final_data = {
        "final_itinerary": context["itinerary"],
        "iteration_logs": context["iteration_logs"],
        "input_data": context["input_data"],
    }
    return json.dumps(final_data, indent=2)

## Evaluation dataset

Our workflow needs sample input data for testing. Below are 3 hardcoded test cases with different origins, destinations, budgets, and preferences. In a real application, you'll have a more extensive dataset with dozens if not hundreds of test cases. 

In [10]:
def dataset() -> List[braintrust.EvalCase]:
    return [
        braintrust.EvalCase(
            input={
                "origin": "NYC",
                "destination": "Miami",
                "start_date": get_future_date(),
                "budget": "high",
                "preferences": {"activity_level": "high", "foodie": True},
            },
        ),
        braintrust.EvalCase(
            input={
                "origin": "SFO",
                "destination": "Seattle",
                "start_date": get_future_date(),
                "budget": "medium",
                "preferences": {"activity_level": "low"},
            },
        ),
        braintrust.EvalCase(
            input={
                "origin": "IAH",
                "destination": "Paris",
                "start_date": get_future_date(),
                "budget": "low",
                "preferences": {"activity_level": "low"},
            },
        ),
    ]

## Defining our scoring function

We implement a custom LLM-based scorer that checks whether the final itinerary actually meets the user’s preferences. For instance, if the user wants a “high-activity trip,” but the final plan doesn’t suggest outdoor excursions or active elements, the scorer may judge that it’s missing key requirements.

In [11]:
judge_itinerary = autoevals.LLMClassifier(
    name="LLM Itinerary Judge",
    prompt_template=(
        "User preferences: {{input.preferences}}\n\n"
        "Here is the final itinerary:\n{{output}}\n\n"
        "Does this itinerary meet the user preferences? (Y/N)\n"
        "Provide a short chain-of-thought, then say 'Final: Y' or 'Final: N'.\n"
    ),
    choice_scores={"Y": 1.0, "N": 0.0},
    use_cot=True,
)

## Evaluating the agent's end-to-end performance

We define a `chain_task` that calls `agent_loop()`, then run an eval. Because the `agent_loop()` is wrapped with `@braintrust.traced`, each iteration and sub-step gets logged in the Braintrust UI.


In [None]:
def chain_task(input_data: Dict[str, Any], hooks) -> str:
    hooks.metadata["origin"] = input_data["origin"]
    hooks.metadata["destination"] = input_data["destination"]
    return agent_loop(client, input_data)


if __name__ == "__main__":
    braintrust.Eval(
        name="TravelPlanner",
        data=dataset,
        task=chain_task,
        scores=[judge_itinerary],
        experiment_name="end-to-end-eval",
        metadata={"notes": "End to end evaluation of our travel planning agent"},
    )

![end-to-end](./assets/e2e.png)

Starting with this top down approach is a generally recommended because it allows you spot where the chain might be breaking or not performing as expected. The Braintrust UI allows you to click intro any given component, view information such as the prompt or metadata. View each step can help decide which sub-component (weather fetch, flight fetch, judge) might need a closer look or some tuning. You would then run a separate evaluation on that component.


## Evaluating the judge step in isolation

After evaluating the end-to-end performance of an agent, you might want to take a closer look at a single component. For instance, if you notice that the agent frequently repeats certain actions when it shouldn’t, you might suspect the judge logic is misclassifying steps. To do this, we'll need to create a new experiement, a new dataset of test cases, and new scorers to evaluate specific components. 

<Callout type="info">
Depending on the complexity of your agent, or how you like to organize your work in Braintrust, you can choose to create a new project for this evaluation instead of adding it to the existing project like we do here.
</Callout>

For demonstration purposes, our approach is going to be simple. We create a judge-only dataset, along with a minimal `judge_eval_task` that passes the sample inputs though `judge_step_with_cot()` and then compares the response to our expected label using a heuristic scorer called `ExactMatch()`.

In [13]:
def dataset_judge_eval() -> List[braintrust.EvalCase]:
    """
    A small dataset focusing on testing judge_step_with_cot in isolation.
    Each EvalCase includes a step_description and some minimal context.
    We also specify an expected final decision ("Y" or "N") from the judge.
    """
    return [
        braintrust.EvalCase(
            input={
                "step_description": "Action: GET_WEATHER, Params: {'city': 'NYC', 'date': '2025-02-01'}",
                "context_data": {
                    "input_data": {
                        "origin": "NYC",
                        "destination": "Miami",
                        "budget": "medium",
                        "preferences": {"foodie": True},
                    },
                    "weather_data": {},  # no existing weather => judge should say "Y"
                    "flight_data": {},
                },
            },
            expected="Y",
        ),
        braintrust.EvalCase(
            input={
                "step_description": "Action: GET_FLIGHTS, Params: {'origin': 'NYC', 'destination': 'Miami'}",
                "context_data": {
                    "input_data": {
                        "origin": "NYC",
                        "destination": "Miami",
                        "budget": "low",
                        "preferences": {},
                    },
                    # Suppose we already have flight data => judge might say "N"
                    "weather_data": {},
                    "flight_data": {
                        "economy_price": 300,
                        "business_price": 1200,
                        "seats_left": 10,
                    },
                },
            },
            expected="N",
        ),
    ]


def judge_eval_task(inputs: Dict[str, Any], hooks) -> str:
    """
    A mini-task to evaluate judge_step_with_cot in isolation.
    We pass the step_description and context_data to the function,
    then return the judge's final decision ("Y" or "N").
    """
    step_description = inputs["step_description"]
    context_data = inputs["context_data"]

    is_ok, _ = judge_step_with_cot(step_description, context_data)

    return "Y" if is_ok else "N"

In [None]:
if __name__ == "__main__":

    braintrust.Eval(
        name="TravelPlanner",
        data=dataset_judge_eval,
        task=judge_eval_task,
        scores=[autoevals.ExactMatch()],
        experiment_name="judge-step-eval",
        metadata={"notes": "Evaluating the judge_step_with_cot function in isolation."},
    )

After you run this evaluation, you can return to your orginal project in Braintrust. There you will see the new experiment for the judge step

![homepage](./assets/homepage.png)

If you click into the experiment, you can see all of the different evaluations and summaries. You can also click an individual row to view a full trace which includes the task function, metadata, and the scorers.

![judge-eval](./assets/judge.png)

## What’s next:

-[Read about](https://www.braintrust.dev/blog/evaluating-agents) on best practices for evaluating agents.

-[Learn](https://www.braintrust.dev/blog/after-evals) what to do after you run an eval

-Try out another [agent cookbook](https://www.braintrust.dev/docs/cookbook/recipes/APIAgent-Py)

