# How good eval infra works

An introduction to all things Braintrust via a speed-run through building evals for a customer support bot


In this demo, we aim to enhance a city's customer support application by developing an AI workflow to streamline the request process and ensure it is routed to the correct department.

This is what their current customer support app looks like:

<img src="./data/city-of-encinitas-customer-support.gif" height="300"/>

Instead of requiring citizens to know exactly where their requests should be routed, jump through all these hoops, and dismiss one pop-up after another, we hypothesize a simple UI where the user can describe their issue and an AI pipeline that will take that description, route the request to the appropraite contact, prompt the UI for any required information, and provide an answer or informational response if warranted.

Let's start by importing the required libraries.


In [None]:
import enum
import json
import os
import random

from concurrent.futures import ThreadPoolExecutor
from datetime import datetime
from functools import partial
from textwrap import dedent
from typing import Annotated, Literal

import anthropic as ac
import braintrust as bt
import openai as oai
import pandas as pd
import requests

from braintrust import wrap_anthropic, wrap_openai
from dotenv import load_dotenv
from pydantic import BaseModel, BeforeValidator, Field
from thefuzz import process as fuzzy_process

load_dotenv(override=True)

## Setup

Once you're signed up on [Braintrust](https://www.braintrust.dev/) and have set the appropriate API keys in your `.env` file, you're ready to go. For this particular demo, you'll need the following API keys:

- `BRAINTRUST_API_KEY`
- `OPENAI_API_KEY`
- `ANTHROPIC_API_KEY`

With that in place, we can create our [Braintrust project](https://www.braintrust.dev/docs/guides/projects) using `braintrust.projects.create()` (this method will create or return the project if it already exists).

We also define Anthropic and OpenAI clients like usual, as well as a "wrapped" version of each client. These "wrapped" versions automatically capture usage data (e.g., prompt tokens, etc) in Braintrust when used within the context of a logger or experiment.


In [None]:
BT_PROJECT_NAME = "wayde-ai-evals-course-2025"
MAX_WORKERS = 5

bt_project = bt.projects.create(name=BT_PROJECT_NAME)

ac_client = ac.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
oai_client = oai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

wrapped_ac_client = wrap_anthropic(ac_client)
wrapped_oai_client = wrap_openai(oai_client)


## Data


"Every error analysis starts with _traces_: the full sequence of inputs, outputs, and actions taken by the pipeline for a given input"

The recommendation is to start with ~ 100 traces of diverse user intents, with preference given to real-world data where possible. In this case, we don't have access to such data and so we'll follow the AIE methodology for creating synthetic data. We'll braintrustify this workflow a bit along the way :)


### Step 1: Define dimensions

"First, before prompting anything, we define key dimensions of the query space... [to] help us systematically vary different aspects of a user’s request"


In [None]:
class DimensionTuple(BaseModel):
    inquiry_type: str = Field(
        description="What kind of inquiry are they making (e.g. informational_ask, complaint,request_for_action_or_service)"
    )
    query_style_and_detail: str = Field(
        description="The style and detail of the query (e.g. short_keywords_minimal_detail, concise_moderate_detail, verbose_detailed_request)"
    )
    customer_persona: str = Field(
        description="Who is making the request (e.g. business_owner, citizen)",
    )
    customer_age: str = Field(
        description="The age of the person making the request (e.g. senior_over_65, youth_under_18, adult)",
    )


class DimensionTuples(BaseModel):
    tuples: list[DimensionTuple]

### Step 2: Define dimension values


In [None]:
TUPLES_GEN_PROMPT = """\
I am designing a customer support chatbot for the city of Encinitas and I want to test it against a diverse
range of inquiries citizens might submit. I have provided you with several dimensions and that constitute the parts of
such a queries along with a list of possible values for each dimension.

## Instructions

Generate {{{num_tuples_to_generate}}} unique combinations of dimension values for based on the dimensions provided below. 
- Each combination should represent a different user scenario. 
- Ensure balanced coverage across all dimensions - don't over-represent any particular value or combination.
- Vary the query styles naturally.
- Attempt to make the dimension value combinations as realistic as possible.
- Never generate a tuple where the customer_persona is a business_owner and the customer_age is youth_under_18.

## Dimensions

inquiry_type:
- informational_ask
- complaint
- request_for_action_or_service

query_style_and_detail:
- concise_moderate_detail: [includes some details relevant to the `inquiry_topic`, e.g., location, names, etc.]
- verbose_detailed_request: [includes specific details relevant to the `inquiry_topic`, e.g., location, names, etc.]
- short_keywords_minimal_details

customer_persona: Who is making the request
- business_owner
- citizen

customer_age: The age of the person making the request
- senior_over_65
- youth_under_18
- adult

Generate {{{num_tuples_to_generate}}} unique dimension tuples following these patterns. Remember to maintain balanced diversity across all dimensions."""

# print(TUPLES_GEN_PROMPT.replace("{{{num_tuples_to_generate}}}", "10"))

Create a versioned prompt and save it to Braintrust.


In [None]:
tuples_gen_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="dimension-tuples-gen-prompt")
try:
    tuples_gen_prompt.build(num_tuples_to_generate=20)
except Exception as e:
    bt_tuples_gen_prompt = bt_project.prompts.create(
        name="DimensionTuplesGenPrompt",
        slug="dimension-tuples-gen-prompt",
        description="Prompt for generating dimension tuples",
        model="claude-4-sonnet-20250514",
        messages=[{"role": "user", "content": TUPLES_GEN_PROMPT}],
        if_exists="replace",
    )

    bt_project.publish()
    tuples_gen_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="dimension-tuples-gen-prompt")


With this in place, we can engage our domain experts in the prompt-building process via our [playgrounds](https://www.braintrust.dev/docs/guides/playground) and use our eval-specific AI assistant, [Loop](https://www.braintrust.dev/docs/guides/loop).

We'll prompt Loop to improve our "prompt" like this:

> "Help me improve this prompt to capture all the different dimensions a customer support inquiry may have for the city of Encinitas.
>
> The dimensions and their values should represent realistic aspects of a customer support request they may get"

We can then get our updated prompt, make changes to the structured output definition, and generate some data!


Fetch our improved prompt and update our Pydantic model:


In [None]:
tuples_gen_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="dimension-tuples-gen-prompt")

In [None]:
# _p = tuples_gen_prompt.build(num_tuples_to_generate=20)
# print(_p["messages"][0]["content"])

In [None]:
class DimensionTuple(BaseModel):
    inquiry_type: str = Field(
        description="What kind of inquiry are they making (e.g. informational_ask, complaint, request_for_action_or_service, report_or_notification, permit_or_license_inquiry, payment_or_billing_question)"
    )
    query_style_and_detail: str = Field(
        description="The style and detail of the query (e.g. concise_moderate_detail, verbose_detailed_request, short_keywords_minimal_details)"
    )
    customer_persona: str = Field(
        description="Who is making the request (e.g. business_owner, citizen, property_owner, visitor_tourist, contractor_developer)",
    )
    customer_age: str = Field(
        description="The age of the person making the request (e.g. senior_over_65, youth_under_18, adult)",
    )
    urgency_level: str = Field(
        description="How urgent the customer perceives their issue to be (e.g. low_routine_inquiry, medium_timely_response_needed, high_urgent_immediate_attention)"
    )
    communication_tone: str = Field(
        description="The emotional tone of the inquiry (e.g. neutral_professional, frustrated_annoyed, polite_respectful, demanding_aggressive, confused_seeking_clarification)"
    )
    time_sensitivity: str = Field(
        description="When the issue occurred or needs resolution (e.g. ongoing_chronic_issue, recent_within_week, immediate_happening_now, future_planning_ahead)"
    )


class DimensionTuples(BaseModel):
    tuples: list[DimensionTuple]

### Step 3: Use an LLM to generate N unique combinations of these dimension values


In [None]:
def generate_synth_data_dimension_tuples(num_tuples: int = 20, model: str = "gpt-4o-mini", model_kwargs: dict = {}):
    """Generate a list of dimension tuples based on the provided prompt."""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M")
    prompt = tuples_gen_prompt.build(num_tuples_to_generate=num_tuples)

    rsp = oai_client.beta.chat.completions.parse(
        model=model,
        messages=prompt["messages"],
        response_format=DimensionTuples,
        **model_kwargs,
    )

    tuples_list: DimensionTuples = rsp.choices[0].message.parsed  # type: ignore

    unique_tuples = []
    seen = set()

    for tup in tuples_list.tuples:
        tuple_str = tup.model_dump_json()
        if tuple_str in seen:
            continue

        seen.add(tuple_str)
        unique_tuples.append(tup)

    bt_experiment = bt.init(project=BT_PROJECT_NAME, experiment=f"synth_tuples_it_{timestamp}")
    for uniq_tup in unique_tuples:
        with bt_experiment.start_span(name="generate_dimension_tuples") as span:
            span.log(input=prompt["messages"], output=uniq_tup, metadata=dict(model=model, model_kwargs=model_kwargs))

    summary = bt_experiment.summarize(summarize_scores=False)
    return summary, rsp.choices[0].message.parsed

In [None]:
exp_summary, dim_tuples = generate_synth_data_dimension_tuples(20)

print(exp_summary)
print(dim_tuples)

### Step 4: Remove invalid dimension tuples (human/SME review)

We'll do this in the Braintrust UI


### Step 5: Demonstrate an approach to building + 20 queries with the help of a SME


Working with an SME and ChatGPT, we were able to capture all of the contacts, customer support requests get routed, and a description of the types of inquiries each gets routed their way. We even curated 5 example support requests for each of these contacts.


In [None]:
dept_support_areas = json.load(open("./data/dept_support_areas.json"))

triage_targets = list(set([route["slug"] for route in dept_support_areas]))

print(len(triage_targets))
triage_targets

In [None]:
for idx, triage_point in enumerate(dept_support_areas):
    if idx >= 2:
        break
    print(f"{triage_point['name']} ({triage_point['slug']})")
    print("- " + "\n- ".join(triage_point["support_areas"][:5]))
    print("-> " + "\n-> ".join(triage_point["examples"][:5]))
    print()

In [None]:
triage_examples = {}
for triage_contact in dept_support_areas:
    triage_target = triage_contact["slug"]
    if triage_target not in triage_examples:
        triage_examples[triage_target] = {"name": triage_contact["name"], "support_areas": triage_contact["support_areas"], "examples": []}
    triage_examples[triage_target]["examples"].extend(triage_contact["examples"])

print(f"Total triage targets: {len(triage_examples)}")
print(f"Total examples: {sum(len(data['examples']) for data in triage_examples.values())}")
print(f"First 5 examples for '{list(triage_examples.keys())[0]}': {triage_examples[list(triage_examples.keys())[0]]['examples'][:5]}")

### Step 6. Scale up 100 or more tuples + queries using an LLM

We can use the "BTQL Sandbox" to help us construct a query to get the "good" dimension tuples from our annotation exercise, and then use that query to get those records to build some synthetic queries.


In [None]:
def get_valid_dimension_tuples():
    cursor = None
    while True:
        response = requests.post(
            "https://staging-api.braintrust.dev/btql",
            json={
                "query": dedent("""
                        select: output
                        from: experiment('36b63064-7f4e-46a4-9673-5533fa327dc5')
                        filter: scores."is_good" = 1
                """)
                + (f" | cursor: '{cursor}'" if cursor else ""),
                "use_brainstore": True,
                "brainstore_realtime": True,  # Include the latest realtime data, but a bit slower.
            },
            headers={"Authorization": "Bearer " + os.environ["BRAINTRUST_API_KEY"]},
        )
        response.raise_for_status()
        response_json = response.json()
        data = response_json.get("data", [])
        cursor = response_json.get("cursor")

        return [row["output"] for row in data]


valid_dim_tuples = get_valid_dimension_tuples()

# print(len(valid_dim_tuples))
# valid_dim_tuples[:5]

Create a versioned prompt and save it to Braintrust.


In [None]:
SYNTH_DATA_GEN_PROMPT = """\
I am designing a customer support chatbot for the city of Encinitas and I want to test it against a diverse
range of realistic inquiries citizens that would appropriately be routed to this contact:

Contact: {{{contact_name}}}
Support Areas: {{{support_areas}}}

## Objective

Produce {{{num_queries_to_generate}}} additional natural language customer support requests using a sample of actual customer support requests 
below, but augmented to have one of the characteristics listed below:

==== Inquiry Characteristics =====
{{{dimension_tuple_json}}}

===== Example Inquiries =====
{{{example_inquiries}}}

## Instructions
To produce each inquiry, follow these steps:
1. Choose a random example from the list of example inquiries above.
2. Choose a random record from the list of inquiry characteristics above.
3. Augment the example inquiry to have the chosen characteristic of that record.
4. Return the augmented inquiry.

The queries should:
1. Sound like real users asking for assistance
2. Naturally incorporate all the dimension values
3. Vary in style and detail level
4. Be realistic and practical
5. Change any named entities (locations, addresses, street names, person names, etc...) to diversify the content
5. Include natural variations in typing style, such as:
   - Some queries in all lowercase
   - Some with random capitalization
   - Some with common typos
   - Some with missing punctuation
   - Some with extra spaces or missing spaces
   - Some with emojis or text speak

Here are examples of realistic query variations for a request to get a pothole fixed from 
an adult, business_owner, concise_moderate_detail:

Proper formatting:
- "There is a huge pothole on the corner of 123 Main St. It's been there for weeks affecting customers trying to park nearby"
- "Can someone please fix the pothole on 123 Main St."

All lowercase:
- "need someone to fix the pothole on 123 main st."
- "can a crew come out and fix the pothole on the corner of 123 main st. and grand ave."

Random caps:
- "NEED someone to fix the pothole on 123 main st. ASAP"
- "can a crew come out and fix the pothole on the corner of 123 main st. and grand ave??? It needs to happen NOW! Like today!"

Common typos:
- "need someone to fx the pothol on 123 main st. asap pleze!!!"
- "Can a crew come out and fx the pothole on the corner of 123 main st. & grand ave??? thx much!"

Missing punctuation:
- "need someone to fx the pothole on 123 main st asap plz"
- "Can a crew come out and fix the pothole on the corner of 123 main st & grand ave ... thx much"

With emojis/text speak:
- "the pothole needs to be fixed now on 123 main st. and grand ave! 🥗"
- "pls help get the pothole on 123 main st. & grand ave fixed thx"

Generate {{{num_queries_to_generate}}} unique queries,varying the text style naturally."""


# print(
#     (
#         SYNTH_DATA_GEN_PROMPT.replace("{{{num_queries_to_generate}}}", "4")
#         .replace("{{{dimension_tuple_json}}}", json.dumps(valid_dim_tuples[:4], indent=2))
#         .replace("{{{example_inquiries}}}", "- " + "\n- ".join(triage_examples["public_works"]["examples"][:5]))
#         .replace("{{{contact_name}}}", triage_examples["public_works"]["name"])
#         .replace("{{{support_areas}}}", ", ".join(triage_examples["public_works"]["support_areas"]))
#     )
# )

In [None]:
synth_query_gen_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="synth-query-gen-prompt")
try:
    synth_query_gen_prompt.prompt
except Exception as e:
    bt_syth_query_gen_prompt = bt_project.prompts.create(
        name="SynthQueryGenPrompt",
        slug="synth-query-gen-prompt",
        description="Prompt for generating synthetic queries",
        model="claude-4-sonnet-20250514",
        messages=[{"role": "user", "content": SYNTH_DATA_GEN_PROMPT}],
        if_exists="replace",
    )

    bt_project.publish()
    synth_query_gen_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="synth-query-gen-prompt")


In [None]:
# _p = synth_query_gen_prompt.build(
#     num_queries_to_generate=4,
#     dimension_tuple_json=json.dumps(valid_dim_tuples[:4], indent=2),
#     example_inquiries="- " + "\n- ".join(triage_examples["public_works"]["examples"][:5]),
#     contact_name=triage_examples["public_works"]["name"],
#     support_areas=", ".join(triage_examples["public_works"]["support_areas"]),
# )

# _p

Generate synthetic data we can later run through our customer support bot:


In [None]:
class QueryList(BaseModel):
    queries: list[str]

In [None]:
def generate_synth_queries(triage_slug: str, num_queries: int = 10, model: str = "gpt-4o-mini", model_kwargs: dict = {}) -> dict:
    random_dim_tuples = random.sample(valid_dim_tuples, min(num_queries, len(valid_dim_tuples)))

    prompt = synth_query_gen_prompt.build(
        num_queries_to_generate=num_queries,
        dimension_tuple_json=json.dumps(random_dim_tuples, indent=2),
        example_inquiries="- " + "\n- ".join(triage_examples[triage_slug]["examples"][:5]),
        contact_name=triage_examples[triage_slug]["name"],
        support_areas=", ".join(triage_examples[triage_slug]["support_areas"]),
    )

    rsp = oai_client.beta.chat.completions.parse(
        model=model,
        messages=prompt["messages"],
        response_format=QueryList,
        **model_kwargs,
    )

    query_list: QueryList = rsp.choices[0].message.parsed  # type: ignore

    return {
        "prompt": prompt["messages"],
        "triage_slug": triage_slug,
        "sampled_dim_tuples": random_dim_tuples,
        "handcoded_queries": triage_examples[triage_slug]["examples"],
        "synth_queries": query_list.queries,
    }

In [None]:
# rsp = generate_synth_queries("public_works", num_queries=10)

# for q in rsp["handcoded_queries"]:
#     print(q)
# print("-" * 100)
# for q in rsp["synth_queries"]:
#     print(q)

In [None]:
def generate_queries_parallel(triage_targets: list[str], num_queries: int = 10, model: str = "gpt-4o-mini", model_kwargs: dict = {}):
    """Generate queries in parallel for all dimension tuples."""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M")

    print(f"Generating {num_queries} queries each for {len(triage_targets)} triage targets...")
    # Run in parallel
    worker = partial(generate_synth_queries, num_queries=num_queries, model=model, model_kwargs=model_kwargs)
    responses = list(ThreadPoolExecutor(max_workers=MAX_WORKERS).map(worker, triage_targets))

    # Add query items
    all_queries = []
    for response in responses:
        prompt = response["prompt"]
        triage_slug = response["triage_slug"]

        queries = [{"prompt": prompt, "triage_slug": triage_slug, "query": q, "query_source": "synth"} for q in response["synth_queries"]]
        queries.extend(
            [{"prompt": prompt, "triage_slug": triage_slug, "query": q, "query_source": "handcoded"} for q in response["handcoded_queries"]]
        )

        all_queries.extend(queries)

    # Add to experiment
    bt_experiment = bt.init(project=BT_PROJECT_NAME, experiment=f"add_queries_it_{timestamp}")
    query_id = 1

    for query_item in all_queries:
        qid = f"{timestamp}_{query_id:03d}"
        query_id += 1

        with bt_experiment.start_span(name="add_query") as span:
            span.log(
                input=query_item["prompt"],
                output=query_item["query"],
                metadata={
                    "id": qid,
                    "triage_slug": query_item["triage_slug"],
                    "query_source": query_item["query_source"],
                    "model": model,
                    "model_kwargs": model_kwargs,
                },
            )

    summary = bt_experiment.summarize(summarize_scores=False)
    return summary, queries

In [None]:
summary, queries = generate_queries_parallel(triage_targets, num_queries=10, model="gpt-4o-mini")
# summary, queries = generate_queries_parallel(triage_targets[:2], num_queries=10, model="gpt-4o-mini")

print(len(queries))
print(summary)

### Step 7. Remove invalid queries (human/SME review)

We'll do this in the Braintrust UI


## Tasks

Let's use our "good" synthetic and hand-coded queries to build a real-world customer support bot with two main components:

1. A web search retrieval task to gather context for answering and routing requests.
2. A retrieval synthesis task that takes the output from the retrievals to determine where to route support, what information is needed to resolve the request, and to generate some form of response for the user.

To achieve this, we'll define some tasks and [log](https://www.braintrust.dev/docs/guides/logs) them to Braintrust.


In [None]:
bt_logger = bt.init_logger(project=BT_PROJECT_NAME)

dept_support_areas = json.load(open("./data/dept_support_areas.json"))
triage_targets = list(set([route["slug"] for route in dept_support_areas]))

Let's use our old friend, the "BTQL Sandbox" to build a query to get only the queries marked as good from the human review above.


In [None]:
def get_test_data():
    cursor = None
    while True:
        response = requests.post(
            "https://staging-api.braintrust.dev/btql",
            json={
                "query": dedent("""
                        select: output, metadata
                        from: experiment('b715cf02-86f6-4471-bc10-cef617e128ec')
                        filter: scores."is_good" = 1
                """)
                + (f" | cursor: '{cursor}'" if cursor else ""),
                "use_brainstore": True,
                "brainstore_realtime": True,  # Include the latest realtime data, but a bit slower.
            },
            headers={"Authorization": "Bearer " + os.environ["BRAINTRUST_API_KEY"]},
        )
        response.raise_for_status()
        response_json = response.json()
        data = response_json.get("data", [])
        cursor = response_json.get("cursor")

        return [{"input": row["output"], "metadata": row["metadata"]} for row in data]


test_data = get_test_data()

print(len(test_data))
test_data[:5]

### Retrieval

The first sub-task we need is a retrieval task to gather the appropriate context for routing and responding to the customer support request.


#### Prompt

We will create a versioned prompt as we did above.


In [None]:
WEB_SEARCH_SYSTEM_PROMPT = dedent("""\
    Your name is Sunny and you are in charge of answering question about the city of {{{city}}} and routing requests to the correct departments for triage.
    You are given access to a web search retrieval tool to use in an attempt to answer satisfy the user's inquiry.
    
    ## Instructions
    - Always include any relevant links in your response! Never respond with a generic link placeholder like '(Note: The specific registration link would be on the city's website)'     
    - If you cannot find the information you need or fulfill the request, make sure to tell the user that in the final answer
""")

In [None]:
bt_web_search_prompt = bt_project.prompts.create(
    name="WebSearchPrompt",
    slug="web-search-prompt",
    description="Prompt for retrieving context from the web",
    model="claude-sonnet-4-20250514",
    messages=[
        {"role": "system", "content": WEB_SEARCH_SYSTEM_PROMPT},
        {"role": "user", "content": "{{{query}}}"},
    ],
    if_exists="replace",
)

bt_project.publish()

In [None]:
web_search_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="web-search-prompt")
web_search_prompt.build(city="Encinitas", query="What is the weather in Encinitas?")

#### Task


We'll use Anthropic's built-in [web search tool](https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/web-search-tool) for our initial build.

We'll configure it to only use this particular city's domain and also assume the user's location to be within the city.


In [None]:
# limit the number of searches per request
N_SEARCHES = 1

# To limit the scope of the search to specific domains
RETRIEVAL_WEB_SEARCH_ALLOWED_DOMAINS = [
    "encinitasca.gov",
    # "anc.apm.activecommunities.com/encinitasparksandrec",  # P&R activity registration
    # "/issuu.com/encinitasca.gov/",  # P&R activty guides
]

# To localize search results based on user's location
USER_LOCATION = {
    "type": "approximate",
    "city": "Encinitas",
    "region": "California",
    "country": "US",
    "timezone": "America/Los_Angeles",
}

In [None]:
@bt.traced()
def ac_web_search(query: str, city: str, model="claude-sonnet-4-20250514", model_kwargs={}):
    """Search the web for the given query."""

    prompt = web_search_prompt.build(city=city, query=query)

    rsp = wrapped_ac_client.messages.create(
        model=model,
        system=prompt["messages"].pop(0)["content"],
        messages=prompt["messages"],
        tools=[
            {
                "type": "web_search_20250305",
                "name": "web_search",
                "max_uses": N_SEARCHES,
                "allowed_domains": RETRIEVAL_WEB_SEARCH_ALLOWED_DOMAINS,
                "user_location": USER_LOCATION,
            }  # type: ignore
        ],
        **model_kwargs,
    )
    return rsp


In [None]:
_example = test_data[0]
# _example

In [None]:
# with bt_logger.start_span(name="web_search") as trace:
rsp = ac_web_search(
    _example["input"],
    "Encinitas",
    model_kwargs={"max_tokens": 2048, "thinking": {"type": "enabled", "budget_tokens": 1024}},
)

rsp
# rsp.model_dump()

#### Parse utility


Let's create a utility function to parse the web search tool's response into something more digestible. This will allow us to control exactly what bits we want to pass in as context to our synthesis task.


In [None]:
def parse_ac_web_search_response(message):
    """
    Parse a web search response message to extract searches, results, citations, and final answers.

    Args:
        message: The message object from Anthropic's web search tool

    Returns:
        dict: Contains 'searches', 'results', 'citations', and 'final_answer'
    """

    searches = []
    results = []
    citations = []
    final_answer_parts = []

    # Iterate through all content blocks
    for block in message.content:
        # Extract search queries
        if hasattr(block, "type") and block.type == "server_tool_use":
            if hasattr(block, "name") and block.name == "web_search":
                search_query = block.input.get("query", "No query found")
                searches.append({"id": block.id, "query": search_query, "input": block.input})

        # Extract search results
        elif hasattr(block, "type") and block.type == "web_search_tool_result":
            for result_block in block.content:
                if hasattr(result_block, "type") and result_block.type == "web_search_result":
                    results.append(
                        {
                            "title": result_block.title,
                            "url": result_block.url,
                            "encrypted_content": result_block.encrypted_content,
                            "page_age": getattr(result_block, "page_age", None),
                        }
                    )

        # Extract final answer text blocks (with or without citations)
        elif hasattr(block, "type") and block.type == "text":
            text_content = block.text
            block_citations = getattr(block, "citations", None)

            # Add to final answer
            final_answer_parts.append(text_content)

            # If this text block has citations, extract them
            if block_citations:
                for citation in block_citations:
                    citations.append(
                        {"cited_text": citation.cited_text, "title": citation.title, "url": citation.url, "type": citation.type}
                    )

    return {"searches": searches, "results": results, "citations": citations, "final_answer": "".join(final_answer_parts)}


In [None]:
# Test the parsing function with the existing response
parsed = parse_ac_web_search_response(rsp)

print("=== SEARCHES ===")
for i, search in enumerate(parsed["searches"], 1):
    print(f"{i}. Query: {search['query']}")
    print(f"   ID: {search['id']}")
    print()

print("=== RESULTS ===")
for i, result in enumerate(parsed["results"], 1):
    print(f"{i}. {result['title']}")
    print(f"   URL: {result['url']}")
    print(f"   Page Age: {result['page_age']}")
    print()

print("=== CITATIONS ===")
for i, citation in enumerate(parsed["citations"], 1):
    print(f"{i}. Cited Text: {citation['cited_text'][:100]}...")
    print(f"   Title: {citation['title']}")
    print(f"   URL: {citation['url']}")
    print()

print("=== FINAL ANSWER ===")
print(parsed["final_answer"])  # [:500] + "..." if len(parsed["final_answer"]) > 500 else parsed["final_answer"])


#### Task (v2)

After review, we only need to pass in the `final_answer` as context to formulate a proper response. We'll update our task function as such:


In [None]:
@bt.traced()
def ac_web_search(query: str, city: str, model="claude-sonnet-4-20250514", model_kwargs={}):
    """Search the web for the given query."""

    prompt = web_search_prompt.build(city=city, query=query)

    rsp = wrapped_ac_client.messages.create(
        model=model,
        system=prompt["messages"].pop(0)["content"],
        messages=prompt["messages"],
        **model_kwargs,
        tools=[
            {
                "type": "web_search_20250305",
                "name": "web_search",
                "max_uses": N_SEARCHES,
                "allowed_domains": RETRIEVAL_WEB_SEARCH_ALLOWED_DOMAINS,
                "user_location": USER_LOCATION,
            }  # type: ignore
        ],
    )

    # return rsp
    parsed = parse_ac_web_search_response(rsp)
    return {"citations": parsed["citations"], "final_answer": parsed["final_answer"]}

In [None]:
rsp = ac_web_search(
    _example["input"],
    "Encinitas",
    model_kwargs={"max_tokens": 2048, "thinking": {"type": "enabled", "budget_tokens": 1024}},
)

rsp
# rsp.model_dump()

### Retrieval synthesis

The second sub-task we need is a synthesis task that takes the retrieved context and attempts to provide an appropriate response, route the request to the correct contact, and finally identify any other information that would be needed to properly respond.


#### Prompt

We will create a versioned prompt as we have for the other tasks above.


In [None]:
for dept in dept_support_areas[:2]:
    print(dept["name"])
    print("- " + "\n- ".join(dept["support_areas"]))
    print("-> " + "\n-> ".join(dept["examples"]))
    print()

In [None]:
def build_triage_route_details() -> str:
    output = "=" * 10 + "\n"
    for dept in dept_support_areas:
        output += f"Triage Target: {dept['slug']}\n"
        output += f"Triage Name: {dept['name']}\n"
        output += f"Support Areas: {','.join(dept['support_areas'])}"
        output += "\n" + "=" * 10 + "\n"

    return output.strip()


In [None]:
print(build_triage_route_details()[:550])

In [None]:
SYNTHESIZE_RETRIEVAL_SYSTEM_PROMPT = f"""\
Your name is Sunny and you are an expert in providing responses to customer support responses for the city of {{{{city}}}} based on the original user inquiry, 
triage route descriptions, and the context provided.

## Instructions
- Provide a final answer to the user inquiry ONLY if it is answerable from the "Context" provided below.
- Using the user's inquiry and the "Context" provided below, choose the most specific triage target to route the user to based on the "Triage Options" defined below.
- If provided, make sure you include any relevant links, phone numbers, emails, or other information from the "Context" and how the user can use them to get the information they need.
- Use markdown formatting for the full response.
- Always direct complaints about personnel to "human_resources"

## Triage Options    
{build_triage_route_details()}

## Context
{{{{context}}}}"""

In [None]:
bt_retieval_synthesis_prompt = bt_project.prompts.create(
    name="WebSearchRetrievalSynthesis",
    slug="web-search-retrieval-synthesis",
    description="Prompt for synthesizing the retrieval results into a final answer",
    model="claude-4-sonnet-20250514",
    messages=[
        {"role": "system", "content": SYNTHESIZE_RETRIEVAL_SYSTEM_PROMPT},
        {"role": "user", "content": "User Inquiry: {{{user_inquiry}}}"},
    ],
    if_exists="replace",
)

bt_project.publish()

In [None]:
retrieval_synth_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="web-search-retrieval-synthesis")

In [None]:
# _p = retrieval_synth_prompt.build(city="Encinitas", user_inquiry=_example["input"], context=rsp)
# print(_p["messages"][0]["content"])

#### Structured output definition

We structure the output as a Pydantic class so we can better get at and evaluate the different aspects of our AI pipeline.


In [None]:
class CustomerSupportResponse(BaseModel):
    """The response from the customer support agent."""

    def convert_str_to_triage_target(v: str) -> str:  # type: ignore
        """Ensure entity type is a valid enum."""
        if v in triage_targets:
            return v
        else:
            try:
                match, score = fuzzy_process.extractOne(v.upper(), [s.upper() for s in triage_targets])  # type: ignore
                return match if score >= 80 else "General"
            except ValueError:
                return "General"

    chain_of_thought: str = Field(
        ...,
        description="Explain your decision about whether this question is answerable and why you chose to route it to the triage contact you chose",
    )
    is_answerable_from_context: bool = Field(
        ..., description="Whether the given response can answer the question from the context provided"
    )
    requires_follow_up: bool = Field(
        ...,
        description="Whether the inquiry requires follow-up from the triage contact",
    )
    requires_location: bool = Field(
        ...,
        description="Whether the inquiry requires the user to provide their location to be resolved by the triage contact",
    )
    requires_photo_or_video: bool = Field(
        ...,
        description="Whether the inquiry requires the user to provide a photo or video to be resolved by the triage contact",
    )
    requires_user_contact_info: bool = Field(
        ...,
        description="Whether the inquiry requires the user to provide their contact information to be resolved by the triage contact",
    )
    full_response: str = Field(
        ...,
        description="The detailed answer to the question based on the context provided",
    )
    final_answer: str = Field(
        ...,
        description="A succinct and concise answer to the question based on the context provided",
    )
    route_to: Annotated[Literal[*triage_targets], BeforeValidator(convert_str_to_triage_target)]  # type: ignore


#### Task

We set things up in this task to allow teams to use either OpenAI or Anthropic models.


In [None]:
class MODEL_VENDOR(str, enum.Enum):
    ANTHROPIC = "anthropic"
    OPENAI = "openai"

In [None]:
@bt.traced()
def synthesize_retrieval(
    user_inquiry: str,
    retrieval_response: str,
    city: str,
    vendor: MODEL_VENDOR = MODEL_VENDOR.ANTHROPIC,
    model="claude-sonnet-4-20250514",
    model_kwargs={},
) -> CustomerSupportResponse:
    """Synthesize the retrieval results into a final answer."""

    prompt = retrieval_synth_prompt.build(city=city, user_inquiry=user_inquiry, context=retrieval_response)

    rsp = None
    if vendor == MODEL_VENDOR.ANTHROPIC:
        tools = [
            {
                "name": "customer_support_response",
                "description": "Build CustomerSupportResponse object",
                "input_schema": CustomerSupportResponse.model_json_schema(),
            }
        ]
        rsp = wrapped_ac_client.messages.create(
            model=model,
            system=prompt["messages"].pop(0)["content"],  # type: ignore
            messages=prompt["messages"],
            tools=tools,
            tool_choice={"type": "tool", "name": "customer_support_response"},
            **model_kwargs,
        )
        return CustomerSupportResponse(**rsp.content[0].input)

    elif vendor == MODEL_VENDOR.OPENAI:
        result = wrapped_oai_client.beta.chat.completions.parse(
            model="gpt-4o",
            messages=prompt["messages"],
            response_format=CustomerSupportResponse,
            **model_kwargs,
        )
        return result.choices[0].message.parsed

    else:
        raise ValueError(f"Invalid vendor: {vendor}")


In [None]:
# parsed["final_answer"]

Let's try it with Anthropic


In [None]:
rsp = synthesize_retrieval(
    user_inquiry=_example["input"],
    retrieval_response=parsed["final_answer"],
    city="Encinitas",
    model_kwargs={"max_tokens": 1024},
)
rsp

In [None]:
rsp.model_dump()

Let's try it with OpenAI


In [None]:
rsp = synthesize_retrieval(
    user_inquiry=_example["input"],
    retrieval_response=parsed["final_answer"],
    city="Encinitas",
    vendor=MODEL_VENDOR.OPENAI,
    model="gpt-4o-mini",
    model_kwargs={},
)
rsp

In [None]:
rsp.model_dump()

### Customer support task

Let's put our entire workflow together under the `ask_customer_support` task.


In [None]:
@bt.traced()
def ask_customer_support(
    user_inquiry: str,
    city: str,
    search_model: str = "claude-sonnet-4-20250514",
    synthesis_model: str = "gpt-4o",
    synthesis_vendor: MODEL_VENDOR = MODEL_VENDOR.OPENAI,
    search_model_kwargs={"max_tokens": 2048, "thinking": {"type": "enabled", "budget_tokens": 1024}},
    synthesis_model_kwargs: dict = {},
) -> CustomerSupportResponse:
    """Ask the customer support agent a question."""

    search_results = ac_web_search(user_inquiry, city, model=search_model, model_kwargs=search_model_kwargs)
    customer_support_resp = synthesize_retrieval(
        user_inquiry=user_inquiry,
        retrieval_response=search_results["final_answer"],
        city=city,
        vendor=synthesis_vendor,
        model=synthesis_model,
        model_kwargs=synthesis_model_kwargs,
    )

    return customer_support_resp

Let's test how this works with a few questions


In [None]:
len(test_data)

In [None]:
def process_single_request(example):
    """Process a single customer support request."""
    city = "Encinitas"

    with bt.start_span(name="request_support_task") as span:
        span.log(input=example["input"])
        customer_support_resp = ask_customer_support(
            user_inquiry=example["input"],
            city=city,
            search_model="claude-sonnet-4-20250514",
            synthesis_model="gpt-4o-mini",
            synthesis_vendor=MODEL_VENDOR.OPENAI,
            search_model_kwargs={"max_tokens": 2048, "thinking": {"type": "enabled", "budget_tokens": 1024}},
            synthesis_model_kwargs={},
        )

        metadata = example.get("metadata", {})
        filtered_metadata = {
            "id": metadata.get("id"),
            "query_source": metadata.get("query_source"),
            "triage_slug": metadata.get("triage_slug"),
        }

        span.log(output=customer_support_resp, metadata=filtered_metadata)
        return customer_support_resp

In [None]:
# Run all test data in parallel
print(f"Processing {len(test_data)} customer support requests in parallel...")
timestamp = datetime.now().strftime("%Y%m%d_%H%M")

with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
    results = list(executor.map(process_single_request, test_data[:5]))

print(f"Completed processing {len(results)} requests!")

Once we have our traces logged, we can move them into a "golden [dataset](https://www.braintrust.dev/docs/guides/datasets)" to build evals on.

We'll do this in the Braintrust UI and then use this dataset in our initial experiments.


In [None]:
ds = bt.init_dataset(project=BT_PROJECT_NAME, name="initial_error_analysis")
rows = list(ds)[:5]

In [None]:
row = rows[0]
row

## Scoring

So we have our `task` and our `data`, all we need now to start building evals is a way to score how well our task did on each input in our dataset.

Braintrust comes with several helpful scorers out of the box via our Autoevals library. Fun fact: you can use them outside of Braintrust as well. In addition to these scoring functions, we can define our own, as demonstrated below, where we configure a reference-based scorer to see how well the task predicts the right contact to route the inquiry to (if available).


In [None]:
def is_correct_triage_contact(
    input: str | None = None,
    expected: dict | None = None,
    output: CustomerSupportResponse | None = None,
    metadata: dict | None = None,
    **kwargs,
) -> int | None:
    if output and metadata and metadata.get("triage_slug"):
        return int(output.route_to == metadata.get("triage_slug"))

    if output and expected:
        return int(expected["route_to"] == output.route_to)

    return None

In [None]:
is_correct_triage_contact(
    input=row.get("input", ""),
    output=CustomerSupportResponse.model_validate(row.get("expected", {})),
    expected=row.get("expected", None),
    metadata=row.get("metadata", None),  # type: ignore
)

## Evals

We now have the 3 things required to run an offline eval, a.k.a. an [experiment](https://www.braintrust.dev/docs/guides/experiments)

**Data**: Handcoded and Synthetic customer support requests/inquiries \
**Task**: A customer support task that uses two subtasks to route and attempt to answer the request \
**Scorers**: An initial reference-based scorer that scores the response on whether it routed the request to the right contact


In [None]:
timestamp = datetime.now().strftime("%Y%m%d_%H%M")

exp_metadata = {
    "city": "Encinitas",
    "search_model": "claude-sonnet-4-20250514",
    "synthesis_model": "gpt-4o-mini",
    "synthesis_vendor": MODEL_VENDOR.OPENAI,
    "search_model_kwargs": {"max_tokens": 2048, "thinking": {"type": "enabled", "budget_tokens": 1024}},
    "synthesis_model_kwargs": {"temperature": 0.75},
}
await bt.EvalAsync(
    name=BT_PROJECT_NAME,
    experiment_name=f"initial_error_analysis_{timestamp}",
    data=rows,  # type: ignore
    task=partial(ask_customer_support, **exp_metadata),
    metadata=exp_metadata,
    scores=[is_correct_triage_contact],  # type: ignore
)

We can now start iterating on improving the routing.

This can be done in various ways. For example:

1. We can change the models and or generation kwargs.
2. We can improve the prompts, and so forth.
3. We can make corrections to our dataset.


In [None]:
timestamp = datetime.now().strftime("%Y%m%d_%H%M")
what_changed = "fixed routing error in dataset"

exp_metadata = {
    "city": "Encinitas",
    "search_model": "claude-sonnet-4-20250514",
    "synthesis_model": "gpt-4o-mini",
    "synthesis_vendor": MODEL_VENDOR.OPENAI,
    "search_model_kwargs": {"max_tokens": 2048, "thinking": {"type": "enabled", "budget_tokens": 1024}},
    "synthesis_model_kwargs": {"temperature": 0.75},
}
await bt.EvalAsync(
    name=BT_PROJECT_NAME,
    experiment_name=f"initial_error_analysis_{timestamp}",
    data=rows,  # type: ignore
    task=partial(ask_customer_support, **exp_metadata),
    metadata={**exp_metadata, "what_changed": what_changed},
    scores=[is_correct_triage_contact],  # type: ignore
)

## Next Steps

From here, we can begin iterating. In particular, we can:

- Build a reference-free scorer, something like an LLM-as-Judge, to score routing that can be used both in offline and online-evals
- Work with SME's to better identify other failure modes and iterate on


## Fin
