# How Good Eval Infra Works

An introduction to all things Braintrust


In [1]:
import enum
import json
import os
import random

from concurrent.futures import ThreadPoolExecutor
from datetime import datetime
from functools import partial
from textwrap import dedent
from typing import Annotated, Literal

import anthropic as ac
import braintrust as bt
import openai as oai
import pandas as pd
import requests

from braintrust import wrap_anthropic, wrap_openai
from dotenv import load_dotenv
from pydantic import BaseModel, BeforeValidator, Field
from thefuzz import process as fuzzy_process

load_dotenv(override=True)

  from .autonotebook import tqdm as notebook_tqdm


True

## Setup

Once you're signed up on [Braintrust](https://www.braintrust.dev/) and have set the appropriate API keys in your `.env` file, you're ready to go! For this particular workflow, you'll need the following keys:

- `BRAINTRUST_API_KEY`
- `OPENAI_API_KEY`
- `ANTHROPIC_API_KEY`

With that in place we can create our [Braintrust project](https://www.braintrust.dev/docs/guides/projects) using `braintrust.projects.create()` (this method will create or else return the project if it already exists)

We also define Anthropic and OpenAI clients like usual, as well as a "wrapped" version of each client. These "wrapped" versions automatically capture usage data (e.g., prompt tokens, etc...) in Braintrust when used within the context of a logger or experiement.


In [2]:
BT_PROJECT_NAME = "wayde-ai-evals-course-2025"
MAX_WORKERS = 5

bt_project = bt.projects.create(name=BT_PROJECT_NAME)

ac_client = ac.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
oai_client = oai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

wrapped_ac_client = wrap_anthropic(ac_client)
wrapped_oai_client = wrap_openai(oai_client)


## Data


"Every error analysis starts with _traces_: the full sequence of inputs, outputs, and actions taken by the pipeline for a given input"

The recommendation is to start with ~ 100 traces of diverse user intents, with preference given to real world data where possible. In this case, we don't have access to such data and so we'll follow the AIE methodology for creating synthetic data. We'll braintrustify this workflow a bit so that it looks like this:

1. Define at least 3 "key dimensions" (e.g., categories for different parts of a user's query) of the query space.

2. For each dimension, define at least 3-4 values

3. Use an LLM to generate 10-20 unique combinations of these dimension values (e.g., dimension tuples) for each triage contact

4. Remove invalid dimension tuples (human/SME review)

5. Demonstrate an approach to building + 20 queries with the help of a SME

6. Scale up 100 or more tuples + queries using an LLM

7. Remove invalid synthetic queries (human/SME review)


### Step 1: Define Dimensions

"First, before prompting anything, we define key dimensions of the query space .. [to] help us systematically vary different aspects of a user’s request" \
-- from the course reader


In [3]:
class DimensionTuple(BaseModel):
    inquiry_type: str = Field(
        description="What kind of inquiry are they making (e.g. informational_ask, complaint,request_for_action_or_service)"
    )
    query_style_and_detail: str = Field(
        description="The style and detail of the query (e.g. short_keywords_minimal_detail, concise_moderate_detail, verbose_detailed_request)"
    )
    customer_persona: str = Field(
        description="Who is making the request (e.g. business_owner, citizen)",
    )
    customer_age: str = Field(
        description="The age of the person making the request (e.g. senior_over_65, youth_under_18, adult)",
    )


class DimensionTuples(BaseModel):
    tuples: list[DimensionTuple]

### Step 2: Define Dimension Values


In [4]:
TUPLES_GEN_PROMPT = """\
I am designing a customer support chatbot for the city of Encinitas and I want to test it against a diverse
range of inquiries citizens might submit. I have provided you with several dimensions and that constitute the parts of
such a queries along with a list of possible values for each dimension.

## Instructions

Generate {{{num_tuples_to_generate}}} unique combinations of dimension values for based on the dimensions provided below. 
- Each combination should represent a different user scenario. 
- Ensure balanced coverage across all dimensions - don't over-represent any particular value or combination.
- Vary the query styles naturally.
- Attempt to make the dimension value combinations as realistic as possible.
- Never generate a tuple where the customer_persona is a business_owner and the customer_age is youth_under_18.

## Dimensions

inquiry_type:
- informational_ask
- complaint
- request_for_action_or_service

query_style_and_detail:
- concise_moderate_detail: [includes some details relevant to the `inquiry_topic`, e.g., location, names, etc.]
- verbose_detailed_request: [includes specific details relevant to the `inquiry_topic`, e.g., location, names, etc.]
- short_keywords_minimal_details

customer_persona: Who is making the request
- business_owner
- citizen

customer_age: The age of the person making the request
- senior_over_65
- youth_under_18
- adult

Generate {{{num_tuples_to_generate}}} unique dimension tuples following these patterns. Remember to maintain balanced diversity across all dimensions."""

print(TUPLES_GEN_PROMPT.replace("{{{num_tuples_to_generate}}}", "10"))

I am designing a customer support chatbot for the city of Encinitas and I want to test it against a diverse
range of inquiries citizens might submit. I have provided you with several dimensions and that constitute the parts of
such a queries along with a list of possible values for each dimension.

## Instructions

Generate 10 unique combinations of dimension values for based on the dimensions provided below. 
- Each combination should represent a different user scenario. 
- Ensure balanced coverage across all dimensions - don't over-represent any particular value or combination.
- Vary the query styles naturally.
- Attempt to make the dimension value combinations as realistic as possible.
- Never generate a tuple where the customer_persona is a business_owner and the customer_age is youth_under_18.

## Dimensions

inquiry_type:
- informational_ask
- complaint
- request_for_action_or_service

query_style_and_detail:
- concise_moderate_detail: [includes some details relevant to the `inq

Create a versioned prompt and save it to Braintrust.


In [12]:
tuples_gen_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="dimension-tuples-gen-prompt")
try:
    tuples_gen_prompt.build(num_tuples_to_generate=20)
except Exception as e:
    bt_tuples_gen_prompt = bt_project.prompts.create(
        name="DimensionTuplesGenPrompt",
        slug="dimension-tuples-gen-prompt",
        description="Prompt for generating dimension tuples",
        model="claude-4-sonnet-20250514",
        messages=[{"role": "user", "content": TUPLES_GEN_PROMPT}],
        if_exists="replace",
    )

    bt_project.publish()
    tuples_gen_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="dimension-tuples-gen-prompt")


With this in place we can actually engage our domain experts in the prompt building process via our [playgrounds](https://www.braintrust.dev/docs/guides/playground) using our eval specific AI assistant, [Loop](https://www.braintrust.dev/docs/guides/loop).

We'll prompt Loop to improve our "prompt" like this:

> "Help me improve this prompt to capture all the different dimensions a customer support inquiry may have for the city of Encinitas.
>
> The dimensions and their values should represent realistic aspects of a customer support request they may get"

We can then get our updated prompt, make changes to the structured output definition, and generate some data!


In [41]:
_p = tuples_gen_prompt.build(num_tuples_to_generate=20)
print(_p["messages"][0]["content"])

I am designing a customer support chatbot for the city of Encinitas and I want to test it against a diverse range of inquiries citizens might submit. I have provided you with several dimensions that constitute the parts of such queries along with a list of possible values for each dimension.

## Instructions

Generate 20 unique combinations of dimension values based on the dimensions provided below. 
- Each combination should represent a different user scenario. 
- Ensure balanced coverage across all dimensions - don't over-represent any particular value or combination.
- Vary the query styles naturally.
- Attempt to make the dimension value combinations as realistic as possible.
- Never generate a tuple where the customer_persona is a business_owner and the customer_age is youth_under_18.

## Dimensions

inquiry_type:
- informational_ask
- complaint
- request_for_action_or_service
- report_or_notification
- permit_or_license_inquiry
- payment_or_billing_question

query_style_and_detai

In [19]:
class DimensionTuple(BaseModel):
    inquiry_type: str = Field(
        description="What kind of inquiry are they making (e.g. informational_ask, complaint, request_for_action_or_service, report_or_notification, permit_or_license_inquiry, payment_or_billing_question)"
    )
    query_style_and_detail: str = Field(
        description="The style and detail of the query (e.g. concise_moderate_detail, verbose_detailed_request, short_keywords_minimal_details)"
    )
    customer_persona: str = Field(
        description="Who is making the request (e.g. business_owner, citizen, property_owner, visitor_tourist, contractor_developer)",
    )
    customer_age: str = Field(
        description="The age of the person making the request (e.g. senior_over_65, youth_under_18, adult)",
    )
    urgency_level: str = Field(
        description="How urgent the customer perceives their issue to be (e.g. low_routine_inquiry, medium_timely_response_needed, high_urgent_immediate_attention)"
    )
    communication_tone: str = Field(
        description="The emotional tone of the inquiry (e.g. neutral_professional, frustrated_annoyed, polite_respectful, demanding_aggressive, confused_seeking_clarification)"
    )
    time_sensitivity: str = Field(
        description="When the issue occurred or needs resolution (e.g. ongoing_chronic_issue, recent_within_week, immediate_happening_now, future_planning_ahead)"
    )


class DimensionTuples(BaseModel):
    tuples: list[DimensionTuple]

### Step 3: Use an LLM to generate N unique combinations of these dimension values


In [20]:
def generate_synth_data_dimension_tuples(num_tuples: int = 20, model: str = "gpt-4o-mini", model_kwargs: dict = {}):
    """Generate a list of dimension tuples based on the provided prompt."""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M")
    prompt = tuples_gen_prompt.build(num_tuples_to_generate=num_tuples)

    rsp = oai_client.beta.chat.completions.parse(
        model=model,
        messages=prompt["messages"],
        response_format=DimensionTuples,
        **model_kwargs,
    )

    tuples_list: DimensionTuples = rsp.choices[0].message.parsed  # type: ignore

    unique_tuples = []
    seen = set()

    for tup in tuples_list.tuples:
        tuple_str = tup.model_dump_json()
        if tuple_str in seen:
            continue

        seen.add(tuple_str)
        unique_tuples.append(tup)

    bt_experiment = bt.init(project=BT_PROJECT_NAME, experiment=f"synth_tuples_it_{timestamp}")
    for uniq_tup in unique_tuples:
        with bt_experiment.start_span(name="generate_dimension_tuples") as span:
            span.log(input=prompt["messages"], output=uniq_tup, metadata=dict(model=model, model_kwargs=model_kwargs))

    summary = bt_experiment.summarize(summarize_scores=False)
    return summary, rsp.choices[0].message.parsed

In [21]:
exp_summary, dim_tuples = generate_synth_data_dimension_tuples(20)

print(exp_summary)
print(dim_tuples)

Skipping git metadata. This is likely because the repository has not been published to a remote yet. Remote named 'origin' didn't exist
Retrying request after error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Sleeping for 0.5 seconds



See results for synth_tuples_it_20250718_1510 at https://www.braintrust.dev/app/braintrustdata.com/p/wayde-ai-evals-course-2025/experiments/synth_tuples_it_20250718_1510
tuples=[DimensionTuple(inquiry_type='informational_ask', query_style_and_detail='concise_moderate_detail', customer_persona='citizen', customer_age='adult', urgency_level='low_routine_inquiry', communication_tone='neutral_professional', time_sensitivity='future_planning_ahead'), DimensionTuple(inquiry_type='complaint', query_style_and_detail='verbose_detailed_request', customer_persona='citizen', customer_age='adult', urgency_level='high_urgent_immediate_attention', communication_tone='frustrated_annoyed', time_sensitivity='recent_within_week'), DimensionTuple(inquiry_type='request_for_action_or_service', query_style_and_detail='short_keywords_minimal_details', customer_persona='property_owner', customer_age='adult', urgency_level='medium_timely_response_needed', communication_tone='polite_respectful', time_sensitivit

Skipping git metadata. This is likely because the repository has not been published to a remote yet. Remote named 'origin' didn't exist
Retrying request after error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Sleeping for 0.5 seconds


### Step 4: Remove invalid dimension tuples (human/SME review)

We'll do this in the Braintrust UI


### Step 5" Demonstrate an approach to building + 20 queries with the help of a SME


In [23]:
dept_support_areas = json.load(open("./data/dept_support_areas.json"))

triage_targets = list(set([route["slug"] for route in dept_support_areas]))

print(len(triage_targets))
triage_targets

11


['homeless_solutions',
 'human_resources',
 'traffic_engineering',
 'san_diego_county_vector_control',
 'code_enforcement',
 'public_works',
 'general',
 'san_diego_humane_society',
 'san_dieguito_water_district',
 'fire_prevention',
 'parks_recreation']

In [24]:
for idx, triage_point in enumerate(dept_support_areas):
    if idx >= 2:
        break
    print(f"{triage_point['name']} ({triage_point['slug']})")
    print("- " + "\n- ".join(triage_point["support_areas"][:5]))
    print("-> " + "\n-> ".join(triage_point["examples"][:5]))
    print()

Public Works (public_works)
- Deceased animals
- Graffiti on public property
- Water leak (non-emergency)
- Streetlight outages
- Pothole repairs
-> I saw a dead raccoon on Melba Road today—can someone remove it?
-> There’s new graffiti on the sidewalk wall along South Coast Highway.
-> A small water leak is pooling near the fire hydrant on Encinitas Blvd.
-> The streetlight in front of 450 Quail Gardens Dr has been out for 3 nights.
-> There’s a deep pothole on Manchester Ave that's damaging cars.

Code Enforcement (code_enforcement)
- Abandoned vehicles
- Illegal signs or banners
- Noise complaints
- Trash or junk on private property
- Homeless overnight camping
-> There’s a broken-down RV parked on Citrus Ave for over two weeks.
-> Someone put up a political banner on private property with no permit.
-> Neighbors at 2 a.m. are playing loud music by the pool every weekend.
-> The vacant lot next to Sunset View has piles of old furniture.
-> A homeless encampment is forming under the 

In [25]:
triage_examples = {}
for triage_contact in dept_support_areas:
    triage_target = triage_contact["slug"]
    if triage_target not in triage_examples:
        triage_examples[triage_target] = {"name": triage_contact["name"], "support_areas": triage_contact["support_areas"], "examples": []}
    triage_examples[triage_target]["examples"].extend(triage_contact["examples"])

print(f"Total triage targets: {len(triage_examples)}")
print(f"Total examples: {sum(len(data['examples']) for data in triage_examples.values())}")
print(f"First 5 examples for '{list(triage_examples.keys())[0]}': {triage_examples[list(triage_examples.keys())[0]]['examples'][:5]}")

Total triage targets: 11
Total examples: 55
First 5 examples for 'public_works': ['I saw a dead raccoon on Melba Road today—can someone remove it?', 'There’s new graffiti on the sidewalk wall along South Coast Highway.', 'A small water leak is pooling near the fire hydrant on Encinitas Blvd.', 'The streetlight in front of 450 Quail Gardens Dr has been out for 3 nights.', "There’s a deep pothole on Manchester Ave that's damaging cars."]


### Step 6. Scale up 100 or more tuples + queries using an LLM

We can use the "BTQL Sandbox" to help us construct a query to get the "good" dimension tuples from our annotation exercise, and then use that query to get those records to build some synthetic queries


In [26]:
def get_valid_dimension_tuples():
    cursor = None
    while True:
        response = requests.post(
            "https://staging-api.braintrust.dev/btql",
            json={
                "query": dedent("""
                        select: output
                        from: experiment('36b63064-7f4e-46a4-9673-5533fa327dc5')
                        filter: scores."is_good" = 1
                """)
                + (f" | cursor: '{cursor}'" if cursor else ""),
                "use_brainstore": True,
                "brainstore_realtime": True,  # Include the latest realtime data, but a bit slower.
            },
            headers={"Authorization": "Bearer " + os.environ["BRAINTRUST_API_KEY"]},
        )
        response.raise_for_status()
        response_json = response.json()
        data = response_json.get("data", [])
        cursor = response_json.get("cursor")

        return [row["output"] for row in data]


valid_dim_tuples = get_valid_dimension_tuples()

print(len(valid_dim_tuples))
valid_dim_tuples[:5]

17


[{'communication_tone': 'polite_respectful',
  'customer_age': 'adult',
  'customer_persona': 'property_owner',
  'inquiry_type': 'report_or_notification',
  'query_style_and_detail': 'verbose_detailed_request',
  'time_sensitivity': 'immediate_happening_now',
  'urgency_level': 'high_urgent_immediate_attention'},
 {'communication_tone': 'neutral_professional',
  'customer_age': 'adult',
  'customer_persona': 'business_owner',
  'inquiry_type': 'informational_ask',
  'query_style_and_detail': 'short_keywords_minimal_details',
  'time_sensitivity': 'future_planning_ahead',
  'urgency_level': 'medium_timely_response_needed'},
 {'communication_tone': 'neutral_professional',
  'customer_age': 'adult',
  'customer_persona': 'contractor_developer',
  'inquiry_type': 'permit_or_license_inquiry',
  'query_style_and_detail': 'short_keywords_minimal_details',
  'time_sensitivity': 'immediate_happening_now',
  'urgency_level': 'low_routine_inquiry'},
 {'communication_tone': 'frustrated_annoyed',


Create a versioned prompt and save it to Braintrust.


In [29]:
SYNTH_DATA_GEN_PROMPT = """\
I am designing a customer support chatbot for the city of Encinitas and I want to test it against a diverse
range of realistic inquiries citizens that would appropriately be routed to this contact:

Contact: {{{contact_name}}}
Support Areas: {{{support_areas}}}

## Objective

Produce {{{num_queries_to_generate}}} additional natural language customer support requests using a sample of actual customer support requests 
below, but augmented to have one of the characteristics listed below:

==== Inquiry Characteristics =====
{{{dimension_tuple_json}}}

===== Example Inquiries =====
{{{example_inquiries}}}

## Instructions
To produce each inquiry, follow these steps:
1. Choose a random example from the list of example inquiries above.
2. Choose a random record from the list of inquiry characteristics above.
3. Augment the example inquiry to have the chosen characteristic of that record.
4. Return the augmented inquiry.

The queries should:
1. Sound like real users asking for assistance
2. Naturally incorporate all the dimension values
3. Vary in style and detail level
4. Be realistic and practical
5. Change any named entities (locations, addresses, street names, person names, etc...) to diversify the content
5. Include natural variations in typing style, such as:
   - Some queries in all lowercase
   - Some with random capitalization
   - Some with common typos
   - Some with missing punctuation
   - Some with extra spaces or missing spaces
   - Some with emojis or text speak

Here are examples of realistic query variations for a request to get a pothole fixed from 
an adult, business_owner, concise_moderate_detail:

Proper formatting:
- "There is a huge pothole on the corner of 123 Main St. It's been there for weeks affecting customers trying to park nearby"
- "Can someone please fix the pothole on 123 Main St."

All lowercase:
- "need someone to fix the pothole on 123 main st."
- "can a crew come out and fix the pothole on the corner of 123 main st. and grand ave."

Random caps:
- "NEED someone to fix the pothole on 123 main st. ASAP"
- "can a crew come out and fix the pothole on the corner of 123 main st. and grand ave??? It needs to happen NOW! Like today!"

Common typos:
- "need someone to fx the pothol on 123 main st. asap pleze!!!"
- "Can a crew come out and fx the pothole on the corner of 123 main st. & grand ave??? thx much!"

Missing punctuation:
- "need someone to fx the pothole on 123 main st asap plz"
- "Can a crew come out and fix the pothole on the corner of 123 main st & grand ave ... thx much"

With emojis/text speak:
- "the pothole needs to be fixed now on 123 main st. and grand ave! 🥗"
- "pls help get the pothole on 123 main st. & grand ave fixed thx"

Generate {{{num_queries_to_generate}}} unique queries,varying the text style naturally."""


print(
    (
        SYNTH_DATA_GEN_PROMPT.replace("{{{num_queries_to_generate}}}", "4")
        .replace("{{{dimension_tuple_json}}}", json.dumps(valid_dim_tuples[:4], indent=2))
        .replace("{{{example_inquiries}}}", "- " + "\n- ".join(triage_examples["public_works"]["examples"][:5]))
        .replace("{{{contact_name}}}", triage_examples["public_works"]["name"])
        .replace("{{{support_areas}}}", ", ".join(triage_examples["public_works"]["support_areas"]))
    )
)

I am designing a customer support chatbot for the city of Encinitas and I want to test it against a diverse
range of realistic inquiries citizens that would appropriately be routed to this contact:

Contact: Public Works
Support Areas: Deceased animals, Graffiti on public property, Water leak (non-emergency), Streetlight outages, Pothole repairs

## Objective

Produce 4 additional natural language customer support requests using a sample of actual customer support requests 
below, but augmented to have one of the characteristics listed below:

==== Inquiry Characteristics =====
[
  {
    "communication_tone": "polite_respectful",
    "customer_age": "adult",
    "customer_persona": "property_owner",
    "inquiry_type": "report_or_notification",
    "query_style_and_detail": "verbose_detailed_request",
    "time_sensitivity": "immediate_happening_now",
    "urgency_level": "high_urgent_immediate_attention"
  },
  {
    "communication_tone": "neutral_professional",
    "customer_age": "a

In [37]:
synth_query_gen_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="synth-query-gen-prompt")
try:
    synth_query_gen_prompt.prompt
except Exception as e:
    bt_syth_query_gen_prompt = bt_project.prompts.create(
        name="SynthQueryGenPrompt",
        slug="synth-query-gen-prompt",
        description="Prompt for generating synthetic queries",
        model="claude-4-sonnet-20250514",
        messages=[{"role": "user", "content": SYNTH_DATA_GEN_PROMPT}],
        if_exists="replace",
    )

    bt_project.publish()
    synth_query_gen_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="synth-query-gen-prompt")


In [38]:
_p = synth_query_gen_prompt.build(
    num_queries_to_generate=4,
    dimension_tuple_json=json.dumps(valid_dim_tuples[:4], indent=2),
    example_inquiries="- " + "\n- ".join(triage_examples["public_works"]["examples"][:5]),
    contact_name=triage_examples["public_works"]["name"],
    support_areas=", ".join(triage_examples["public_works"]["support_areas"]),
)

In [42]:
class QueryList(BaseModel):
    queries: list[str]

In [43]:
def generate_synth_queries(triage_slug: str, num_queries: int = 10, model: str = "gpt-4o-mini", model_kwargs: dict = {}) -> dict:
    random_dim_tuples = random.sample(valid_dim_tuples, min(num_queries, len(valid_dim_tuples)))

    prompt = synth_query_gen_prompt.build(
        num_queries_to_generate=num_queries,
        dimension_tuple_json=json.dumps(random_dim_tuples, indent=2),
        example_inquiries="- " + "\n- ".join(triage_examples[triage_slug]["examples"][:5]),
        contact_name=triage_examples[triage_slug]["name"],
        support_areas=", ".join(triage_examples[triage_slug]["support_areas"]),
    )

    rsp = oai_client.beta.chat.completions.parse(
        model=model,
        messages=prompt["messages"],
        response_format=QueryList,
        **model_kwargs,
    )

    query_list: QueryList = rsp.choices[0].message.parsed  # type: ignore

    return {
        "prompt": prompt["messages"],
        "triage_slug": triage_slug,
        "sampled_dim_tuples": random_dim_tuples,
        "handcoded_queries": triage_examples[triage_slug]["examples"],
        "synth_queries": query_list.queries,
    }

In [44]:
# rsp = generate_synth_queries("public_works", num_queries=10)

# for q in rsp["handcoded_queries"]:
#     print(q)

# print("-" * 100)

# for q in rsp["synth_queries"]:
#     print(q)

In [45]:
def generate_queries_parallel(triage_targets: list[str], num_queries: int = 10, model: str = "gpt-4o-mini", model_kwargs: dict = {}):
    """Generate queries in parallel for all dimension tuples."""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M")

    print(f"Generating {num_queries} queries each for {len(triage_targets)} triage targets...")
    # Run in parallel
    worker = partial(generate_synth_queries, num_queries=num_queries, model=model, model_kwargs=model_kwargs)
    responses = list(ThreadPoolExecutor(max_workers=MAX_WORKERS).map(worker, triage_targets))

    # Add query items
    all_queries = []
    for response in responses:
        prompt = response["prompt"]
        triage_slug = response["triage_slug"]

        queries = [{"prompt": prompt, "triage_slug": triage_slug, "query": q, "query_source": "synth"} for q in response["synth_queries"]]
        queries.extend(
            [{"prompt": prompt, "triage_slug": triage_slug, "query": q, "query_source": "handcoded"} for q in response["handcoded_queries"]]
        )

        all_queries.extend(queries)

    # Add to experiment
    bt_experiment = bt.init(project=BT_PROJECT_NAME, experiment=f"add_queries_it_{timestamp}")
    query_id = 1

    for query_item in all_queries:
        qid = f"{timestamp}_{query_id:03d}"
        query_id += 1

        with bt_experiment.start_span(name="add_query") as span:
            span.log(
                input=query_item["prompt"],
                output=query_item["query"],
                metadata={
                    "id": qid,
                    "triage_slug": query_item["triage_slug"],
                    "query_source": query_item["query_source"],
                    "model": model,
                    "model_kwargs": model_kwargs,
                },
            )

    summary = bt_experiment.summarize(summarize_scores=False)
    return summary, queries

In [46]:
summary, queries = generate_queries_parallel(triage_targets, num_queries=10, model="gpt-4o-mini")
# summary, queries = generate_queries_parallel(triage_targets[:2], num_queries=10, model="gpt-4o-mini")

print(len(queries))
print(summary)

Generating 10 queries each for 11 triage targets...
15

See results for add_queries_it_20250718_1530 at https://www.braintrust.dev/app/braintrustdata.com/p/wayde-ai-evals-course-2025/experiments/add_queries_it_20250718_1530


### Step 7. Remove invalid queries (human/SME review)

We'll do this in the Braintrust UI


## Tasks

Lets use our "good" synthetic and handcoded queries to build a real-world customer support bot with two key pieces:

1. A web search retrieval task to get context w/r/t answering and routing requests
2. A retrieval synthesis task that takes the output from the retrieval tasks to finalize where the support should get routed, what information is needed to resolve the request, and an answer of some sort to provide the user

To make this happen, we'll define some tasks and [log](https://www.braintrust.dev/docs/guides/logs) them to Braintrust.


In [3]:
bt_logger = bt.init_logger(project=BT_PROJECT_NAME)

dept_support_areas = json.load(open("./data/dept_support_areas.json"))

triage_targets = list(set([route["slug"] for route in dept_support_areas]))

Let's use our old friend, the "BTQL Sandbox" to build a query to get only the queries marked as good from the human review above.


In [4]:
def get_test_data():
    cursor = None
    while True:
        response = requests.post(
            "https://staging-api.braintrust.dev/btql",
            json={
                "query": dedent("""
                        select: output, metadata
                        from: experiment('b715cf02-86f6-4471-bc10-cef617e128ec')
                        filter: scores."is_good" = 1
                """)
                + (f" | cursor: '{cursor}'" if cursor else ""),
                "use_brainstore": True,
                "brainstore_realtime": True,  # Include the latest realtime data, but a bit slower.
            },
            headers={"Authorization": "Bearer " + os.environ["BRAINTRUST_API_KEY"]},
        )
        response.raise_for_status()
        response_json = response.json()
        data = response_json.get("data", [])
        cursor = response_json.get("cursor")

        return [{"input": row["output"], "metadata": row["metadata"]} for row in data]


test_data = get_test_data()

print(len(test_data))
test_data[:5]

56


[{'input': 'An officer at the parking kiosk at Cottonwood Creek was rude to me.',
  'metadata': {'id': '20250716_1728_101',
   'model': 'gpt-4o-mini',
   'model_kwargs': {},
   'query_source': 'handcoded',
   'triage_slug': 'human_resources'}},
 {'input': 'Do you offer fire‑safe yard workshops or materials for residents?',
  'metadata': {'id': '20250716_1728_090',
   'model': 'gpt-4o-mini',
   'model_kwargs': {},
   'query_source': 'handcoded',
   'triage_slug': 'fire_prevention'}},
 {'input': 'Can I book a fire safety inspection for my new event space on Vulcan Ave?',
  'metadata': {'id': '20250716_1728_089',
   'model': 'gpt-4o-mini',
   'model_kwargs': {},
   'query_source': 'handcoded',
   'triage_slug': 'fire_prevention'}},
 {'input': 'Neighbors built a wood storage lean‑to less than 5\u202fft from their house.',
  'metadata': {'id': '20250716_1728_088',
   'model': 'gpt-4o-mini',
   'model_kwargs': {},
   'query_source': 'handcoded',
   'triage_slug': 'fire_prevention'}},
 {'inpu

### Retrieval


#### Prompt


In [5]:
WEB_SEARCH_SYSTEM_PROMPT = dedent("""\
    Your name is Sunny and you are in charge of answering question about the city of {{{city}}} and routing requests to the correct departments for triage.
    You are given access to a web search retrieval tool to use in an attempt to answer satisfy the user's inquiry.
    
    ## Instructions
    - Always include any relevant links in your response! Never respond with a generic link placeholder like '(Note: The specific registration link would be on the city's website)'     
    - If you cannot find the information you need or fulfill the request, make sure to tell the user that in the final answer
""")

In [6]:
bt_web_search_prompt = bt_project.prompts.create(
    name="WebSearchPrompt",
    slug="web-search-prompt",
    description="Prompt for retrieving context from the web",
    model="claude-sonnet-4-20250514",
    messages=[
        {"role": "system", "content": WEB_SEARCH_SYSTEM_PROMPT},
        {"role": "user", "content": "{{{query}}}"},
    ],
    if_exists="replace",
)

bt_project.publish()

{'status': 'success'}

In [7]:
web_search_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="web-search-prompt")
web_search_prompt.build(city="Encinitas", query="What is the weather in Encinitas?")

{'model': 'claude-sonnet-4-20250514',
 'span_info': {'metadata': {'prompt': {'variables': {'city': 'Encinitas',
     'query': 'What is the weather in Encinitas?'},
    'id': '77c412bb-4403-4002-84e1-03c70b89bf2d',
    'project_id': 'b3adadab-154b-4474-ad2d-820167938213',
    'version': '1000195469107564590'}}},
 'messages': [{'content': "Your name is Sunny and you are in charge of answering question about the city of Encinitas and routing requests to the correct departments for triage.\nYou are given access to a web search retrieval tool to use in an attempt to answer satisfy the user's inquiry.\n\n## Instructions\n- Always include any relevant links in your response! Never respond with a generic link placeholder like '(Note: The specific registration link would be on the city's website)'     \n- If you cannot find the information you need or fulfill the request, make sure to tell the user that in the final answer\n",
   'role': 'system'},
  {'content': 'What is the weather in Encinita

#### Config


In [8]:
# limit the number of searches per request
N_SEARCHES = 1

# To limit the scope of the search to specific domains
RETRIEVAL_WEB_SEARCH_ALLOWED_DOMAINS = [
    "encinitasca.gov",
    # "anc.apm.activecommunities.com/encinitasparksandrec",  # P&R activity registration
    # "/issuu.com/encinitasca.gov/",  # P&R activty guides
]

# To localize search results based on user's location
USER_LOCATION = {
    "type": "approximate",
    "city": "Encinitas",
    "region": "California",
    "country": "US",
    "timezone": "America/Los_Angeles",
}

#### Task


In [9]:
@bt.traced()
def ac_web_search(query: str, city: str, model="claude-sonnet-4-20250514", model_kwargs={}):
    """Search the web for the given query."""

    prompt = web_search_prompt.build(city=city, query=query)

    rsp = wrapped_ac_client.messages.create(
        model=model,
        system=prompt["messages"].pop(0)["content"],
        messages=prompt["messages"],
        tools=[
            {
                "type": "web_search_20250305",
                "name": "web_search",
                "max_uses": N_SEARCHES,
                "allowed_domains": RETRIEVAL_WEB_SEARCH_ALLOWED_DOMAINS,
                "user_location": USER_LOCATION,
            }  # type: ignore
        ],
        **model_kwargs,
    )
    return rsp


In [10]:
_example = test_data[0]
_example

{'input': 'An officer at the parking kiosk at Cottonwood Creek was rude to me.',
 'metadata': {'id': '20250716_1728_101',
  'model': 'gpt-4o-mini',
  'model_kwargs': {},
  'query_source': 'handcoded',
  'triage_slug': 'human_resources'}}

In [11]:
# with bt_logger.start_span(name="web_search") as trace:
rsp = ac_web_search(
    _example["input"],
    "Encinitas",
    model_kwargs={"max_tokens": 2048, "thinking": {"type": "enabled", "budget_tokens": 1024}},
)

rsp
# rsp.model_dump()

Message(id='msg_016TLPyuR3H8ZTM95eWeJdZi', content=[ThinkingBlock(signature='Er0ECkYIBRgCKkB/4X+mbyW8aM4KW0D9/5mJIA+ihBd5Y9k0YWcVHAt33yvVZK4mB1CqnVdYJ+BVaBmUQTKjw2F6bmxqMqzevsRoEgyY6KA7LlHzOy0e9AsaDIo9pMGkOL2v9/JYmCIwEqxwaFbr8v0cY2GRB5nexaP+kYucFlTixYEi/SAXDadWpCWtTMjJNXIsi/XK4vXxKqQD8nZlOhhfbKidWgYJ7Pk9oBiLLjZkbXZCnyjS7HK1Y91my6mxK0mvr0sZC1yfysC9uoT0v2bv3KfJbyrC2RM9wig24iBWDMLYVc/eHNY6fBNe2Z2A9Z0ghsRnJk4jz1JPzVOyWtRN3pdVWjntkKTCKOYVT0VKugEUVoPRxVUZoblTJz0jrZeHq5c4gqfzmbeaJzk+itzeYO+OXt8HIcAml49kcdedRpuDyw3gEPTnk6S1cCtsyDd7HioIwAcxVt3wCMP1eELHKrD0n/wHx9GbdBqksPf2mZD83zZEBj6POe8EWbKndKtZpfcpg/aQNE3ACGpkLmMhIsMxRcfu7IQ61pZLuhFrTlbh4tqgs69tuofHVSCQgste/ZQKVNGn8RvIref2pJqxRUC9yr0f4Wns+dhm+qk+h+kwOct6UFdxvm/qmMSAzGs3/y/whnLfF2aeprLM51NVOUKYXZZgUO/PUuXqwRoP1QKUlLcTxlojTmG4gfL3A1PzTw7Edg4y1nb4J+1RSyLstzkIlLwkQfrPoT5n9MQBk1f4ucCwxsa16u/jYabCGAE=', thinking="The user is reporting an incident about a rude officer at a parking kiosk at Cottonwood Creek in Encinitas. This is a complaint about city

#### Parse Utility


In [12]:
def parse_ac_web_search_response(message):
    """
    Parse a web search response message to extract searches, results, citations, and final answers.

    Args:
        message: The message object from Anthropic's web search tool

    Returns:
        dict: Contains 'searches', 'results', 'citations', and 'final_answer'
    """

    searches = []
    results = []
    citations = []
    final_answer_parts = []

    # Iterate through all content blocks
    for block in message.content:
        # Extract search queries
        if hasattr(block, "type") and block.type == "server_tool_use":
            if hasattr(block, "name") and block.name == "web_search":
                search_query = block.input.get("query", "No query found")
                searches.append({"id": block.id, "query": search_query, "input": block.input})

        # Extract search results
        elif hasattr(block, "type") and block.type == "web_search_tool_result":
            for result_block in block.content:
                if hasattr(result_block, "type") and result_block.type == "web_search_result":
                    results.append(
                        {
                            "title": result_block.title,
                            "url": result_block.url,
                            "encrypted_content": result_block.encrypted_content,
                            "page_age": getattr(result_block, "page_age", None),
                        }
                    )

        # Extract final answer text blocks (with or without citations)
        elif hasattr(block, "type") and block.type == "text":
            text_content = block.text
            block_citations = getattr(block, "citations", None)

            # Add to final answer
            final_answer_parts.append(text_content)

            # If this text block has citations, extract them
            if block_citations:
                for citation in block_citations:
                    citations.append(
                        {"cited_text": citation.cited_text, "title": citation.title, "url": citation.url, "type": citation.type}
                    )

    return {"searches": searches, "results": results, "citations": citations, "final_answer": "".join(final_answer_parts)}


In [13]:
# Test the parsing function with the existing response
parsed = parse_ac_web_search_response(rsp)

print("=== SEARCHES ===")
for i, search in enumerate(parsed["searches"], 1):
    print(f"{i}. Query: {search['query']}")
    print(f"   ID: {search['id']}")
    print()

print("=== RESULTS ===")
for i, result in enumerate(parsed["results"], 1):
    print(f"{i}. {result['title']}")
    print(f"   URL: {result['url']}")
    print(f"   Page Age: {result['page_age']}")
    print()

print("=== CITATIONS ===")
for i, citation in enumerate(parsed["citations"], 1):
    print(f"{i}. Cited Text: {citation['cited_text'][:100]}...")
    print(f"   Title: {citation['title']}")
    print(f"   URL: {citation['url']}")
    print()

print("=== FINAL ANSWER ===")
print(parsed["final_answer"])  # [:500] + "..." if len(parsed["final_answer"]) > 500 else parsed["final_answer"])


=== SEARCHES ===
1. Query: Encinitas complaint parking enforcement officer
   ID: srvtoolu_01C8dx6Ux8otrNSzYqS1QKmc

=== RESULTS ===
1. Second Level Appeal | City of Encinitas
   URL: https://www.encinitasca.gov/government/departments/development-services/code-enforcement/parking-citations-appeals/second-level-appeal
   Page Age: None

2. Disabled Parking Violations | City of Encinitas
   URL: https://www.encinitasca.gov/government/departments/development-services/code-enforcement/parking-citations-appeals/disabled-parking-violations
   Page Age: None

3. ACE Parking - Downtown & Business District Parking Citations | City of Encinitas
   URL: https://www.encinitasca.gov/government/departments/development-services/code-enforcement/ace-parking-downtown-business-district-parking-citations
   Page Age: None

4. Filing a Complaint | City of Encinitas
   URL: https://www.encinitasca.gov/government/departments/development-services/code-enforcement/filing-a-complaint
   Page Age: None

5. Code

#### Task (v2)


In [14]:
@bt.traced()
def ac_web_search(query: str, city: str, model="claude-sonnet-4-20250514", model_kwargs={}):
    """Search the web for the given query."""

    prompt = web_search_prompt.build(city=city, query=query)

    rsp = wrapped_ac_client.messages.create(
        model=model,
        system=prompt["messages"].pop(0)["content"],
        messages=prompt["messages"],
        **model_kwargs,
        tools=[
            {
                "type": "web_search_20250305",
                "name": "web_search",
                "max_uses": N_SEARCHES,
                "allowed_domains": RETRIEVAL_WEB_SEARCH_ALLOWED_DOMAINS,
                "user_location": USER_LOCATION,
            }  # type: ignore
        ],
    )

    # return rsp
    parsed = parse_ac_web_search_response(rsp)
    return {"citations": parsed["citations"], "final_answer": parsed["final_answer"]}

In [15]:
rsp = ac_web_search(
    _example["input"],
    "Encinitas",
    model_kwargs={"max_tokens": 2048, "thinking": {"type": "enabled", "budget_tokens": 1024}},
)

rsp
# rsp.model_dump()

{'citations': [{'cited_text': 'Code Enforcement 505 S. Vulcan Ave. Encinitas, CA 92024 (760) 633-2685 Email ... Code Enforcement has the responsibility of ensuring compliance with t...',
   'title': 'Code Enforcement | City of Encinitas',
   'url': 'https://www.encinitasca.gov/i-want-to/requests-hub/code-enforcement-form-761',
   'type': 'web_search_result_location'},
  {'cited_text': 'Code Enforcement 505 S. Vulcan Ave. Encinitas, CA 92024 (760) 633-2685 Email · Code Enforcement Guide · Campaign Signs FAQ · Parking Citations & Appea...',
   'title': 'Code Enforcement | City of Encinitas',
   'url': 'https://www.encinitasca.gov/i-want-to/requests-hub/code-enforcement-form-761',
   'type': 'web_search_result_location'},
  {'cited_text': 'If you witness a violation, please notify Code Enforcement staff by completing an online complaint form. ',
   'title': 'Code Enforcement | City of Encinitas',
   'url': 'https://www.encinitasca.gov/i-want-to/requests-hub/code-enforcement-form-761',
   

### Retrieval Synthesiss


#### Prompt


In [16]:
for dept in dept_support_areas:
    print(dept["name"])
    print("- " + "\n- ".join(dept["support_areas"]))
    print("-> " + "\n-> ".join(dept["examples"]))
    print()

Public Works
- Deceased animals
- Graffiti on public property
- Water leak (non-emergency)
- Streetlight outages
- Pothole repairs
-> I saw a dead raccoon on Melba Road today—can someone remove it?
-> There’s new graffiti on the sidewalk wall along South Coast Highway.
-> A small water leak is pooling near the fire hydrant on Encinitas Blvd.
-> The streetlight in front of 450 Quail Gardens Dr has been out for 3 nights.
-> There’s a deep pothole on Manchester Ave that's damaging cars.

Code Enforcement
- Abandoned vehicles
- Illegal signs or banners
- Noise complaints
- Trash or junk on private property
- Homeless overnight camping
-> There’s a broken-down RV parked on Citrus Ave for over two weeks.
-> Someone put up a political banner on private property with no permit.
-> Neighbors at 2 a.m. are playing loud music by the pool every weekend.
-> The vacant lot next to Sunset View has piles of old furniture.
-> A homeless encampment is forming under the 101 overpass near Cardiff.

Parks 

In [17]:
triage_targets

['traffic_engineering',
 'general',
 'parks_recreation',
 'human_resources',
 'san_dieguito_water_district',
 'homeless_solutions',
 'public_works',
 'san_diego_county_vector_control',
 'fire_prevention',
 'san_diego_humane_society',
 'code_enforcement']

In [18]:
def build_triage_route_details() -> str:
    output = "=" * 10 + "\n"
    for dept in dept_support_areas:
        output += f"Triage Target: {dept['slug']}\n"
        output += f"Triage Name: {dept['name']}\n"
        output += f"Support Areas: {','.join(dept['support_areas'])}"
        output += "\n" + "=" * 10 + "\n"

    return output.strip()


In [19]:
print(build_triage_route_details())

Triage Target: public_works
Triage Name: Public Works
Support Areas: Deceased animals,Graffiti on public property,Water leak (non-emergency),Streetlight outages,Pothole repairs
Triage Target: code_enforcement
Triage Name: Code Enforcement
Support Areas: Abandoned vehicles,Illegal signs or banners,Noise complaints,Trash or junk on private property,Homeless overnight camping
Triage Target: parks_recreation
Triage Name: Parks & Recreation
Support Areas: Playground issues,Gate issues in parks,Graffiti in park sites,Trail maintenance,Trash or junk in park sites
Triage Target: homeless_solutions
Triage Name: Homeless Solutions
Support Areas: Homeless encampments in parks or beaches,Homeless encampments on private property,General inquiries about individuals experiencing homelessness,Reporting aggressive behavior,Need outreach services
Triage Target: fire_prevention
Triage Name: Fire Prevention
Support Areas: Fire hazard reporting,Concerns about brush clearance,Fire code violations,Inspection

In [20]:
SYNTHESIZE_RETRIEVAL_SYSTEM_PROMPT = f"""\
Your name is Sunny and you are an expert in providing responses to customer support responses for the city of {{{{city}}}} based on the original user inquiry, 
triage route descriptions, and the context provided.

## Instructions
- Provide a final answer to the user inquiry ONLY if it is answerable from the "Context" provided below.
- Using the user's inquiry and the "Context" provided below, choose the most specific triage target to route the user to based on the "Triage Options" defined below.
- If provided, make sure you include any relevant links, phone numbers, emails, or other information from the "Context" and how the user can use them to get the information they need.
- Use markdown formatting for the full response.
- Always direct complaints about personnel to "human_resources"

## Triage Options    
{build_triage_route_details()}

## Context
{{{{context}}}}"""

In [21]:
bt_retieval_synthesis_prompt = bt_project.prompts.create(
    name="WebSearchRetrievalSynthesis",
    slug="web-search-retrieval-synthesis",
    description="Prompt for synthesizing the retrieval results into a final answer",
    model="claude-4-sonnet-20250514",
    messages=[
        {"role": "system", "content": SYNTHESIZE_RETRIEVAL_SYSTEM_PROMPT},
        {"role": "user", "content": "User Inquiry: {{{user_inquiry}}}"},
    ],
    if_exists="replace",
)

bt_project.publish()

{'status': 'success'}

In [22]:
retrieval_synth_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="web-search-retrieval-synthesis")

_p = retrieval_synth_prompt.build(city="Encinitas", user_inquiry=_example["input"], context=rsp)
_p

{'model': 'claude-4-sonnet-20250514',
 'span_info': {'metadata': {'prompt': {'variables': {'city': 'Encinitas',
     'user_inquiry': 'An officer at the parking kiosk at Cottonwood Creek was rude to me.',
     'context': {'citations': [{'cited_text': 'Code Enforcement 505 S. Vulcan Ave. Encinitas, CA 92024 (760) 633-2685 Email ... Code Enforcement has the responsibility of ensuring compliance with t...',
        'title': 'Code Enforcement | City of Encinitas',
        'url': 'https://www.encinitasca.gov/i-want-to/requests-hub/code-enforcement-form-761',
        'type': 'web_search_result_location'},
       {'cited_text': 'Code Enforcement 505 S. Vulcan Ave. Encinitas, CA 92024 (760) 633-2685 Email · Code Enforcement Guide · Campaign Signs FAQ · Parking Citations & Appea...',
        'title': 'Code Enforcement | City of Encinitas',
        'url': 'https://www.encinitasca.gov/i-want-to/requests-hub/code-enforcement-form-761',
        'type': 'web_search_result_location'},
       {'cited_t

In [23]:
print(_p["messages"][0]["content"])

Your name is Sunny and you are an expert in providing responses to customer support responses for the city of Encinitas based on the original user inquiry, 
triage route descriptions, and the context provided.

## Instructions
- Provide a final answer to the user inquiry ONLY if it is answerable from the "Context" provided below.
- Using the user's inquiry and the "Context" provided below, choose the most specific triage target to route the user to based on the "Triage Options" defined below.
- If provided, make sure you include any relevant links, phone numbers, emails, or other information from the "Context" and how the user can use them to get the information they need.
- Use markdown formatting for the full response.
- Always direct complaints about personnel to "human_resources"

## Triage Options    
Triage Target: public_works
Triage Name: Public Works
Support Areas: Deceased animals,Graffiti on public property,Water leak (non-emergency),Streetlight outages,Pothole repairs
Triag

#### Structured Output Definition


In [24]:
class CustomerSupportResponse(BaseModel):
    """The response from the customer support agent."""

    def convert_str_to_triage_target(v: str) -> str:  # type: ignore
        """Ensure entity type is a valid enum."""
        if v in triage_targets:
            return v
        else:
            try:
                match, score = fuzzy_process.extractOne(v.upper(), [s.upper() for s in triage_targets])  # type: ignore
                return match if score >= 80 else "General"
            except ValueError:
                return "General"

    chain_of_thought: str = Field(
        ...,
        description="Explain your decision about whether this question is answerable and why you chose to route it to the triage contact you chose",
    )
    is_answerable_from_context: bool = Field(
        ..., description="Whether the given response can answer the question from the context provided"
    )
    requires_follow_up: bool = Field(
        ...,
        description="Whether the inquiry requires follow-up from the triage contact",
    )
    requires_location: bool = Field(
        ...,
        description="Whether the inquiry requires the user to provide their location to be resolved by the triage contact",
    )
    requires_photo_or_video: bool = Field(
        ...,
        description="Whether the inquiry requires the user to provide a photo or video to be resolved by the triage contact",
    )
    requires_user_contact_info: bool = Field(
        ...,
        description="Whether the inquiry requires the user to provide their contact information to be resolved by the triage contact",
    )
    full_response: str = Field(
        ...,
        description="The detailed answer to the question based on the context provided",
    )
    final_answer: str = Field(
        ...,
        description="A succinct and concise answer to the question based on the context provided",
    )
    route_to: Annotated[Literal[*triage_targets], BeforeValidator(convert_str_to_triage_target)]  # type: ignore


#### Task


In [25]:
class MODEL_VENDOR(str, enum.Enum):
    ANTHROPIC = "anthropic"
    OPENAI = "openai"

In [26]:
@bt.traced()
def synthesize_retrieval(
    user_inquiry: str,
    retrieval_response: str,
    city: str,
    vendor: MODEL_VENDOR = MODEL_VENDOR.ANTHROPIC,
    model="claude-sonnet-4-20250514",
    model_kwargs={},
) -> CustomerSupportResponse:
    """Synthesize the retrieval results into a final answer."""

    prompt = retrieval_synth_prompt.build(city=city, user_inquiry=user_inquiry, context=retrieval_response)

    rsp = None
    if vendor == MODEL_VENDOR.ANTHROPIC:
        tools = [
            {
                "name": "customer_support_response",
                "description": "Build CustomerSupportResponse object",
                "input_schema": CustomerSupportResponse.model_json_schema(),
            }
        ]
        rsp = wrapped_ac_client.messages.create(
            model=model,
            system=prompt["messages"].pop(0)["content"],  # type: ignore
            messages=prompt["messages"],
            tools=tools,
            tool_choice={"type": "tool", "name": "customer_support_response"},
            **model_kwargs,
        )
        return CustomerSupportResponse(**rsp.content[0].input)

    elif vendor == MODEL_VENDOR.OPENAI:
        result = wrapped_oai_client.beta.chat.completions.parse(
            model="gpt-4o",
            messages=prompt["messages"],
            response_format=CustomerSupportResponse,
            **model_kwargs,
        )
        return result.choices[0].message.parsed

    else:
        raise ValueError(f"Invalid vendor: {vendor}")


In [27]:
parsed["final_answer"]

"I'm sorry to hear about your negative experience with the officer at the parking kiosk at Cottonwood Creek. This type of conduct is certainly not acceptable, and the city should be made aware of this issue.\n\nTo file a complaint about the officer's behavior, here are your options:\n\n**For Code Enforcement/Parking Officer Complaints:**\nContact Code Enforcement at 505 S. Vulcan Ave. Encinitas, CA 92024 or call (760) 633-2685, or you can email them directly.\n\n**Filing a Citizen Complaint:**\nYou can also file a complaint through the City's Code Enforcement complaint process. There's a Citizen Complaint Form available on the city's website at: https://www.encinitasca.gov/government/departments/development-services/code-enforcement/filing-a-complaint/citizen-complaint-form\n\n**Additional Contact Information:**\n- Phone: (760) 943-2299 (this number is used for parking citation inquiries but may also help route your complaint)\n- Address: City of Encinitas, Attention: Code Enforcement,

Let's try it with Anthropic


In [28]:
rsp = synthesize_retrieval(
    user_inquiry=_example["input"],
    retrieval_response=parsed["final_answer"],
    city="Encinitas",
    model_kwargs={"max_tokens": 1024},
)
rsp

CustomerSupportResponse(chain_of_thought='This inquiry involves a complaint about the conduct of a city officer (parking enforcement officer). The user is reporting rude behavior from an officer at the Cottonwood Creek parking kiosk. According to the instructions, complaints about personnel should always be directed to "human_resources". The context provided contains detailed information about how to file complaints about officer behavior, including specific contact information and forms, making this inquiry fully answerable from the context.', is_answerable_from_context=True, requires_follow_up=True, requires_location=False, requires_photo_or_video=False, requires_user_contact_info=False, full_response="I'm sorry to hear about your negative experience with the officer at the parking kiosk at Cottonwood Creek. This type of conduct is certainly not acceptable, and the city should be made aware of this issue.\n\nTo file a complaint about the officer's behavior, here are your options:\n\n

In [29]:
rsp.model_dump()

{'chain_of_thought': 'This inquiry involves a complaint about the conduct of a city officer (parking enforcement officer). The user is reporting rude behavior from an officer at the Cottonwood Creek parking kiosk. According to the instructions, complaints about personnel should always be directed to "human_resources". The context provided contains detailed information about how to file complaints about officer behavior, including specific contact information and forms, making this inquiry fully answerable from the context.',
 'is_answerable_from_context': True,
 'requires_follow_up': True,
 'requires_location': False,
 'requires_photo_or_video': False,
 'requires_user_contact_info': False,
 'full_response': "I'm sorry to hear about your negative experience with the officer at the parking kiosk at Cottonwood Creek. This type of conduct is certainly not acceptable, and the city should be made aware of this issue.\n\nTo file a complaint about the officer's behavior, here are your options:

Let's try it with OpenAI


In [30]:
rsp = synthesize_retrieval(
    user_inquiry=_example["input"],
    retrieval_response=parsed["final_answer"],
    city="Encinitas",
    vendor=MODEL_VENDOR.OPENAI,
    model="gpt-4o-mini",
    model_kwargs={},
)
rsp

CustomerSupportResponse(chain_of_thought="The inquiry relates to a complaint about the behavior of a city staff member, which falls under the jurisdiction of Code Enforcement. The context provides detailed guidance on how to file a complaint regarding an officer's behavior at the parking kiosk.", is_answerable_from_context=True, requires_follow_up=True, requires_location=False, requires_photo_or_video=False, requires_user_contact_info=True, full_response="I'm sorry to hear about your negative experience with the officer at the parking kiosk at Cottonwood Creek. Such conduct is certainly not acceptable, and the city should be made aware of this issue.\n\nTo file a complaint about the officer's behavior, you have a few options:\n\n1. **Contact Code Enforcement:**\n   - Address: 505 S. Vulcan Ave., Encinitas, CA 92024\n   - Phone: (760) 633-2685\n   - Email: [Code Enforcement Email] (Contact information can be found on their website)\n\n2. **Filing a Citizen Complaint:**\n   - Use the Cit

In [31]:
rsp.model_dump()

{'chain_of_thought': "The inquiry relates to a complaint about the behavior of a city staff member, which falls under the jurisdiction of Code Enforcement. The context provides detailed guidance on how to file a complaint regarding an officer's behavior at the parking kiosk.",
 'is_answerable_from_context': True,
 'requires_follow_up': True,
 'requires_location': False,
 'requires_photo_or_video': False,
 'requires_user_contact_info': True,
 'full_response': "I'm sorry to hear about your negative experience with the officer at the parking kiosk at Cottonwood Creek. Such conduct is certainly not acceptable, and the city should be made aware of this issue.\n\nTo file a complaint about the officer's behavior, you have a few options:\n\n1. **Contact Code Enforcement:**\n   - Address: 505 S. Vulcan Ave., Encinitas, CA 92024\n   - Phone: (760) 633-2685\n   - Email: [Code Enforcement Email] (Contact information can be found on their website)\n\n2. **Filing a Citizen Complaint:**\n   - Use the

### Customer Support Task

Let's put our entire workflow together under the `ask_customer_support` task


In [32]:
@bt.traced()
def ask_customer_support(
    user_inquiry: str,
    city: str,
    search_model: str = "claude-sonnet-4-20250514",
    synthesis_model: str = "gpt-4o",
    synthesis_vendor: MODEL_VENDOR = MODEL_VENDOR.OPENAI,
    search_model_kwargs={"max_tokens": 2048, "thinking": {"type": "enabled", "budget_tokens": 1024}},
    synthesis_model_kwargs: dict = {},
) -> CustomerSupportResponse:
    """Ask the customer support agent a question."""

    search_results = ac_web_search(user_inquiry, city, model=search_model, model_kwargs=search_model_kwargs)
    customer_support_resp = synthesize_retrieval(
        user_inquiry=user_inquiry,
        retrieval_response=search_results["final_answer"],
        city=city,
        vendor=synthesis_vendor,
        model=synthesis_model,
        model_kwargs=synthesis_model_kwargs,
    )

    return customer_support_resp

Let's test how this works with a few questions


In [33]:
len(test_data)

56

In [34]:
def process_single_request(example):
    """Process a single customer support request."""
    city = "Encinitas"

    with bt.start_span(name="request_support_task") as span:
        span.log(input=example["input"])
        customer_support_resp = ask_customer_support(
            user_inquiry=example["input"],
            city=city,
            search_model="claude-sonnet-4-20250514",
            synthesis_model="gpt-4o-mini",
            synthesis_vendor=MODEL_VENDOR.OPENAI,
            search_model_kwargs={"max_tokens": 2048, "thinking": {"type": "enabled", "budget_tokens": 1024}},
            synthesis_model_kwargs={},
        )

        metadata = example.get("metadata", {})
        filtered_metadata = {
            "id": metadata.get("id"),
            "query_source": metadata.get("query_source"),
            "triage_slug": metadata.get("triage_slug"),
        }

        span.log(output=customer_support_resp, metadata=filtered_metadata)
        return customer_support_resp

In [35]:
# Run all test data in parallel
print(f"Processing {len(test_data)} customer support requests in parallel...")
timestamp = datetime.now().strftime("%Y%m%d_%H%M")

with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
    results = list(executor.map(process_single_request, test_data[:5]))

print(f"Completed processing {len(results)} requests!")

Processing 56 customer support requests in parallel...
Completed processing 5 requests!


Once we have our traces logged we can move them into a "golden [dataset](https://www.braintrust.dev/docs/guides/datasets)" for the purpose of building evals on.

We'll do this in the Braintrust UI and then fetch it to use here ...


In [36]:
ds = bt.init_dataset(project=BT_PROJECT_NAME, name="initial_error_analysis")
rows = list(ds)[:5]

In [37]:
row = rows[0]
row

{'_pagination_key': 'p07528570527535661060',
 '_xact_id': '1000195469122582589',
 'created': '2025-07-18T23:38:41.580Z',
 'dataset_id': '6f23f7c5-79ce-413b-ad2f-5464221e0f14',
 'expected': {'chain_of_thought': 'The inquiry involves potential fire hazards related to sparks from a grill near dry vegetation. This is a fire prevention issue, and the provided context specifically outlines options for reporting fire hazards.',
  'final_answer': 'Please report the potential fire hazard with sparks from a grill to Encinitas Fire Prevention by using the MyEncinitas App, the web portal, or calling (760) 633-2820 for immediate attention.',
  'full_response': "Thank you for reporting this potential fire hazard. Since you've noticed sparks from a grill near dry vegetation, it is essential to address this promptly. Please report this fire hazard using the following options for the quickest response:\n\n1. **MyEncinitas App**: You can download the app from the Google Play or App Store to report direc

## Scoring

So we have our `task` and our `data`, all we need now to start building evals is a way to score how well our task did on each input in our dataset.

Braintrust, comes with a number of helpful scorers out of the box via our Autoevals library. Fun fact, you can use them outside of Braintrust as well. In addition to these scoring functions, we can define our own as demonstrated below where we configure a reference-based scorerer to see how well the task predicts the right contact to route the inquiry too (if available)


In [38]:
def is_correct_triage_contact(
    input: str | None = None,
    expected: dict | None = None,
    output: CustomerSupportResponse | None = None,
    metadata: dict | None = None,
    **kwargs,
) -> int | None:
    if output and metadata and metadata.get("triage_slug"):
        return int(output.route_to == metadata.get("triage_slug"))

    if output and expected:
        return int(expected["route_to"] == output.route_to)

    return None

In [39]:
is_correct_triage_contact(
    input=row.get("input", ""),
    output=CustomerSupportResponse.model_validate(row.get("expected", {})),
    expected=row.get("expected", None),
    metadata=row.get("metadata", None),  # type: ignore
)

1

## Evals

We now have the 3 things required for run an offline eval, a.k.a. an [experiment](https://www.braintrust.dev/docs/guides/experiments)

**Data**: Handcoded and Synthetic customer support requests/inquiries \
**Task**: A customer support task that uses two subtasks to route and attempt to answer the request \
**Scorers**: An initial reference-based scorer that scores the response on whether it routed the request to the right contact


In [40]:
timestamp = datetime.now().strftime("%Y%m%d_%H%M")

exp_metadata = {
    "city": "Encinitas",
    "search_model": "claude-sonnet-4-20250514",
    "synthesis_model": "gpt-4o-mini",
    "synthesis_vendor": MODEL_VENDOR.OPENAI,
    "search_model_kwargs": {"max_tokens": 2048, "thinking": {"type": "enabled", "budget_tokens": 1024}},
    "synthesis_model_kwargs": {"temperature": 0.75},
}
await bt.EvalAsync(
    name=BT_PROJECT_NAME,
    experiment_name=f"initial_error_analysis_{timestamp}",
    data=rows,  # type: ignore
    task=partial(ask_customer_support, **exp_metadata),
    metadata=exp_metadata,
    scores=[is_correct_triage_contact],  # type: ignore
)

Skipping git metadata. This is likely because the repository has not been published to a remote yet. Remote named 'origin' didn't exist
Experiment initial_error_analysis_20250718_1639 is running at https://www.braintrust.dev/app/braintrustdata.com/p/wayde-ai-evals-course-2025/experiments/initial_error_analysis_20250718_1639
wayde-ai-evals-course-2025 [experiment_name=initial_error_analysis_20250718_1639] (data): 5it [00:00, 45789.34it/s]
wayde-ai-evals-course-2025 [experiment_name=initial_error_analysis_20250718_1639] (tasks): 100%|██████████| 5/5 [00:21<00:00,  4.39s/it]



initial_error_analysis_20250718_1639 compared to add_queries_it_20250718_1530:
80.00% 'is_correct_triage_contact' score

1752881973.00s start
1752881991.36s end
18.34s duration
18.32s llm_duration
13131.20tok prompt_tokens
914.60tok completion_tokens
14045.80tok total_tokens
0.05$ estimated_cost
0tok prompt_cached_tokens
0tok prompt_cache_creation_tokens

See results for initial_error_analysis_20250718_1639 at https://www.braintrust.dev/app/braintrustdata.com/p/wayde-ai-evals-course-2025/experiments/initial_error_analysis_20250718_1639


EvalResultWithSummary(summary="...", results=[...])

We can now start iterating on this failure mode.

We can do so in all kinds of ways. We can change the models and or generation kwargs, we can improve the prompts, and so forth.


In [41]:
timestamp = datetime.now().strftime("%Y%m%d_%H%M")
what_changed = "fixed routing error in dataset"

exp_metadata = {
    "city": "Encinitas",
    "search_model": "claude-sonnet-4-20250514",
    "synthesis_model": "gpt-4o-mini",
    "synthesis_vendor": MODEL_VENDOR.OPENAI,
    "search_model_kwargs": {"max_tokens": 2048, "thinking": {"type": "enabled", "budget_tokens": 1024}},
    "synthesis_model_kwargs": {"temperature": 0.75},
}
await bt.EvalAsync(
    name=BT_PROJECT_NAME,
    experiment_name=f"initial_error_analysis_{timestamp}",
    data=rows,  # type: ignore
    task=partial(ask_customer_support, **exp_metadata),
    metadata={**exp_metadata, "what_changed": what_changed},
    scores=[is_correct_triage_contact],  # type: ignore
)

Skipping git metadata. This is likely because the repository has not been published to a remote yet. Remote named 'origin' didn't exist
Experiment initial_error_analysis_20250718_1642 is running at https://www.braintrust.dev/app/braintrustdata.com/p/wayde-ai-evals-course-2025/experiments/initial_error_analysis_20250718_1642
wayde-ai-evals-course-2025 [experiment_name=initial_error_analysis_20250718_1642] (data): 5it [00:00, 27813.69it/s]
wayde-ai-evals-course-2025 [experiment_name=initial_error_analysis_20250718_1642] (tasks): 100%|██████████| 5/5 [00:25<00:00,  5.14s/it]



initial_error_analysis_20250718_1642 compared to initial_error_analysis_20250718_1639:
80.00% (-) 'is_correct_triage_contact' score	(0 improvements, 0 regressions)

1752882130.22s start
1752882148.94s end
18.69s (+35.16%) 'duration'                    	(3 improvements, 2 regressions)
18.66s (+34.01%) 'llm_duration'                	(3 improvements, 2 regressions)
15782.60tok (+265140.00%) 'prompt_tokens'               	(1 improvements, 4 regressions)
944.80tok (+3020.00%) 'completion_tokens'           	(2 improvements, 3 regressions)
16727.40tok (+268160.00%) 'total_tokens'                	(1 improvements, 4 regressions)
0.06$ (+00.83%) 'estimated_cost'              	(2 improvements, 3 regressions)
0tok (-) 'prompt_cached_tokens'        	(0 improvements, 0 regressions)
0tok (-) 'prompt_cache_creation_tokens'	(0 improvements, 0 regressions)

See results for initial_error_analysis_20250718_1642 at https://www.braintrust.dev/app/braintrustdata.com/p/wayde-ai-evals-course-2025/experiments/

EvalResultWithSummary(summary="...", results=[...])

## Fin
