# A customer support bot - Part 2

An introduction to all things Braintrust via a speed-run through building evals for a customer support bot


This notebook picks up where we left off in the "01_generate_data" notebook.

As a reminder, evals require three things:

1. Data
2. One or more task functions
3. One or more scoring functions that tell us how well each row in our dataset did on that task

With our data in hand, we will define the tasks necessary for the customer support bot to function, a scorer to see how well it routed the requests for examples that include the target route, and finally run some evals.

Let's start by importing the required libraries.


In [1]:
import enum
import json
import os
import random

from concurrent.futures import ThreadPoolExecutor
from datetime import datetime
from functools import partial
from textwrap import dedent
from typing import Annotated, Literal

import anthropic as ac
import braintrust as bt
import openai as oai
import pandas as pd
import requests

from braintrust import wrap_anthropic, wrap_openai
from dotenv import load_dotenv
from pydantic import BaseModel, BeforeValidator, Field
from thefuzz import process as fuzzy_process

load_dotenv(override=True)

  from .autonotebook import tqdm as notebook_tqdm


True

## Setup

Once you're signed up on [Braintrust](https://www.braintrust.dev/) and have set the appropriate API keys in your `.env` file, you're ready to go. For this particular demo, you'll need the following API keys:

- `BRAINTRUST_API_KEY`
- `OPENAI_API_KEY`
- `ANTHROPIC_API_KEY`

With that in place, we can create our [Braintrust project](https://www.braintrust.dev/docs/guides/projects) using `braintrust.projects.create()` (this method will create or return the project if it already exists).

We also define Anthropic and OpenAI clients like usual, as well as a "wrapped" version of each client. These "wrapped" versions automatically capture usage data (e.g., prompt tokens, etc) in Braintrust when used within the context of a logger or experiment.


In [2]:
BT_PROJECT_NAME = "braintrust-intro"
MAX_WORKERS = 5

bt_project = bt.projects.create(name=BT_PROJECT_NAME)
bt_logger = bt.init_logger(project=BT_PROJECT_NAME)

ac_client = ac.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
oai_client = oai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

wrapped_ac_client = wrap_anthropic(ac_client)
wrapped_oai_client = wrap_openai(oai_client)


**Key takeaways**:

1. We use `braintrust.projects.create()` to create or fetch a Braintrust project.
2. We use `braintrust.init_logger` to initalize our logger.
3. We define both standard OpenAI and Anthropic clients, as well as wrapped versions of these clients that support automatic logging of their usage data.
4. Any time we execute a method decorated with `braintrust.traced` or run one of the wrapped clients, traces will be created in Braintrust.


## Tasks

**Objective**: Define functions that will take a customer support inquiry as input and then return a structured response that includes: a reasonable response/answer, the department the request should be routed to, and any information required from the user to complete the ask.

Let's use our "good" synthetic and hand-coded queries to build a real-world customer support bot with two main components:

1. A web search retrieval task to gather context for answering and routing requests.
2. A retrieval synthesis task that takes the output from the retrievals to determine where to route support, what information is needed to resolve the request, and to generate some form of response for the user.

To achieve this, we'll define some tasks and [log](https://www.braintrust.dev/docs/guides/logs) them to Braintrust.


In [3]:
dept_support_areas = json.load(open("./data/dept_support_areas.json"))
triage_targets = list(set([route["slug"] for route in dept_support_areas]))

Let's use our old friend, the "[BTQL](https://www.braintrust.dev/docs/reference/btql) Sandbox" to build a query to get only the queries marked as good from the human review above.


In [4]:
def get_test_data():
    cursor = None
    while True:
        response = requests.post(
            "https://staging-api.braintrust.dev/btql",
            json={
                "query": dedent("""
                        select: output, metadata
                        from: experiment('2df12d7b-7c66-49ff-b7c7-87e95c027163')
                        filter: scores."is_good" = 1
                """)
                + (f" | cursor: '{cursor}'" if cursor else ""),
                "use_brainstore": True,
                "brainstore_realtime": True,  # Include the latest realtime data, but a bit slower.
            },
            headers={"Authorization": "Bearer " + os.environ["BRAINTRUST_API_KEY"]},
        )
        response.raise_for_status()
        response_json = response.json()
        data = response_json.get("data", [])
        cursor = response_json.get("cursor")

        return [{"input": row["output"], "metadata": row["metadata"]} for row in data]


test_data = get_test_data()

print(len(test_data))
test_data[:2]

55


[{'input': 'My water bill doubled this month—can you explain the charges?',
  'metadata': {'id': '20250724_1103_101',
   'model': 'gpt-4o-mini',
   'model_kwargs': {},
   'query_source': 'handcoded',
   'triage_slug': 'san_dieguito_water_district'}},
 {'input': 'A homeless encampment is forming under the 101 overpass near Cardiff.',
  'metadata': {'human_review': '',
   'id': '20250724_1103_090',
   'model': 'gpt-4o-mini',
   'model_kwargs': {},
   'query_source': 'handcoded',
   'triage_slug': 'code_enforcement'}}]

**Key takeaways**:

1. Using BTQL we can query our datasets, experiments, and logs for whatever particular pieces of data we are interested in.


### Retrieval

The first sub-task we need is a retrieval task to gather the appropriate context for routing and responding to the customer support request.


#### Prompt

We will create a versioned prompt in the same way we did in the first notebook.


In [5]:
WEB_SEARCH_SYSTEM_PROMPT = dedent("""\
    Your name is Sunny and you are in charge of answering question about the city of {{{city}}} and routing requests to the correct departments for triage.
    You are given access to a web search retrieval tool to use in an attempt to answer satisfy the user's inquiry.
    
    ## Instructions
    - Always include any relevant links in your response! Never respond with a generic link placeholder like '(Note: The specific registration link would be on the city's website)'     
    - If you cannot find the information you need or fulfill the request, make sure to tell the user that in the final answer
""")

In [6]:
bt_web_search_prompt = bt_project.prompts.create(
    name="WebSearchPrompt",
    slug="web-search-prompt",
    description="Prompt for retrieving context from the web",
    model="claude-sonnet-4-20250514",
    messages=[
        {"role": "system", "content": WEB_SEARCH_SYSTEM_PROMPT},
        {"role": "user", "content": "{{{query}}}"},
    ],
    if_exists="replace",
)

bt_project.publish()

{'status': 'success'}

In [7]:
web_search_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="web-search-prompt")
web_search_prompt.build(city="Encinitas", query="What is the weather in Encinitas?")

{'model': 'claude-sonnet-4-20250514',
 'span_info': {'metadata': {'prompt': {'variables': {'city': 'Encinitas',
     'query': 'What is the weather in Encinitas?'},
    'id': 'c9939b31-6749-47bd-bcf4-40f085a2405a',
    'project_id': '2b7e69b4-7ce8-4da6-b956-72a38eaab1ee',
    'version': '1000195501817803678'}}},
 'messages': [{'content': "Your name is Sunny and you are in charge of answering question about the city of Encinitas and routing requests to the correct departments for triage.\nYou are given access to a web search retrieval tool to use in an attempt to answer satisfy the user's inquiry.\n\n## Instructions\n- Always include any relevant links in your response! Never respond with a generic link placeholder like '(Note: The specific registration link would be on the city's website)'     \n- If you cannot find the information you need or fulfill the request, make sure to tell the user that in the final answer\n",
   'role': 'system'},
  {'content': 'What is the weather in Encinita

#### Task


We'll use Anthropic's built-in [web search tool](https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/web-search-tool) for our initial build.

We'll configure it to only use this particular city's domain and also assume the user's location to be within the city.


In [8]:
# limit the number of searches per request
N_SEARCHES = 1

# To limit the scope of the search to specific domains
RETRIEVAL_WEB_SEARCH_ALLOWED_DOMAINS = [
    "encinitasca.gov",
    # "anc.apm.activecommunities.com/encinitasparksandrec",  # P&R activity registration
    # "/issuu.com/encinitasca.gov/",  # P&R activty guides
]

# To localize search results based on user's location
USER_LOCATION = {
    "type": "approximate",
    "city": "Encinitas",
    "region": "California",
    "country": "US",
    "timezone": "America/Los_Angeles",
}

In [9]:
@bt.traced()
def ac_web_search(query: str, city: str, model="claude-sonnet-4-20250514", model_kwargs={}):
    """Search the web for the given query."""

    prompt = web_search_prompt.build(city=city, query=query)

    rsp = wrapped_ac_client.messages.create(
        model=model,
        system=prompt["messages"].pop(0)["content"],
        messages=prompt["messages"],
        tools=[
            {
                "type": "web_search_20250305",
                "name": "web_search",
                "max_uses": N_SEARCHES,
                "allowed_domains": RETRIEVAL_WEB_SEARCH_ALLOWED_DOMAINS,
                "user_location": USER_LOCATION,
            }  # type: ignore
        ],
        **model_kwargs,
    )
    return rsp


**Key takeaways**:

1. Because `ac_web_search()` is decorated with `braintrust.traced`, the inputs and outputs to this method will automaticlaly be captured in Braintrust when we call it.
2. Because we are using `wrapped_ac_client`, calls to our LLM will be included in trace as well (including their usage metrics).


In [10]:
_example = test_data[0]
# _example

In [11]:
# with bt_logger.start_span(name="web_search") as trace:
rsp = ac_web_search(
    _example["input"],
    "Encinitas",
    model_kwargs={"max_tokens": 2048, "thinking": {"type": "enabled", "budget_tokens": 1024}},
)

# rsp
# rsp.model_dump()

#### Parse utility


Let's create a utility function to parse the web search tool's response into something more digestible. This will allow us to control exactly what bits we want to pass in as context to our synthesis task.


In [12]:
def parse_ac_web_search_response(message):
    """
    Parse a web search response message to extract searches, results, citations, and final answers.

    Args:
        message: The message object from Anthropic's web search tool

    Returns:
        dict: Contains 'searches', 'results', 'citations', and 'final_answer'
    """

    searches = []
    results = []
    citations = []
    final_answer_parts = []

    # Iterate through all content blocks
    for block in message.content:
        # Extract search queries
        if hasattr(block, "type") and block.type == "server_tool_use":
            if hasattr(block, "name") and block.name == "web_search":
                search_query = block.input.get("query", "No query found")
                searches.append({"id": block.id, "query": search_query, "input": block.input})

        # Extract search results
        elif hasattr(block, "type") and block.type == "web_search_tool_result":
            for result_block in block.content:
                if hasattr(result_block, "type") and result_block.type == "web_search_result":
                    results.append(
                        {
                            "title": result_block.title,
                            "url": result_block.url,
                            "encrypted_content": result_block.encrypted_content,
                            "page_age": getattr(result_block, "page_age", None),
                        }
                    )

        # Extract final answer text blocks (with or without citations)
        elif hasattr(block, "type") and block.type == "text":
            text_content = block.text
            block_citations = getattr(block, "citations", None)

            # Add to final answer
            final_answer_parts.append(text_content)

            # If this text block has citations, extract them
            if block_citations:
                for citation in block_citations:
                    citations.append(
                        {"cited_text": citation.cited_text, "title": citation.title, "url": citation.url, "type": citation.type}
                    )

    return {"searches": searches, "results": results, "citations": citations, "final_answer": "".join(final_answer_parts)}


In [13]:
# Test the parsing function with the existing response
parsed = parse_ac_web_search_response(rsp)

print("=== SEARCHES ===")
for i, search in enumerate(parsed["searches"], 1):
    print(f"{i}. Query: {search['query']}")
    print(f"   ID: {search['id']}")
    print()

print("=== RESULTS ===")
for i, result in enumerate(parsed["results"], 1):
    print(f"{i}. {result['title']}")
    print(f"   URL: {result['url']}")
    print(f"   Page Age: {result['page_age']}")
    print()

print("=== CITATIONS ===")
for i, citation in enumerate(parsed["citations"], 1):
    print(f"{i}. Cited Text: {citation['cited_text'][:100]}...")
    print(f"   Title: {citation['title']}")
    print(f"   URL: {citation['url']}")
    print()

print("=== FINAL ANSWER ===")
print(parsed["final_answer"])  # [:500] + "..." if len(parsed["final_answer"]) > 500 else parsed["final_answer"])


=== SEARCHES ===
1. Query: Encinitas water bill rates charges
   ID: srvtoolu_011jEHLifAJ1i1i1HbLsNich

=== RESULTS ===
1. Pay Water Bills | City of Encinitas
   URL: https://www.encinitasca.gov/community/pay-water-bill
   Page Age: None

2. Customer Care | City of Encinitas
   URL: https://www.encinitasca.gov/government/departments/utilities/san-dieguito-water-district/i-want-to
   Page Age: None

3. Wastewater Engineering & Sewer Rates | City of Encinitas
   URL: https://www.encinitasca.gov/government/departments/utilities/utilities-engineering-planning/wasterwater-engineering
   Page Age: None

4. Utilities | City of Encinitas
   URL: https://www.encinitasca.gov/government/departments/utilities
   Page Age: None

5. San Dieguito Water District Services & Requests | City of Encinitas
   URL: https://www.encinitasca.gov/government/departments/utilities/engineering-planning/services-requests
   Page Age: None

6. San Dieguito Water District | City of Encinitas
   URL: https://www.encin

#### Task (v2)

After review, we only need to pass in the `final_answer` as context to formulate a proper response. We'll update our task function as such:


In [14]:
@bt.traced()
def ac_web_search(query: str, city: str, model="claude-sonnet-4-20250514", model_kwargs={}):
    """Search the web for the given query."""

    prompt = web_search_prompt.build(city=city, query=query)

    rsp = wrapped_ac_client.messages.create(
        model=model,
        system=prompt["messages"].pop(0)["content"],
        messages=prompt["messages"],
        **model_kwargs,
        tools=[
            {
                "type": "web_search_20250305",
                "name": "web_search",
                "max_uses": N_SEARCHES,
                "allowed_domains": RETRIEVAL_WEB_SEARCH_ALLOWED_DOMAINS,
                "user_location": USER_LOCATION,
            }  # type: ignore
        ],
    )

    # return rsp
    parsed = parse_ac_web_search_response(rsp)
    return {"citations": parsed["citations"], "final_answer": parsed["final_answer"]}

In [15]:
rsp = ac_web_search(
    _example["input"],
    "Encinitas",
    model_kwargs={"max_tokens": 2048, "thinking": {"type": "enabled", "budget_tokens": 1024}},
)

rsp
# rsp.model_dump()

{'citations': [{'cited_text': 'Billing Cycle: The District bills potable (drinking) water customers on a bi-monthly basis (every two months) and are due and payable upon presentatio...',
   'title': 'Customer Care | City of Encinitas',
   'url': 'https://www.encinitasca.gov/government/departments/utilities/san-dieguito-water-district/i-want-to',
   'type': 'web_search_result_location'},
  {'cited_text': 'A 12% adjustment to District revenues will be reflected on all bills effective 7/1/25. ',
   'title': 'Customer Care | City of Encinitas',
   'url': 'https://www.encinitasca.gov/government/departments/utilities/san-dieguito-water-district/i-want-to',
   'type': 'web_search_result_location'},
  {'cited_text': 'Disputing Your Bill: If you think your bill is incorrect or to dispute any services or charges on a bill, you must conta',
   'title': 'Customer Care | City of Encinitas',
   'url': 'https://www.encinitasca.gov/government/departments/utilities/san-dieguito-water-district/i-want-to

### Retrieval synthesis

The second sub-task we need is a synthesis task that takes the retrieved context and attempts to provide an appropriate response, route the request to the correct contact, and finally identify any other information that would be needed to properly respond.


#### Prompt

We will create a versioned prompt as we did above.


In [16]:
for dept in dept_support_areas[:2]:
    print(dept["name"])
    print("- " + "\n- ".join(dept["support_areas"]))
    print("-> " + "\n-> ".join(dept["examples"]))
    print()

Public Works
- Deceased animals
- Graffiti on public property
- Water leak (non-emergency)
- Streetlight outages
- Pothole repairs
-> I saw a dead raccoon on Melba Road today—can someone remove it?
-> There’s new graffiti on the sidewalk wall along South Coast Highway.
-> A small water leak is pooling near the fire hydrant on Encinitas Blvd.
-> The streetlight in front of 450 Quail Gardens Dr has been out for 3 nights.
-> There’s a deep pothole on Manchester Ave that's damaging cars.

Code Enforcement
- Abandoned vehicles
- Illegal signs or banners
- Noise complaints
- Trash or junk on private property
- Homeless overnight camping
-> There’s a broken-down RV parked on Citrus Ave for over two weeks.
-> Someone put up a political banner on private property with no permit.
-> Neighbors at 2 a.m. are playing loud music by the pool every weekend.
-> The vacant lot next to Sunset View has piles of old furniture.
-> A homeless encampment is forming under the 101 overpass near Cardiff.



In [17]:
def build_triage_route_details() -> str:
    output = "=" * 10 + "\n"
    for dept in dept_support_areas:
        output += f"Triage Target: {dept['slug']}\n"
        output += f"Triage Name: {dept['name']}\n"
        output += f"Support Areas: {','.join(dept['support_areas'])}"
        output += "\n" + "=" * 10 + "\n"

    return output.strip()


In [18]:
print(build_triage_route_details()[:550])

Triage Target: public_works
Triage Name: Public Works
Support Areas: Deceased animals,Graffiti on public property,Water leak (non-emergency),Streetlight outages,Pothole repairs
Triage Target: code_enforcement
Triage Name: Code Enforcement
Support Areas: Abandoned vehicles,Illegal signs or banners,Noise complaints,Trash or junk on private property,Homeless overnight camping
Triage Target: parks_recreation
Triage Name: Parks & Recreation
Support Areas: Playground issues,Gate issues in parks,Graffiti in park sites,


In [19]:
SYNTHESIZE_RETRIEVAL_SYSTEM_PROMPT = f"""\
Your name is Sunny and you are an expert in providing responses to customer support responses for the city of {{{{city}}}} based on the original user inquiry, 
triage route descriptions, and the context provided.

## Instructions
- Provide a final answer to the user inquiry ONLY if it is answerable from the "Context" provided below.
- Using the user's inquiry and the "Context" provided below, choose the most specific triage target to route the user to based on the "Triage Options" defined below.
- If provided, make sure you include any relevant links, phone numbers, emails, or other information from the "Context" and how the user can use them to get the information they need.
- Use markdown formatting for the full response.
- Always direct complaints about personnel to "human_resources"

## Triage Options    
{build_triage_route_details()}

## Context
{{{{context}}}}"""

In [20]:
bt_retieval_synthesis_prompt = bt_project.prompts.create(
    name="WebSearchRetrievalSynthesis",
    slug="web-search-retrieval-synthesis",
    description="Prompt for synthesizing the retrieval results into a final answer",
    model="claude-4-sonnet-20250514",
    messages=[
        {"role": "system", "content": SYNTHESIZE_RETRIEVAL_SYSTEM_PROMPT},
        {"role": "user", "content": "User Inquiry: {{{user_inquiry}}}"},
    ],
    if_exists="replace",
)

bt_project.publish()

{'status': 'success'}

In [21]:
retrieval_synth_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="web-search-retrieval-synthesis")

In [22]:
_p = retrieval_synth_prompt.build(city="Encinitas", user_inquiry=_example["input"], context="test")
print(_p["messages"][0]["content"])

Your name is Sunny and you are an expert in providing responses to customer support responses for the city of Encinitas based on the original user inquiry, 
triage route descriptions, and the context provided.

## Instructions
- Provide a final answer to the user inquiry ONLY if it is answerable from the "Context" provided below.
- Using the user's inquiry and the "Context" provided below, choose the most specific triage target to route the user to based on the "Triage Options" defined below.
- If provided, make sure you include any relevant links, phone numbers, emails, or other information from the "Context" and how the user can use them to get the information they need.
- Use markdown formatting for the full response.
- Always direct complaints about personnel to "human_resources"

## Triage Options    
Triage Target: public_works
Triage Name: Public Works
Support Areas: Deceased animals,Graffiti on public property,Water leak (non-emergency),Streetlight outages,Pothole repairs
Triag

#### Structured output definition

We structure the output as a Pydantic class so we can better get at and evaluate the different aspects of our AI pipeline.


In [23]:
class CustomerSupportResponse(BaseModel):
    """The response from the customer support agent."""

    def convert_str_to_triage_target(v: str) -> str:  # type: ignore
        """Ensure entity type is a valid enum."""
        if v in triage_targets:
            return v
        else:
            try:
                match, score = fuzzy_process.extractOne(v.upper(), [s.upper() for s in triage_targets])  # type: ignore
                return match if score >= 80 else "General"
            except ValueError:
                return "General"

    chain_of_thought: str = Field(
        ...,
        description="Explain your decision about whether this question is answerable and why you chose to route it to the triage contact you chose",
    )
    is_answerable_from_context: bool = Field(
        ..., description="Whether the given response can answer the question from the context provided"
    )
    requires_follow_up: bool = Field(
        ...,
        description="Whether the inquiry requires follow-up from the triage contact",
    )
    requires_location: bool = Field(
        ...,
        description="Whether the inquiry requires the user to provide their location to be resolved by the triage contact",
    )
    requires_photo_or_video: bool = Field(
        ...,
        description="Whether the inquiry requires the user to provide a photo or video to be resolved by the triage contact",
    )
    requires_user_contact_info: bool = Field(
        ...,
        description="Whether the inquiry requires the user to provide their contact information to be resolved by the triage contact",
    )
    full_response: str = Field(
        ...,
        description="The detailed answer to the question based on the context provided",
    )
    final_answer: str = Field(
        ...,
        description="A succinct and concise answer to the question based on the context provided",
    )
    route_to: Annotated[Literal[*triage_targets], BeforeValidator(convert_str_to_triage_target)]  # type: ignore


#### Task

We set things up in this task to allow teams to use either OpenAI or Anthropic models.


In [24]:
class MODEL_VENDOR(str, enum.Enum):
    ANTHROPIC = "anthropic"
    OPENAI = "openai"

In [25]:
@bt.traced()
def synthesize_retrieval(
    user_inquiry: str,
    retrieval_response: str,
    city: str,
    vendor: MODEL_VENDOR = MODEL_VENDOR.ANTHROPIC,
    model="claude-sonnet-4-20250514",
    model_kwargs={},
) -> CustomerSupportResponse:
    """Synthesize the retrieval results into a final answer."""

    prompt = retrieval_synth_prompt.build(city=city, user_inquiry=user_inquiry, context=retrieval_response)

    rsp = None
    if vendor == MODEL_VENDOR.ANTHROPIC:
        tools = [
            {
                "name": "customer_support_response",
                "description": "Build CustomerSupportResponse object",
                "input_schema": CustomerSupportResponse.model_json_schema(),
            }
        ]
        rsp = wrapped_ac_client.messages.create(
            model=model,
            system=prompt["messages"].pop(0)["content"],  # type: ignore
            messages=prompt["messages"],
            tools=tools,
            tool_choice={"type": "tool", "name": "customer_support_response"},
            **model_kwargs,
        )
        return CustomerSupportResponse(**rsp.content[0].input)

    elif vendor == MODEL_VENDOR.OPENAI:
        result = wrapped_oai_client.beta.chat.completions.parse(
            model="gpt-4o",
            messages=prompt["messages"],
            response_format=CustomerSupportResponse,
            **model_kwargs,
        )
        return result.choices[0].message.parsed

    else:
        raise ValueError(f"Invalid vendor: {vendor}")


In [26]:
parsed["final_answer"]

"I understand your concern about your doubled water bill! Let me help explain some potential causes based on Encinitas water billing information.\n\nFirst, it's important to note that the District bills potable (drinking) water customers on a bi-monthly basis (every two months) and are due and payable upon presentation. This means you might be seeing charges for two months of usage at once.\n\nHere are some common reasons why your water bill might have doubled:\n\n**Billing Cycle:** Since Encinitas uses bi-monthly billing, if you're comparing to a previous monthly statement or there was a timing difference, this could explain the increase.\n\n**Leak Detection:** The search results mention leak detection services, which suggests this is a common issue. Undetected leaks can significantly increase water usage and costs.\n\n**Seasonal Usage:** Increased outdoor watering, especially during warmer months, can substantially raise your bill.\n\n**Rate Changes:** There may have been recent rate

Let's try it with Anthropic


In [27]:
rsp = synthesize_retrieval(
    user_inquiry=_example["input"],
    retrieval_response=parsed["final_answer"],
    city="Encinitas",
    model_kwargs={"max_tokens": 1024},
)
# rsp

In [28]:
rsp.model_dump()

{'chain_of_thought': "This question is about water billing, which falls under the San Dieguito Water District's purview based on the triage options. The context provided contains comprehensive information about water billing in Encinitas, including common reasons for bill increases, billing cycles, and contact information. I can provide a detailed answer from the context, but the user may need follow-up assistance to review their specific bill details and usage history with the water district directly.",
 'is_answerable_from_context': True,
 'requires_follow_up': True,
 'requires_location': False,
 'requires_photo_or_video': False,
 'requires_user_contact_info': False,
 'full_response': "I understand your concern about your doubled water bill! Let me help explain some potential causes based on Encinitas water billing information.\n\n**Important Note About Billing Cycle:**\nThe San Dieguito Water District bills potable (drinking) water customers on a **bi-monthly basis (every two months

Let's try it with OpenAI


In [29]:
rsp = synthesize_retrieval(
    user_inquiry=_example["input"],
    retrieval_response=parsed["final_answer"],
    city="Encinitas",
    vendor=MODEL_VENDOR.OPENAI,
    model="gpt-4o-mini",
    model_kwargs={},
)
# rsp

In [30]:
rsp.model_dump()

{'chain_of_thought': "The user's inquiry about a doubled water bill can be addressed by the context provided, which explains potential reasons for a water bill increase. The inquiry specifically falls under the support area of water billing questions, so it should be routed to the San Dieguito Water District.",
 'is_answerable_from_context': True,
 'requires_follow_up': True,
 'requires_location': False,
 'requires_photo_or_video': False,
 'requires_user_contact_info': False,
 'full_response': 'I understand your concern about your doubled water bill! Based on the Encinitas water billing information, there are several reasons why your bill might have increased:\n\n1. **Bi-monthly Billing:** The District bills potable water customers every two months, so you might be seeing charges for two months of usage at once.\n2. **Leak Detection:** Undetected leaks can significantly increase water usage and costs.\n3. **Seasonal Usage:** Increased outdoor watering during warmer months can raise you

### Customer support task

Let's put our entire workflow together under the `ask_customer_support` task.


In [31]:
@bt.traced()
def ask_customer_support(
    user_inquiry: str,
    city: str,
    search_model: str = "claude-sonnet-4-20250514",
    synthesis_model: str = "gpt-4o",
    synthesis_vendor: MODEL_VENDOR = MODEL_VENDOR.OPENAI,
    search_model_kwargs={"max_tokens": 2048, "thinking": {"type": "enabled", "budget_tokens": 1024}},
    synthesis_model_kwargs: dict = {},
) -> CustomerSupportResponse:
    """Ask the customer support agent a question."""

    search_results = ac_web_search(user_inquiry, city, model=search_model, model_kwargs=search_model_kwargs)
    customer_support_resp = synthesize_retrieval(
        user_inquiry=user_inquiry,
        retrieval_response=search_results["final_answer"],
        city=city,
        vendor=synthesis_vendor,
        model=synthesis_model,
        model_kwargs=synthesis_model_kwargs,
    )

    return customer_support_resp

Let's test how this works with a few questions


In [32]:
len(test_data)

55

In [33]:
def process_single_request(example):
    """Process a single customer support request."""
    city = "Encinitas"

    with bt.start_span(name="request_support_task") as span:
        span.log(input=example["input"])
        customer_support_resp = ask_customer_support(
            user_inquiry=example["input"],
            city=city,
            search_model="claude-sonnet-4-20250514",
            synthesis_model="gpt-4o-mini",
            synthesis_vendor=MODEL_VENDOR.OPENAI,
            search_model_kwargs={"max_tokens": 2048, "thinking": {"type": "enabled", "budget_tokens": 1024}},
            synthesis_model_kwargs={},
        )

        metadata = example.get("metadata", {})
        filtered_metadata = {
            "id": metadata.get("id"),
            "query_source": metadata.get("query_source"),
            "triage_slug": metadata.get("triage_slug"),
        }

        span.log(output=customer_support_resp, metadata=filtered_metadata)
        return customer_support_resp

**Key takeaways**:

1. We use `braintrust.start_span()` to manually construct our trace here in the same way for logs as we do with experiments.
2. Manually constructing the trace gives us full control over what is logged and where.


In [34]:
# Run all test data in parallel
print(f"Processing {len(test_data)} customer support requests in parallel...")
timestamp = datetime.now().strftime("%Y%m%d_%H%M")

with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
    results = list(executor.map(process_single_request, test_data))

print(f"Completed processing {len(results)} requests!")

Processing 55 customer support requests in parallel...
Completed processing 55 requests!


## Curate a dataset of our traces for experimentation (offline evals)


Once we have our traces logged, we can move them into a "golden [dataset](https://www.braintrust.dev/docs/guides/datasets)" to build evals on. Creating a dataset for experiments or logs is as easy as selecting all the rows you want to include and clicking "Add to dataset"

We'll do this in the Braintrust UI and then use this dataset in our initial experiments.

<img src="./data/curate-dataset.png" width="800"/>


**Key takeaways**:

1. Curate datasets from experiments and logs to run your evals on.
2. You can configure tags to mark traces that need review or else have been reviewed in your project configuration.


In [35]:
ds = bt.init_dataset(project=BT_PROJECT_NAME, name="initial_error_analysis")
# rows = list(ds)[:5]
rows = list(ds)

Retrying request after error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Sleeping for 0.5 seconds


**Key takeaways**:

1. Use `braintrust.init_dataset()` to fetch your data from Braintrust.
2. Use can use this method as the `data` for your evals (see below).


In [36]:
print(len(rows))
row = rows[0]
row

55


{'_pagination_key': 'p07530714983148879926',
 '_xact_id': '1000195501844378445',
 'created': '2025-07-24T18:20:16.715Z',
 'dataset_id': '234585eb-8164-4909-973a-77c0c2f4491f',
 'expected': {'chain_of_thought': "The user's inquiry is about a noise issue caused by neighbors playing loud music at late hours. This type of complaint falls under the domain of Code Enforcement in the City of Encinitas, which addresses noise complaints. The context discusses how to file a noise complaint and provides links to do so online or contact numbers for immediate assistance, making it directly answerable from the context.",
  'final_answer': "For the noise issue with your neighbors playing loud music at 2 a.m., you can file a complaint through the City of Encinitas' Code Enforcement. Fill out the [Citizen's Complaint Form](https://www.encinitasca.gov/government/departments/development-services/code-enforcement/filing-a-complaint/citizen-complaint-form) or use the [Code Enforcement Form](https://www.enc

## Scoring

**Objective**: Define function(s) that will return a score for a specific outcome we care about.

So we have our `task` and our `data`, all we need now to start building evals is a way to score how well our task did on each input in our dataset.

Braintrust comes with several helpful scorers out of the box via our [Autoevals library](https://www.braintrust.dev/docs/reference/autoevals). Fun fact: you can use them outside of Braintrust as well. In addition to these scoring functions, we can define our own, as demonstrated below, where we configure a reference-based scorer to see how well the task predicts the right contact to route the inquiry to (but only if the target route is specified).


In [37]:
def is_correct_triage_contact(
    input: str | None = None,
    expected: dict | None = None,
    output: CustomerSupportResponse | None = None,
    metadata: dict | None = None,
    **kwargs,
) -> int | None:
    if output and metadata and metadata.get("triage_slug"):
        return int(output.route_to == metadata.get("triage_slug"))

    if output and expected:
        return int(expected["route_to"] == output.route_to)

    return None

In [38]:
is_correct_triage_contact(
    input=row.get("input", ""),
    output=CustomerSupportResponse.model_validate(row.get("expected", {})),
    expected=row.get("expected", None),
    metadata=row.get("metadata", None),  # type: ignore
)

1

**Key takeaways**:

1. Scoring functions accept the following arguments: input, expected, output, and metadata.
2. You can build custom scoring functions like we did above or use any of the scoring functions provided in our Autoevals library.


## Evals

We now have the 3 things required to run an offline eval, a.k.a. an [experiment](https://www.braintrust.dev/docs/guides/experiments)

**Data**: Handcoded and Synthetic customer support requests/inquiries

**Task**: A customer support task that uses two subtasks to route and attempt to answer the request

**Scorers**: An initial reference-based scorer that scores the response on whether it routed the request to the right contact


In [39]:
timestamp = datetime.now().strftime("%Y%m%d_%H%M")

exp_metadata = {
    "city": "Encinitas",
    "search_model": "claude-sonnet-4-20250514",
    "synthesis_model": "gpt-4o-mini",
    "synthesis_vendor": MODEL_VENDOR.OPENAI,
    "search_model_kwargs": {"max_tokens": 2048, "thinking": {"type": "enabled", "budget_tokens": 1024}},
    "synthesis_model_kwargs": {"temperature": 0.75},
}
await bt.EvalAsync(
    name=BT_PROJECT_NAME,
    experiment_name=f"initial_error_analysis_{timestamp}",
    data=rows,  # type: ignore
    task=partial(ask_customer_support, **exp_metadata),
    metadata=exp_metadata,
    scores=[is_correct_triage_contact],  # type: ignore
)

Experiment initial_error_analysis_20250724_1121 is running at https://www.braintrust.dev/app/aie-course-2025/p/braintrust-intro/experiments/initial_error_analysis_20250724_1121
braintrust-intro [experiment_name=initial_error_analysis_20250724_1121] (data): 55it [00:00, 131221.11it/s]
braintrust-intro [experiment_name=initial_error_analysis_20250724_1121] (tasks): 100%|██████████| 55/55 [02:04<00:00,  2.27s/it]  



initial_error_analysis_20250724_1121 compared to add_queries_it_20250724_1103:
76.36% 'is_correct_triage_contact' score

1753381272.24s start
1753381382.45s end
69.46s duration
20.94s llm_duration
12551.55tok prompt_tokens
954.93tok completion_tokens
13506.47tok total_tokens
0.05$ estimated_cost
0tok prompt_cached_tokens
0tok prompt_cache_creation_tokens

See results for initial_error_analysis_20250724_1121 at https://www.braintrust.dev/app/aie-course-2025/p/braintrust-intro/experiments/initial_error_analysis_20250724_1121


EvalResultWithSummary(summary="...", results=[...])

We can now start iterating on improving the routing. This can be done in various ways. For example:

1. We can change the models and or generation kwargs.
2. We can improve the prompts and context provided to our models.
3. We can make corrections to our dataset.


In [40]:
ds = bt.init_dataset(project=BT_PROJECT_NAME, name="initial_error_analysis")
rows = list(ds)  # [:5]

Retrying request after error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Sleeping for 0.5 seconds


In [41]:
print(len(list(ds)))

55


In [42]:
timestamp = datetime.now().strftime("%Y%m%d_%H%M")
what_changed = "fixed incorrect triage_slug in the 'golden' dataset"

exp_metadata = {
    "city": "Encinitas",
    "search_model": "claude-sonnet-4-20250514",
    "synthesis_model": "gpt-4o-mini",
    "synthesis_vendor": MODEL_VENDOR.OPENAI,
    "search_model_kwargs": {"max_tokens": 2048, "thinking": {"type": "enabled", "budget_tokens": 1024}},
    "synthesis_model_kwargs": {"temperature": 0.75},
}
await bt.EvalAsync(
    name=BT_PROJECT_NAME,
    experiment_name=f"initial_error_analysis_{timestamp}",
    data=rows,  # type: ignore
    task=partial(ask_customer_support, **exp_metadata),
    metadata={**exp_metadata, "what_changed": what_changed},
    scores=[is_correct_triage_contact],  # type: ignore
)

Experiment initial_error_analysis_20250724_1132 is running at https://www.braintrust.dev/app/aie-course-2025/p/braintrust-intro/experiments/initial_error_analysis_20250724_1132
braintrust-intro [experiment_name=initial_error_analysis_20250724_1132] (data): 55it [00:00, 156716.52it/s]
braintrust-intro [experiment_name=initial_error_analysis_20250724_1132] (tasks): 100%|██████████| 55/55 [06:30<00:00,  7.10s/it]  
Retrying request after error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Sleeping for 0.5 seconds



initial_error_analysis_20250724_1132 compared to initial_error_analysis_20250724_1121:
85.45% (+09.09%) 'is_correct_triage_contact' score	(8 improvements, 3 regressions)

1753381952.44s start
1753382070.01s end
76.29s (+682.42%) 'duration'                    	(19 improvements, 36 regressions)
26.84s (+589.84%) 'llm_duration'                	(22 improvements, 33 regressions)
12479.85tok (-7169.09%) 'prompt_tokens'               	(26 improvements, 29 regressions)
972.04tok (+1710.91%) 'completion_tokens'           	(22 improvements, 33 regressions)
13451.89tok (-5458.18%) 'total_tokens'                	(26 improvements, 29 regressions)
0.05$ (+00.00%) 'estimated_cost'              	(26 improvements, 29 regressions)
0tok (-) 'prompt_cached_tokens'        	(0 improvements, 0 regressions)
0tok (-) 'prompt_cache_creation_tokens'	(0 improvements, 0 regressions)

See results for initial_error_analysis_20250724_1132 at https://www.braintrust.dev/app/aie-course-2025/p/braintrust-intro/experimen

EvalResultWithSummary(summary="...", results=[...])

## Next Steps

Keep iterating! Appropraite next steps include:

- Review more synthetic queries and run our evals on more data.
- Build a reference-free scorer, an LLM-as-Judge, for example, to score routing in both offline and online evaluations.
- Work with SME's to better identify other failure modes and iterate to build evals for.


## Fin
