# A customer support bot - Part 1

An introduction to all things Braintrust via a speed-run through building evals for a customer support bot.


In this demo, we aim to enhance a city's customer support application by developing an AI workflow to streamline the request process and ensure it is routed to the correct department.

We propose a simple UI where the user can describe their issue and an AI pipeline that will take that description, route the request to the appropriate contact, prompt the UI for any required information, and provide an answer or informational response if warranted.

To build evals for such a pipeline, we need three things:

1. Data
2. One or more task functions
3. One or more scoring functions that tell us how well each row in our dataset did on that task

In this notebook, we will work on the data piece.

Let's start by importing the required libraries.


In [1]:
import enum
import json
import os
import random

from concurrent.futures import ThreadPoolExecutor
from datetime import datetime
from functools import partial
from textwrap import dedent
from typing import Annotated, Literal

import anthropic as ac
import braintrust as bt
import openai as oai
import pandas as pd
import requests

from braintrust import wrap_anthropic, wrap_openai
from dotenv import load_dotenv
from pydantic import BaseModel, BeforeValidator, Field
from thefuzz import process as fuzzy_process

load_dotenv(override=True)

  from .autonotebook import tqdm as notebook_tqdm


True

## Setup

Once you're signed up on [Braintrust](https://www.braintrust.dev/) and have set the appropriate API keys in your `.env` file, you're ready to go. For this particular demo, you'll need the following API keys:

- `BRAINTRUST_API_KEY`
- `OPENAI_API_KEY`
- `ANTHROPIC_API_KEY`

With that in place, we can create our [Braintrust project](https://www.braintrust.dev/docs/guides/projects) using `braintrust.projects.create()`. This method will return the project if it already exists, or else create it the first time we attempt so send any artifacts it's way.

We also define Anthropic and OpenAI clients like usual, as well as a "wrapped" version of each client. These "wrapped" versions automatically capture usage data (e.g., prompt tokens, etc) in Braintrust when used within the context of a logger or experiment.


In [3]:
BT_PROJECT_NAME = "braintrust-intro"
MAX_WORKERS = 5

bt_project = bt.projects.create(name=BT_PROJECT_NAME)

ac_client = ac.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
oai_client = oai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

wrapped_ac_client = wrap_anthropic(ac_client)
wrapped_oai_client = wrap_openai(oai_client)


**Key takeaways**:

1. We use `braintrust.projects.create()` to create or fetch a Braintrust project.
2. We define both standard OpenAI and Anthropic clients, as well as wrapped versions of these clients that support automatic logging of their usage data.


## Data


**Objective**: Curate around 100 diverse and realistic customer support requests.

> "Every error analysis starts with _traces_: the full sequence of inputs, outputs, and actions taken by the pipeline for a given input"

The recommendation is to start with ~ 100 traces of diverse user intents, with preference given to real-world data where possible. In this case, we don't have access to such data and so we'll follow the AIE methodology for creating synthetic data. We'll braintrustify this workflow a bit along the way :)


### Step 1: Define dimensions

> "First, before prompting anything, we define key dimensions of the query space... [to] help us systematically vary different aspects of a user’s request"


In [4]:
class DimensionTuple(BaseModel):
    inquiry_type: str = Field(
        description="What kind of inquiry are they making (e.g. informational_ask, complaint,request_for_action_or_service)"
    )
    query_style_and_detail: str = Field(
        description="The style and detail of the query (e.g. short_keywords_minimal_detail, concise_moderate_detail, verbose_detailed_request)"
    )
    customer_persona: str = Field(
        description="Who is making the request (e.g. business_owner, citizen)",
    )
    customer_age: str = Field(
        description="The age of the person making the request (e.g. senior_over_65, youth_under_18, adult)",
    )


class DimensionTuples(BaseModel):
    tuples: list[DimensionTuple]

### Step 2: Define dimension values


In [5]:
TUPLES_GEN_PROMPT = """\
I am designing a customer support chatbot for the city of Encinitas and I want to test it against a diverse
range of inquiries citizens might submit. I have provided you with several dimensions and that constitute the parts of
such a queries along with a list of possible values for each dimension.

## Instructions

Generate {{{num_tuples_to_generate}}} unique combinations of dimension values for based on the dimensions provided below. 
- Each combination should represent a different user scenario. 
- Ensure balanced coverage across all dimensions - don't over-represent any particular value or combination.
- Vary the query styles naturally.
- Attempt to make the dimension value combinations as realistic as possible.
- Never generate a tuple where the customer_persona is a business_owner and the customer_age is youth_under_18.

## Dimensions

inquiry_type:
- informational_ask
- complaint
- request_for_action_or_service

query_style_and_detail:
- concise_moderate_detail: [includes some details relevant to the `inquiry_topic`, e.g., location, names, etc.]
- verbose_detailed_request: [includes specific details relevant to the `inquiry_topic`, e.g., location, names, etc.]
- short_keywords_minimal_details

customer_persona: Who is making the request
- business_owner
- citizen

customer_age: The age of the person making the request
- senior_over_65
- youth_under_18
- adult

Generate {{{num_tuples_to_generate}}} unique dimension tuples following these patterns. Remember to maintain balanced diversity across all dimensions."""


In [6]:
# print(TUPLES_GEN_PROMPT.replace("{{{num_tuples_to_generate}}}", "10"))

Create a [versioned prompt](https://www.braintrust.dev/docs/guides/functions/prompts) and save it to Braintrust.


In [7]:
tuples_gen_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="dimension-tuples-gen-prompt")
try:
    tuples_gen_prompt.build(num_tuples_to_generate=20)
except Exception as e:
    bt_tuples_gen_prompt = bt_project.prompts.create(
        name="DimensionTuplesGenPrompt",
        slug="dimension-tuples-gen-prompt",
        description="Prompt for generating dimension tuples",
        model="claude-4-sonnet-20250514",
        messages=[{"role": "user", "content": TUPLES_GEN_PROMPT}],
        if_exists="replace",
    )

    bt_project.publish()
    tuples_gen_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="dimension-tuples-gen-prompt")


**Key takeaways**:

1. We can create and manage versions of prompts using the SDK or through the UI.
2. We can retrieve any prompt version via the SDK for use in our code (the latest version is returned by default).


With this in place, we can engage our domain experts in the prompt-building process via our [playgrounds](https://www.braintrust.dev/docs/guides/playground) and use our eval-specific AI assistant, [Loop](https://www.braintrust.dev/docs/guides/loop).

We'll prompt Loop to improve our "prompt" like this:

> "Help me improve this prompt to capture all the different dimensions a customer support inquiry may have for the city of Encinitas.
>
> The dimensions and their values should represent realistic aspects of a customer support request they may get"

<img src="./data/dim-tuples-gen-playground-loop.png" width="800"/>

We can then get our updated prompt, make changes to the structured output definition, and generate some data!

**Key takeaways**:

1. Besides optimizing our prompts, Loop can also enhance our scorers and assist in curating datasets for evaluations.


Fetch our improved prompt and update our Pydantic model:


In [8]:
tuples_gen_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="dimension-tuples-gen-prompt")

In [9]:
_p = tuples_gen_prompt.build(num_tuples_to_generate=20)
print(_p["messages"][0]["content"])

I am designing a customer support chatbot for the city of Encinitas and I want to test it against a diverse range of inquiries citizens might submit. I have provided you with several dimensions that constitute the parts of such queries along with a list of possible values for each dimension.

## Instructions

Generate 20 unique combinations of dimension values based on the dimensions provided below. 
- Each combination should represent a different user scenario. 
- Ensure balanced coverage across all dimensions - don't over-represent any particular value or combination.
- Vary the query styles naturally.
- Attempt to make the dimension value combinations as realistic as possible.
- Never generate a tuple where the customer_persona is a business_owner and the customer_age is youth_under_18.

## Dimensions

inquiry_type:
- informational_ask
- complaint
- request_for_action_or_service
- permit_application
- service_status_check

urgency_level:
- emergency: [immediate safety concern or ser

In [10]:
class DimensionTuple(BaseModel):
    inquiry_type: str = Field(
        description="What kind of inquiry are they making (e.g. informational_ask, complaint, request_for_action_or_service, permit_application, service_status_check)"
    )
    urgency_level: str = Field(
        description="The urgency level of the request (e.g. emergency, urgent, routine, follow_up)",
    )
    query_style_and_detail: str = Field(
        description="The style and detail of the query (e.g. concise_moderate_detail, verbose_detailed_request, short_keywords_minimal_details)"
    )
    customer_persona: str = Field(
        description="Who is making the request (e.g. resident, business_owner, visitor, property_owner)",
    )
    customer_age: str = Field(
        description="The age of the person making the request (e.g. senior_over_65, youth_under_18, adult_18_64)",
    )
    communication_preference: str = Field(
        description="Preferred communication method (e.g. phone_call, email, online_portal, in_person_visit, social_media)"
    )
    time_sensitivity: str = Field(
        description="How quickly a response is needed (e.g. same_day_response_needed, within_week_response_ok, no_specific_timeline)"
    )


class DimensionTuples(BaseModel):
    tuples: list[DimensionTuple]

### Step 3: Use an LLM to generate N unique combinations of these dimension values


In [11]:
def generate_synth_data_dimension_tuples(num_tuples: int = 20, model: str = "gpt-4o-mini", model_kwargs: dict = {}):
    """Generate a list of dimension tuples based on the provided prompt."""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M")
    prompt = tuples_gen_prompt.build(num_tuples_to_generate=num_tuples)

    rsp = oai_client.beta.chat.completions.parse(
        model=model,
        messages=prompt["messages"],
        response_format=DimensionTuples,
        **model_kwargs,
    )

    tuples_list: DimensionTuples = rsp.choices[0].message.parsed  # type: ignore

    unique_tuples = []
    seen = set()

    for tup in tuples_list.tuples:
        tuple_str = tup.model_dump_json()
        if tuple_str in seen:
            continue

        seen.add(tuple_str)
        unique_tuples.append(tup)

    bt_experiment = bt.init(project=BT_PROJECT_NAME, experiment=f"synth_tuples_it_{timestamp}")
    for uniq_tup in unique_tuples:
        with bt_experiment.start_span(name="generate_dimension_tuples") as span:
            span.log(input=prompt["messages"], output=uniq_tup, metadata=dict(model=model, model_kwargs=model_kwargs))

    summary = bt_experiment.summarize(summarize_scores=False)
    return summary, rsp.choices[0].message.parsed

**Key Takeaways**:

1. We use `braintrust.init` to manually create a new experiment.
2. We generate a trace in the form of a single span, adding information for `input`, `output`, and `metadata`.
3. We obtain the experiment summary via `braintrust.summarize` to review the experiment results.


In [12]:
exp_summary, dim_tuples = generate_synth_data_dimension_tuples(20)

print(exp_summary)
print(dim_tuples)


See results for synth_tuples_it_20250724_1057 at https://www.braintrust.dev/app/aie-course-2025/p/braintrust-intro/experiments/synth_tuples_it_20250724_1057
tuples=[DimensionTuple(inquiry_type='informational_ask', urgency_level='routine', query_style_and_detail='concise_moderate_detail', customer_persona='resident', customer_age='adult_18_64', communication_preference='email', time_sensitivity='within_week_response_ok'), DimensionTuple(inquiry_type='complaint', urgency_level='urgent', query_style_and_detail='verbose_detailed_request', customer_persona='visitor', customer_age='adult_18_64', communication_preference='phone_call', time_sensitivity='same_day_response_needed'), DimensionTuple(inquiry_type='request_for_action_or_service', urgency_level='emergency', query_style_and_detail='concise_moderate_detail', customer_persona='resident', customer_age='adult_18_64', communication_preference='in_person_visit', time_sensitivity='same_day_response_needed'), DimensionTuple(inquiry_type='per

### Step 4: Remove invalid dimension tuples (human/SME review)

We'll do this in the Braintrust UI with [human review](https://www.braintrust.dev/docs/guides/human-review).


<img src="./data/dim-tuples-human-review.png" width="800"/>

**Key takeaways**:

1. To set up scorers for human review, navigate to "Configuration" > "Human Review".
2. Free-form scorers can be configured to record data in the metadata.
3. Human review scorers can be used in BTQL queries.


### Step 5: Demonstrate an approach to building + 20 queries with the help of a SME


Working with an SME and ChatGPT, we were able to capture all of the contacts, customer support requests get routed, and a description of the types of inquiries each gets routed their way. We even curated 5 example support requests for each of these contacts.


In [13]:
dept_support_areas = json.load(open("./data/dept_support_areas.json"))

triage_targets = list(set([route["slug"] for route in dept_support_areas]))

print(len(triage_targets))
triage_targets

11


['parks_recreation',
 'general',
 'fire_prevention',
 'human_resources',
 'traffic_engineering',
 'code_enforcement',
 'san_dieguito_water_district',
 'public_works',
 'san_diego_humane_society',
 'homeless_solutions',
 'san_diego_county_vector_control']

In [14]:
for idx, triage_point in enumerate(dept_support_areas):
    if idx >= 2:
        break
    print(f"{triage_point['name']} ({triage_point['slug']})")
    print("- " + "\n- ".join(triage_point["support_areas"][:5]))
    print("-> " + "\n-> ".join(triage_point["examples"][:5]))
    print()

Public Works (public_works)
- Deceased animals
- Graffiti on public property
- Water leak (non-emergency)
- Streetlight outages
- Pothole repairs
-> I saw a dead raccoon on Melba Road today—can someone remove it?
-> There’s new graffiti on the sidewalk wall along South Coast Highway.
-> A small water leak is pooling near the fire hydrant on Encinitas Blvd.
-> The streetlight in front of 450 Quail Gardens Dr has been out for 3 nights.
-> There’s a deep pothole on Manchester Ave that's damaging cars.

Code Enforcement (code_enforcement)
- Abandoned vehicles
- Illegal signs or banners
- Noise complaints
- Trash or junk on private property
- Homeless overnight camping
-> There’s a broken-down RV parked on Citrus Ave for over two weeks.
-> Someone put up a political banner on private property with no permit.
-> Neighbors at 2 a.m. are playing loud music by the pool every weekend.
-> The vacant lot next to Sunset View has piles of old furniture.
-> A homeless encampment is forming under the 

In [15]:
triage_examples = {}
for triage_contact in dept_support_areas:
    triage_target = triage_contact["slug"]
    if triage_target not in triage_examples:
        triage_examples[triage_target] = {"name": triage_contact["name"], "support_areas": triage_contact["support_areas"], "examples": []}
    triage_examples[triage_target]["examples"].extend(triage_contact["examples"])

print(f"Total triage targets: {len(triage_examples)}")
print(f"Total examples: {sum(len(data['examples']) for data in triage_examples.values())}")
print(f"First 5 examples for '{list(triage_examples.keys())[0]}': {triage_examples[list(triage_examples.keys())[0]]['examples'][:5]}")

Total triage targets: 11
Total examples: 55
First 5 examples for 'public_works': ['I saw a dead raccoon on Melba Road today—can someone remove it?', 'There’s new graffiti on the sidewalk wall along South Coast Highway.', 'A small water leak is pooling near the fire hydrant on Encinitas Blvd.', 'The streetlight in front of 450 Quail Gardens Dr has been out for 3 nights.', "There’s a deep pothole on Manchester Ave that's damaging cars."]


### Step 6. Scale up 100 or more tuples + queries using an LLM

We can use the "[BTQL](https://www.braintrust.dev/docs/reference/btql) Sandbox" to help us construct a query to get the "good" dimension tuples from our annotation exercise, and then use that query to get those records to build some synthetic queries.


In [16]:
def get_valid_dimension_tuples():
    cursor = None
    while True:
        response = requests.post(
            "https://staging-api.braintrust.dev/btql",
            json={
                "query": dedent("""
                        select: output
                        from: experiment('6006d3ae-ccea-48bf-b7ba-d4d27b18e271')
                        filter: scores."is_good" = 1
                """)
                + (f" | cursor: '{cursor}'" if cursor else ""),
                "use_brainstore": True,
                "brainstore_realtime": True,  # Include the latest realtime data, but a bit slower.
            },
            headers={"Authorization": "Bearer " + os.environ["BRAINTRUST_API_KEY"]},
        )
        response.raise_for_status()
        response_json = response.json()
        data = response_json.get("data", [])
        cursor = response_json.get("cursor")

        return [row["output"] for row in data]


valid_dim_tuples = get_valid_dimension_tuples()

print(len(valid_dim_tuples))
valid_dim_tuples[:2]

18


[{'communication_preference': 'online_portal',
  'customer_age': 'adult_18_64',
  'customer_persona': 'visitor',
  'inquiry_type': 'permit_application',
  'query_style_and_detail': 'short_keywords_minimal_details',
  'time_sensitivity': 'same_day_response_needed',
  'urgency_level': 'urgent'},
 {'communication_preference': 'email',
  'customer_age': 'adult_18_64',
  'customer_persona': 'resident',
  'inquiry_type': 'request_for_action_or_service',
  'query_style_and_detail': 'verbose_detailed_request',
  'time_sensitivity': 'no_specific_timeline',
  'urgency_level': 'routine'}]

**Key takeaways**:

1. We can use BTQL to query our logs, experiments, and datasets.
2. Use the "BTQL sandbox" to build and test your queries before putting them in code.


Create a versioned prompt and save it to Braintrust.


In [17]:
SYNTH_DATA_GEN_PROMPT = """\
I am designing a customer support chatbot for the city of Encinitas and I want to test it against a diverse
range of realistic inquiries citizens that would appropriately be routed to this contact:

Contact: {{{contact_name}}}
Support Areas: {{{support_areas}}}

## Objective

Produce {{{num_queries_to_generate}}} additional natural language customer support requests using a sample of actual customer support requests 
below, but augmented to have one of the characteristics listed below:

==== Inquiry Characteristics =====
{{{dimension_tuple_json}}}

===== Example Inquiries =====
{{{example_inquiries}}}

## Instructions
To produce each inquiry, follow these steps:
1. Choose a random example from the list of example inquiries above.
2. Choose a random record from the list of inquiry characteristics above.
3. Augment the example inquiry to have the chosen characteristic of that record.
4. Return the augmented inquiry.

The queries should:
1. Sound like real users asking for assistance
2. Naturally incorporate all the dimension values
3. Vary in style and detail level
4. Be realistic and practical
5. Change any named entities (locations, addresses, street names, person names, etc...) to diversify the content
5. Include natural variations in typing style, such as:
   - Some queries in all lowercase
   - Some with random capitalization
   - Some with common typos
   - Some with missing punctuation
   - Some with extra spaces or missing spaces
   - Some with emojis or text speak

Here are examples of realistic query variations for a request to get a pothole fixed from 
an adult, business_owner, concise_moderate_detail:

Proper formatting:
- "There is a huge pothole on the corner of 123 Main St. It's been there for weeks affecting customers trying to park nearby"
- "Can someone please fix the pothole on 123 Main St."

All lowercase:
- "need someone to fix the pothole on 123 main st."
- "can a crew come out and fix the pothole on the corner of 123 main st. and grand ave."

Random caps:
- "NEED someone to fix the pothole on 123 main st. ASAP"
- "can a crew come out and fix the pothole on the corner of 123 main st. and grand ave??? It needs to happen NOW! Like today!"

Common typos:
- "need someone to fx the pothol on 123 main st. asap pleze!!!"
- "Can a crew come out and fx the pothole on the corner of 123 main st. & grand ave??? thx much!"

Missing punctuation:
- "need someone to fx the pothole on 123 main st asap plz"
- "Can a crew come out and fix the pothole on the corner of 123 main st & grand ave ... thx much"

With emojis/text speak:
- "the pothole needs to be fixed now on 123 main st. and grand ave! 🥗"
- "pls help get the pothole on 123 main st. & grand ave fixed thx"

Generate {{{num_queries_to_generate}}} unique queries,varying the text style naturally."""


# print(
#     (
#         SYNTH_DATA_GEN_PROMPT.replace("{{{num_queries_to_generate}}}", "4")
#         .replace("{{{dimension_tuple_json}}}", json.dumps(valid_dim_tuples[:4], indent=2))
#         .replace("{{{example_inquiries}}}", "- " + "\n- ".join(triage_examples["public_works"]["examples"][:5]))
#         .replace("{{{contact_name}}}", triage_examples["public_works"]["name"])
#         .replace("{{{support_areas}}}", ", ".join(triage_examples["public_works"]["support_areas"]))
#     )
# )

In [18]:
synth_query_gen_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="synth-query-gen-prompt")
try:
    synth_query_gen_prompt.prompt
except Exception as e:
    bt_syth_query_gen_prompt = bt_project.prompts.create(
        name="SynthQueryGenPrompt",
        slug="synth-query-gen-prompt",
        description="Prompt for generating synthetic queries",
        model="claude-4-sonnet-20250514",
        messages=[{"role": "user", "content": SYNTH_DATA_GEN_PROMPT}],
        if_exists="replace",
    )

    bt_project.publish()
    synth_query_gen_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="synth-query-gen-prompt")


In [19]:
# _p = synth_query_gen_prompt.build(
#     num_queries_to_generate=4,
#     dimension_tuple_json=json.dumps(valid_dim_tuples[:4], indent=2),
#     example_inquiries="- " + "\n- ".join(triage_examples["public_works"]["examples"][:5]),
#     contact_name=triage_examples["public_works"]["name"],
#     support_areas=", ".join(triage_examples["public_works"]["support_areas"]),
# )

# _p

Generate synthetic data we can later run through our customer support bot:


In [20]:
class QueryList(BaseModel):
    queries: list[str]

In [21]:
def generate_synth_queries(triage_slug: str, num_queries: int = 10, model: str = "gpt-4o-mini", model_kwargs: dict = {}) -> dict:
    random_dim_tuples = random.sample(valid_dim_tuples, min(num_queries, len(valid_dim_tuples)))

    prompt = synth_query_gen_prompt.build(
        num_queries_to_generate=num_queries,
        dimension_tuple_json=json.dumps(random_dim_tuples, indent=2),
        example_inquiries="- " + "\n- ".join(triage_examples[triage_slug]["examples"][:5]),
        contact_name=triage_examples[triage_slug]["name"],
        support_areas=", ".join(triage_examples[triage_slug]["support_areas"]),
    )

    rsp = oai_client.beta.chat.completions.parse(
        model=model,
        messages=prompt["messages"],
        response_format=QueryList,
        **model_kwargs,
    )

    query_list: QueryList = rsp.choices[0].message.parsed  # type: ignore

    return {
        "prompt": prompt["messages"],
        "triage_slug": triage_slug,
        "sampled_dim_tuples": random_dim_tuples,
        "handcoded_queries": triage_examples[triage_slug]["examples"],
        "synth_queries": query_list.queries,
    }

In [22]:
# rsp = generate_synth_queries("public_works", num_queries=10)

# for q in rsp["handcoded_queries"]:
#     print(q)
# print("-" * 100)
# for q in rsp["synth_queries"]:
#     print(q)

In [23]:
def generate_queries_parallel(triage_targets: list[str], num_queries: int = 10, model: str = "gpt-4o-mini", model_kwargs: dict = {}):
    """Generate queries in parallel for all dimension tuples."""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M")

    print(f"Generating {num_queries} queries each for {len(triage_targets)} triage targets...")
    # Run in parallel
    worker = partial(generate_synth_queries, num_queries=num_queries, model=model, model_kwargs=model_kwargs)
    responses = list(ThreadPoolExecutor(max_workers=MAX_WORKERS).map(worker, triage_targets))

    # Add query items
    all_queries = []
    for response in responses:
        prompt = response["prompt"]
        triage_slug = response["triage_slug"]

        queries = [{"prompt": prompt, "triage_slug": triage_slug, "query": q, "query_source": "synth"} for q in response["synth_queries"]]
        queries.extend(
            [{"prompt": prompt, "triage_slug": triage_slug, "query": q, "query_source": "handcoded"} for q in response["handcoded_queries"]]
        )

        all_queries.extend(queries)

    # Add to experiment
    bt_experiment = bt.init(project=BT_PROJECT_NAME, experiment=f"add_queries_it_{timestamp}")
    query_id = 1

    for query_item in all_queries:
        qid = f"{timestamp}_{query_id:03d}"
        query_id += 1

        with bt_experiment.start_span(name="add_query") as span:
            span.log(
                input=query_item["prompt"],
                output=query_item["query"],
                metadata={
                    "id": qid,
                    "triage_slug": query_item["triage_slug"],
                    "query_source": query_item["query_source"],
                    "model": model,
                    "model_kwargs": model_kwargs,
                },
            )

    summary = bt_experiment.summarize(summarize_scores=False)
    return summary, queries

**Key takeaways**:

1. Use `braintrust.init()` to manually create a new experiment.
2. Add a distinct trace as a single span, including information for `input`, `output`, and `metadata`.
3. Retrieve the experiment summary with `braintrust.summarize()` to review the experiment results..


In [24]:
summary, queries = generate_queries_parallel(triage_targets, num_queries=10, model="gpt-4o-mini")
# summary, queries = generate_queries_parallel(triage_targets[:2], num_queries=10, model="gpt-4o-mini")

print(len(queries))
print(summary)

Generating 10 queries each for 11 triage targets...
15

See results for add_queries_it_20250724_1103 at https://www.braintrust.dev/app/aie-course-2025/p/braintrust-intro/experiments/add_queries_it_20250724_1103


### Step 7. Remove invalid queries (human/SME review)

We'll do this in the Braintrust UI


<img src="./data/synth-queries-human-review.png" width="800"/>

**Key takeaways**:

1. We can apply filters and groups to better look at the data we are interested in.


## Next Steps

In the "02_tasks_and_evals" notebook we will do the following:

- Define task(s) for our evals
- Curate a dataset to use for running experiments
- Define scorer(s)
- Run evals


## Fin
