In [1]:
pip install braintrust autoevals pydantic

Collecting braintrust
  Downloading braintrust-0.2.4-py3-none-any.whl.metadata (2.8 kB)
Collecting autoevals
  Downloading autoevals-0.0.129-py3-none-any.whl.metadata (17 kB)
Collecting GitPython (from braintrust)
  Downloading gitpython-3.1.45-py3-none-any.whl.metadata (13 kB)
Collecting chevron (from braintrust)
  Downloading chevron-0.14.0-py3-none-any.whl.metadata (4.9 kB)
Collecting sseclient-py (from braintrust)
  Downloading sseclient_py-1.8.0-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting python-slugify (from braintrust)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting polyleven (from autoevals)
  Downloading polyleven-0.9.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Collecting braintrust_core==0.0.59 (from autoevals)
  Downloading braintrust_core-0.0.59-py3-none-any.whl.metadata (669 bytes)
Collecting gitdb<5,>=4.0.1 (from GitPython->braintrust)
  Using cached gitdb-

In [49]:
from dotenv import load_dotenv
import braintrust as bt
from braintrust import wrap_litellm
from litellm import completion
from pydantic import BaseModel, Field
from typing import List, Optional, Dict, Any
from datetime import datetime
import json
load_dotenv(override=True)
MODEL_NAME="bedrock/anthropic.claude-3-sonnet-20240229-v1:0"

In [50]:
BT_PROJECT_NAME = "customer-service-agent-eval"
MAX_WORKERS = 5

bt_project = bt.projects.create(name=BT_PROJECT_NAME)

---

# Understanding the Problem

## Instructions:
1. **Read the objective carefully**: We need ~100 diverse, realistic customer support requests
2. **Key insight**: "Naively generated queries tend to be generic, repetitive, and fail to capture real usage patterns"
3. **Solution**: Use a systematic approach with defined dimensions

## Workshop Discussion Points:
- Why is synthetic data important when real data isn't available?
- What makes a query "realistic" vs "generic"?
- How can we ensure diversity in our generated data?

# Define Evaluation Dimensions

A dimension is a way to categorize different parts of a user query. Each dimension represents ONE axis of variation. In our example customer service chatbot.

Feature: What task or enquiry the user wants to perform e.g order cancelation
Persona: What type of client e.g first time buyer, existing buyer
scenario: How clear is the intent specified from the user e.g concise or verbose

## Instructions:
1. **Study each dimension carefully**:
   - **Intent**: What the user wants to accomplish
   - **Complexity**: How difficult the query is to handle
   - **Persona**: What type of user is making the request
   - **Language Style**: How the user communicates

2. **Critical principle**: "Choose dimensions that describe where the AI application is likely to fail"

3. **Review the Pydantic models**: Notice how we structure our data for validation

## Workshop Activity:
What other dimensions might be relevant for your specific use case?

# Dimension Tuple Generation Function

## Instructions:
1. **Examine the prompt structure**: Notice how we:
   - Provide clear instructions for balanced coverage
   - Include realistic constraints (e.g., new customers rarely have returns)
   - Request specific numbers of combinations

2. **Understand the parallel processing**: We make multiple calls and deduplicate results

3. **Run this cell**: It defines the function but doesn't execute it yet

## Key Concept:
Good prompt engineering includes constraints and examples to guide the LLM toward realistic outputs.

In [51]:
#### Define Pydantic Models for structured Output

class DimensionTuple(BaseModel):
    intent: str = Field(
        description="The user's primary goal or task (e.g., product_inquiry, order_status_check, return_request, technical_support, account_management, general_info)."
    )
    complexity: str = Field(
        description="The difficulty and structure of the query (e.g., simple, multi-turn, ambiguous)."
    )
    persona: str = Field(
        description="The type of user based on their behavior or relationship with the store (e.g., new_customer, repeat_customer, frustrated_customer, loyalty_member)."
    )
    language_style: str = Field(
        description="Linguistic characteristics of the query (e.g., formal, informal, contains_slang, includes_typos, verbose, concise)."
    )

class DimensionTuples(BaseModel):
    tuples: List[DimensionTuple]

class DimensionTuplesList(BaseModel):
    tuples: List[DimensionTuple]


In [52]:
TUPLES_GEN_PROMPT="""\
        I am designing a customer support chatbot for a retail company and need to generate a diverse set of synthetic test data to evaluate its performance. I've provided you with the key dimensions that make up a customer's inquiry, along with a list of possible values for each.
        
        ## Instructions
        
        Generate {{{num_tuples_to_generate}}} unique combinations of dimension values based on the dimensions provided below.
        
        * Each combination should represent a distinct customer support scenario.
        * Ensure **balanced coverage** across all dimensions; avoid over-representing any single value or combination.
        * The generated tuples should be as realistic and varied as possible. For example, a frustrated customer is likely to use informal language and ask a complex question about a return.
        * Never generate a tuple where the persona is 'new_customer' and the intent is 'return_request' or 'order_status_check' unless the complexity is multi-turn to simulate a scenario where they are new to this process.
        
        ## Dimensions
        
        * **intent**: What kind of inquiry are they making?
            * product_inquiry
            * order_status_check
            * request_for_action_or_service
            * return_request
            * cancel_order
            * technical_support
            * account_management
            * general_info
        * **complexity**: The difficulty and structure of the query.
            * simple
            * multi-turn
            * ambiguous
        * **persona**: The type of user making the request
            * new_customer
            * repeat_customer
            * frustrated_customer
            * loyalty_member
        * **language_style**: The linguistic characteristics of the query
            * formal
            * informal
            * contains_slang
            * includes_typos
        
        Generate {{{num_tuples_to_generate}}} unique dimension tuples.
    """

In [53]:
def get_or_create_tuples_prompt():
    try:
        # Try to load existing prompt
        prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="dimension-tuples-gen-prompt")
        # Validate it works
        prompt.build(num_tuples_to_generate=20)
        return prompt
    except Exception:
        # Create new prompt if loading/building fails
        bt_project.prompts.create(
            name="DimensionTuplesGenPrompt",
            slug="dimension-tuples-gen-prompt", 
            description="Prompt for generating dimension tuples",
            model="claude-4-sonnet-20250514",
            messages=[{"role": "user", "content": TUPLES_GEN_PROMPT}],
            if_exists="replace",
        )
        bt_project.publish()
        return bt.load_prompt(project=BT_PROJECT_NAME, slug="dimension-tuples-gen-prompt")

tuples_gen_prompt = get_or_create_tuples_prompt()

In [54]:
#fetch existing prompts
tuples_gen_prompt = bt.load_prompt(project=BT_PROJECT_NAME, slug="dimension-tuples-gen-prompt")


In [55]:
tuples_gen_prompt

<braintrust.logger.Prompt at 0x7fbedeb23ec0>

In [56]:
_p = tuples_gen_prompt.build(num_tuples_to_generate=20)
print(_p["messages"])

[{'content': "        I am designing a customer support chatbot for a retail company and need to generate a diverse set of synthetic test data to evaluate its performance. I've provided you with the key dimensions that make up a customer's inquiry, along with a list of possible values for each.\n        \n        ## Instructions\n        \n        Generate 20 unique combinations of dimension values based on the dimensions provided below.\n        \n        * Each combination should represent a distinct customer support scenario.\n        * Ensure **balanced coverage** across all dimensions; avoid over-representing any single value or combination.\n        * The generated tuples should be as realistic and varied as possible. For example, a frustrated customer is likely to use informal language and ask a complex question about a return.\n        * Never generate a tuple where the persona is 'new_customer' and the intent is 'return_request' or 'order_status_check' unless the complexity is

![title](images/braintrust_playgrounds.png)

---

# Generate Dimension Combinations

## Instructions:
1. **Execute this cell**: It will generate diverse dimension combinations
2. **Watch the output**: You should see parallel generation happening
3. **Review the results**: Examine the generated tuples for:
   - Realistic combinations
   - Balanced coverage across dimensions
   - Absence of impossible scenarios

## Expected Output:
- "Generated X total tuples, Y unique"
- A list of DimensionTuple objects with varied combinations

---

In [69]:
def generate_synth_data_dimension_tuples(num_tuples: int = 20, model_kwargs: dict = {}):
    """Generate a list of dimension tuples based on the provided prompt."""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M")
    prompt = tuples_gen_prompt.build(num_tuples_to_generate=num_tuples)
    rsp = completion(
        model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
        messages=prompt["messages"],
        response_format=DimensionTuples,
        **model_kwargs,
    )
    
    # Parse JSON content and validate with Pydantic
    content = rsp.choices[0].message.content
    tuples_dict = json.loads(content)
    tuples_list: DimensionTuples = DimensionTuples(**tuples_dict)
    
    unique_tuples = []
    seen = set()
    
    for tup in tuples_list.tuples:
        tuple_str = tup.model_dump_json()
        if tuple_str in seen:
            continue
        seen.add(tuple_str)
        unique_tuples.append(tup)
    
    bt_experiment = bt.init(project=BT_PROJECT_NAME, experiment=f"synth_tuples_it_{timestamp}")
    for uniq_tup in unique_tuples:
        with bt_experiment.start_span(name="generate_dimension_tuples") as span:
            span.log(input=prompt["messages"], output=uniq_tup, metadata=dict(model_kwargs=model_kwargs))
    
    summary = bt_experiment.summarize(summarize_scores=False)
    return summary, tuples_list


In [70]:
generated_data=generate_synth_data_dimension_tuples(num_tuples=2)

In [71]:
generated_data

(ExperimentSummary(project_name='customer-service-agent-eval', project_id='8a70e092-c540-4db6-bb9d-2fa3da5d1e2e', experiment_id='714b8c07-9b0c-415d-ad93-c552212492fa', experiment_name='synth_tuples_it_20250826_0052', project_url='https://www.braintrust.dev/app/aiopsdream/p/customer-service-agent-eval', experiment_url='https://www.braintrust.dev/app/aiopsdream/p/customer-service-agent-eval/experiments/synth_tuples_it_20250826_0052', comparison_experiment_name=None, scores={}, metrics={}),
 DimensionTuples(tuples=[DimensionTuple(intent='technical_support', complexity='multi-turn', persona='frustrated_customer', language_style='informal'), DimensionTuple(intent='request_for_action_or_service', complexity='simple', persona='loyalty_member', language_style='formal')]))

---

# Review Generated Tuples

## Instructions:
1. **Examine the output**: Look at the variety of combinations generated
2. **Quality check**: Verify that combinations make sense (e.g., frustrated customers with complex queries)
3. **Note the balance**: See how different dimensions are represented

## Workshop Discussion:
- Which combinations seem most realistic?
- Are there any combinations that seem problematic?
- How does this compare to manually brainstorming scenarios?

## Key Takeaways
- We use braintrust.init to manually create a new experiment.
- We generate a trace in the form of a single span, adding information for input, output, and metadata.
- We obtain the experiment summary via braintrust.summarize to review the experiment results.

---