# Generating Synthetic Traces for Customer Support Agent  Evaluation

## Workshop Overview

In this workshop, you'll learn how to systematically generate high-quality synthetic data for evaluating a customer support chatbot. We'll follow a structured approach that moves beyond naive prompt generation to create realistic, diverse customer queries.

## Learning Objectives

By the end of this workshop, you will:
- Learn how to define evaluation dimensions for your specific use case
- Generate diverse dimension combinations using LLMs
- Create realistic natural language queries from structured dimensions
- Build a curated dataset for LLM evaluation

---

# Environment Setup

## Instructions:
1. **Install Required Dependencies**: Run this cell to install the necessary Python packages for the workshop

In [25]:
!pip install boto3 pydantic litellm

Collecting litellm
  Downloading litellm-1.75.9-py3-none-any.whl.metadata (41 kB)
Collecting click (from litellm)
  Downloading click-8.2.1-py3-none-any.whl.metadata (2.5 kB)
Collecting httpx>=0.23.0 (from litellm)
  Downloading httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting jsonschema<5.0.0,>=4.22.0 (from litellm)
  Downloading jsonschema-4.25.1-py3-none-any.whl.metadata (7.6 kB)
Collecting openai>=1.99.5 (from litellm)
  Downloading openai-1.100.2-py3-none-any.whl.metadata (29 kB)
Collecting python-dotenv>=0.2.0 (from litellm)
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Collecting tokenizers (from litellm)
  Downloading tokenizers-0.21.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting anyio (from httpx>=0.23.0->litellm)
  Downloading anyio-4.10.0-py3-none-any.whl.metadata (4.0 kB)
Collecting httpcore==1.* (from httpx>=0.23.0->litellm)
  Downloading httpcore-1.0.9-py3-none-any.whl.metadata (21 kB)
Collecting h11

# Import Libraries and Configuration

## Instructions:
1. **Review the imports**: Notice we're importing tools for:
   - AWS Bedrock integration (`boto3`)
   - Data validation (`pydantic`)
   - LLM calls (`litellm`)
   - Parallel processing and progress tracking

In [124]:
import boto3
import json
from pydantic import BaseModel, Field
from typing import List, Optional, Dict, Any
from litellm import completion
import time
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
import pandas as pd

MODEL_NAME="bedrock/anthropic.claude-3-sonnet-20240229-v1:0"
NUM_TUPLES_TO_GENERATE = 10  # Generate more tuples than needed to ensure diversity
NUM_QUERIES_PER_TUPLE = 5    # Generate multiple queries per tuple
OUTPUT_CSV_PATH = Path("bootstrap_data/synthetic_queries_for_analysis.csv")
MAX_WORKERS = 5  # Number of parallel LLM calls

### LLM calls Utility function

In [115]:
def call_llm(messages: List[Dict[str, str]], response_format: Any) -> Any:
    """Make a single LLM call with retries."""
    max_retries = 3
    for attempt in range(max_retries):
        try:
            response = completion(
                model=MODEL_NAME,
                messages=messages,
                response_format=response_format
            )
            return response_format(**json.loads(response.choices[0].message.content))
        except Exception as e:
            if attempt == max_retries - 1:
                raise e
            time.sleep(1)  # Wait before retry

---

# Understanding the Problem

## Instructions:
1. **Read the objective carefully**: We need ~100 diverse, realistic customer support requests
2. **Key insight**: "Naively generated queries tend to be generic, repetitive, and fail to capture real usage patterns"
3. **Solution**: Use a systematic approach with defined dimensions

## Workshop Discussion Points:
- Why is synthetic data important when real data isn't available?
- What makes a query "realistic" vs "generic"?
- How can we ensure diversity in our generated data?

# Define Evaluation Dimensions

A dimension is a way to categorize different parts of a user query. Each dimension represents ONE axis of variation. In our example customer service chatbot.

Feature: What task or enquiry the user wants to perform e.g order cancelation
Persona: What type of client e.g first time buyer, existing buyer
scenario: How clear is the intent specified from the user e.g concise or verbose

## Instructions:
1. **Study each dimension carefully**:
   - **Intent**: What the user wants to accomplish
   - **Complexity**: How difficult the query is to handle
   - **Persona**: What type of user is making the request
   - **Language Style**: How the user communicates

2. **Critical principle**: "Choose dimensions that describe where the AI application is likely to fail"

3. **Review the Pydantic models**: Notice how we structure our data for validation

## Workshop Activity:
What other dimensions might be relevant for your specific use case?

# Dimension Tuple Generation Function

## Instructions:
1. **Examine the prompt structure**: Notice how we:
   - Provide clear instructions for balanced coverage
   - Include realistic constraints (e.g., new customers rarely have returns)
   - Request specific numbers of combinations

2. **Understand the parallel processing**: We make multiple calls and deduplicate results

3. **Run this cell**: It defines the function but doesn't execute it yet

## Key Concept:
Good prompt engineering includes constraints and examples to guide the LLM toward realistic outputs.

In [116]:
#### Define Pydantic Models for structured Output

class DimensionTuple(BaseModel):
    intent: str = Field(
        description="The user's primary goal or task (e.g., product_inquiry, order_status_check, return_request, technical_support, account_management, general_info)."
    )
    complexity: str = Field(
        description="The difficulty and structure of the query (e.g., simple, multi-turn, ambiguous)."
    )
    persona: str = Field(
        description="The type of user based on their behavior or relationship with the store (e.g., new_customer, repeat_customer, frustrated_customer, loyalty_member)."
    )
    language_style: str = Field(
        description="Linguistic characteristics of the query (e.g., formal, informal, contains_slang, includes_typos, verbose, concise)."
    )

class DimensionTuples(BaseModel):
    tuples: List[DimensionTuple]

class DimensionTuplesList(BaseModel):
    tuples: List[DimensionTuple]


##

In [117]:
def generate_key_dimensions(num_tuples_to_generate) -> List[DimensionTuple]:
    """Generate diverse dimension tuples."""
    prompt="""\
        I am designing a customer support chatbot for a retail company and need to generate a diverse set of synthetic test data to evaluate its performance. I've provided you with the key dimensions that make up a customer's inquiry, along with a list of possible values for each.
        
        ## Instructions
        
        Generate {{{num_tuples_to_generate}}} unique combinations of dimension values based on the dimensions provided below.
        
        * Each combination should represent a distinct customer support scenario.
        * Ensure **balanced coverage** across all dimensions; avoid over-representing any single value or combination.
        * The generated tuples should be as realistic and varied as possible. For example, a frustrated customer is likely to use informal language and ask a complex question about a return.
        * Never generate a tuple where the persona is 'new_customer' and the intent is 'return_request' or 'order_status_check' unless the complexity is multi-turn to simulate a scenario where they are new to this process.
        
        ## Dimensions
        
        * **intent**: What kind of inquiry are they making?
            * product_inquiry
            * order_status_check
            * request_for_action_or_service
            * return_request
            * cancel_order
            * technical_support
            * account_management
            * general_info
        * **complexity**: The difficulty and structure of the query.
            * simple
            * multi-turn
            * ambiguous
        * **persona**: The type of user making the request
            * new_customer
            * repeat_customer
            * frustrated_customer
            * loyalty_member
        * **language_style**: The linguistic characteristics of the query
            * formal
            * informal
            * contains_slang
            * includes_typos
        
        Generate {{{num_tuples_to_generate}}} unique dimension tuples.
    """
    final_prompt = prompt.replace("{{{num_tuples_to_generate}}}", str(num_tuples_to_generate))

    messages = [{"role": "user", "content": final_prompt}]
    
    try:
        print("Generating dimension tuples in parallel...")
        with ThreadPoolExecutor(max_workers=2) as executor:
            # Submit five generation tasks using a loop
            futures = []
            for _ in range(5):
                futures.append(executor.submit(call_llm, messages, DimensionTuplesList))
            
            # Wait for all to complete and collect results
            responses = []
            for future in futures:
                responses.append(future.result())
        
        # Combine tuples and remove duplicates
        all_tuples = []
        for response in responses:
            all_tuples.extend(response.tuples)
        unique_tuples = []
        seen = set()
        
        for tup in all_tuples:
            # Convert tuple to a comparable string representation
            tuple_str = tup.model_dump_json()
            if tuple_str not in seen:
                seen.add(tuple_str)
                unique_tuples.append(tup)
        
        print(f"Generated {len(all_tuples)} total tuples, {len(unique_tuples)} unique")
        return unique_tuples
    except Exception as e:
        print(f"Error generating dimension tuples: {e}")
        return []


---

# Generate Dimension Combinations

## Instructions:
1. **Execute this cell**: It will generate diverse dimension combinations
2. **Watch the output**: You should see parallel generation happening
3. **Review the results**: Examine the generated tuples for:
   - Realistic combinations
   - Balanced coverage across dimensions
   - Absence of impossible scenarios

## Expected Output:
- "Generated X total tuples, Y unique"
- A list of DimensionTuple objects with varied combinations

---

In [118]:
dimension_tuples = generate_key_dimensions(num_tuples_to_generate=5)

Generating dimension tuples in parallel...
Generated 25 total tuples, 20 unique


---

# Review Generated Tuples

## Instructions:
1. **Examine the output**: Look at the variety of combinations generated
2. **Quality check**: Verify that combinations make sense (e.g., frustrated customers with complex queries)
3. **Note the balance**: See how different dimensions are represented

## Workshop Discussion:
- Which combinations seem most realistic?
- Are there any combinations that seem problematic?
- How does this compare to manually brainstorming scenarios?

---

In [119]:
dimension_tuples

[DimensionTuple(intent='product_inquiry', complexity='simple', persona='repeat_customer', language_style='formal'),
 DimensionTuple(intent='request_for_action_or_service', complexity='multi-turn', persona='loyalty_member', language_style='informal'),
 DimensionTuple(intent='technical_support', complexity='ambiguous', persona='frustrated_customer', language_style='contains_slang'),
 DimensionTuple(intent='return_request', complexity='multi-turn', persona='repeat_customer', language_style='includes_typos'),
 DimensionTuple(intent='account_management', complexity='simple', persona='new_customer', language_style='formal'),
 DimensionTuple(intent='technical_support', complexity='multi-turn', persona='frustrated_customer', language_style='informal'),
 DimensionTuple(intent='account_management', complexity='ambiguous', persona='loyalty_member', language_style='includes_typos'),
 DimensionTuple(intent='request_for_action_or_service', complexity='multi-turn', persona='new_customer', language_st

# Query Generation Functions

## Instructions:
1. **Study the prompt template**: Notice how we:
   - Inject the dimension tuple into the prompt
   - Provide specific formatting instructions
   - Include examples of realistic variations
   - Request natural, conversational language

2. **Understand the parallel processing setup**: We'll generate multiple queries per tuple simultaneously

3. **Run this cell below**: This defines the functions but doesn't execute generation yet

## Key Concept:
Converting structured dimensions to natural language requires careful prompt engineering with examples and constraints.

---

In [120]:
class QueriesList(BaseModel):
    queries: list[str]

def generate_queries(dimension_tuple: DimensionTuple) -> List[str]:
    """Generate natural language queries for a given dimension tuple using Bedrock."""
    
    # Use json.dumps for a clean, readable JSON representation in the prompt
    dimension_json = json.dumps(dimension_tuple.model_dump(), indent=2)

    prompt = f"""
            You are tasked with generating natural language queries for an online apparel retailer chatbot. Your goal is to create realistic, varied queries that match specific characteristics. Here's what you need to know:
        
        First, here is the dimension tuple that defines the characteristics for this set of queries:
        <dimension_tuple>
        {{DIMENSION_TUPLE}}
        </dimension_tuple>
        
        Now, here are the dimensions for query generation:
        <dimension_json>
        {{DIMENSION_JSON}}
        </dimension_json>
        
        Follow these instructions to generate the queries:
        
        1. Generate exactly {{NUM_QUERIES_PER_TUPLE}} unique queries that perfectly match the specified dimensions in the dimension tuple.
        
        2. Focus on realism: The queries should sound like something a real person would type into a chatbot. They should be natural and conversational.
        
        3. Incorporate all dimensions: The language and content of each query must naturally reflect all five dimensions (intent, complexity, persona, language_style, and failure_scenario) as specified in the dimension tuple.
        
        4. Vary the style: Within the specified 'language_style', introduce natural variations such as:
           - Common misspellings
           - Missing punctuation
           - Different capitalization
           - Use of emojis
           - Text-speak (e.g., 'thx', 'pls', 'gonna')
        
        5. Be concise and direct: Each query should be to the point and reflect how a real user would interact with a chatbot.
        
        Here are some examples of realistic variations to guide you:
        
        - For a "Frustrated Customer" with "Technical Support" intent and "Incomplete Info" scenario:
          "my discount code aint working its stupid"
          "why is the site crashing"
          "i need help with the cart but its broken"
        
        - For a "New Customer" with "Product Inquiry" intent, "Simple" complexity, and "Formal" language style:
          "Could you please tell me about the sizing for the women's jackets?"
          "Hello, I have a question regarding the material of the new shirt."
        
        Your output should be a simple list of the {{NUM_QUERIES_PER_TUPLE}} generated queries, nothing more. Do not number the queries or add any additional text. Each query should be on a new line.
        
        <output>
        [Insert your {{NUM_QUERIES_PER_TUPLE}} generated queries here, one per line]
        </output>
    """

    messages = [{"role": "user", "content": prompt}]
    
    try:
        # The call to the LLM would be handled here
        response = call_llm(messages, QueriesList)
        return response.queries
    except Exception as e:
        print(f"Error generating queries for tuple: {e}")
        return []


---

# Execute Query Generation

## Instructions:
1. **Run this cell**: It will generate natural language queries for each dimension tuple
2. **Monitor progress**: You'll see a progress bar showing query generation
3. **Handle rate limits**: The system includes automatic retries for rate limit errors

## Expected Behavior:
- Progress bar showing generation status
- Some rate limit warnings (normal with parallel processing)
- Successful generation of multiple queries per tuple

## Troubleshooting:
Rate limit errors are expected with parallel processing. So we only generating 20 query samples.

---

In [121]:
class QueryWithDimensions(BaseModel):
    id: str
    query: str
    dimension_tuple: DimensionTuple
    is_realistic_and_kept: int = 1
    notes_for_filtering: str = ""

def generate_queries_parallel(dimension_tuples: List[DimensionTuple]) -> List[QueryWithDimensions]:
    """Generate queries in parallel for all dimension tuples."""
    all_queries = []
    query_id = 1
    
    print(f"Generating {NUM_QUERIES_PER_TUPLE} queries each for {len(dimension_tuples)} dimension tuples...")
    
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        # Submit all query generation tasks
        future_to_tuple = {
            executor.submit(generate_queries_for_tuple, dim_tuple): i 
            for i, dim_tuple in enumerate(dimension_tuples)
        }
        
        # Process completed generations as they finish
        with tqdm(total=len(dimension_tuples), desc="Generating Queries") as pbar:
            for future in as_completed(future_to_tuple):
                tuple_idx = future_to_tuple[future]
                try:
                    queries = future.result()
                    if queries:
                        for query in queries:
                            all_queries.append(QueryWithDimensions(
                                id=f"SYN{query_id:03d}",
                                query=query,
                                dimension_tuple=dimension_tuples[tuple_idx]
                            ))
                            query_id += 1
                    pbar.update(1)
                except Exception as e:
                    print(f"Tuple {tuple_idx + 1} generated an exception: {e}")
                    pbar.update(1)
    
    return all_queries
#### Execute to generate queries by passing tuples generated earlier
queries = generate_queries_parallel(dimension_tuples=dimension_tuples)


Generating 5 queries each for 20 dimension tuples...


Generating Queries:  30%|███       | 6/20 [00:14<00:39,  2.82s/it]


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



Generating Queries:  35%|███▌      | 7/20 [00:16<00:34,  2.62s/it]


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

Error generating queries for tuple: litellm.RateLimitError: BedrockException - {"message":"Too many requests, please wait before trying again."}

[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



Generating Queries:  45%|████▌     | 9/20 [00:18<00:19,  1.76s/it]


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

Error generating queries for tuple: litellm.RateLimitError: BedrockException - {"message":"Too many requests, please wait before trying again."}

[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



Generating Queries:  50%|█████     | 10/20 [00:20<00:18,  1.89s/it]


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

Error generating queries for tuple: litellm.RateLimitError: BedrockException - {"message":"Too many requests, please wait before trying again."}

[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



Generating Queries:  55%|█████▌    | 11/20 [00:21<00:14,  1.56s/it]


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



Generating Queries:  60%|██████    | 12/20 [00:21<00:09,  1.20s/it]


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



Generating Queries:  65%|██████▌   | 13/20 [00:22<00:07,  1.10s/it]


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



Generating Queries:  70%|███████   | 14/20 [00:22<00:04,  1.23it/s]


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

Error generating queries for tuple: litellm.RateLimitError: BedrockException - {"message":"Too many requests, please wait before trying again."}

[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



Generating Queries:  75%|███████▌  | 15/20 [00:23<00:04,  1.24it/s]


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

Error generating queries for tuple: litellm.RateLimitError: BedrockException - {"message":"Too many requests, please wait before trying again."}

[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



Generating Queries:  80%|████████  | 16/20 [00:24<00:03,  1.06it/s]


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

Error generating queries for tuple: litellm.RateLimitError: BedrockException - {"message":"Too many requests, please wait before trying again."}


Generating Queries:  85%|████████▌ | 17/20 [00:25<00:02,  1.44it/s]


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

Error generating queries for tuple: litellm.RateLimitError: BedrockException - {"message":"Too many requests, please wait before trying again."}


Generating Queries:  90%|█████████ | 18/20 [00:25<00:01,  1.36it/s]


[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

Error generating queries for tuple: litellm.RateLimitError: BedrockException - {"message":"Too many requests, please wait before trying again."}

[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.



Generating Queries: 100%|██████████| 20/20 [00:37<00:00,  1.86s/it]

Error generating queries for tuple: 1 validation error for QueriesList
queries
  Input should be a valid list [type=list_type, input_value='[\n"Hey, do you have any... much after washing"\n]', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/list_type





# Review Generated Queries

## Instructions:
1. **Examine the output**: Look at the variety and realism of generated queries
2. **Quality assessment**: Check if queries match their dimension specifications:
   - Do frustrated customers sound frustrated?
   - Are complex queries actually complex?
   - Do the language styles match the specifications?

3. **Count your results**: Verify you have the expected number of queries

## Workshop Discussion:
- Which queries seem most realistic?
- How do these compare to queries you might write manually?
- What patterns do you notice across different dimension combinations?

---

In [122]:
queries

[QueryWithDimensions(id='SYN001', query='hi can u help me find some cute new summer outfits?', dimension_tuple=DimensionTuple(intent='technical_support', complexity='ambiguous', persona='frustrated_customer', language_style='contains_slang'), is_realistic_and_kept=1, notes_for_filtering=''),
 QueryWithDimensions(id='SYN002', query='what are your current sales on dresses looking for something flowy', dimension_tuple=DimensionTuple(intent='technical_support', complexity='ambiguous', persona='frustrated_customer', language_style='contains_slang'), is_realistic_and_kept=1, notes_for_filtering=''),
 QueryWithDimensions(id='SYN003', query='howdy looking for a nice tank top and shorts outfit for vacation', dimension_tuple=DimensionTuple(intent='technical_support', complexity='ambiguous', persona='frustrated_customer', language_style='contains_slang'), is_realistic_and_kept=1, notes_for_filtering=''),
 QueryWithDimensions(id='SYN004', query='heyy any recommendations for rompers or jumpsuits ve

## Save Queries to CSV

In [126]:
def save_queries_to_csv(queries: List[QueryWithDimensions]):
    """Save generated queries to CSV using pandas."""
    import pandas as pd
    if not queries:
        print("No queries to save.")
        return

    # Convert to DataFrame
    df = pd.DataFrame([
        {
            'id': q.id,
            'query': q.query,
            'dimension_tuple_json': q.dimension_tuple.model_dump_json(),
            'is_realistic_and_kept': q.is_realistic_and_kept,
            'notes_for_filtering': q.notes_for_filtering
        }
        for q in queries
    ])
    
    # Save to CSV
    df.to_csv(OUTPUT_CSV_PATH, index=False)
    print(f"Saved {len(queries)} queries to {OUTPUT_CSV_PATH}")

save_queries_to_csv(queries=queries)

Saved 59 queries to bootstrap_data/synthetic_queries_for_analysis.csv


---

# Next Steps (Beyond This Workshop)

After completing this workshop, you would typically:

1. **Manual Curation**: Review and filter the generated queries
2. **SME Review**: Have domain experts validate the realistic scenarios
3. **Dataset Expansion**: Generate additional queries to reach your target size (~100)
4. **Quality Scoring**: Add human ratings for query realism and difficulty
5. **Response Generation**: Use your customer support system to generate responses
6. **Evaluation Setup**: Create scoring functions to evaluate response 


# Key Takeaways

1. **Systematic > Ad-hoc**: Structured dimension-based generation produces better data than simple prompting
2. **Dimensions Matter**: Choose dimensions based on where your system is likely to fail
3. **Parallel Processing**: Use concurrent API calls to speed up generation (with rate limit handling)
4. **Quality Control**: Always include human review in your synthetic data pipeline
5. **Iterative Improvement**: Refine your dimensions and prompts based on output quality

---
