## Dataset Generation - Automated Test Case Creation for Agent Evaluation

This tutorial demonstrates automated test case generation for agent evaluation. You'll learn how to generate diverse, high-quality test datasets using the DatasetGenerator API, including saving and loading datasets for reuse across evaluation runs.

### What You'll Learn
- Generate test cases from scratch using topics
- Generate contextual test cases from agent tools and APIs
- Update existing datasets with edge cases and corner scenarios
- Save datasets to JSON files for reuse
- Load datasets from JSON files
- Use auto-rubric generation for evaluators
- Apply topic planning for diverse test coverage

### Tutorial Details

| Information         | Details                                                                       |
|:--------------------|:------------------------------------------------------------------------------|
| Tutorial type       | Intermediate - Automated dataset generation with persistence                  |
| Tutorial components | Multi-agent system, DatasetGenerator, dataset persistence                     |
| Tutorial vertical   | Agent Evaluation                                                              |
| Example complexity  | Medium                                                                        |
| SDK used            | Strands Agents, Strands Evals                                                 |

### Understanding Dataset Generation

Dataset generation automates test case creation for evaluating AI agents. Instead of manually writing test cases, you use AI to generate diverse, comprehensive test scenarios.

#### Why Use Dataset Generation?

| Manual Creation | Automated Generation |
|:----------------|:---------------------|
| Time-consuming | Generate 10-100+ cases in minutes |
| Limited coverage | Diverse topic coverage via topic planning |
| Hard to anticipate edge cases | Automatically includes edge cases |
| Requires domain expertise | Generates evaluation rubrics automatically |

#### When to Use Dataset Generation

Use dataset generation when you need comprehensive coverage across multiple scenarios, rapid prototyping during development, domain-specific testing based on agent tools, regression testing as your agent evolves, or continuous evaluation across different agent versions.

#### Three Generation Strategies

| Strategy | Method | Best For |
|:---------|:-------|:---------|
| From Scratch | `from_scratch_async()` | New agents, broad coverage, exploratory testing |
| From Context | `from_context_async()` | Testing specific tools, API integration scenarios |
| Update Existing | `update_current_dataset_async()` | Adding edge cases, iterative improvement |

#### Dataset Persistence

| Operation | Method | Use Case |
|:----------|:-------|:---------|
| Save | `dataset.to_file('name.json')` | Preserve for reuse, version control |
| Load | `Dataset.from_file('name.json')` | Consistent evaluation, team sharing |

### Environment Setup

Configure AWS region and model settings for this tutorial.

In [None]:
import boto3

# AWS Configuration
session = boto3.Session()
AWS_REGION = session.region_name or 'us-east-1'
DEFAULT_MODEL = 'us.anthropic.claude-3-7-sonnet-20250219-v1:0'

### Setup and Imports

Import all necessary libraries for agent creation, dataset generation, and evaluation.

In [None]:
# Standard imports
import asyncio
from typing import Dict, List, Any

# Strands imports
from strands import Agent, tool
from strands.multiagent import GraphBuilder

# Strands Evals imports
from strands_evals import Dataset, Case
from strands_evals.generators import DatasetGenerator
from strands_evals.evaluators import OutputEvaluator

# Display utilities
from IPython.display import Markdown, display

### Multi-Agent System with Parallel Execution

We'll create a multi-agent decision-making system with parallel execution. Specialized agents analyze different aspects of a problem and feed into a final decision-maker.

#### Agent Code

The following multi-agent code demonstrates parallel execution with memory branching, adapted from graph agent patterns in strands-samples.

In [None]:
# Multi-agent code adapted from: /strands-samples/01-tutorials/02-multi-agent-systems/03-graph-agent/

# Create specialized agents for parallel analysis
financial_advisor = Agent(
    name="financial_advisor",
    system_prompt="You are a financial advisor focused on cost-benefit analysis, budget implications, and ROI calculations. Provide concise financial assessment.",
    model=DEFAULT_MODEL
)

technical_architect = Agent(
    name="technical_architect",
    system_prompt="You are a technical architect who evaluates feasibility, implementation challenges, and technical risks. Provide concise technical assessment.",
    model=DEFAULT_MODEL
)

market_researcher = Agent(
    name="market_researcher",
    system_prompt="You are a market researcher who analyzes market conditions, user needs, and competitive landscape. Provide concise market assessment.",
    model=DEFAULT_MODEL
)

risk_analyst = Agent(
    name="risk_analyst",
    system_prompt="You are a risk analyst who synthesizes input from finance, technical, and market experts to identify potential risks, mitigation strategies, and provide a final recommendation.",
    model=DEFAULT_MODEL
)

# Build the agent graph with parallel execution
builder = GraphBuilder()

# Add nodes
builder.add_node(financial_advisor, "finance_expert")
builder.add_node(technical_architect, "tech_expert")
builder.add_node(market_researcher, "market_expert")
builder.add_node(risk_analyst, "risk_analyst")

# Add edges - parallel execution pattern
# Finance expert feeds into both tech and market experts
builder.add_edge("finance_expert", "tech_expert")
builder.add_edge("finance_expert", "market_expert")

# Both tech and market experts feed into risk analyst
builder.add_edge("tech_expert", "risk_analyst")
builder.add_edge("market_expert", "risk_analyst")

# Set entry point
builder.set_entry_point("finance_expert")

# Build the graph
decision_graph = builder.build()

### Test the Multi-Agent System

Before generating datasets, let's verify the multi-agent system works correctly.

In [None]:
# Test the multi-agent system
test_query = "Should we invest $500K in developing an AI-powered customer service chatbot?"
result = decision_graph(test_query)

# Show execution flow
print("\nExecution Order:")
for node in result.execution_order:
    print(f"  - {node.node_id}")

### Strategy 1: Generate Dataset from Scratch

The `from_scratch_async()` method generates test cases from a list of topics. This strategy is ideal when you want to ensure diverse coverage across multiple domains or scenarios.

#### Key Features
- **Topic-based generation**: Specify topics to ensure comprehensive coverage
- **Auto-rubric generation**: Automatically creates evaluation rubrics
- **Difficulty distribution**: Generates easy, medium, and hard test cases
- **Dataset persistence**: Save to JSON for reuse

In [None]:
# Initialize dataset generator
generator = DatasetGenerator(
    input_type=str,
    output_type=str,
    include_expected_output=True,
    model=DEFAULT_MODEL
)

In [None]:
# Generate dataset from scratch with topics
topics = [
    "technology investments",
    "business process automation",
    "market expansion strategies"
]

# Generate dataset
scratch_dataset = await generator.from_scratch_async(
    topics=topics,
    task_description="Multi-agent decision system that provides recommendations on business investments and strategies",
    num_cases=9,
    evaluator=OutputEvaluator
)

#### Save Dataset to JSON

Save the generated dataset to a JSON file for reuse in future evaluation runs.

In [None]:
# Save dataset to JSON file
scratch_dataset.to_file('scratch_dataset.json')

#### Preview Generated Test Cases

In [None]:
# Display sample test cases
for i, case in enumerate(scratch_dataset.cases[:3], 1):
    case_info = f"""
**Case {i}: {case.name}**
**Input**: {case.input}
**Expected Output**: {case.expected_output}
    """
    display(Markdown(case_info))

### Strategy 2: Generate Dataset from Context

The `from_context_async()` method generates test cases based on your agent's specific context, such as tool definitions, APIs, or documentation. This ensures test cases are relevant to your agent's actual capabilities.

#### Key Features
- **Context-aware generation**: Uses agent tools and APIs to create relevant tests
- **Topic planning**: Optionally expand context into diverse topics
- **Tool-specific testing**: Generates tests that exercise specific tools
- **Auto-rubric generation**: Creates rubrics aligned with context

In [None]:
# Define agent context (tools and capabilities)
agent_context = """
Multi-agent decision system with the following capabilities:

Agents:
- Financial Advisor: Analyzes costs, ROI, budget impact, financial risks
- Technical Architect: Evaluates technical feasibility, implementation complexity, architecture risks
- Market Researcher: Assesses market demand, competition, user needs, market timing
- Risk Analyst: Synthesizes all inputs to provide final recommendation

Decision Flow:
1. Financial analysis runs first
2. Technical and market analysis run in parallel
3. Risk analyst synthesizes all perspectives

Output Format:
- Financial assessment
- Technical assessment
- Market assessment
- Final risk analysis and recommendation
"""

# Generate dataset with topic planning for diversity
context_dataset = await generator.from_context_async(
    context=agent_context,
    task_description="Multi-agent system that evaluates business decisions across financial, technical, and market dimensions",
    num_cases=12,
    num_topics=4,  # Topic planning: expand into 4 diverse topics
    evaluator=OutputEvaluator
)

#### Save Context-Based Dataset

In [None]:
# Save context-based dataset to JSON
context_dataset.to_file('context_dataset.json')

#### Preview Context-Based Test Cases

In [None]:
# Display sample test cases
for i, case in enumerate(context_dataset.cases[:3], 1):
    case_info = f"""
**Case {i}: {case.name}**
**Input**: {case.input}
**Expected Output**: {case.expected_output[:200]}...
    """
    display(Markdown(case_info))

### Strategy 3: Update Existing Dataset with Edge Cases

The `update_current_dataset_async()` method extends an existing dataset by adding new test cases. This is ideal for iteratively improving test coverage by adding edge cases, corner scenarios, or addressing gaps discovered in production.

#### Key Features
- **Incremental improvement**: Add tests without starting from scratch
- **Edge case coverage**: Focus on corner cases and failure scenarios
- **Dataset continuity**: Preserves existing tests while adding new ones
- **Rubric updates**: Optionally update evaluation rubrics

#### Load Existing Dataset from JSON

First, let's load one of our previously saved datasets to demonstrate the update workflow.

In [None]:
# Load existing dataset from JSON
loaded_dataset = Dataset.from_file('scratch_dataset.json')

In [None]:
# Update dataset with edge cases
edge_case_context = """
Add edge cases and challenging scenarios:
- Conflicting financial and technical assessments
- High-risk, high-reward decisions
- Decisions with missing or incomplete information
- Time-sensitive decisions requiring rapid analysis
- Decisions involving ethical considerations
- Scenarios where experts disagree
"""

# Update dataset by adding edge cases
updated_dataset = await generator.update_current_dataset_async(
    source_dataset=loaded_dataset,
    task_description="Multi-agent decision system handling complex and edge case scenarios",
    num_cases=6,
    context=edge_case_context,
    add_new_cases=True
)

print(f"\nOriginal dataset: {len(loaded_dataset.cases)} cases")
print(f"Updated dataset: {len(updated_dataset.cases)} cases")
print(f"New cases added: {len(updated_dataset.cases) - len(loaded_dataset.cases)}")
print(f"\nUpdated rubric: {updated_dataset.evaluator.rubric}")

#### Save Updated Dataset

In [None]:
# Save updated dataset to JSON
print(f"Total test cases: {len(updated_dataset.cases)}")
updated_dataset.to_file('updated_dataset.json')

#### Preview Edge Cases

In [None]:
# Display newly added edge cases (last 3 cases)
new_cases = updated_dataset.cases[-3:]
for i, case in enumerate(new_cases, 1):
    case_info = f"""
**Edge Case {i}: {case.name}**
**Input**: {case.input}
**Expected Output**: {case.expected_output[:200]}...
    """
    display(Markdown(case_info))

### Run Evaluation with Generated Dataset

Now let's evaluate our multi-agent system using one of the generated datasets.

In [None]:
# Define agent task function
def agent_task(case: Case) -> str:
    """
    Execute the multi-agent decision system with the given case input.
    """
    result = decision_graph(case.input)
    return str(result)

In [None]:
# Use first 3 cases for demonstration
eval_dataset = Dataset(
    cases=context_dataset.cases[:3],
    evaluator=context_dataset.evaluator
)

report = eval_dataset.run_evaluations(agent_task)

### Evaluation Results

Display the evaluation results using the auto-generated rubric.

In [None]:
# Display evaluation report
report.run_display()

### Dataset Persistence Workflow

Let's demonstrate a complete workflow showing how datasets can be saved and loaded across different evaluation sessions.

In [None]:
# Summary of dataset persistence workflow
workflow_summary = """
## Dataset Persistence Workflow Summary

### Generated Datasets

**1. scratch_dataset.json**
- Strategy: from_scratch_async()
- Topics: technology investments, business automation, market expansion
- Test cases: 9
- Use case: Broad coverage testing

**2. context_dataset.json**
- Strategy: from_context_async()
- Context: Multi-agent capabilities and decision flow
- Test cases: 12 (with topic planning)
- Use case: Context-aware testing

**3. updated_dataset.json**
- Strategy: update_current_dataset_async()
- Source: scratch_dataset.json + edge cases
- Test cases: 15 (original 9 + 6 new)
- Use case: Iterative improvement with edge cases

### Loading Datasets

```python
# Load any saved dataset
dataset = Dataset.from_file('dataset_name.json')

# Run evaluation
report = dataset.run_evaluations(agent_task)
```

### Benefits

- **Consistency**: Use the same test suite across agent versions
- **Collaboration**: Share datasets with team members
- **Version Control**: Track dataset changes over time
- **Regression Testing**: Ensure new changes don't break existing functionality
- **CI/CD Integration**: Automate evaluation in deployment pipelines
"""

display(Markdown(workflow_summary))

## Best Practices for Dataset Generation

### Choosing the Right Strategy

| Strategy | Use When |
|:---------|:---------|
| `from_scratch_async()` | Starting new project, need broad coverage, no detailed context yet |
| `from_context_async()` | Have well-defined tools/APIs, need tests matching actual capabilities |
| `update_current_dataset_async()` | Improving existing dataset, discovered gaps, adding edge cases |

### Key Recommendations

| Area | Recommendation |
|:-----|:---------------|
| Topic Planning | Use `num_topics=3-6` for diverse coverage |
| Persistence | Save datasets when generation takes time or needs reuse |
| Auto-rubrics | Best with default evaluators; use manual rubrics for specific requirements |
| Iteration | Start broad → add context → refine with edge cases → evaluate → iterate |

## Summary

You've successfully learned how to generate and persist evaluation datasets using Strands Evals. You now understand:

- How to generate test cases from scratch using topics with `from_scratch_async()`
- How to generate contextual test cases from agent capabilities with `from_context_async()`
- How to update existing datasets with edge cases using `update_current_dataset_async()`
- How to save datasets to JSON files with `dataset.to_file()`
- How to load datasets from JSON files with `Dataset.from_file()`
- How to use auto-rubric generation for evaluators
- How to apply topic planning for diverse test coverage
- Best practices for choosing generation strategies

Dataset generation enables you to create comprehensive, diverse test suites quickly, and dataset persistence ensures you can reuse these tests consistently across evaluation runs. This forms the foundation for robust, continuous agent evaluation workflows.