# GEPA Prompt Optimization Tutorial

This notebook demonstrates how to use the GEPA (Generate, Evaluate, Produce, Assess) optimizer to improve your prompts systematically.

## What is GEPA?

GEPA is a structured methodology for prompt optimization:

1. **Generate**: Run tests with your current prompt
2. **Evaluate**: Assess the quality of outputs using LLM judge
3. **Produce**: Create an improved version based on feedback
4. **Assess**: Validate the improvement and decide whether to continue

## Setup

In [None]:
# Imports
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

from src.optimizer import GEPAOptimizer, LLMEvaluator, TestCaseLoader, TestCaseBuilder
from src.prompts.template import apply_prompt_template
from datetime import datetime

## Example 1: Quick Validation

Let's start by validating a simple prompt without optimization.

In [None]:
# Define a simple prompt
simple_prompt = """
You are a helpful assistant that answers questions clearly and concisely.
Always respond in a professional tone.
"""

# Create test cases
builder = TestCaseBuilder()
builder.add("What is Python?", expected="Should provide clear definition")
builder.add("Explain machine learning", expected="Should give concise explanation")
builder.add("Hello!", expected="Should greet professionally")

test_cases = builder.build()
print(f"Created {len(test_cases)} test cases")

In [None]:
# Create a simple evaluator
async def simple_evaluator(result):
    evaluator = LLMEvaluator(
        criteria={
            "clarity": "Response should be clear and easy to understand",
            "conciseness": "Response should be concise, not verbose",
            "professionalism": "Tone should be professional"
        },
        guidelines=["Be helpful", "Stay on topic"]
    )
    return await evaluator.evaluate(result)

In [None]:
# Create optimizer (we'll just use it for validation, no optimization)
optimizer = GEPAOptimizer(
    base_prompt=simple_prompt,
    evaluation_fn=simple_evaluator,
    agent_type="claude-sonnet-4",
    verbose=True
)

# Run just the generate + evaluate steps
results = await optimizer.generate(test_cases)
evaluation = await optimizer.evaluate(results)

print(f"\n{'='*60}")
print(f"Score: {evaluation['avg_score']:.2f}")
print(f"Pass Rate: {evaluation['pass_rate']:.1%}")

## Example 2: Full Optimization

Now let's optimize a prompt using the full GEPA cycle.

In [None]:
# Load test cases from file
loader = TestCaseLoader()
coordinator_tests = loader.load("coordinator_tests.yaml")

print(f"Loaded {len(coordinator_tests)} test cases:")
for i, test in enumerate(coordinator_tests[:3], 1):
    print(f"{i}. {test['input'][:50]}...")

In [None]:
# Load current coordinator prompt
current_prompt = apply_prompt_template(
    prompt_name="coordinator",
    prompt_context={"CURRENT_TIME": datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
)

print(f"Current prompt length: {len(current_prompt)} characters")
print(f"\nFirst 200 chars:\n{current_prompt[:200]}...")

In [None]:
# Define custom evaluator for coordinator
async def coordinator_evaluator(result):
    output = result.get("output", "")
    metadata = result.get("metadata", {})
    
    # Check structural correctness
    has_handoff = "handoff_to_planner" in output.lower()
    expected_action = metadata.get("expected_action", "")
    should_handoff = expected_action == "handoff_to_planner"
    
    # Use LLM evaluator
    llm_eval = LLMEvaluator(
        criteria={
            "handoff_logic": "Should correctly decide when to handoff to planner",
            "response_quality": "Response should be appropriate and helpful",
            "tone": "Should be friendly but professional"
        },
        guidelines=[
            "Simple greetings handled directly",
            "Complex tasks handed off to planner",
            "Clear handoff marker when handing off"
        ]
    )
    
    eval_result = await llm_eval.evaluate(result)
    
    # Adjust score based on handoff correctness
    if has_handoff == should_handoff:
        eval_result["score"] = min(1.0, eval_result["score"] + 0.1)
    else:
        eval_result["score"] = max(0.0, eval_result["score"] - 0.3)
        eval_result["feedback"] = f"[HANDOFF ERROR] {eval_result['feedback']}"
    
    return eval_result

In [None]:
# Create optimizer with guidelines
guidelines = [
    "Use clear XML tags for structure",
    "Provide concrete examples",
    "Define clear decision criteria",
    "Be specific about constraints"
]

improvement_context = """
**System Prompt Best Practices:**
- Use structural tags like <role>, <instructions>, <constraints>
- Provide examples for each major scenario
- Define measurable success criteria
- Specify both what to do and what NOT to do
- Keep language clear and direct
"""

optimizer = GEPAOptimizer(
    base_prompt=current_prompt,
    evaluation_fn=coordinator_evaluator,
    agent_type="claude-sonnet-4",
    guidelines=guidelines,
    improvement_context=improvement_context,
    enable_reasoning=False,
    verbose=True
)

In [None]:
# Run optimization (limit to 2 iterations for demo)
best_version = await optimizer.optimize(
    test_cases=coordinator_tests,
    max_iterations=2,
    target_score=0.9,
    min_improvement=0.05
)

In [None]:
# Review results
if best_version:
    print("\n" + "="*60)
    print("OPTIMIZATION SUMMARY")
    print("="*60)
    print(f"Versions created: {len(optimizer.versions)}")
    print(f"Best version: {best_version.version}")
    print(f"Best score: {best_version.score:.3f}")
    print(f"Pass rate: {best_version.metadata.get('pass_rate', 0):.1%}")
    
    print("\nFirst 300 chars of optimized prompt:")
    print(best_version.prompt[:300] + "...")

## Example 3: Compare Versions

Let's compare all versions created during optimization.

In [None]:
# Compare scores across versions
import matplotlib.pyplot as plt

versions = [v.version for v in optimizer.versions]
scores = [v.score for v in optimizer.versions]

plt.figure(figsize=(10, 5))
plt.plot(versions, scores, marker='o', linewidth=2, markersize=8)
plt.xlabel('Version', fontsize=12)
plt.ylabel('Score', fontsize=12)
plt.title('GEPA Optimization Progress', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.ylim(0, 1)
plt.axhline(y=0.9, color='g', linestyle='--', label='Target Score')
plt.legend()
plt.tight_layout()
plt.show()

print("\nScore progression:")
for v in optimizer.versions:
    print(f"  Version {v.version}: {v.score:.3f} (Pass rate: {v.metadata.get('pass_rate', 0):.1%})")

## Example 4: Save and Export

Save the optimized prompt and optimization history.

In [None]:
# Save optimized prompt
if best_version:
    output_path = project_root / "artifacts" / "coordinator_optimized.md"
    output_path.parent.mkdir(exist_ok=True)
    
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(best_version.prompt)
    
    print(f"✓ Saved optimized prompt to: {output_path}")
    
    # Save history
    history_path = project_root / "artifacts" / "optimization_history.json"
    optimizer.save_history(str(history_path))
    print(f"✓ Saved optimization history to: {history_path}")

## Example 5: Creating Custom Test Cases

Build custom test cases programmatically.

In [None]:
# Create test cases for a hypothetical agent
builder = TestCaseBuilder()

# Add various test cases
builder.add(
    "Calculate the mean of [1, 2, 3, 4, 5]",
    expected="Should perform calculation and return 3",
    metadata={"category": "math", "priority": "high"}
)

builder.add(
    "What's the weather like?",
    expected="Should politely indicate inability to access weather data",
    metadata={"category": "limitation", "priority": "medium"}
)

builder.add(
    "Generate a Python function to sort a list",
    expected="Should generate working Python code with explanation",
    metadata={"category": "code_generation", "priority": "high"}
)

# Build and save
custom_tests = builder.build()
builder.save("custom_agent_tests.yaml")

print(f"Created {len(custom_tests)} custom test cases")
print("Saved to: data/test_cases/custom_agent_tests.yaml")

## Example 6: Using Simple Evaluator

For quick structural checks without LLM overhead.

In [None]:
from src.optimizer.llm_evaluator import SimpleEvaluator

# Create evaluator with structural rules
simple_eval = SimpleEvaluator(
    required_keywords=["handoff_to_planner"],
    min_length=10,
    max_length=500
)

# Test it
test_result = {
    "input": "Analyze data",
    "output": "handoff_to_planner: I'll analyze this data for you."
}

eval_result = simple_eval.evaluate(test_result)
print(f"Score: {eval_result['score']:.2f}")
print(f"Feedback: {eval_result['feedback']}")

## Best Practices Summary

1. **Start with validation** before full optimization
2. **Use realistic test cases** that cover edge cases
3. **Define clear evaluation criteria** with specific metrics
4. **Iterate gradually** (3-5 iterations usually sufficient)
5. **Review improvements manually** before deploying
6. **Save optimization history** for auditing
7. **Combine evaluators** (Simple for structure, LLM for quality)

## Next Steps

- Try optimizing other agents (planner, supervisor)
- Create custom evaluators for domain-specific needs
- Build comprehensive test suites for your use cases
- Integrate GEPA into your CI/CD pipeline
- Experiment with different guidelines and contexts