# GEPA Configuration Deep Dive

This notebook explores the configuration options for GEPA optimization runs in detail. You'll learn how to:

1. Configure optimization targets
2. Set up seed prompts for each target
3. Balance template vs rubric scoring weights
4. Configure GEPA algorithm parameters
5. Set up data splitting strategies
6. Enable LLM-powered feedback generation

---

## Setup

In [None]:
import sys
from pathlib import Path

sys.path.insert(0, str(Path.cwd().parent.parent.parent / "src"))

from karenina import Benchmark
from karenina.schemas import ModelConfig
from karenina.integrations.gepa import (
    OptimizationConfig,
    OptimizationTarget,
)

# Load benchmark for examples
benchmark_path = Path.home() / "Projects/karenina-monorepo/local_data/data/checkpoints/aime_2025.jsonld"
benchmark = Benchmark.load(benchmark_path)
print(f"Loaded: {benchmark.name} ({len(benchmark.get_question_ids())} questions)")

---

## OptimizationTarget

GEPA can optimize three types of text components in the verification pipeline:

| Target | Description | Use Case |
|--------|-------------|----------|
| `ANSWERING_SYSTEM_PROMPT` | System prompt for the model being evaluated | Improve answer quality |
| `PARSING_INSTRUCTIONS` | Instructions for the judge LLM | Improve parsing accuracy |
| `MCP_TOOL_DESCRIPTIONS` | Descriptions for MCP tools | Guide tool selection |

In [None]:
# All available targets
print("Available OptimizationTarget values:\n")
for target in OptimizationTarget:
    print(f"  {target.name}")
    print(f"    Value: {target.value}")
    print()

### Single Target Optimization

Most common: optimize just the answering system prompt.

In [None]:
# Optimize only the answering prompt
single_target_config = OptimizationConfig(
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
    seed_answering_prompt="You are a helpful assistant.",
)

print(f"Targets: {single_target_config.targets}")
print(f"Seed prompt: {single_target_config.seed_answering_prompt}")

### Multi-Target Optimization

Optimize multiple components simultaneously for better overall performance.

In [None]:
# Optimize both answering and parsing
multi_target_config = OptimizationConfig(
    targets=[
        OptimizationTarget.ANSWERING_SYSTEM_PROMPT,
        OptimizationTarget.PARSING_INSTRUCTIONS,
    ],
    seed_answering_prompt="You are a math expert. Show your reasoning.",
    seed_parsing_instructions="Extract the integer answer from the response. Look for the final answer.",
)

print("Multi-target configuration:")
print(f"  Targets: {[t.value for t in multi_target_config.targets]}")
print(f"  Answering seed: {multi_target_config.seed_answering_prompt[:50]}...")
print(f"  Parsing seed: {multi_target_config.seed_parsing_instructions[:50]}...")

### MCP Tool Description Optimization

For benchmarks that use MCP tools, optimize the tool descriptions.

In [None]:
# Configure MCP tool optimization
mcp_config = OptimizationConfig(
    targets=[OptimizationTarget.MCP_TOOL_DESCRIPTIONS],
    seed_mcp_tool_descriptions={
        "calculator": "A calculator for basic arithmetic operations.",
        "wolfram_alpha": "Query Wolfram Alpha for mathematical computations.",
        "python_repl": "Execute Python code to solve problems.",
    },
)

print("MCP tool descriptions:")
for tool, desc in mcp_config.seed_mcp_tool_descriptions.items():
    print(f"  {tool}: {desc}")

---

## Seed Prompts

Seed prompts are the starting point for optimization. GEPA evolves these through mutation and reflection.

### Default Seeds

If you don't provide seeds, defaults are used:

In [None]:
# Config without explicit seeds - defaults are applied
auto_seed_config = OptimizationConfig(
    targets=[
        OptimizationTarget.ANSWERING_SYSTEM_PROMPT,
        OptimizationTarget.PARSING_INSTRUCTIONS,
    ],
    # No seeds provided - defaults will be used
)

print("Auto-generated seeds:")
print(f"  Answering: {auto_seed_config.seed_answering_prompt}")
print(f"  Parsing: {auto_seed_config.seed_parsing_instructions}")

### Domain-Specific Seeds

For best results, provide domain-specific seed prompts:

In [None]:
# AIME-specific seed prompt
aime_config = OptimizationConfig(
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
    seed_answering_prompt="""
You are an expert competition mathematician solving AIME (American Invitational Mathematics Examination) problems.

Key guidelines:
1. AIME answers are always integers from 0 to 999
2. Show your complete reasoning step by step
3. Verify your answer by checking edge cases
4. Box your final integer answer

Solve the following problem:
""".strip(),
)

print("AIME-specific seed:")
print(aime_config.seed_answering_prompt)

### Getting the Seed Candidate

Use `get_seed_candidate()` to build the initial candidate dict for GEPA:

In [None]:
# Build seed candidate for GEPA
config = OptimizationConfig(
    targets=[
        OptimizationTarget.ANSWERING_SYSTEM_PROMPT,
        OptimizationTarget.PARSING_INSTRUCTIONS,
    ],
    seed_answering_prompt="Solve this math problem step by step.",
    seed_parsing_instructions="Extract the final integer answer.",
)

seed_candidate = config.get_seed_candidate()

print("Seed candidate dict:")
for key, value in seed_candidate.items():
    print(f"  {key}: {value[:50]}..." if len(value) > 50 else f"  {key}: {value}")

---

## Scoring Weights

Control how template correctness and rubric quality are balanced in the optimization score.

**Formula**: `score = template_weight * template_score + rubric_weight * rubric_score`

In [None]:
# Default weights (70% correctness, 30% quality)
default_weights = OptimizationConfig(
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
    template_weight=0.7,
    rubric_weight=0.3,
)

print(f"Default: template={default_weights.template_weight}, rubric={default_weights.rubric_weight}")

In [None]:
# Correctness-only (for factual benchmarks like AIME)
correctness_only = OptimizationConfig(
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
    template_weight=1.0,
    rubric_weight=0.0,
)

print(f"Correctness-only: template={correctness_only.template_weight}, rubric={correctness_only.rubric_weight}")

In [None]:
# Quality-focused (for open-ended tasks)
quality_focused = OptimizationConfig(
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
    template_weight=0.3,
    rubric_weight=0.7,
)

print(f"Quality-focused: template={quality_focused.template_weight}, rubric={quality_focused.rubric_weight}")

In [None]:
# Weights must sum to 1.0 - this will raise an error
try:
    invalid_weights = OptimizationConfig(
        targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
        template_weight=0.5,
        rubric_weight=0.3,  # Sum = 0.8, not 1.0
    )
except ValueError as e:
    print(f"Validation error: {e}")

---

## GEPA Algorithm Parameters

Configure the GEPA optimization algorithm:

In [None]:
# Full GEPA parameter configuration
gepa_params_config = OptimizationConfig(
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
    
    # Reflection model for prompt mutation
    reflection_model="anthropic/claude-haiku-4-5",  # LiteLLM format
    
    # Optimization budget (number of evaluations)
    max_metric_calls=150,
    
    # Candidate selection strategy
    candidate_selection_strategy="pareto",  # "pareto", "current_best", "epsilon_greedy"
)

print("GEPA parameters:")
print(f"  Reflection model: {gepa_params_config.reflection_model}")
print(f"  Max metric calls: {gepa_params_config.max_metric_calls}")
print(f"  Selection strategy: {gepa_params_config.candidate_selection_strategy}")

### Candidate Selection Strategies

| Strategy | Description | Best For |
|----------|-------------|---------|
| `pareto` | Multi-objective Pareto optimization | Multi-model benchmarks |
| `current_best` | Greedily select highest scorer | Single-objective tasks |
| `epsilon_greedy` | Explore vs exploit tradeoff | Avoiding local optima |

In [None]:
# Multi-model Pareto optimization
pareto_config = OptimizationConfig(
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
    candidate_selection_strategy="pareto",
)
print(f"Pareto: {pareto_config.candidate_selection_strategy}")

# Greedy best selection
greedy_config = OptimizationConfig(
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
    candidate_selection_strategy="current_best",
)
print(f"Greedy: {greedy_config.candidate_selection_strategy}")

# Epsilon-greedy exploration
explore_config = OptimizationConfig(
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
    candidate_selection_strategy="epsilon_greedy",
)
print(f"Explore: {explore_config.candidate_selection_strategy}")

---

## Data Splitting Configuration

Configure how the benchmark is split into train/val/test sets.

### Ratio-Based Splitting

In [None]:
# Default 80/20 split
default_split = OptimizationConfig(
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
    train_ratio=0.8,
    val_ratio=0.2,
)
print(f"Default: train={default_split.train_ratio}, val={default_split.val_ratio}")

# With test set
with_test = OptimizationConfig(
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
    train_ratio=0.7,
    val_ratio=0.15,
    test_ratio=0.15,
    split_seed=42,  # For reproducibility
)
print(f"With test: train={with_test.train_ratio}, val={with_test.val_ratio}, test={with_test.test_ratio}")
print(f"  Seed: {with_test.split_seed}")

In [None]:
# Ratios must sum to 1.0 - this will raise an error
try:
    invalid_ratios = OptimizationConfig(
        targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
        train_ratio=0.5,
        val_ratio=0.3,
        test_ratio=0.1,  # Sum = 0.9, not 1.0
    )
except ValueError as e:
    print(f"Validation error: {e}")

### Explicit Question ID Lists

For precise control, specify exact question IDs:

In [None]:
# Get question IDs from benchmark
all_ids = benchmark.get_question_ids()

# Explicit ID-based splitting
explicit_split = OptimizationConfig(
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
    train_question_ids=all_ids[:20],  # First 20 for training
    val_question_ids=all_ids[20:26],  # Next 6 for validation
    test_question_ids=all_ids[26:],   # Last 4 for testing
)

print(f"Explicit split:")
print(f"  Train: {len(explicit_split.train_question_ids)} questions")
print(f"  Val: {len(explicit_split.val_question_ids)} questions")
print(f"  Test: {len(explicit_split.test_question_ids)} questions")

---

## Feedback Generation

Enable LLM-powered feedback for richer diagnostics during optimization.

In [None]:
# Configure feedback model
feedback_model = ModelConfig(
    id="feedback-haiku",
    model_provider="anthropic",
    model_name="claude-haiku-4-5",
    temperature=0.7,
    interface="langchain",
)

feedback_config = OptimizationConfig(
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
    
    # Enable LLM feedback
    feedback_model=feedback_model,
    
    # Enable differential analysis (compare successful vs failed traces)
    enable_differential_analysis=True,
)

print("Feedback configuration:")
print(f"  Model: {feedback_config.feedback_model.model_name}")
print(f"  Differential analysis: {feedback_config.enable_differential_analysis}")

---

## Complete Configuration Example

Putting it all together for the AIME benchmark:

In [None]:
# Complete AIME optimization configuration
complete_config = OptimizationConfig(
    # What to optimize
    targets=[
        OptimizationTarget.ANSWERING_SYSTEM_PROMPT,
        OptimizationTarget.PARSING_INSTRUCTIONS,
    ],
    
    # Domain-specific seeds
    seed_answering_prompt="""
You are an expert competition mathematician solving AIME problems.
AIME answers are always integers from 0 to 999.
Show complete step-by-step reasoning and box your final answer.
""".strip(),
    seed_parsing_instructions="""
Extract the final integer answer from the response.
Look for boxed answers or the last integer mentioned.
Return only the integer value (0-999).
""".strip(),
    
    # Scoring: correctness-focused for AIME
    template_weight=1.0,
    rubric_weight=0.0,
    
    # Data splitting
    train_ratio=0.7,
    val_ratio=0.2,
    test_ratio=0.1,
    split_seed=42,
    
    # GEPA parameters
    reflection_model="anthropic/claude-haiku-4-5",
    max_metric_calls=100,
    candidate_selection_strategy="pareto",
    
    # Feedback (optional)
    feedback_model=feedback_model,
    enable_differential_analysis=True,
)

print("Complete AIME Configuration:")
print(f"  Targets: {[t.value for t in complete_config.targets]}")
print(f"  Weights: template={complete_config.template_weight}, rubric={complete_config.rubric_weight}")
print(f"  Split: train={complete_config.train_ratio}, val={complete_config.val_ratio}, test={complete_config.test_ratio}")
print(f"  Reflection model: {complete_config.reflection_model}")
print(f"  Max calls: {complete_config.max_metric_calls}")
print(f"  Selection: {complete_config.candidate_selection_strategy}")
print(f"  Feedback enabled: {complete_config.feedback_model is not None}")

---

## Summary

| Parameter | Description | Default |
|-----------|-------------|---------|
| `targets` | What to optimize | Required |
| `seed_*` | Initial prompts | Auto-generated |
| `template_weight` | Weight for correctness | 0.7 |
| `rubric_weight` | Weight for quality | 0.3 |
| `reflection_model` | LLM for mutations | `openai/gpt-4o` |
| `max_metric_calls` | Evaluation budget | 150 |
| `candidate_selection_strategy` | How to pick candidates | `pareto` |
| `train_ratio` | Training set fraction | 0.8 |
| `val_ratio` | Validation set fraction | 0.2 |
| `test_ratio` | Test set fraction | None |
| `split_seed` | Random seed for splitting | None |
| `feedback_model` | LLM for diagnostics | None |
| `enable_differential_analysis` | Compare success/failure | True |

## Next Steps

- [03_data_splitting.ipynb](03_data_splitting.ipynb) - Advanced splitting strategies
- [04_scoring_deep_dive.ipynb](04_scoring_deep_dive.ipynb) - Understanding score computation