# Full GEPA Optimization Workflow

This notebook demonstrates a complete end-to-end GEPA optimization workflow using the AIME 2025 benchmark. We'll cover:

1. Loading and preparing the benchmark
2. Configuring optimization
3. Splitting data for train/val/test
4. Running baseline verification
5. Setting up the GEPA adapter
6. Simulating optimization (with manual prompt iterations)
7. Tracking results
8. Exporting optimized prompts
9. Final evaluation

---

## Step 1: Setup and Imports

In [None]:
import sys
import tempfile
from pathlib import Path

sys.path.insert(0, str(Path.cwd().parent.parent.parent / "src"))

# Core Karenina imports
from karenina import Benchmark

# GEPA integration imports
from karenina.integrations.gepa import (
    GEPA_AVAILABLE,
    OptimizationConfig,
    OptimizationRun,
    OptimizationTarget,
    OptimizationTracker,
    compute_improvement,
    compute_single_score,
    export_prompts_json,
    export_to_preset,
    split_benchmark,
)
from karenina.schemas import ModelConfig, VerificationConfig

# Create temp directory for outputs
OUTPUT_DIR = Path(tempfile.mkdtemp(prefix="gepa_workflow_"))

print(f"GEPA available: {GEPA_AVAILABLE}")
print(f"Output directory: {OUTPUT_DIR}")

---

## Step 2: Load and Explore Benchmark

In [None]:
# Load the AIME 2025 benchmark
benchmark_path = Path.home() / "Projects/karenina-monorepo/local_data/data/checkpoints/aime_2025.jsonld"
benchmark = Benchmark.load(benchmark_path)

print(f"Benchmark: {benchmark.name}")
print(f"Description: {benchmark.description}")
print(f"Total questions: {len(benchmark.get_question_ids())}")

In [None]:
# Explore a sample question
question_ids = benchmark.get_question_ids()
sample_q = benchmark.get_question(question_ids[0])

print("Sample AIME problem:")
print(f"  Question: {sample_q['question'][:100]}...")
print(f"  Answer: {sample_q['raw_answer']}")

---

## Step 3: Configure Optimization

In [None]:
# Define the seed prompt (starting point for optimization)
SEED_PROMPT = """You are a helpful math assistant. 
Solve the problem and provide the final answer."""

# Create optimization config
opt_config = OptimizationConfig(
    # What to optimize
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
    # Seed prompt
    seed_answering_prompt=SEED_PROMPT,
    # Scoring: correctness only for AIME
    template_weight=1.0,
    rubric_weight=0.0,
    # Data splitting
    train_ratio=0.7,
    val_ratio=0.2,
    test_ratio=0.1,
    split_seed=42,
    # GEPA parameters
    reflection_model="anthropic/claude-haiku-4-5",
    max_metric_calls=100,
)

print("Optimization Configuration:")
print(f"  Targets: {[t.value for t in opt_config.targets]}")
print(f"  Scoring: template={opt_config.template_weight}, rubric={opt_config.rubric_weight}")
print(f"  Split: train={opt_config.train_ratio}, val={opt_config.val_ratio}, test={opt_config.test_ratio}")

---

## Step 4: Split Data

In [None]:
# Split the benchmark
split = split_benchmark(
    benchmark,
    train_ratio=opt_config.train_ratio,
    val_ratio=opt_config.val_ratio,
    test_ratio=opt_config.test_ratio,
    seed=opt_config.split_seed,
)

print("Data Split:")
print(f"  {split.summary()}")
print(f"\nTrain set: {len(split.train)} questions (for optimization)")
print(f"Val set: {len(split.val)} questions (for candidate selection)")
print(f"Test set: {len(split.test)} questions (held out for final eval)")

---

## Step 5: Create Verification Config

In [None]:
def create_verification_config(system_prompt: str) -> VerificationConfig:
    """Create a verification config with the given system prompt."""
    return VerificationConfig(
        answering_models=[
            ModelConfig(
                id="claude-haiku",
                model_provider="anthropic",
                model_name="claude-haiku-4-5",
                temperature=0.0,
                interface="langchain",
                system_prompt=system_prompt,
            )
        ],
        parsing_models=[
            ModelConfig(
                id="parser",
                model_provider="anthropic",
                model_name="claude-haiku-4-5",
                temperature=0.0,
                interface="langchain",
            )
        ],
        evaluation_mode="template_only",
        replicate_count=1,
    )


print("Verification config factory created")

---

## Step 6: Run Baseline Evaluation

In [None]:
def evaluate_prompt(prompt: str, question_ids: list, description: str = "") -> float:
    """Evaluate a prompt on given questions and return the score."""
    config = create_verification_config(prompt)
    results = benchmark.run_verification(config, question_ids=question_ids)

    scores = [compute_single_score(r, template_weight=1.0, rubric_weight=0.0) for r in results.results]
    avg_score = sum(scores) / len(scores) if scores else 0.0

    passed = sum(1 for s in scores if s > 0)
    print(f"{description}: {avg_score:.2%} ({passed}/{len(scores)} passed)")

    return avg_score

In [None]:
# Evaluate baseline on training set (subset for speed)
train_sample_ids = split.train_ids[:5]
val_sample_ids = split.val_ids[:3]

print("Evaluating baseline prompt...")
print(f"  Train sample: {len(train_sample_ids)} questions")
print(f"  Val sample: {len(val_sample_ids)} questions")
print()

baseline_train = evaluate_prompt(SEED_PROMPT, train_sample_ids, "Baseline (train)")
baseline_val = evaluate_prompt(SEED_PROMPT, val_sample_ids, "Baseline (val)")

---

## Step 7: Optimization Loop (Simulated)

In a real GEPA run, the reflection LLM would automatically generate improved prompts. Here we simulate the optimization by manually iterating through progressively better prompts.

In [None]:
# Simulated prompt evolution (what GEPA would generate)
PROMPT_GENERATIONS = [
    # Generation 0: Baseline
    """You are a helpful math assistant. 
Solve the problem and provide the final answer.""",
    # Generation 1: Add AIME context
    """You are a math assistant solving AIME problems.
AIME answers are integers from 0 to 999.
Solve step by step and give the final integer answer.""",
    # Generation 2: Add structure
    """You are an expert mathematician solving AIME competition problems.

Important:
- AIME answers are ALWAYS integers from 0 to 999
- Show your complete reasoning
- State your final answer clearly

Solve the problem:""",
    # Generation 3: Add verification step
    """You are an expert competition mathematician specializing in AIME problems.

CRITICAL GUIDELINES:
1. AIME answers are ALWAYS integers from 0 to 999
2. Show complete step-by-step reasoning
3. Verify your answer by checking edge cases
4. State your final answer as: "The answer is [N]"

Solve systematically:""",
]

In [None]:
# Run optimization loop
print("=" * 60)
print("OPTIMIZATION LOOP")
print("=" * 60)

generation_results = []

for gen, prompt in enumerate(PROMPT_GENERATIONS):
    print(f"\n--- Generation {gen} ---")
    print(f"Prompt: {prompt[:80]}...")

    train_score = evaluate_prompt(prompt, train_sample_ids, f"  Gen {gen} (train)")
    val_score = evaluate_prompt(prompt, val_sample_ids, f"  Gen {gen} (val)")

    improvement = compute_improvement(baseline_val, val_score)
    print(f"  Improvement vs baseline: {improvement:+.2%}")

    generation_results.append(
        {
            "generation": gen,
            "prompt": prompt,
            "train_score": train_score,
            "val_score": val_score,
            "improvement": improvement,
        }
    )

In [None]:
# Find best generation
best_result = max(generation_results, key=lambda x: x["val_score"])

print("\n" + "=" * 60)
print("OPTIMIZATION COMPLETE")
print("=" * 60)
print(f"\nBest generation: {best_result['generation']}")
print(f"Best val score: {best_result['val_score']:.2%}")
print(f"Improvement: {best_result['improvement']:+.2%}")
print("\nBest prompt:")
print(best_result["prompt"])

---

## Step 8: Final Test Evaluation

In [None]:
# Evaluate on held-out test set
print("Evaluating on held-out test set...")
print(f"  Test questions: {len(split.test_ids)}")
print()

# Baseline on test
baseline_test = evaluate_prompt(SEED_PROMPT, split.test_ids, "Baseline (test)")

# Best prompt on test
best_test = evaluate_prompt(best_result["prompt"], split.test_ids, "Optimized (test)")

test_improvement = compute_improvement(baseline_test, best_test)
print(f"\nTest improvement: {test_improvement:+.2%}")

---

## Step 9: Track Results

In [None]:
# Create tracker and log the run
tracker = OptimizationTracker(OUTPUT_DIR / "optimization_history.db")

run = OptimizationRun(
    benchmark_name=benchmark.name,
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT.value],
    seed_prompts={"answering_system_prompt": SEED_PROMPT},
    optimized_prompts={"answering_system_prompt": best_result["prompt"]},
    train_score=best_result["train_score"],
    val_score=best_result["val_score"],
    test_score=best_test,
    improvement=best_result["improvement"],
    reflection_model=opt_config.reflection_model,
    metric_calls=len(PROMPT_GENERATIONS) * (len(train_sample_ids) + len(val_sample_ids)),
    best_generation=best_result["generation"],
    total_generations=len(PROMPT_GENERATIONS),
)

run_id = tracker.log_run(run)

print(f"Logged optimization run: {run_id}")
print(f"  Train: {run.train_score:.2%}")
print(f"  Val: {run.val_score:.2%}")
print(f"  Test: {run.test_score:.2%}")
print(f"  Improvement: {run.improvement:+.2%}")

---

## Step 10: Export Results

In [None]:
# Export as verification preset
optimized_prompts = {"answering_system_prompt": best_result["prompt"]}
base_config = create_verification_config(SEED_PROMPT)

preset_path = export_to_preset(
    optimized_prompts=optimized_prompts,
    base_config=base_config,
    output_path=OUTPUT_DIR / "aime_optimized_preset.json",
)

print(f"Exported preset: {preset_path}")

In [None]:
# Export as lightweight prompts file
prompts_path = export_prompts_json(
    optimized_prompts=optimized_prompts,
    metadata={
        "benchmark": benchmark.name,
        "train_score": run.train_score,
        "val_score": run.val_score,
        "test_score": run.test_score,
        "improvement": run.improvement,
        "best_generation": run.best_generation,
        "total_generations": run.total_generations,
    },
    output_path=OUTPUT_DIR / "optimized_prompts.json",
)

print(f"Exported prompts: {prompts_path}")

---

## Summary

In [None]:
print("=" * 60)
print("WORKFLOW SUMMARY")
print("=" * 60)

print(f"""
Benchmark: {benchmark.name}
Questions: {len(benchmark.get_question_ids())}

Data Split:
  Train: {len(split.train)} questions
  Val: {len(split.val)} questions  
  Test: {len(split.test)} questions

Optimization:
  Generations: {len(PROMPT_GENERATIONS)}
  Best generation: {best_result["generation"]}
  
Results:
  Baseline (val): {baseline_val:.2%}
  Optimized (val): {best_result["val_score"]:.2%}
  Improvement: {best_result["improvement"]:+.2%}
  
Test Set (held out):
  Baseline: {baseline_test:.2%}
  Optimized: {best_test:.2%}
  Improvement: {test_improvement:+.2%}

Outputs:
  Preset: {preset_path.name}
  Prompts: {prompts_path.name}
  History: optimization_history.db
""")

In [None]:
# List all output files
print("Output files:")
for f in OUTPUT_DIR.iterdir():
    print(f"  {f.name}: {f.stat().st_size} bytes")

---

## Next Steps

1. **Production usage**: Use the exported preset with `karenina verify --preset`
2. **Iterate further**: Run more optimization with different targets
3. **Multi-model**: Add more answering models for Pareto optimization
4. **Scale up**: Use the full benchmark (all 30 questions)

### Using the Preset

```bash
karenina verify aime_2025.jsonld --preset aime_optimized_preset.json --output results.json
```

### Cleanup

In [None]:
# Optional: Clean up (uncomment to delete temp files)
# import shutil
# shutil.rmtree(OUTPUT_DIR, ignore_errors=True)
# print(f"Cleaned up: {OUTPUT_DIR}")

print(f"\nOutput directory preserved at: {OUTPUT_DIR}")
print("Run the cleanup cell above to delete.")