# GEPA Integration Quick Start

This notebook provides a quick introduction to the GEPA (Generative Evolutionary Prompt Advancement) integration in Karenina. GEPA is a framework for automatically optimizing prompts and instructions used in LLM evaluation pipelines.

## What is GEPA?

GEPA uses evolutionary algorithms combined with LLM reflection to iteratively improve text components like:
- **Answering system prompts**: The system prompt given to models being evaluated
- **Parsing instructions**: Instructions for the judge LLM that extracts structured data
- **MCP tool descriptions**: Descriptions that guide model tool selection

## Prerequisites

```bash
# Install karenina with GEPA support
pip install karenina[gepa]

# Or install GEPA separately
pip install gepa

# Set your API key
export ANTHROPIC_API_KEY="your-api-key"
```

---

## Setup and Imports

In [None]:
import sys
from pathlib import Path

# Add karenina to path (for development)
sys.path.insert(0, str(Path.cwd().parent.parent.parent / "src"))

# Core imports
from karenina import Benchmark

# GEPA integration imports
from karenina.integrations.gepa import (
    GEPA_AVAILABLE,
    OptimizationConfig,
    OptimizationTarget,
    OptimizationTracker,
    compute_single_score,
    export_to_preset,
    split_benchmark,
)
from karenina.schemas import ModelConfig, VerificationConfig

print(f"GEPA available: {GEPA_AVAILABLE}")

---

## Load the AIME 2025 Benchmark

We'll use the AIME 2025 benchmark which contains 30 math problems with integer answers (0-999).

In [None]:
# Load the benchmark
benchmark_path = Path.home() / "Projects/karenina-monorepo/local_data/data/checkpoints/aime_2025.jsonld"
benchmark = Benchmark.load(benchmark_path)

print(f"Benchmark: {benchmark.name}")
print(f"Description: {benchmark.description}")
print(f"Questions: {len(benchmark.get_question_ids())}")

---

## Quick Overview of Key Components

### 1. OptimizationTarget

Specifies what text components can be optimized:

In [None]:
# Available optimization targets
for target in OptimizationTarget:
    print(f"{target.name}: {target.value}")

### 2. OptimizationConfig

Configuration for a GEPA optimization run:

In [None]:
# Create an optimization config
opt_config = OptimizationConfig(
    # What to optimize
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
    # Initial seed prompt
    seed_answering_prompt="You are a helpful math assistant. Solve the problem step by step.",
    # Scoring weights (must sum to 1.0)
    template_weight=0.7,  # Weight for correctness
    rubric_weight=0.3,  # Weight for quality traits
    # Data splitting
    train_ratio=0.7,
    val_ratio=0.2,
    test_ratio=0.1,
    split_seed=42,
    # GEPA parameters
    reflection_model="anthropic/claude-haiku-4-5",
    max_metric_calls=50,
)

print("Optimization Config:")
print(f"  Targets: {[t.value for t in opt_config.targets]}")
print(f"  Weights: template={opt_config.template_weight}, rubric={opt_config.rubric_weight}")
print(f"  Split: train={opt_config.train_ratio}, val={opt_config.val_ratio}, test={opt_config.test_ratio}")

### 3. Benchmark Splitting

Split the benchmark into train/val/test sets:

In [None]:
# Split the benchmark
split = split_benchmark(
    benchmark,
    train_ratio=0.7,
    val_ratio=0.2,
    test_ratio=0.1,
    seed=42,
)

print(split.summary())
print(f"\nTrain question IDs: {split.train_ids[:3]}...")
print(f"Val question IDs: {split.val_ids[:2]}...")
print(f"Test question IDs: {split.test_ids[:1] if split.test_ids else 'None'}...")

### 4. KareninaDataInst

Each split contains `KareninaDataInst` objects - the GEPA-compatible representation of questions:

In [None]:
# Inspect a data instance
sample_inst = split.train[0]

print(f"Question ID: {sample_inst.question_id[:50]}...")
print(f"Question: {sample_inst.question_text[:100]}...")
print(f"Answer: {sample_inst.raw_answer}")
print(f"Has template: {len(sample_inst.template_code) > 0}")
print(f"Has rubric: {sample_inst.rubric is not None}")

---

## Run Verification (Baseline)

Before optimization, let's run verification to establish a baseline score:

In [None]:
# Create verification config with Claude Haiku
verification_config = VerificationConfig(
    answering_models=[
        ModelConfig(
            id="claude-haiku",
            model_provider="anthropic",
            model_name="claude-haiku-4-5",
            temperature=0.0,
            interface="langchain",
            system_prompt="You are a helpful math assistant. Solve the problem step by step and provide the final integer answer.",
        )
    ],
    parsing_models=[
        ModelConfig(
            id="claude-haiku-parser",
            model_provider="anthropic",
            model_name="claude-haiku-4-5",
            temperature=0.0,
            interface="langchain",
        )
    ],
    evaluation_mode="template_only",
    replicate_count=1,
)

print("Verification config created")
print(f"  Answering model: {verification_config.answering_models[0].model_name}")
print(f"  Parsing model: {verification_config.parsing_models[0].model_name}")

In [None]:
# Run verification on a subset (first 5 questions) to get baseline
sample_ids = split.train_ids[:5]

print(f"Running verification on {len(sample_ids)} questions...")
results = benchmark.run_verification(verification_config, question_ids=sample_ids)

# Calculate scores
scores = []
for result in results:
    score = compute_single_score(result, template_weight=1.0, rubric_weight=0.0)
    scores.append(score)
    status = "✓" if score > 0 else "✗"
    print(f"  {status} {result.metadata.question_id[:40]}... Score: {score:.2f}")

baseline_score = sum(scores) / len(scores) if scores else 0.0
print(f"\nBaseline Score: {baseline_score:.2%}")

---

## Optimization Tracking

Track optimization runs for reproducibility and comparison:

In [None]:
import tempfile

from karenina.integrations.gepa import OptimizationRun

# Create a tracker (using temp directory for this demo)
temp_dir = Path(tempfile.mkdtemp())
tracker = OptimizationTracker(temp_dir / "optimization_history.db")

# Log a sample run (simulating what would happen after optimization)
sample_run = OptimizationRun(
    benchmark_name="AIME 2025",
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT.value],
    seed_prompts={"answering_system_prompt": "You are a helpful math assistant."},
    optimized_prompts={
        "answering_system_prompt": "You are an expert competition mathematician. Solve AIME problems systematically."
    },
    train_score=0.75,
    val_score=0.70,
    test_score=0.68,
    improvement=0.15,  # 15% improvement
    reflection_model="anthropic/claude-haiku-4-5",
    metric_calls=50,
    best_generation=8,
    total_generations=10,
)

run_id = tracker.log_run(sample_run)
print(f"Logged run: {run_id}")

# Retrieve the run
retrieved = tracker.get_run(run_id)
print("\nRetrieved run:")
print(f"  Benchmark: {retrieved.benchmark_name}")
print(f"  Val Score: {retrieved.val_score:.2%}")
print(f"  Improvement: {retrieved.improvement:.2%}")

---

## Export Optimized Prompts

Export optimized prompts as a Karenina preset for reuse:

In [None]:
from karenina.integrations.gepa import export_prompts_json

# Export as a verification preset
optimized_prompts = {
    "answering_system_prompt": "You are an expert competition mathematician specialized in AIME problems. Always show your work step by step and box the final integer answer."
}

preset_path = export_to_preset(
    optimized_prompts=optimized_prompts,
    base_config=verification_config,
    output_path=temp_dir / "optimized_preset.json",
)

print(f"Exported preset to: {preset_path}")

# Also export as lightweight JSON
prompts_path = export_prompts_json(
    optimized_prompts=optimized_prompts,
    metadata={
        "benchmark": "AIME 2025",
        "improvement": 0.15,
        "train_score": 0.75,
        "val_score": 0.70,
    },
    output_path=temp_dir / "optimized_prompts.json",
)

print(f"Exported prompts to: {prompts_path}")

---

## Summary

In this quickstart, we covered:

1. **Loading a benchmark** - Using the AIME 2025 benchmark with 30 math problems
2. **OptimizationTarget** - The three types of text components that can be optimized
3. **OptimizationConfig** - Configuration for GEPA optimization runs
4. **Benchmark splitting** - Creating train/val/test sets for optimization
5. **Running verification** - Establishing a baseline score
6. **OptimizationTracker** - Persisting optimization runs for reproducibility
7. **Exporting results** - Saving optimized prompts as presets

## Next Steps

- [02_configuration.ipynb](02_configuration.ipynb) - Deep dive into OptimizationConfig
- [03_data_splitting.ipynb](03_data_splitting.ipynb) - Advanced splitting strategies
- [04_scoring_deep_dive.ipynb](04_scoring_deep_dive.ipynb) - Understanding score computation
- [05_karenina_adapter.ipynb](05_karenina_adapter.ipynb) - Using the KareninaAdapter
- [09_full_optimization_workflow.ipynb](09_full_optimization_workflow.ipynb) - Complete end-to-end example