# KareninaAdapter: GEPA Integration

The `KareninaAdapter` is the core integration class that bridges GEPA's optimization framework with Karenina's verification pipeline. This notebook covers:

1. Adapter initialization and configuration
2. The `evaluate()` method for scoring candidates
3. Capturing execution trajectories
4. Building reflective datasets for GEPA

---

## Setup

In [None]:
import sys
from pathlib import Path

sys.path.insert(0, str(Path.cwd().parent.parent.parent / "src"))

from karenina import Benchmark
from karenina.schemas import VerificationConfig, ModelConfig
from karenina.integrations.gepa import (
    GEPA_AVAILABLE,
    OptimizationTarget,
    split_benchmark,
)

# Check if GEPA is available
print(f"GEPA available: {GEPA_AVAILABLE}")

if GEPA_AVAILABLE:
    from karenina.integrations.gepa import KareninaAdapter
else:
    print("Note: GEPA not installed. Install with: pip install gepa")
    print("This notebook will show the API but won't execute GEPA-specific code.")

In [None]:
# Load benchmark
benchmark_path = Path.home() / "Projects/karenina-monorepo/local_data/data/checkpoints/aime_2025.jsonld"
benchmark = Benchmark.load(benchmark_path)

# Create a data split
split = split_benchmark(
    benchmark,
    train_ratio=0.7,
    val_ratio=0.2,
    test_ratio=0.1,
    seed=42,
)

print(f"Loaded: {benchmark.name}")
print(split.summary())

---

## Creating a VerificationConfig

The adapter needs a base `VerificationConfig` that it will modify with optimized components.

In [None]:
# Base verification config
base_config = VerificationConfig(
    answering_models=[
        ModelConfig(
            id="claude-haiku",
            model_provider="anthropic",
            model_name="claude-haiku-4-5",
            temperature=0.0,
            interface="langchain",
            # This prompt will be replaced by GEPA during optimization
            system_prompt="You are a helpful assistant.",
        ),
    ],
    parsing_models=[
        ModelConfig(
            id="claude-haiku-parser",
            model_provider="anthropic",
            model_name="claude-haiku-4-5",
            temperature=0.0,
            interface="langchain",
        )
    ],
    evaluation_mode="template_only",
    replicate_count=1,
)

print("Base config created")
print(f"  Answering model: {base_config.answering_models[0].model_name}")
print(f"  Parsing model: {base_config.parsing_models[0].model_name}")

---

## Initializing KareninaAdapter

The adapter connects GEPA to Karenina's verification pipeline.

In [None]:
if GEPA_AVAILABLE:
    # Create the adapter
    adapter = KareninaAdapter(
        benchmark=benchmark,
        base_config=base_config,
        targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
        template_weight=1.0,  # AIME: correctness only
        rubric_weight=0.0,
        feedback_model_config=None,  # Optional: for LLM feedback
        enable_differential_analysis=True,
    )
    
    print("KareninaAdapter initialized:")
    print(f"  Targets: {adapter.targets}")
    print(f"  Template weight: {adapter.template_weight}")
    print(f"  Rubric weight: {adapter.rubric_weight}")
else:
    print("Skipping adapter creation (GEPA not installed)")

### Adapter Parameters

| Parameter | Description |
|-----------|-------------|
| `benchmark` | Karenina Benchmark with questions |
| `base_config` | VerificationConfig to modify |
| `targets` | Which components to optimize |
| `template_weight` | Weight for correctness (0-1) |
| `rubric_weight` | Weight for quality (0-1) |
| `feedback_model_config` | Optional ModelConfig for LLM feedback |
| `enable_differential_analysis` | Compare success vs failure traces |

---

## The evaluate() Method

The core GEPA interface method. Evaluates a candidate (optimized prompts) on a batch of questions.

In [None]:
if GEPA_AVAILABLE:
    # Define a candidate (the prompts to evaluate)
    candidate = {
        "answering_system_prompt": """
You are an expert competition mathematician solving AIME problems.

IMPORTANT:
- AIME answers are ALWAYS integers from 0 to 999
- Show your complete step-by-step reasoning
- Verify your answer before submitting
- State your final answer clearly at the end
""".strip()
    }
    
    print("Candidate prompt:")
    print(candidate["answering_system_prompt"])

In [None]:
if GEPA_AVAILABLE:
    # Evaluate on a small batch (first 3 training questions)
    batch = split.train[:3]
    
    print(f"Evaluating on {len(batch)} questions...")
    
    eval_result = adapter.evaluate(
        batch=batch,
        candidate=candidate,
        capture_traces=False,  # Don't capture trajectories yet
    )
    
    print(f"\nEvaluation results:")
    print(f"  Outputs: {len(eval_result.outputs)} results")
    print(f"  Scores: {eval_result.scores}")
    print(f"  Average score: {sum(eval_result.scores) / len(eval_result.scores):.2%}")

### Understanding EvaluationBatch

The `evaluate()` method returns an `EvaluationBatch` with:

| Field | Type | Description |
|-------|------|-------------|
| `outputs` | list[dict] | VerificationResults per question (keyed by model) |
| `scores` | list[float] | Score per question (0.0-1.0) |
| `trajectories` | list[KareninaTrajectory] | Execution traces (if capture_traces=True) |
| `objective_scores` | list[dict] | Per-model scores for Pareto optimization |

In [None]:
if GEPA_AVAILABLE:
    # Inspect the outputs
    for i, (output, score) in enumerate(zip(eval_result.outputs, eval_result.scores)):
        print(f"\nQuestion {i+1}:")
        for model_name, result in output.items():
            passed = result.template.verify_result if result.template else False
            status = "PASS" if passed else "FAIL"
            print(f"  {model_name}: {status} (score: {score:.2f})")

---

## Capturing Trajectories

Trajectories are detailed execution traces used for GEPA's reflective feedback.

In [None]:
if GEPA_AVAILABLE:
    # Evaluate with trajectory capture
    eval_with_traces = adapter.evaluate(
        batch=split.train[:2],  # Just 2 questions
        candidate=candidate,
        capture_traces=True,  # Capture trajectories
    )
    
    print(f"Captured {len(eval_with_traces.trajectories)} trajectories")

In [None]:
if GEPA_AVAILABLE and eval_with_traces.trajectories:
    # Inspect a trajectory
    traj = eval_with_traces.trajectories[0]
    
    print("KareninaTrajectory fields:")
    print(f"  Question: {traj.data_inst.question_text[:60]}...")
    print(f"  Model: {traj.model_name}")
    print(f"  Score: {traj.score:.2f}")
    print(f"  Passed: {traj.passed()}")
    print(f"  Raw response: {traj.raw_llm_response[:100] if traj.raw_llm_response else 'None'}...")
    print(f"  Parsing error: {traj.parsing_error}")
    print(f"  Failed fields: {traj.failed_fields}")
    print(f"  Rubric scores: {traj.rubric_scores}")

In [None]:
if GEPA_AVAILABLE and eval_with_traces.trajectories:
    # Convert trajectory to feedback dict (for GEPA)
    feedback_dict = traj.to_feedback_dict()
    
    print("Feedback dict for GEPA:")
    for key, value in feedback_dict.items():
        if isinstance(value, str) and len(value) > 100:
            print(f"  {key}: {value[:100]}...")
        else:
            print(f"  {key}: {value}")

---

## Multi-Model Evaluation

For multi-model benchmarks, the adapter evaluates all models and provides per-model objective scores.

In [None]:
# Create config with multiple answering models
multi_model_config = VerificationConfig(
    answering_models=[
        ModelConfig(
            id="haiku",
            model_provider="anthropic",
            model_name="claude-haiku-4-5",
            temperature=0.0,
            interface="langchain",
            system_prompt="You are a helpful assistant.",
        ),
        ModelConfig(
            id="sonnet",
            model_provider="anthropic",
            model_name="claude-sonnet-4-5",
            temperature=0.0,
            interface="langchain",
            system_prompt="You are a helpful assistant.",
        ),
    ],
    parsing_models=[
        ModelConfig(
            id="parser",
            model_provider="anthropic",
            model_name="claude-haiku-4-5",
            temperature=0.0,
            interface="langchain",
        )
    ],
    evaluation_mode="template_only",
    replicate_count=1,
)

print("Multi-model config:")
for model in multi_model_config.answering_models:
    print(f"  - {model.id}: {model.model_name}")

In [None]:
if GEPA_AVAILABLE:
    # Create multi-model adapter
    multi_adapter = KareninaAdapter(
        benchmark=benchmark,
        base_config=multi_model_config,
        targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
        template_weight=1.0,
        rubric_weight=0.0,
    )
    
    # Evaluate
    multi_eval = multi_adapter.evaluate(
        batch=split.train[:2],
        candidate=candidate,
        capture_traces=True,
    )
    
    print(f"Multi-model evaluation:")
    print(f"  Total trajectories: {len(multi_eval.trajectories)}")
    print(f"  Objective scores: {multi_eval.objective_scores}")

---

## Building Reflective Datasets

The `make_reflective_dataset()` method builds feedback for GEPA's reflection LLM.

In [None]:
if GEPA_AVAILABLE:
    # Build reflective dataset from evaluation
    reflective_data = adapter.make_reflective_dataset(
        candidate=candidate,
        eval_batch=eval_with_traces,
        components_to_update=["answering_system_prompt"],
    )
    
    print(f"Reflective dataset:")
    for component, examples in reflective_data.items():
        print(f"\n  Component: {component}")
        print(f"  Examples: {len(examples)}")
        
        if examples:
            ex = examples[0]
            print(f"\n  Sample example:")
            print(f"    Inputs: {list(ex.get('Inputs', {}).keys())}")
            print(f"    Generated Outputs: {ex.get('Generated Outputs', '')[:80]}...")
            print(f"    Feedback: {ex.get('Feedback', '')[:100]}...")

---

## With LLM Feedback Generator

For richer feedback, configure an LLM feedback generator.

In [None]:
if GEPA_AVAILABLE:
    # Create adapter with LLM feedback
    feedback_model = ModelConfig(
        id="feedback-haiku",
        model_provider="anthropic",
        model_name="claude-haiku-4-5",
        temperature=0.7,
        interface="langchain",
    )
    
    adapter_with_feedback = KareninaAdapter(
        benchmark=benchmark,
        base_config=base_config,
        targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
        template_weight=1.0,
        rubric_weight=0.0,
        feedback_model_config=feedback_model,
        enable_differential_analysis=True,
    )
    
    print("Adapter with LLM feedback:")
    print(f"  Feedback generator: {adapter_with_feedback.feedback_generator is not None}")
    print(f"  Differential analysis: {adapter_with_feedback.enable_differential_analysis}")

---

## Integration with GEPA

Here's how the adapter is used in a full GEPA optimization loop (conceptual):

In [None]:
# Conceptual GEPA optimization loop
print("""
# Full GEPA integration (conceptual)

from gepa import GEPA

# 1. Create adapter
adapter = KareninaAdapter(
    benchmark=benchmark,
    base_config=verification_config,
    targets=[OptimizationTarget.ANSWERING_SYSTEM_PROMPT],
)

# 2. Define seed candidate
seed_candidate = {
    "answering_system_prompt": "You are a math expert."
}

# 3. Run GEPA optimization
result = GEPA.optimize(
    seed_candidate=seed_candidate,
    trainset=split.train,
    valset=split.val,
    adapter=adapter,
    max_metric_calls=100,
    reflection_model="anthropic/claude-haiku-4-5",
)

# 4. Get optimized prompts
optimized = result.best_candidate
print(f"Optimized prompt: {optimized['answering_system_prompt']}")
print(f"Improvement: {result.improvement:.2%}")
""")

---

## Summary

| Method | Purpose |
|--------|--------|
| `__init__()` | Initialize adapter with benchmark and config |
| `evaluate()` | Score a candidate on a batch of questions |
| `make_reflective_dataset()` | Build feedback for GEPA reflection |

| EvaluationBatch Field | Description |
|----------------------|-------------|
| `outputs` | VerificationResults per question |
| `scores` | Score per question (0.0-1.0) |
| `trajectories` | Execution traces for reflection |
| `objective_scores` | Per-model scores for Pareto |

## Next Steps

- [06_feedback_generation.ipynb](06_feedback_generation.ipynb) - LLM-powered feedback generation