# GEPA: Optimize Anything using ASI (additional side information)

GEPA is a text evolution engine: Given a target metric, GEPA can efficiently search for the right parameters (including numerical, textual and code) to improve that metric. This way, GEPA can optimize essentially represent _anything_ that has a textual representation. In this post, we leverage this insight to present GEPA's optimize-anything API, which leverages the reflective capabilities of LLMs, to optimize anything representable as text. Crucially, GEPA can leverage any additional information available from the optimization environment simply by serializing into text.

Agent Optimization — Evolving DSPy Programs for ARC-AGI

Our second example pushes GEPA further: optimizing not just prompts or hyperparameters, but the *entire structure* of an AI agent. We'll evolve a DSPy program to solve ARC-AGI tasks — a challenging benchmark requiring visual reasoning and pattern recognition.

**The task**: Given input-output matrix pairs as training examples, produce the correct output for test inputs.

**What GEPA optimizes**: The entire DSPy program source code — signatures, modules, control flow, and prompting strategies.

**Result**: GEPA improves Gemini-2.5-Pro's performance from **44% to 49.5%** by discovering an elaborate 5-step reasoning pipeline with self-refinement.


### Setting up the dataset


In [10]:
from datasets import load_dataset
import random
import dspy

ds = load_dataset("dataartist/arc-agi")

def format_dataset(data):
    return [
        dspy.Example(
            training_examples=ex["train"],
            test_inputs=[x["input"] for x in ex["test"]],
            test_outputs=[x["output"] for x in ex["test"]],
        ).with_inputs("training_examples", "test_inputs")
        for ex in data
    ]

full_train = format_dataset(ds["training"])
test_set = format_dataset(ds["evaluation"])

random.Random(0).shuffle(full_train)
split_idx = len(full_train) // 2
train_set, val_set = full_train[:split_idx], full_train[split_idx:]

print(f"Train set: {len(train_set)}")
print(f"Val set: {len(val_set)}")
print(f"Test set: {len(test_set)}")

Train set: 200
Val set: 200
Test set: 400


In [None]:
train_set = train_set[:20]
val_set = val_set[:20]
test_set = test_set[:10]

### The seed candidate

We start with a simple Chain-of-Thought program — just a single DSPy module:


In [9]:
seed_candidate = """import dspy
from typing import List
import pydantic

MATRIX = List[List[int]]

class TrainingExample(pydantic.BaseModel):
    input: MATRIX
    output: MATRIX

class SolveTaskSignature(dspy.Signature):
    training_examples: List[TrainingExample] = dspy.InputField(description="Input and output examples demonstrating the task to be performed.")
    test_inputs: List[MATRIX] = dspy.InputField(description="Input matrices to be solved following the task described in the training examples.")
    test_outputs: List[MATRIX] = dspy.OutputField(description="Output matrices corresponding to the test inputs.")

program = dspy.ChainOfThought(SolveTaskSignature)"""

### Dspy adaptor

In [14]:
import os

from gepa.adapters.dspy_full_program_adapter.full_program_adapter import DspyAdapter
from examples.arc_agi.main import metric_fn

# Create LMs
task_lm = dspy.LM(
    model="openai/gpt-5",
    temperature=1.0,
    max_tokens=32000,
    api_key=os.environ.get("OPENAI_API_KEY"),
)

# Create adapter
adapter = DspyAdapter(
    task_lm=task_lm,
    metric_fn=metric_fn,
    num_threads=64,
    reflection_lm="openai/gpt-5",
)

### Fitness function

The fitness function compiles and runs the DSPy program, comparing outputs against ground truth. Crucially, it provides detailed feedback about *what went wrong*:


In [15]:
def format_error_results(program, batch, error_msg):
    """Create error results for all examples in batch."""
    results = []
    for example in batch:
        log = {"error": error_msg, "program": program}
        side_info = {
            "input": example,
            "error": error_msg,
        }
        results.append((0.0, log, side_info))
    return results

def format_results(program, trajectories):
    results = [] 
    for traj in trajectories:
        """Create a single result item from trajectory."""
        metric_result = traj.get("score")
        score = metric_result.get("score")
        feedback = metric_result.get("feedback")
        prediction = traj.get("prediction")
        model_answer = prediction.get("test_outputs")

        log = {
            "program": program,
            "model_answer": model_answer,
            "score": score,
        }

        side_info = {
            "input": traj.get("example"),
            "reasoning": prediction.get("reasoning"),
            "feedback": feedback,
            "output": model_answer,
        }

        results.append((score, log, side_info))

    return results

def fitness_fn(candidate, batch):
    """Evaluate candidate program on batch and return results."""
    program = candidate["program"]

    try:
        eval_batch = adapter.evaluate(batch, candidate, capture_traces=True)
    except Exception as e:
        print(f"Error evaluating candidate: {e}")
        return format_error_results(program, batch, str(e))

    # Program error
    if not isinstance(eval_batch.trajectories, list):
        error_msg = f"All examples failed. Program error: {str(eval_batch.trajectories)}"
        return format_error_results(program, batch, error_msg)

    # Process evaluations with no errors
    return format_results(program, eval_batch.trajectories)

### Running GEPA optimization


In [None]:
from gepa.optimize_anything import (
    EngineConfig,
    GEPAConfig,
    ReflectionConfig,
    optimize_anything,
)

gepa_config = GEPAConfig(
    engine=EngineConfig(
        max_metric_calls=100,
        track_best_outputs=True,
    ),
    reflection=ReflectionConfig(
        reflection_minibatch_size=3,
        skip_perfect_score=False,
        reflection_lm="openai/gpt-5",
    )
)

result = optimize_anything(
    seed_candidate={"program": seed_candidate},
    fitness_fn=fitness_fn,
    dataset=train_set,
    valset=val_set,
    config=gepa_config,
)

Average Metric: 8.00 / 8 (100.0%):  80%|████████  | 8/10 [01:22<00:18,  9.14s/it] 

### What GEPA discovered

After optimization, GEPA evolved the simple ChainOfThought into an elaborate 5-step pipeline:

1. **Hypothesize Rule**: Ask LLM to deduce a natural language transformation rule from training examples
2. **Generate Code**: Ask LLM to implement the rule as a Python function
3. **Validate on Training**: Run the code on all training examples, collecting feedback on failures
4. **Refine if Needed**: If validation fails, ask LLM to fix the code using gathered feedback
5. **Execute on Test**: Run the refined code on test inputs

Remarkably, **GEPA discovered reflective self-refinement** — having the LLM check and fix its own code before producing final outputs.


In [None]:
# View the evolved program
print(result.best_candidate["program"][:2000])  # First 2000 chars

## Key Takeaways

The `optimize_anything` API demonstrates GEPA's power as a general-purpose text evolution engine:

1. **Unified interface**: Whether you're optimizing prompts, code, or agent architectures, the API is the same — just define your fitness function with rich `side_info`.

2. **Side information is key**: The more diagnostic information you provide, the better GEPA's LLM-based reflection can understand failures and propose targeted improvements.

3. **Beyond scalar optimization**: Traditional optimizers only see scores. GEPA sees error messages, execution traces, and domain-specific feedback — enabling it to optimize complex artifacts that would be impossible to search blindly.

4. **Emergent capabilities**: GEPA can discover sophisticated strategies (like self-refinement in the ARC-AGI example) that weren't explicitly programmed — they emerge from the optimization process itself.

Try `optimize_anything` on your own optimization problems. If you can express your system's parameters as text and compute a score with diagnostic feedback, GEPA can optimize it.
