# GEPA: Optimize Anything using ASI (additional side information)

GEPA is a text evolution engine: Given a target metric, GEPA can efficiently search for the right parameters (including numerical, textual and code) to improve that metric. This way, GEPA can optimize essentially represent _anything_ that has a textual representation. In this post, we leverage this insight to present GEPA's optimize-anything API, which leverages the reflective capabilities of LLMs, to optimize anything representable as text. Crucially, GEPA can leverage any additional information available from the optimization environment simply by serializing into text.

## The optimize_anything API

At its core, the API is remarkably simple. You provide just two things:

1. **A seed candidate** — your starting point, represented as a dictionary mapping parameter names to their values. 
2. **A fitness function** — tells GEPA how good each candidate is. The fitness function also returns any additional information available from the environment about the evaluated candidate, like compiler error messages, that can guide the optimization.

That's it. GEPA handles the rest — selecting candidates, reflecting on failures, proposing improvements, and tracking the optimization trajectory, finally returning the optimized parameters.

### The Fitness Function: Your Optimization Signal

The fitness function is where you define *what* you're optimizing for. It takes a candidate and a batch of data instances, returning scores and diagnostic information:

```python
def fitness_fn(candidate: dict[str, str], batch: list[DataInst]) -> list[tuple[float, Any, dict]]:
    results = []
    for instance in batch:
        # Run your system with the candidate parameters
        output = run_my_system(candidate, instance)
        
        # Compute a score (higher is better)
        score = compute_score(output, instance)
        
        # Collect diagnostic info for LLM reflection
        side_info = {
            "input": instance["input"],
            "output": output,
            "expected": instance["expected"],
            "error_analysis": analyze_errors(output, instance)
        }
        
        results.append((score, output, side_info))
    return results
```

The magic happens in `side_info` — this is GEPA's secret weapon.


### The Power of Side Information

The `side_info` dictionary is where GEPA shines. Unlike traditional optimization that only sees a scalar score, GEPA's LLM-based reflection can understand *why* a candidate performed poorly:

- **Error messages**: Compiler errors, runtime exceptions, validation failures
- **Execution traces**: What the candidate actually did vs. what was expected
- **Partial results**: Which sub-tasks succeeded, which failed
- **Domain-specific feedback**: Any signal that helps explain performance

The more informative your `side_info`, the better GEPA can reason about improvements. This is what enables GEPA to optimize complex artifacts like code and agent architectures — not just tweak numbers.


## Example 1: Code Optimization — Evolving Optimization Algorithms

Our first example demonstrates GEPA's ability to evolve code. We'll optimize Python code that minimizes blackbox functions from the [evalset benchmark](https://github.com/sigopt/evalset/tree/main) — a collection of challenging optimization test functions (Ackley, Rosenbrock, Rastrigin, etc.).

**The task**: Given a blackbox function, write code that finds its minimum. The code can use any optimization library (Optuna, scipy, etc.) and must return the best `x` found.

**What GEPA optimizes**: The Python code itself — its structure, algorithm choice, hyperparameters, and implementation details.


### Setting up the dataset

Each data instance is a blackbox optimization problem with bounds, dimension, and problem characteristics:


In [3]:
from examples.polynomial.evalset import problems

# Create dataset from benchmark problems
dataset = []
for problem_name, problem in problems.items():
    dataset.append({
        "problem_name": problem_name,
        "problem_description": f"""Blackbox optimization problem.
        Minimize a function that takes a numpy array of shape ({problem.dim},) and returns a scalar.
        Bounds: {problem.bounds}""",
        "dim": problem.dim,
        "bounds": problem.bounds,
    })

### The seed candidate

We start with a trivial baseline — code that just guesses the center of the search space:


In [None]:
seed_candidate = {
    "code": """
import numpy as np

def solve(dim):
    # A trivial baseline: guess the center of the search space
    x = [0.5] * dim
    y = evaluator.evaluate(np.array(x))
    print("y:", y)
    return x

if __name__ == "__main__":
    global x
    x = solve(dim)
"""
}


### The fitness function

The fitness function executes the candidate code in a sandboxed environment, captures the result, and returns rich diagnostic information:


In [None]:
import numpy as np
from typing import Any, Sequence

def execute_code(code: str, global_vars: dict, timeout: int = 30) -> dict:
    """Execute code in a sandboxed environment with timeout."""
    # Implementation handles: stdout/stderr capture, timeout, exception handling
    # Returns: {"output": str, "logs": str, "results": dict, "error": str}
    ...

def fitness_fn(candidate: dict[str, str], batch: Sequence[Any]) -> list[tuple[float, Any, dict]]:
    """Evaluate optimization code on a batch of blackbox problems."""
    code = candidate["code"]
    results = []
    
    for problem_instance in batch:
        problem = problems[problem_instance["problem_name"]]
        
        # Execute the candidate code with problem context
        execution = execute_code(
            code,
            global_vars={"dim": problem.dim, "evaluator": evaluator},
            timeout=30
        )
        
        # Compute score: negative function value (higher is better)
        if "x" in execution["results"] and execution["error"] == "":
            x = np.array(execution["results"]["x"])
            score = -problem.do_evaluate(x)  # Negate because we minimize
        else:
            score = -99999  # Penalize failed executions
        
        # Rich side_info for LLM reflection
        side_info = {
            "scores": {"score": score},
            "Input": {"problem_description": problem_instance["problem_description"]},
            "code_side_info": {
                "X": execution["results"].get("x", "not found"),
                "Prints": execution["output"],       # Captured stdout
                "Logs": execution["logs"],           # Captured stderr  
                "Error": execution["error"],         # Any exceptions
                "Num evaluation calls": evaluator.local_evaluation_calls,
            },
        }
        
        results.append((score, {"code": code, **side_info}, side_info))
    
    return results


Notice how `side_info` captures everything the LLM needs to understand *why* the code failed or succeeded: error messages, print output, the actual result found, and evaluation budget used.

### Running GEPA optimization


In [None]:
from gepa.optimize_anything import (
    optimize_anything,
    GEPAConfig,
    EngineConfig,
    ReflectionConfig,
)

result = optimize_anything(
    seed_candidate=seed_candidate,
    fitness_fn=fitness_fn,
    dataset=dataset,
    config=GEPAConfig(
        engine=EngineConfig(
            max_metric_calls=1000,
            track_best_outputs=True,
        ),
        reflection=ReflectionConfig(
            reflection_lm="openai/gpt-4o",  # LLM for proposing improvements
            reflection_minibatch_size=3,     # Problems shown per reflection
        ),
    ),
)

# Access the optimized code
print(result.best_candidate["code"])


GEPA evolves the code from a trivial baseline into sophisticated optimization strategies — discovering the use of libraries like Optuna, implementing proper bounds handling, and tuning algorithm hyperparameters.


## Example 2: Agent Optimization — Evolving DSPy Programs for ARC-AGI

Our second example pushes GEPA further: optimizing not just prompts or hyperparameters, but the *entire structure* of an AI agent. We'll evolve a DSPy program to solve ARC-AGI tasks — a challenging benchmark requiring visual reasoning and pattern recognition.

**The task**: Given input-output matrix pairs as training examples, produce the correct output for test inputs.

**What GEPA optimizes**: The entire DSPy program source code — signatures, modules, control flow, and prompting strategies.

**Result**: GEPA improves Gemini-2.5-Pro's performance from **44% to 49.5%** by discovering an elaborate 5-step reasoning pipeline with self-refinement.


### Setting up the dataset


In [None]:
from datasets import load_dataset
import dspy

ds = load_dataset("dataartist/arc-agi")

# Each example has training pairs and test inputs/outputs
dataset = [
    dspy.Example(
        training_examples=ex["train"],
        test_inputs=[x["input"] for x in ex["test"]],
        test_outputs=[x["output"] for x in ex["test"]],
    ).with_inputs("training_examples", "test_inputs")
    for ex in ds["training"]
]


### The seed candidate

We start with a simple Chain-of-Thought program — just a single DSPy module:


In [None]:
seed_program = """import dspy
from typing import List
import pydantic

MATRIX = List[List[int]]

class TrainingExample(pydantic.BaseModel):
    input: MATRIX
    output: MATRIX

class SolveTaskSignature(dspy.Signature):
    training_examples: List[TrainingExample] = dspy.InputField(
        description="Input and output examples demonstrating the task."
    )
    test_inputs: List[MATRIX] = dspy.InputField(
        description="Input matrices to be solved."
    )
    test_outputs: List[MATRIX] = dspy.OutputField(
        description="Output matrices corresponding to the test inputs."
    )

program = dspy.ChainOfThought(SolveTaskSignature)
"""

seed_candidate = {"program": seed_program}


### The fitness function

The fitness function compiles and runs the DSPy program, comparing outputs against ground truth. Crucially, it provides detailed feedback about *what went wrong*:


In [None]:
task_lm = dspy.LM(model="gemini/gemini-2.5-pro", max_tokens=32000)

def validate_matrix(pred_matrix, gold_matrix):
    """Check if prediction matches gold, returning (is_valid, feedback)."""
    if not isinstance(pred_matrix, list):
        return False, f"Expected List[List[int]], got {type(pred_matrix)}"
    
    if len(pred_matrix) != len(gold_matrix):
        return False, f"Wrong dimensions: {len(pred_matrix)} rows vs {len(gold_matrix)} expected"
    
    wrong_indices = []
    for i, (pred_row, gold_row) in enumerate(zip(pred_matrix, gold_matrix)):
        for j, (pred_val, gold_val) in enumerate(zip(pred_row, gold_row)):
            if pred_val != gold_val:
                wrong_indices.append((i, j))
    
    if wrong_indices:
        return False, f"Incorrect values at indices: {wrong_indices[:10]}... Correct: {gold_matrix}"
    return True, "Correct!"

def fitness_fn(candidate: dict, batch: Sequence, **kwargs) -> list[tuple[float, Any, dict]]:
    """Evaluate a DSPy program on ARC-AGI tasks."""
    program_src = candidate["program"]
    results = []
    
    for example in batch:
        # Compile and run the program
        try:
            exec(program_src, globals())
            with dspy.context(lm=task_lm):
                pred = program(
                    training_examples=example.training_examples,
                    test_inputs=example.test_inputs
                )
            pred_outputs = pred.test_outputs
            error = None
        except Exception as e:
            pred_outputs = []
            error = str(e)
        
        # Score each test output
        feedbacks = []
        correct = 0
        for i, (pred_out, gold_out) in enumerate(
            zip(pred_outputs, example.test_outputs)
        ):
            valid, feedback = validate_matrix(pred_out, gold_out)
            correct += int(valid)
            feedbacks.append(f"Test {i}: {feedback}")
        
        score = correct / len(example.test_outputs) if example.test_outputs else 0
        
        # Rich side_info enables targeted reflection
        side_info = {
            "scores": {"accuracy": score},
            "TrainingExamples": str(example.training_examples)[:500],
            "TestInputs": str(example.test_inputs)[:500],
            "PredictedOutputs": str(pred_outputs)[:500],
            "ExpectedOutputs": str(example.test_outputs)[:500],
            "Feedback": "\n".join(feedbacks),
            "Error": error,
        }
        
        results.append((score, pred_outputs, side_info))
    
    return results


### Running GEPA optimization


In [None]:
reflection_lm = dspy.LM(model="gemini/gemini-2.5-pro", max_tokens=32000)

result = optimize_anything(
    seed_candidate=seed_candidate,
    fitness_fn=fitness_fn,
    dataset=dataset,
    config=GEPAConfig(
        engine=EngineConfig(
            max_metric_calls=4000,
            track_best_outputs=True,
        ),
        reflection=ReflectionConfig(
            reflection_lm=lambda x: reflection_lm(x)[0],
            reflection_minibatch_size=3,
        ),
    ),
)


### What GEPA discovered

After optimization, GEPA evolved the simple ChainOfThought into an elaborate 5-step pipeline:

1. **Hypothesize Rule**: Ask LLM to deduce a natural language transformation rule from training examples
2. **Generate Code**: Ask LLM to implement the rule as a Python function
3. **Validate on Training**: Run the code on all training examples, collecting feedback on failures
4. **Refine if Needed**: If validation fails, ask LLM to fix the code using gathered feedback
5. **Execute on Test**: Run the refined code on test inputs

Remarkably, **GEPA discovered reflective self-refinement** — having the LLM check and fix its own code before producing final outputs.


In [None]:
# View the evolved program
print(result.best_candidate["program"][:2000])  # First 2000 chars


## Key Takeaways

The `optimize_anything` API demonstrates GEPA's power as a general-purpose text evolution engine:

1. **Unified interface**: Whether you're optimizing prompts, code, or agent architectures, the API is the same — just define your fitness function with rich `side_info`.

2. **Side information is key**: The more diagnostic information you provide, the better GEPA's LLM-based reflection can understand failures and propose targeted improvements.

3. **Beyond scalar optimization**: Traditional optimizers only see scores. GEPA sees error messages, execution traces, and domain-specific feedback — enabling it to optimize complex artifacts that would be impossible to search blindly.

4. **Emergent capabilities**: GEPA can discover sophisticated strategies (like self-refinement in the ARC-AGI example) that weren't explicitly programmed — they emerge from the optimization process itself.

Try `optimize_anything` on your own optimization problems. If you can express your system's parameters as text and compute a score with diagnostic feedback, GEPA can optimize it.
