# GEPA: Optimize Anything using ASI (additional side information)

GEPA is a text evolution engine: Given a fitness function, GEPA can efficiently search for the right parameters (including numerical, textual and code) to improve the blackbox fitness. GEPA can optimize essentially _anything_ that has a textual representation. We leverage this insight to present GEPA's optimize-anything API, which leverages the reflective capabilities of LLMs, to optimize anything representable as text. Crucially, GEPA can leverage any additional information available from the optimization environment simply by serializing into text.

In this blog, we will walk through using GEPA to optimize the parameters of a blackbox polynomial optimization problem, prompts of an agent for math tasks, optimizing the harness of the agent itself, and discovering efficient circle packing algorithms using GEPA.

## The optimize_anything API

At its core, the API is remarkably simple. You provide just the following things:

1. **A seed candidate** — your starting point, represented as a dictionary mapping parameter names to their values. 
2. **A fitness function** — tells GEPA how good each candidate is. The fitness function also returns any additional information available from the environment about the evaluated candidate, like compiler error messages, that can guide the optimization.
3. **(Optional) A datasets** - A dataset if the task domain consists of multiple related problem instances, especially if the parameters being optimized must generalize to unseen datapoints, OR, if there are several closely related problems being solved (like circle packing with different number of circles).

That's it. GEPA handles the rest — selecting candidates, reflecting on failures, proposing improvements, and tracking the optimization trajectory, finally returning the optimized parameters.

### The Fitness Function: Your Optimization Signal

The fitness function is where you define *what* you're optimizing for. It takes a candidate and a batch of data instances, returning scores and diagnostic information:

```python
def fitness_fn(candidate: dict[str, str], example: DataInst) -> tuple[float, Any, dict]:
    # Run your system with the candidate parameters
    output = run_my_system(candidate, example)
    
    # Compute a score (higher is better)
    score = compute_score(output, example)
    
    # Collect diagnostic info for LLM reflection
    side_info = {
        "input": example["input"],
        "output": output,
        "expected": example["expected"],
        "error_analysis": analyze_errors(output, instance)
    }
    
    return score, output, side_info
```

The magic happens in `side_info` — this is GEPA's secret weapon.


### The Power of Side Information

The `side_info` dictionary is where GEPA shines. Unlike traditional optimization that only sees a scalar score, GEPA's LLM-based reflection can understand *why* a candidate performed poorly:

- **Error messages**: Compiler errors, runtime exceptions, validation failures
- **Execution traces**: What the candidate actually did vs. what was expected
- **Partial results**: Which sub-tasks succeeded, which failed
- **Domain-specific feedback**: Any signal that helps explain performance

The more informative your `side_info`, the better GEPA can reason about improvements. Note the keys within `side_info` are entirely flexible. You can include any contextual metadata that helps an LLM interpret performance and propose meaningful iterations.  This is what enables GEPA to optimize complex artifacts like code and agent architectures — not just tweak numbers. 

In this notebook, we provide minimal, actionable examples to demonstrate the flexibility of the GEPA API.
1. Optimizing code to solve a black-box polynomial function.
2. Optimizing an LLM prompt for a math dataset.
3. Optimizing agent architecture for the ARC-AGI benchmark.
4. 

## Example 1: Code Optimization — Evolving Optimization Algorithms

Our first example demonstrates GEPA's ability to evolve code. We'll optimize Python code that minimizes blackbox functions from the [evalset benchmark](https://github.com/sigopt/evalset/tree/main) — a collection of challenging optimization test functions (Ackley, Rosenbrock, Rastrigin, etc.).

**The task**: Given a blackbox function, write code that finds its minimum. The code can use any optimization library (Optuna, scipy, etc.) and must return the best `x` found.

**What GEPA optimizes**: The Python code itself — its structure, algorithm choice, hyperparameters, and implementation details.

![Polynomial Graph](./assets/blog/polynomial.png)

### Setting up the dataset

Each data instance is a blackbox optimization problem with bounds, dimension, and problem characteristics:


In [None]:
from examples.polynomial.evalset import problems

problem = problems["ackley"]
dataset = [{
    "problem_description": f"""Blackbox optimization problem.
    Minimize a function that takes a numpy array of shape ({problem.dim},) and returns a scalar.
    Bounds: {problem.bounds}""",
    "dim": problem.dim,
    "bounds": problem.bounds,
}]

### The seed candidate

We start with a trivial baseline — code that just guesses the center of the search space:


In [None]:
seed_candidate = """
import numpy as np

def solve(dim):
    # A trivial baseline: guess the center of the search space
    x = [0.5] * dim
    y = evaluator.evaluate(np.array(x))
    print("y:", y)
    return x

if __name__ == "__main__":
    global x
    x = solve(dim)
"""

### The fitness function

The fitness function executes the candidate code in a sandboxed environment, captures the result, and returns rich diagnostic information:


In [None]:
import numpy as np
from typing import Any, Sequence

def execute_code(code: str, global_vars: dict, timeout: int = 30) -> dict:
    """Execute code in a sandboxed environment with timeout."""
    # Implementation handles: stdout/stderr capture, timeout, exception handling
    # Returns: {"output": str, "logs": str, "results": dict, "error": str}
    ...

def fitness_fn(candidate: dict[str, str], batch: Sequence[Any]) -> list[tuple[float, Any, dict]]:
    """Evaluate optimization code on a batch of blackbox problems."""
    code = candidate["code"]
    results = []
    
    for problem_instance in batch:
        problem = problems[problem_instance["problem_name"]]
        
        # Execute the candidate code with problem context
        execution = execute_code(
            code,
            global_vars={"dim": problem.dim, "evaluator": evaluator},
            timeout=30
        )
        
        # Compute score: negative function value (higher is better)
        if "x" in execution["results"] and execution["error"] == "":
            x = np.array(execution["results"]["x"])
            score = -problem.do_evaluate(x)  # Negate because we minimize
        else:
            score = -99999  # Penalize failed executions
        
        # Rich side_info for LLM reflection
        side_info = {
            "scores": {"score": score},
            "Input": {"problem_description": problem_instance["problem_description"]},
            "code_side_info": {
                "X": execution["results"].get("x", "not found"),
                "Prints": execution["output"],       # Captured stdout
                "Logs": execution["logs"],           # Captured stderr  
                "Error": execution["error"],         # Any exceptions
                "Num evaluation calls": evaluator.local_evaluation_calls,
            },
        }
        
        results.append((score, {"code": code, **side_info}, side_info))
    
    return results


Notice how `side_info` captures everything the LLM needs to understand *why* the code failed or succeeded: error messages, print output, the actual result found, and evaluation budget used.

### Running GEPA optimization


In [None]:
from gepa.optimize_anything import (
    optimize_anything,
    GEPAConfig,
    EngineConfig,
    ReflectionConfig,
)

result = optimize_anything(
    seed_candidate=seed_candidate,
    fitness_fn=fitness_fn,
    dataset=dataset,
    config=GEPAConfig(
        engine=EngineConfig(
            max_metric_calls=1000,
            track_best_outputs=True,
        ),
        reflection=ReflectionConfig(
            reflection_lm="openai/gpt-4o",  # LLM for proposing improvements
            reflection_minibatch_size=3,     # Problems shown per reflection
        ),
    ),
)

# Access the optimized code
print(result.best_candidate["code"])


GEPA evolves the code from a trivial baseline into sophisticated optimization strategies — discovering the use of libraries like Optuna, implementing proper bounds handling, and tuning algorithm hyperparameters.


## Example 2: Prompt Optimization — AIME

Baseline Score: 46.67%  
Optimized Score: 53.33%  
Improvement: 6.66%

![AIME Graph](./assets/blog/aime_optimization_progress.png)

### Setting up the dataset and the seed candidate

We use a split of AIME problems for training, validation, and testing:


In [5]:
from examples.math.dataset import load_math_dataset

trainset, valset, testset = load_math_dataset()

Loaded 45 training examples
Loaded 45 validation examples
Loaded 30 test examples


In [None]:
import dspy
import os

# Define the language model
api_key = os.environ.get("OPENAI_API_KEY")
lm = dspy.LM("gpt-4.1-mini", api_key=api_key, temperature=1.0, max_tokens=32000)
dspy.configure(lm=lm)

# Let's optimize the prompt of the following dspy reasoning module.
class MathSolverSignature(dspy.Signature):
    input = dspy.InputField(desc="The math problem to solve.")
    answer = dspy.OutputField(desc="The final numerical answer.")

predictor = dspy.ChainOfThought(MathSolverSignature)

# This is the initial prompt that we will optimize.
SEED_PROMPT = """Solve the math problem carefully. Break down the steps and provide the final answer as a single number."""

### The fitness function

The fitness function runs the predictor on each example and collects detailed feedback about the reasoning process:


In [None]:
from typing import Any, Sequence
from examples.math.main import math_metric

def fitness_fn(candidate: dict[str, str], batch: Sequence[Any]) -> list[tuple[float, Any, dict]]:
    predictor.predict.signature.instructions = candidate["prompt"]

    evaluator = dspy.Evaluate(
        devset=list(batch),
        metric=math_metric,
        num_threads=16,
        display_progress=True,
    )
    eval_result = evaluator(predictor)

    results = []

    for example, prediction, metric_result in eval_result.results:
        score = metric_result.score
        feedback = metric_result.feedback

        artifact = {
            "prompt": candidate["prompt"],
            "answer": prediction.answer,
        }

        side_info = {
            "Input": example.input,
            "Output": prediction.answer,
            "Reasoning": prediction.reasoning,
            "ExecutionFeedback": feedback,
        }

        results.append((score, artifact, side_info))

    return results

### Running GEPA optimization

Notice how `side_info` includes the model's reasoning trace — this helps GEPA understand *how* the model approached each problem, not just whether it got the right answer.


In [None]:
from gepa.optimize_anything import (
    EngineConfig,
    GEPAConfig,
    ReflectionConfig,
    optimize_anything,
)

gepa_config = GEPAConfig(
    engine=EngineConfig(
        max_metric_calls=800,
        track_best_outputs=True,
    ),
    reflection=ReflectionConfig(
        reflection_minibatch_size=3,
        skip_perfect_score=False,
        reflection_lm="openai/gpt-5",
    )
)

result = optimize_anything(
    seed_candidate={"prompt": SEED_PROMPT},
    fitness_fn=fitness_fn,
    dataset=trainset,
    valset=valset,
    config=gepa_config,
)

## Example 3: Agent Optimization — Evolving DSPy Programs for ARC-AGI

Our second example pushes GEPA further: optimizing not just prompts or hyperparameters, but the *entire structure* of an AI agent. We'll evolve a DSPy program to solve ARC-AGI tasks — a challenging benchmark requiring visual reasoning and pattern recognition.

**The task**: Given input-output matrix pairs as training examples, produce the correct output for test inputs.

**What GEPA optimizes**: The entire DSPy program source code — signatures, modules, control flow, and prompting strategies.

**Result**: GEPA improves Gemini-2.5-Pro's performance from **44% to 49.5%** by discovering an elaborate 5-step reasoning pipeline with self-refinement.

![ARC AGI Graph](./assets/blog/arc_agi_optimization_progress.png)

### Setting up the dataset


In [None]:
from examples.arc_agi.data import load_arc_agi_dataset

train_set, val_set, test_set = load_arc_agi_dataset()

### The seed candidate

We start with a simple Chain-of-Thought program — just a single DSPy module:


In [None]:
seed_candidate = """import dspy
from typing import List
import pydantic

MATRIX = List[List[int]]

class TrainingExample(pydantic.BaseModel):
    input: MATRIX
    output: MATRIX

class SolveTaskSignature(dspy.Signature):
    training_examples: List[TrainingExample] = dspy.InputField(description="Input and output examples demonstrating the task to be performed.")
    test_inputs: List[MATRIX] = dspy.InputField(description="Input matrices to be solved following the task described in the training examples.")
    test_outputs: List[MATRIX] = dspy.OutputField(description="Output matrices corresponding to the test inputs.")

program = dspy.ChainOfThought(SolveTaskSignature)"""

In [None]:
import dspy
import os

from gepa.adapters.dspy_full_program_adapter.full_program_adapter import DspyAdapter
from examples.arc_agi.main import metric_fn

# Create LMs
task_lm = dspy.LM(
    model="openai/gpt-5",
    temperature=1.0,
    max_tokens=32000,
    api_key=os.environ.get("OPENAI_API_KEY"),
)

# Create adapter
adapter = DspyAdapter(
    task_lm=task_lm,
    metric_fn=metric_fn,
    num_threads=64,
    reflection_lm="openai/gpt-5",
)

### The fitness function

The fitness function compiles and runs the DSPy program, comparing outputs against ground truth. Crucially, it provides detailed feedback about *what went wrong*:


In [None]:
def format_error_results(program, batch, error_msg):
    """Create error results for all examples in batch."""
    results = []
    for example in batch:
        log = {"error": error_msg, "program": program}
        side_info = {
            "input": example,
            "error": error_msg,
        }
        results.append((0.0, log, side_info))
    return results

def format_results(program, trajectories):
    results = [] 
    for traj in trajectories:
        """Create a single result item from trajectory."""
        metric_result = traj.get("score")
        score = metric_result.get("score")
        feedback = metric_result.get("feedback")
        prediction = traj.get("prediction")
        model_answer = prediction.get("test_outputs")

        log = {
            "program": program,
            "model_answer": model_answer,
            "score": score,
        }

        side_info = {
            "input": traj.get("example"),
            "reasoning": prediction.get("reasoning"),
            "feedback": feedback,
            "output": model_answer,
        }

        results.append((score, log, side_info))

    return results

def fitness_fn(candidate, batch):
    """Evaluate candidate program on batch and return results."""
    program = candidate["program"]

    try:
        eval_batch = adapter.evaluate(batch, candidate, capture_traces=True)
    except Exception as e:
        print(f"Error evaluating candidate: {e}")
        return format_error_results(program, batch, str(e))

    # Program error
    if not isinstance(eval_batch.trajectories, list):
        error_msg = f"All examples failed. Program error: {str(eval_batch.trajectories)}"
        return format_error_results(program, batch, error_msg)

    # Process evaluations with no errors
    return format_results(program, eval_batch.trajectories)

### Running GEPA optimization


In [None]:
from gepa.optimize_anything import (
    EngineConfig,
    GEPAConfig,
    ReflectionConfig,
    optimize_anything,
)

gepa_config = GEPAConfig(
    engine=EngineConfig(
        max_metric_calls=100,
        track_best_outputs=True,
    ),
    reflection=ReflectionConfig(
        reflection_minibatch_size=3,
        skip_perfect_score=False,
        reflection_lm="openai/gpt-5",
    )
)

result = optimize_anything(
    seed_candidate={"program": seed_candidate},
    fitness_fn=fitness_fn,
    dataset=train_set,
    valset=val_set,
    config=gepa_config,
)

### What GEPA discovered

After optimization, GEPA evolved the simple ChainOfThought into an elaborate 5-step pipeline:

1. **Hypothesize Rule**: Ask LLM to deduce a natural language transformation rule from training examples
2. **Generate Code**: Ask LLM to implement the rule as a Python function
3. **Validate on Training**: Run the code on all training examples, collecting feedback on failures
4. **Refine if Needed**: If validation fails, ask LLM to fix the code using gathered feedback
5. **Execute on Test**: Run the refined code on test inputs

Remarkably, **GEPA discovered reflective self-refinement** — having the LLM check and fix its own code before producing final outputs.


In [None]:
# View the evolved program
print(result.best_candidate["program"][:2000])  # First 2000 chars


## Circle Packing

![Circle Packing 21](./assets/blog/circle_packing/circle_packing_21.png)

In [None]:
seed_candidate = """
import numpy as np

def pack_circles(n_circles: int) -> list[tuple[float, float]]:
    '''Pack n unit circles, returning their center positions.'''
    positions = []
    for i in range(n_circles):
        # Place circles in a simple grid pattern
        row = i // int(np.sqrt(n_circles) + 1)
        col = i % int(np.sqrt(n_circles) + 1)
        x = col * 2.1  # Slight spacing to avoid overlap
        y = row * 2.1
        positions.append((x, y))
    return positions
"""


![Circle Packing 26](./assets/blog/circle_packing/circle_packing_26.png)

![Circle Packing 32](./assets/blog/circle_packing/circle_packing_32.png)

## Key Takeaways

The `optimize_anything` API demonstrates GEPA's power as a general-purpose text evolution engine:

1. **Unified interface**: Whether you're optimizing prompts, code, or agent architectures, the API is the same — just define your fitness function with rich `side_info`.

2. **Side information is key**: The more diagnostic information you provide, the better GEPA's LLM-based reflection can understand failures and propose targeted improvements.

3. **Beyond scalar optimization**: Traditional optimizers only see scores. GEPA sees error messages, execution traces, and domain-specific feedback — enabling it to optimize complex artifacts that would be impossible to search blindly.

4. **Emergent capabilities**: GEPA can discover sophisticated strategies (like self-refinement in the ARC-AGI example) that weren't explicitly programmed — they emerge from the optimization process itself.

Try `optimize_anything` on your own optimization problems. If you can express your system's parameters as text and compute a score with diagnostic feedback, GEPA can optimize it.
