# GEPA: Optimize Anything with LLMs

LLM-based optimization algorithms like GEPA, OpenEvolve, ShinkaEvolve, and AlphaEvolve have shown real promise. But their focus has been fragmented: GEPA focused on evolving LLM prompts, while others targeted scientific and algorithmic discovery. Today, we're announcing GEPA's new `optimize_anything` API—an open-source library you can plug into any scenario to optimize text of any kind: prompts, code, agents, etc.

With this API, GEPA becomes a general-purpose text evolution engine. Given a target metric, GEPA efficiently searches for the right parameters to improve that metric. It uses LLMs to generate proposals and leverages side information from the optimization environment to guide the search. This means GEPA can optimize essentially *anything* with a textual representation.

To show what this looks like in practice, we provide four examples:
- Mathematical optimization. GEPA can outperform Optuna on Evalset.
- Prompt evolution. We evolve a prompt for GPT-4.1 Mini on AIME 2025, improving accuracy from 46.67% to 53.33%.
- Agent program evolution. We evolve the full agent program for GPT-5 on ARC-AGI, boosting scores from 55.6% to 60.5%.
- Algorithmic discovery. We tackle a classic combinatorial optimization problem, Circle Packing, achieving 100+% of AlphaEvolve and OpenEvolve's solutions.

## The optimize_anything API

At its core, the API is remarkably simple. You provide just two things:

1. **A seed candidate** — your starting point, represented as a dictionary mapping parameter names to their values. 
2. **A fitness function** — tells GEPA how good each candidate is. The fitness function also returns any additional information available from the environment about the evaluated candidate, like compiler error messages, that can guide the optimization.

That's it. GEPA handles the rest — selecting candidates, reflecting on failures, proposing improvements, and tracking the optimization trajectory, finally returning the optimized parameters.

### The Fitness Function: Your Optimization Signal

The fitness function is where you define *what* you're optimizing for. It takes a candidate and a data instance, returning scores and diagnostic information:

```python
def fitness_fn(candidate: dict[str, str], instance: DataInst) -> list[tuple[float, Any, dict]]:
    # Run your system with the candidate parameters
    output = run_my_system(candidate, instance)
    
    # Compute a score (higher is better)
    score = compute_score(output, instance)
    
    # Collect diagnostic info for LLM reflection
    side_info = {
        "input": instance,
        "output": output,
        "expected": instance["expected"],
        "error_analysis": analyze_errors(output)
    }
    return (score, output, side_info)
```

### The Power of Side Information

The `side_info` dictionary empowers GEPA. Unlike traditional optimization that only sees a scalar score, GEPA's LLM-based reflection can understand *why* a candidate performed poorly:

- **Error messages**: Compiler errors, runtime exceptions, validation failures
- **Execution traces**: What the candidate actually did vs. what was expected
- **Partial results**: Which sub-tasks succeeded, which failed
- **Domain-specific feedback**: Any signal that helps explain performance

You have complete control over what to put inside `side_info.` The more informative your `side_info`, the better GEPA can reason about improvements. This enables GEPA to optimize complex artifacts like code and agent architectures — not just tweak numbers. 

## Example 1: Mathematical optimization

We first demonstrate GEPA's ability to evolve a search code that minimizes blackbox functions from the [evalset benchmark](https://github.com/sigopt/evalset/tree/main) — a collection of challenging optimization test functions (Ackley, Rosenbrock, Rastrigin, etc.) benchmarked by the [Optuna paper](https://arxiv.org/pdf/1907.10902). 

**The task**: Given a blackbox function, write code that finds its minimum. The code can use any optimization library (Optuna, scipy, etc.) and returns the best `x`.

**What GEPA optimizes**: The Python code itself — its structure, algorithm choice, hyperparameters, and implementation details.

The figure below shows GEPA competing with Optuna. Starting from minimal baseline code, GEPA initially underperforms. However, it progressively discovers more effective optimization strategies, eventually finding better solutions than Optuna in later stages. 

<img src="./assets/blog/polynomial_optimization_normalized.png" width="50%">

Here is the code to reproduce this experiment: [link].

While Optuna requires users to select sampling algorithms and techniques for advanced use cases, GEPA frees you from such decisions. By simply defining a baseline code template and the fitness function, GEPA automatically evolves the search code—from experimenting with high-level search strategies to fine-tuning hyperparameters. 

Now, let's walk through a simple example of optimizing code on a single problem from evalset.


### Setting up the dataset

Here, the dataset is a single blackbox optimization problem with bounds, dimension, and problem characteristics:


In [5]:
from examples.polynomial.evalset import sample_problem

dataset = [sample_problem]
dataset

ImportError: cannot import name 'sample_problem' from 'examples.polynomial.evalset' (/Users/lukedhlee/luke_optany/external/gepa-optimize-anything/examples/polynomial/evalset.py)

### The seed candidate

We start with a trivial baseline that randomly samples a solution. The function signature exposes the problem structure—`dim` and `bounds` define the search space, `total_evaluation_budgets` limits how many times the code can call the objective, and `prev_best_x` provides the best solution found so far (if any). The solver code can use `objective_function` to evaluate and compare candidates before returning its best guess.

(is this function too complicated?)
- maybe i could specify total_evaluation_budgets in the LLM prompt instead

In [None]:
seed_candidate = """
import numpy as np

def solve(dim, bounds, objective_function, prev_best_x):
    bounds_arr = np.array(bounds)
    x = np.random.uniform(bounds_arr[:, 0], bounds_arr[:, 1])
    y = objective_function(x)
    return x
"""

### The fitness function

The fitness function executes the candidate code in a sandboxed environment, captures the result, and returns rich diagnostic information:

(update the code below to use our real code. but currently it looks quite complicated. let's modify it. let's abstrsact awway none-gepa-related codes)


In [None]:
from typing import Any
from examples.new_polynomial.evaluator import execute_code, compute_score

# TODO: add best_side_info to the fitness function
def fitness_fn(candidate: dict[str, str], problem: Any, best_side_info: dict) -> list[tuple[float, Any, dict]]:
    code = candidate["code"]
    execution = execute_code(code, 300, {
        "dim": problem["dim"], 
        "bounds": problem["bounds"], 
        "objective_function": problem["objective_function"],
        "prev_best_x": best_side_info["X"],
    })
    score = compute_score(execution)
    
    side_info = {
        "scores": {"score": score},
        "Input": {"problem_description": problem["problem_description"]},
        "code_side_info": {
            "X": execution["results"].get("x", "not found"),
            "Prints": execution["output"],       # Captured stdout
            "Logs": execution["logs"],           # Captured stderr  
            "Error": execution["error"],         # Any exceptions
        },
    }
    
    return score, {"code": code, **side_info}, side_info

Notice how `side_info` captures everything the LLM needs to understand *why* the code failed or succeeded: error messages, print output, and the result found.

### Running GEPA optimization


In [None]:
from gepa.optimize_anything import (
    optimize_anything,
    GEPAConfig,
    EngineConfig,
    ReflectionConfig,
)

result = optimize_anything(
    seed_candidate=seed_candidate,
    fitness_fn=fitness_fn,
    dataset=dataset,
    config=GEPAConfig(
        engine=EngineConfig(
            max_metric_calls=100,  # tweak the number
            track_best_outputs=True,
            cache_evaluation=True,  # TODO: add
        ),
        reflection=ReflectionConfig(
            reflection_lm="openai/gpt-5.1",
            reflection_minibatch_size=1,     # Problems shown per reflection. In this example, we have only one problem, and thus we set it 1.
        ),
    ),
)

# Access the optimized code
print(result.best_candidate["code"])

In [2]:
from examples.new_polynomial.best_program import program

print(program)


# EVOLVE-BLOCK-START
# MUTATION APPLIED: Added zero-vector warm-start, ridge linear surrogate directional probes, and a budget-limited Nelder–Mead subspace finisher
# RATIONALE: Zero seed targets typical polynomial optima near origin; ridge-linear steps cheaply exploit global gradient hints from archive; Nelder–Mead subspace can squeeze extra improvements near the end without heavy modeling

import numpy as np
import os
import json
import math
import time


def solve(dim, total_evaluation_budgets, bounds):
    # Bounds and helpers
    lb = np.array([b[0] for b in bounds], dtype=float)
    ub = np.array([b[1] for b in bounds], dtype=float)
    span = ub - lb
    mid = (lb + ub) / 2.0

    # Anytime-valid best
    best_x = mid.copy()
    best_y = -np.inf
    evals = 0
    budgets = int(total_evaluation_budgets)

    rng = np.random.default_rng()

    # Reflection to bounds (mirror) for arrays of points
    def reflect_to_bounds(X):
        X = np.asarray(X, dtype=float)
        if X.ndi

**The evolved solution:** GEPA discovered a hybrid optimizer that combines adaptive evolutionary search with surrogate-assisted trust-region methods—automatically escalating from cheap linear models to richer quadratic approximations as the search stalls.

**Why it works:** Rather than relying on a fixed algorithm, GEPA learned to dynamically balance exploration (orthogonalized sampling, Cauchy jumps) and exploitation (gradient probes, Nelder-Mead) based on observed progress—a strategy no human specified, but one that outperforms hand-tuned baselines.

## Example 2: Prompt Optimization

In our [GEPA paper](link) at ICLR 2025, we showed that GEPA outperforms the previous state-of-the-art optimizer, MIPROv2, by over 10%—and even beats GRPO using only 2% of the rollouts across four tasks.

<!-- <img src="./assets/blog/gepa_aime.png" width="70%"> -->

In this tutorial, we evolve a GPT4.1 Mini's prompt by training it on AIME 2022~2024 and test it on AIME 2025. 

<img src="./assets/blog/aime_best_comparison.png" width="70%">

In [3]:
from examples.math.dataset import load_math_dataset

trainset, valset, testset = load_math_dataset()



Loaded 45 training examples
Loaded 45 validation examples
Loaded 30 test examples


In [None]:
import dspy
import os

# Use GPT-4.1-mini as the language model to solve the math problems.
lm = dspy.LM("gpt-4.1-mini", api_key=os.environ.get("OPENAI_API_KEY"), temperature=1.0, max_tokens=32000)
dspy.configure(lm=lm)

# Define a simple base prompt that we will optimize.
SEED_PROMPT = """Solve the math problem carefully. Break down the steps and provide the final answer as a single number."""

In [None]:
from gepa.optimize_anything import SideInfo

from examples.math.main import run_llm, math_metric

def fitness_fn(candidate: dict[str, str], example: Any) -> list[tuple[float, Any, SideInfo]]:
    prediction = run_llm(example, candidate["prompt"])
    metric_result = math_metric(example, prediction)
    score = metric_result.score
    feedback = metric_result.feedback

    output = {
        "prompt": candidate["prompt"],
        "answer": prediction.answer,
        "score": score,
    }

    side_info = {
        "Input": example.input,
        "Output": prediction.answer,
        "Reasoning": getattr(prediction, "reasoning", ""),
        "ExecutionFeedback": feedback,
    }

    return (score, output, side_info)

In [None]:
from gepa.optimize_anything import (
    EngineConfig,
    GEPAConfig,
    ReflectionConfig,
    optimize_anything,
)

gepa_config = GEPAConfig(
    engine=EngineConfig(
        max_metric_calls=800,
        track_best_outputs=True,
    ),
    reflection=ReflectionConfig(
        reflection_minibatch_size=3,
        skip_perfect_score=False,
        reflection_lm="openai/gpt-5.1",
    )
)

result = optimize_anything(
    seed_candidate={"prompt": SEED_PROMPT},
    fitness_fn=fitness_fn,
    dataset=trainset,
    valset=valset,
    config=gepa_config,
)

best_prompt = result.best_candidate["prompt"]
best_prompt

In [None]:
from examples.math.main import evaluate_on_dataset

# Baseline Evaluation
print("\nEvaluating Baseline (Initial Prompt)...")
baseline_score = evaluate_on_dataset(SEED_PROMPT, testset)

# Optimized Evaluation
print("\nEvaluating Best Optimized Program...")
best_prompt = result.best_candidate["prompt"]
print(f"Best Prompt Found:\n{best_prompt}")

optimized_score = evaluate_on_dataset(best_prompt, testset)

print(f"Baseline Score: {baseline_score:.2%}")
print(f"Optimized Score: {optimized_score:.2%}")
print(f"Improvement: {optimized_score - baseline_score:.2%}")

In [3]:
from examples.math.best_program import prompt

print(prompt)


Solve from first principles with explicit checks. Requirements:

1) Model precisely:
- Define all objects, variables, and constraints algebraically/combinatorially.
- Choose one counting model (labeled vs indistinguishable) and stay consistent. For combinatorics, either label and divide at the end OR keep indistinguishable throughout—do not mix.
- For number-theory/decimal/ratio problems, state factorizations and gcd/lcm relations explicitly.

2) Mapping/Counting rigor:
- When mapping elements between sets (e.g., m ↦ m/gcd(m,N)), prove injectivity/surjectivity or otherwise handle overlaps via inclusion–exclusion. Do not assume unions over divisors are disjoint without proof.
- When computing a probability, ensure numerator and denominator are counts from the same sample space.
- Keep all computations exact (fractions/radicals/modular arithmetic); avoid decimals unless terminating.

3) Geometry workflow:
- Draw and name a diagram (mentally or on paper). List candidate theorems: power o

TODO: show the evolved prompt
Also run the test

## Example 3: Agent Optimization — Evolving DSPy Programs for ARC-AGI

Our third example pushes GEPA further: optimizing not just prompts or hyperparameters, but the *entire structure* of an AI agent. We'll evolve a DSPy program to solve ARC-AGI tasks — a challenging benchmark requiring visual reasoning and pattern recognition.

**The task**: Given input-output matrix pairs as training examples, produce the correct output for test inputs.

**What GEPA optimizes**: The entire DSPy program source code — signatures, modules, control flow, and prompting strategies.

**Result**: GEPA improves GPT5's performance from **X% to Y%** by discovering an [elaborate 5-step reasoning pipeline with self-refinement.]

<!-- ![ARC AGI Graph](./assets/blog/arc_agi_optimization_progress.png) -->
<img src="./assets/blog/arc_agi_best_comparison.png" width="50%">

### Setting up the dataset


In [2]:
from examples.arc_agi.data import load_arc_agi_dataset

train_set, val_set, test_set = load_arc_agi_dataset()

Train set: 200
Val set: 200
Test set: 400


### The seed candidate

We start with a simple Chain-of-Thought program — just a single DSPy module:


In [3]:
seed_candidate = """import dspy
from typing import List
import pydantic

MATRIX = List[List[int]]

class TrainingExample(pydantic.BaseModel):
    input: MATRIX
    output: MATRIX

class SolveTaskSignature(dspy.Signature):
    training_examples: List[TrainingExample] = dspy.InputField(description="Input and output examples demonstrating the task to be performed.")
    test_inputs: List[MATRIX] = dspy.InputField(description="Input matrices to be solved following the task described in the training examples.")
    test_outputs: List[MATRIX] = dspy.OutputField(description="Output matrices corresponding to the test inputs.")

program = dspy.ChainOfThought(SolveTaskSignature)"""

In [4]:
import dspy
import os

from gepa.adapters.dspy_full_program_adapter.full_program_adapter import DspyAdapter
from examples.arc_agi.main import metric_fn

# Create LMs
task_lm = dspy.LM(
    model="openai/gpt-5",
    temperature=1.0,
    max_tokens=32000,
    api_key=os.environ.get("OPENAI_API_KEY"),
)

# Create adapter
adapter = DspyAdapter(
    task_lm=task_lm,
    metric_fn=metric_fn,
    num_threads=64,
    reflection_lm="openai/gpt-5",
)

### The fitness function

The fitness function compiles and runs the DSPy program, comparing outputs against ground truth. Crucially, it provides detailed feedback about *what went wrong*:


In [None]:
def fitness_fn(candidate, example):
    program = candidate["program"]
    print("Example: ", type(example))

    try:
        evaluation_results = adapter.evaluate([example], candidate, capture_traces=True)
    except Exception as e:
        side_info = {
            "input": example,
            "error": str(e),
            "program": program
        }
        return (0.0, side_info, side_info)

    # Program error
    if not isinstance(evaluation_results.trajectories, list) or len(evaluation_results.trajectories) == 0:
        print("Error: ")
        print(evaluation_results.trajectories)
        side_info = {
            "input": example,
            "error": f"All examples failed. Program error: {str(evaluation_results.trajectories)}",
            "program": program
        }
        return (0.0, side_info, side_info)

    # Process evaluations with no program errors
    trajectory = evaluation_results.trajectories[0]
    metric_result = trajectory.get("score")
    score = metric_result.get("score")
    feedback = metric_result.get("feedback")
    prediction = trajectory.get("prediction")

    side_info = {
        "input": example,
        "reasoning": prediction.get("reasoning"),
        "feedback": feedback,
        "output": prediction.get("test_outputs"),
    }

    return (score, side_info, side_info)

### Running GEPA optimization


In [None]:
from gepa.optimize_anything import (
    EngineConfig,
    GEPAConfig,
    ReflectionConfig,
    optimize_anything,
)
from examples.arc_agi.prompt import REFLECTION_PROMPT

gepa_config = GEPAConfig(
    engine=EngineConfig(
        max_metric_calls=4000,
        track_best_outputs=True,
        parallel=True,
        max_workers=64,
    ),
    reflection=ReflectionConfig(
        reflection_minibatch_size=3,
        reflection_lm="openai/gpt-5",
        reflection_prompt_template=REFLECTION_PROMPT,
    )
)

result = optimize_anything(
    seed_candidate={"program": seed_candidate},
    fitness_fn=fitness_fn,
    dataset=train_set,
    valset=val_set,
    config=gepa_config,
)

In [7]:
from examples.arc_agi.best_program import program

print(program)


import dspy
from typing import List, Optional, Any, Dict, Tuple, Callable
import pydantic
import re
import traceback
import copy

MATRIX = List[List[int]]

class TrainingExample(pydantic.BaseModel):
    input: MATRIX
    output: MATRIX

class SolveTaskSignature(dspy.Signature):
    """
    Solve ARC-style grid transformations by learning a function from examples.

    Inputs:
    - training_examples: A list of (input, output) grid pairs that demonstrate the task. Grids are integer matrices.
    - test_inputs: Grids to transform using the learned task.

    Output:
    - test_outputs: Exact grids corresponding to each test input.

    Approach:
    - Induce a general, deterministic transformation as Python code: def transform(grid: List[List[int]]) -> List[List[int]].
    - Common patterns:
      1) Separator rows/columns: entire rows/cols of a single color often partition the grid; keep separators unchanged.
      2) Block-wise aggregation: when grid is partitioned into kxk blocks by 

### What GEPA discovered
The evolved program implements a code synthesis loop with self-repair. Rather than prompting the LLM to directly output grid transformations, it asks the model to write a `transform(grid)` function, then verifies that function against the task's demonstration pairs before applying it to the held-out test input. If verification fails, it feeds specific mismatches back as hints and retries—up to five attempts.

A few innovations stand out:

- **Verification-driven self-debugging.** The agent doesn't trust its initial output. It executes the generated code on demonstration inputs, diffs against expected outputs, and uses failure diagnostics (e.g., "expected 3 at (2,4), got 0") to refine the next attempt. This behavior emerged from evolution, not manual design.

- **Domain priors in the prompt.** The synthesis prompt explicitly encodes common ARC patterns: separator detection, block aggregation, mask algebra, noise cleanup. GEPA discovered that enumerating these priors helps the underlying model generalize across tasks.

- **Graceful degradation.** If all attempts fail, the program returns an identity transform. It never crashes, even when it cannot solve a task.

GEPA arrived at this design through search rather than human's manual engineering intuition!

In [None]:
# View the evolved program
print(result.best_candidate["program"][:2000])  # First 2000 chars

## Circle Packing

Circle packing is a classic example used by ShinkaEvolve, OpenEvolve, and AlphaEvolve.
Here, we also show how GEPA can conduct an algorithmic discovery for circle packing.

<img src="./assets/blog/circle_packing/circle_packing_26_comparison.png" width="80%">

GEPA finds a world-record-level solution, achieving 99.999% of AlphaEvolve and ShinkaEvolve and 100.064% of OpenEvolve.

<!-- <img src="./assets/blog/circle_packing/circle_packing_21.png" width="50%">

<img src="./assets/blog/circle_packing/circle_packing_26.png" width="50%">

<img src="./assets/blog/circle_packing/circle_packing_32.png" width="50%"> -->

<!-- ### Batch mode (Evolving a search code for 13 instances)

num_circles = [7, 13, 19, 21, 22, 26, 29, 31, etc.]

<img src="./assets/blog/circle_packing/gepa_vs_shinka.png" width="50%">

We take Shinka as a baseline and run the same gpt5.1 for a batch mode. 
We can see that more data instances -> GEPA save computes while Shinka performs a full validation. -->

In [8]:
# best circles
best_circles = [
    [0.08468730125813197, 0.08468730125813197, 0.08468730125813197],
    [0.8889367079091394, 0.1110632920908606, 0.1110632920908606],
    [0.08468730125813231, 0.9153126987418675, 0.08468730125813231],
    [0.8889367079091401, 0.8889367079091401, 0.11106329209085986],
    [0.2735316496509986, 0.10527607855641154, 0.10527607855641154],
    [0.48251901284944326, 0.10371709930579165, 0.10371709930579165],
    [0.682251639361118, 0.09615849835819791, 0.09615849835819791],
    [0.2735316496509994, 0.8947239214435884, 0.10527607855641163],
    [0.4825190128494443, 0.8962829006942082, 0.10371709930579176],
    [0.6822516393611192, 0.9038415016418019, 0.09615849835819812],
    [0.13236009424048484, 0.2964345012443294, 0.13236009424048484],
    [0.07826927088831243, 0.49999999999999845, 0.07826927088831243],
    [0.13236009424048586, 0.7035654987556688, 0.13236009424048586],
    [0.9075441882905443, 0.31372998221417503, 0.09245581170945572],
    [0.9061808044177737, 0.5000000000000004, 0.09381919558222629],
    [0.907544188290544, 0.686270017785826, 0.09245581170945605],
    [0.2697939589756203, 0.500000000000001, 0.11325541719900234],
    [0.38112506495598114, 0.7008876686603106, 0.11641928880622215],
    [0.38112506495598014, 0.2991123313396901, 0.11641928880622275],
    [0.7632491637451031, 0.7594985921321933, 0.06935728482284345],
    [0.5965286484003857, 0.7268072190485636, 0.10053814216358756],
    [0.742587601069024, 0.40428027755541857, 0.09571972244458213],
    [0.7425876010690241, 0.5957197224445819, 0.09571972244458193],
    [0.5965286484003852, 0.273192780951437, 0.1005381421635882],
    [0.5325196677311786, 0.5000000000000006, 0.1351282835717167],
    [0.7632491637451023, 0.24050140786780766, 0.06935728482284145],
]

best_circles_sum = np.array(best_circles)[:, 2].sum()
best_circles_sum

2.635977394754397

Within just 150 evaluations, GEPA found the solution 99.9999880668% of AlphaEvolve, 99.9997836004% of ShinkaEvolve, and 100.063963765% of OpenEvolve.

## Key Takeaways

The `optimize_anything` API demonstrates GEPA's power as a general-purpose text evolution engine:

1. **Unified interface**: Whether you're optimizing prompts, code, or agent architectures, the API is the same — just define your fitness function with rich `side_info`.

2. **Side information is key**: The more diagnostic information you provide, the better GEPA's LLM-based reflection can understand failures and propose targeted improvements.

3. **Beyond scalar optimization**: Traditional optimizers only see scores. GEPA sees error messages, execution traces, and domain-specific feedback — enabling it to optimize complex artifacts that would be impossible to search blindly.

4. **Emergent capabilities**: GEPA can discover sophisticated strategies (like self-refinement in the ARC-AGI example) that weren't explicitly programmed — they emerge from the optimization process itself.

Try `optimize_anything` on your own optimization problems. If you can express your system's parameters as text and compute a score with diagnostic feedback, GEPA can optimize it.
