# `optimize_anything`: A Universal API for Text-Based Optimization

**TL;DR**: We introduce `optimize_anything`, a single, declarative API that uses LLMs as intelligent proposers to optimize *anything* representable as text—code, prompts, configurations, agent architectures. The key insight: if it can be serialized to text, an LLM can reason about it and propose improvements. The secret sauce? **A**uxiliary **S**ide **I**nformation (ASI).

---

## Key Takeaways

1. **Unified Interface**: Whether you're optimizing prompts, code, hyperparameters, or agent architectures, the API is the same—just provide a `seed_candidate` (starting point) and a `fitness_fn` (how good are we doing?).

2. **The Convex Hull of Optimization**: `optimize_anything` is designed to be the "convex hull" of all text-based optimization problems. Different libraries optimize different things (Optuna for hyperparameters, evolutionary strategies for algorithms, gradient descent for neural networks). We unify them under one abstraction.

3. **Side Information is Key**: Unlike traditional optimizers that only see scalar scores, GEPA's LLM-based reflection can understand *why* a candidate performed poorly through rich diagnostic information—error messages, execution traces, partial results.

4. **Emergent Capabilities**: GEPA can discover sophisticated strategies (like self-refinement) that weren't explicitly programmed—they emerge from the optimization process itself.

---

## Results Summary

| Domain | Task | Baseline | Optimized | Improvement |
|--------|------|----------|-----------|-------------|
| **Mathematical Optimization** | EvalSet Benchmark | Optuna TPE | GEPA | Outperforms Optuna |
| **Prompt Engineering** | AIME 2025 (GPT-4.1 Mini) | 46.67% | 60.00% | +13.3% absolute |
| **Agent Evolution** | ARC-AGI (GPT-5) | 55.6% | 60.5% | +4.9% absolute. Discovers sophiscated 6-step agent. |
| **Algorithmic Discovery** | Circle Packing (N=26) | 0.9798 | 2.6359 | Exceeds AlphaEvolve, ShinkaEvolve, OpenEvolve |

---

## Outline

1. **[The Landscape of Optimization (The "Old" Way)](#section-1)** — The fragmented world of optimization libraries
2. **[The Unifying Abstraction: `optimize_anything`](#section-2)** — One API to rule them all
3. **[The Secret Weapon: Auxiliary Side Information (ASI)](#section-3)** — Why GEPA outperforms traditional optimizers
4. **[Example 1: Mathematical Optimization](#section-4)** — Evolving code to beat Optuna on EvalSet
5. **[Example 2: Prompt Engineering](#section-5)** — Optimizing prompts for AIME 2025
6. **[Example 3: Agent Program Evolution](#section-6)** — Evolving DSPy programs for ARC-AGI
7. **[Example 4: Algorithmic Discovery](#section-7)** — Circle packing that matches AlphaEvolve
8. **[How It Works Under the Hood](#section-8)** — The GEPA engine
9. **[Conclusion](#section-9)** — From imperative to declarative optimization

---

<a id="section-1"></a>
# 1. The Landscape of Optimization (The "Old" Way)

The world of optimization is **fragmented**. Each problem domain has its own specialized library with its own API, paradigm, and learning curve. Let's look at the major categories:

### Hyperparameter/Black-Box Optimization (Optuna)

For hyperparameter tuning, you use Bayesian optimization or Tree-structured Parzen Estimators (TPE):

In [None]:
import optuna

def objective(trial):
    # Suggest hyperparameters
    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    n_layers = trial.suggest_int("n_layers", 1, 5)
    
    # Train model and return validation score
    model = build_model(lr, n_layers)
    return train_and_evaluate(model)

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

### Mathematical Optimization (SciPy)

For continuous optimization of mathematical functions, you use classical algorithms like L-BFGS-B or SLSQP:

In [None]:
from scipy.optimize import minimize

def rosenbrock(x):
    return sum(100*(x[1:] - x[:-1]**2)**2 + (1 - x[:-1])**2)

result = minimize(
    rosenbrock, 
    x0=[0, 0, 0, 0],
    method='L-BFGS-B',
    bounds=[(-5, 5)] * 4
)

### Evolutionary Algorithms (DEAP)

For evolving programs or complex structures, you use genetic algorithms:

In [None]:
from deap import base, creator, tools, algorithms

creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", list, fitness=creator.FitnessMax)

toolbox = base.Toolbox()
toolbox.register("attr_bool", random.randint, 0, 1)
toolbox.register("individual", tools.initRepeat, creator.Individual, toolbox.attr_bool, 100)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
toolbox.register("evaluate", eval_func)
toolbox.register("mate", tools.cxTwoPoint)
toolbox.register("mutate", tools.mutFlipBit, indpb=0.05)
toolbox.register("select", tools.selTournament, tournsize=3)

population = toolbox.population(n=300)
algorithms.eaSimple(population, toolbox, cxpb=0.5, mutpb=0.2, ngen=40)

### The Problem: Fragmentation

**A user needs to learn 3 different paradigms to solve 3 different optimization problems.**

| Library | Domain | Paradigm | What You Must Know |
|---------|--------|----------|-------------------|
| Optuna | Hyperparameters | Bayesian/TPE | Samplers, pruners, search space definition |
| SciPy | Mathematical Functions | Classical Methods | Algorithm selection (L-BFGS, SLSQP, etc.) |
| DEAP | Evolutionary | Genetic Algorithms | Crossover, mutation, selection operators |

Each library has:
- **Different APIs and abstractions** — you can't just swap one for another
- **Different optimization strategies** hard-coded into the implementation
- **Different assumptions** about what can be optimized (differentiable? discrete? continuous?)

### The Insight: Text is the Universal Representation

Here's the key insight: **if something can be represented as text, an LLM can reason about it and propose improvements**.

- **Code** is text → LLMs can write and improve code
- **Prompts** are text → LLMs can refine instructions
- **Configurations** are text → LLMs can tune JSON/YAML
- **Agent architectures** are text → LLMs can evolve program structure

What if we had **one API** that could optimize all of them—by leveraging the LLM's ability to understand and generate text?

<!-- This cell intentionally left empty - placeholder for removal -->

---

<a id="section-2"></a>
# 2. The Unifying Abstraction: `optimize_anything`

We introduce `optimize_anything`—a single entry point for optimizing any text-representable artifact. It's designed to be the **"Convex Hull"** of all optimization problems: every point in the space of text-based optimization can be reached through this API.

## The API Signature

The API is intentionally minimal. You need only two things:
1. **A seed candidate** — your starting point
2. **A fitness function** — how to measure success

In [None]:
from gepa.optimize_anything import optimize_anything, GEPAConfig

def optimize_anything(
    # === REQUIRED ===
    seed_candidate: dict[str, str],           # Your starting point (text parameters to optimize)
    fitness_fn: FitnessFn,                    # How to measure success
    
    # === OPTIONAL: Data ===
    dataset: list[DataInst] | None = None,   # Examples to optimize on (for example, multiple related tasks)
    valset: list[DataInst] | None = None,    # Held-out set for ensuring generalization if required
    
    # === OPTIONAL: Natural Language Guidance ===
    objective: str | None = None,            # What you're trying to achieve (e.g. "Find a prompt that maximizes accuracy")
    background: str | None = None,           # Domain knowledge, constraints, strategies (e.g. Domain knowledge about the framework which the candidate is written in)
    
    # === OPTIONAL: Fine-Grained Control ===
    config: GEPAConfig | None = None,        # Engine, reflection, tracking settings
) -> GEPAResult:
    """
    Optimize any parameterized system using evolutionary algorithms with LLM-based reflection.
    
    Returns:
        GEPAResult containing best_candidate, optimization history, and metrics.
    """
    ...

## The Philosophy: Declare, Don't Implement

With `optimize_anything`, the user **declares** the optimization problem:

| You Provide | Example | Purpose |
|-------------|---------|---------|
| `seed_candidate` | `{"prompt": "Solve this math problem:"}` | Your starting point |
| `fitness_fn` | Returns (score, output, side_info) | How to measure success |
| `dataset` (optional) | List of test cases | Multi-instance generalization |
| `objective` (optional) | "Find a prompt that maximizes accuracy" | Natural language guidance |
| `background` (optional) | "Solutions must handle edge cases" | Domain knowledge |

GEPA handles the **how**: proposing mutations, reflecting on failures, selecting candidates, and tracking the optimization trajectory.

## Two Modes of Operation

### Per-Instance Mode (with `dataset`)

For problems where you want parameters that **generalize** across examples:
- **Prompt optimization**: The same prompt should work on many math problems
- **Agent architecture search**: The same agent should solve many tasks

```python
# dataset is a list of examples
result = optimize_anything(
    seed_candidate={"prompt": "Solve:"},
    fitness_fn=evaluate_prompt,
    dataset=math_problems,  # ← Optimize across these
    valset=held_out_problems,  # ← Test generalization
)
```

### Single-Instance Mode (without `dataset`)

For problems defined by a **single optimization target**:
- **Circle packing**: Maximize sum of radii for N circles
- **Code evolution**: Minimize a mathematical function

```python
# dataset=None triggers single-instance mode
result = optimize_anything(
    seed_candidate={"code": "def solve(): ..."},
    fitness_fn=evaluate_code,
    dataset=None,  # ← Single optimization target
)
```

## The Fitness Function: Your Optimization Signal

The fitness function is where you define *what* you're optimizing for:

In [None]:
from typing import Any

def fitness_fn(
    candidate: dict[str, str],  # The parameters being optimized
    example: Any | None = None  # A single data instance (None for single-instance mode)
) -> tuple[float, Any, dict]:
    """
    Returns:
        score: Higher is better
        output: The actual output produced (for tracking)
        side_info: Diagnostic information for LLM reflection
    """
    # Run your system with the candidate parameters
    output = run_my_system(candidate, example)
    
    # Compute a score (higher is better)
    score = compute_score(output, example)
    
    # Collect diagnostic info for LLM reflection
    side_info = {
        "Input": example["input"],
        "Output": output,
        "Expected": example["expected"],
        "Error": get_error_message(output),
    }
    
    return score, output, side_info

---

<a id="section-3"></a>
# 3. The Secret Weapon: Auxiliary Side Information (ASI)

The `side_info` dictionary is GEPA's secret weapon—we call it **ASI** (**A**uxiliary **S**ide **I**nformation). 

> *While the AI community debates when we'll achieve ASI (Artificial Superintelligence), you can achieve **your** ASI today—just return rich diagnostic information from your fitness function.*

## Why ASI Matters

Traditional optimizers only see a **scalar score**:

```
Candidate A → Score: 0.73  (Why did it fail? No idea.)
Candidate B → Score: 0.85  (What made it better? Unknown.)
```

GEPA's LLM-based reflection can understand **why** a candidate performed the way it did:

```
Candidate A → Score: 0.73
  side_info: {
    "Error": "Circle 3 and Circle 7 overlap by 0.02 units",
    "Boundary violations": ["Circle 12 extends past x=1.0"],
    "Best score achieved": 2.847
  }
```

Now the LLM knows *exactly* what to fix.

## What to Include in ASI

| Information Type | Example | Purpose |
|-----------------|---------|----------|
| **Error messages** | `"SyntaxError: invalid syntax on line 42"` | Helps LLM fix code bugs |
| **Execution traces** | `"Called API 3x, timeout on 3rd call"` | Helps LLM understand behavior |
| **Partial results** | `"3/5 test cases passed"` | Helps LLM identify failure patterns |
| **Expected vs Actual** | `"Expected: [1,2,3], Got: [1,2,4]"` | Helps LLM understand what went wrong |
| **Domain feedback** | `"Circles overlap at positions (0.5, 0.3)"` | Helps LLM make domain-aware improvements |
| **Reasoning traces** | `"Model's chain-of-thought: ..."` | Helps LLM understand failure modes |

## The ASI Design Principle

**Be generous with information.** Include anything that would help a human expert understand why the candidate succeeded or failed. The LLM will use this to make targeted, intelligent improvements rather than random mutations.

```python
# Good ASI
side_info = {
    "Input": problem_description,
    "Output": model_output,
    "Expected": correct_answer,
    "Reasoning": model_reasoning_trace,
    "Error": "Division by zero on line 15",
    "Partial scores": {"accuracy": 0.8, "efficiency": 0.3},
}

# Bad ASI (not enough information)
side_info = {"score": 0.73}  # LLM can't help with just this!
```

---

<a id="section-4"></a>
# 4. Example 1: Mathematical Optimization — Beating Optuna

**Result: GEPA outperforms Optuna on the EvalSet benchmark.**

This example demonstrates how `optimize_anything` can evolve **code** that implements optimization algorithms—essentially using LLMs to discover optimization strategies automatically.

## The Challenge

Optuna is the industry standard for black-box optimization. But using Optuna effectively requires:
- Choosing sampling algorithms (TPE, CMA-ES, Random, etc.)
- Defining search spaces manually
- Tuning algorithm-specific hyperparameters
- Deep knowledge of optimization theory

(Luke: the claims above are too strong? Optuna is actually very simplistic)

**What if we could just write code that finds minima, and let GEPA evolve the strategy?**

## The Task

**Given**: A black-box function `objective_function(x) → float` with bounds
**Find**: Python code that discovers the minimum

**What GEPA optimizes**: The Python code itself—algorithm choice, implementation, hyperparameters, heuristics.

<img src="./assets/blog/mathematical_optimization.png" width="60%">

*GEPA starts below Optuna but progressively discovers better strategies, eventually surpassing it.*

## Setting Up the Problem

We use the [EvalSet benchmark](https://github.com/sigopt/evalset)—a collection of challenging optimization test functions (Ackley, Rosenbrock, Rastrigin, etc.).

In [3]:

# # Each problem is a black-box function with bounds and dimension
# dataset = [{
#     "problem_description": """Blackbox optimization problem.
#     Minimize a function that takes a numpy array of shape (11,) and returns a scalar.
#     Bounds: [(-10, 30)] * 11
#     The function is unknown - you can only call objective_function(x) to evaluate.
#     """,
#     "dim": 11,
#     "bounds": [(-10, 30)] * 11,
# }]

from examples.polynomial.evalset.problems import problems, problem_configs

problem_index = 0
problems[problem_index], problem_configs[problem_index]

(Ackley(11), {'name': 'Ackley', 'dim': 11, 'int': None, 'res': None})

## The Seed Candidate

We start with a trivial baseline—random sampling:

In [8]:
seed_candidate = {
    "code": """import numpy as np

def solve(objective_function, config, prev_best_x=None):
    bounds = np.array(config['bounds'])
    x = np.random.uniform(bounds[:, 0], bounds[:, 1])
    y = objective_function(x)
    return x
"""
}

## The Fitness Function

The fitness function executes the code in a sandbox and captures rich diagnostic information:

In [6]:
# Luke: need to either remove the whole custom objective tracking feature or abstract it away into GEPA

from typing import Any
import numpy as np
import json
from pathlib import Path

from gepa.optimize_anything import SideInfo
from gepa.utils.code_execution import execute_code as _execute_code, ExecutionMode


class FitnessEvaluator:
    """Fitness evaluator for GEPA blackbox optimization."""

    def __init__(
        self,
        problem_index: int,
        timeout: int = 300,
        evaluation_budget: int = 100,
        log_dir: str = None,
        seed: int = 0,
    ):
        self.problem_index = problem_index
        self.timeout = timeout
        self.evaluation_budget = evaluation_budget
        self.log_dir = Path(log_dir) if log_dir else None
        self.seed = seed

        # State tracking for warm-start (minimization: lower is better)
        self.evaluation_history = []
        self.best_score = float("inf")
        self.best_x = None

    def evaluate(self, candidate: dict[str, str], **kwargs) -> tuple[float, Any, SideInfo]:
        """Evaluate code candidate on a single problem."""
        code = candidate["code"]
        function = problems[self.problem_index]
        problem_config = problem_configs[self.problem_index]

        # Track state for this candidate
        eval_count = 0
        best_candidate_score = float("inf")
        errors = []

        def objective_function(x):
            nonlocal eval_count, best_candidate_score
            if eval_count >= self.evaluation_budget:
                raise ValueError(f"Evaluation budget exceeded: {eval_count} >= {self.evaluation_budget}")
            eval_count += 1

            score = function.do_evaluate(np.array(x))

            if score < best_candidate_score:
                best_candidate_score = score
            if score < self.best_score:
                self.best_score = score
                self.best_x = np.array(x).copy()

            self.evaluation_history.append({
                "score": score,
                "best_score": self.best_score,
            })
            return score

        # Execute code
        result = _execute_code(
            code=code,
            timeout=self.timeout,
            mode=ExecutionMode.IN_PROCESS,
            entry_point="solve",
            entry_point_kwargs={
                "objective_function": objective_function,
                "config": {"bounds": function.bounds, "dim": function.dim, "budget": self.evaluation_budget},
                "prev_best_x": self.best_x,
            },
            seed=self.seed,
        )

        x = result.variables.get("__return__")
        stdout = self._truncate(result.stdout)
        stderr = self._truncate(result.stderr)

        if result.error:
            errors.append(result.error)
        if result.traceback and result.traceback not in (result.error or ""):
            errors.append(result.traceback)
        if x is None or not isinstance(x, np.ndarray):
            errors.append("Code did not return a valid numpy array.")
        if eval_count == 0:
            errors.append("No objective_function calls were made.")

        # Use best score found, or inf if none
        score = best_candidate_score if best_candidate_score < float("inf") else float("inf")
        print(f"Best score from {eval_count} calls: {score}")

        side_info = {
            "score": score,
            "Input": problem_config["name"],
            "Prints": stdout,
            "Logs": stderr,
            "Error": "\n".join(errors) if errors else "",
        }

        output = {
            **side_info,
            "code": code,
            "X": " ".join(map(str, x.ravel())) if x is not None else "not found",
        }

        self.save()
        gepa_score = -score if score < float("inf") else -1e9
        return (gepa_score, output, side_info)

    def save(self, verbose: bool = False):
        """Save evaluation history to JSON."""
        if not self.log_dir:
            return
        self.log_dir.mkdir(parents=True, exist_ok=True)
        filename = self.log_dir / f"evaluation_history.json"
        try:
            with open(filename, "w") as f:
                json.dump(self.evaluation_history, f, indent=2, default=lambda o: o.tolist() if isinstance(o, np.ndarray) else o)
            if verbose:
                print(f"Saved to {filename}")
        except Exception as e:
            print(f"Warning: Failed to save: {e}")
            
    def  _truncate(self, text: str, limit: int = 4000) -> str:
        """Truncate text to avoid token limits."""
        if len(text) <= limit:
            return text
        half = limit // 2
        return text[:half] + "\n...[truncated]...\n" + text[-half:]


total_evaluation_budgets = 8000
num_proposals = 15
evaluation_budget_per_proposal = total_evaluation_budgets // num_proposals
seconds_per_trial=2
timeout_per_candidate = evaluation_budget_per_proposal * seconds_per_trial

# Create evaluator
evaluator = FitnessEvaluator(
    problem_index=problem_index,
    timeout=timeout_per_candidate,
    evaluation_budget=evaluation_budget_per_proposal,
    seed=0
)

## Running GEPA Optimization

In [None]:
from examples.polynomial.prompt import BACKGROUND
from gepa.optimize_anything import (
    optimize_anything,
    GEPAConfig,
    EngineConfig,
    ReflectionConfig,
)

gepa_config = GEPAConfig(
    engine=EngineConfig(
        max_metric_calls=num_proposals,
        track_best_outputs=True,
        cache_evaluation=True,
    ),
    reflection=ReflectionConfig(
        reflection_lm="openai/gpt-5",
        reflection_minibatch_size=1,
    )
)

result = optimize_anything(
    seed_candidate=seed_candidate,
    fitness_fn=evaluator.evaluate,
    config=gepa_config,
    objective="Evolve a code that minimizes a blackbox objective function.",
    background=BACKGROUND,
)

print("Optimized code:")
print(result.best_candidate["code"])

Best score from 1 calls: 21.109047957197003
Iteration 0: Base program full valset score: -21.109047957197003 over 1 / 1 examples
Iteration 1: Selected program 0 score: -21.109047957197003
Best score from 1 calls: 21.109047957197003


KeyboardInterrupt: 

## What GEPA Discovered

GEPA evolved the trivial random sampler into a sophisticated optimization strategy. Here's a snippet from the evolved code:

In [None]:
# Evolved by GEPA - combines multiple strategies:

evolved_code = """
import numpy as np

def solve(dim, total_evaluation_budgets, bounds):
    lb = np.array([b[0] for b in bounds], dtype=float)
    ub = np.array([b[1] for b in bounds], dtype=float)
    span = ub - lb
    mid = (lb + ub) / 2.0
    
    # 1. Smart seeding: Halton sequence + LHS for diversity
    def halton(n, d):
        # Van der Corput sequence implementation
        ...
    
    # 2. Zero-vector warm-start (often near polynomial optima)
    zero_vec = np.zeros(dim)
    if np.all(zero_vec >= lb) and np.all(zero_vec <= ub):
        seeds.append(zero_vec)
    
    # 3. CMA-ES inspired evolution strategy
    sigma = 0.20  # Adaptive step size
    ...
    
    # 4. Local refinement with coordinate descent
    def local_refine(max_evals):
        ...
    
    return best_x
"""

# GEPA discovered: Halton sequences, zero-vector seeding,
# CMA-ES-style evolution, and local refinement—all without being told!

### Key Takeaway

**GEPA vs. Traditional Optimization**:

| Aspect | Optuna | GEPA |
|--------|--------|------|
| Algorithm selection | Manual (TPE, CMA-ES, etc.) | Automatic (evolved) |
| Hyperparameter tuning | Required | Evolved |
| Domain knowledge needed | High | Low |
| What user provides | Search space + sampler config | Baseline code + fitness function |
| What gets optimized | Parameter values | The optimization algorithm itself |

While Optuna requires users to select algorithms and tune hyperparameters, GEPA automatically **discovers optimization strategies** by evolving code. The user just provides the problem and a baseline—GEPA evolves Halton sequences, surrogate models, local refinement, and more.

---

<a id="section-5"></a>
# 5. Example 2: Prompt Engineering — AIME 2025

**Result: GEPA improves GPT-4.1 Mini's accuracy from 46.67% to 53.33% on AIME 2025.**

This example demonstrates how `optimize_anything` can evolve **prompts**—the natural language instructions that guide LLM behavior.

## The Challenge

Prompt engineering is often done through **trial and error**:
1. Write a prompt
2. Test on a few examples
3. Manually tweak based on intuition
4. Repeat until it "feels right"

This is slow, doesn't scale, and doesn't guarantee you've found the best prompt.

## The Task

**Given**: A dataset of AIME math competition problems
**Find**: A system prompt that maximizes GPT-4.1 Mini's accuracy

**What GEPA optimizes**: The instruction prompt—what guidance to give the model.

<img src="./assets/blog/aime_best_progress.png" width="80%">

## Setting Up the Problem

In [11]:
import dspy
import os

# Configure the language model
api_key = os.environ.get("OPENAI_API_KEY")
lm = dspy.LM("gpt-4.1-mini", api_key=api_key, temperature=1.0, max_tokens=32000)
dspy.configure(lm=lm)

# Load AIME dataset splits
from examples.math.dataset import load_math_dataset
trainset, valset, testset = load_math_dataset()

print(f"Training: {len(trainset)} problems")
print(f"Validation: {len(valset)} problems")
print(f"Test: {len(testset)} problems")



Loaded 45 training examples
Loaded 45 validation examples
Loaded 30 test examples
Training: 45 problems
Validation: 45 problems
Test: 30 problems


## The DSPy Module

We use DSPy's `ChainOfThought` for step-by-step reasoning:

In [14]:
class MathSolverSignature(dspy.Signature):
    """Solve a math competition problem."""
    input = dspy.InputField(desc="The math problem to solve.")
    answer = dspy.OutputField(desc="The final numerical answer.")

predictor = dspy.ChainOfThought(MathSolverSignature)

def run_llm(example, prompt: str):
    """Run the LLM on a single example with the given prompt."""
    predictor.predict.signature.instructions = prompt
    return predictor(input=example.input)

## The Seed Candidate

In [15]:
seed_candidate = {
    "prompt": "Solve the math problem carefully. Break down the steps and provide the final answer as a single number."
}

## The Fitness Function

The fitness function runs the predictor and collects detailed feedback:

In [16]:
def math_metric(example, prediction):
    """Compute score and detailed feedback for math problems."""
    correct_answer, written_solution = int(example.answer), getattr(example, "solution", "")
    solution_suffix = f" Here's the full step-by-step solution:\n{written_solution}\n\nThink about what takeaways you can learn from this solution to improve your future answers and approach to similar problems" if written_solution else ""

    try:
        llm_answer = int(prediction.answer)
    except (ValueError, TypeError):
        feedback_text = f"The final answer must be a valid integer and nothing else. You responded with '{prediction.answer}', which couldn't be parsed as a python integer. Please ensure your answer is a valid integer without any additional text or formatting. The correct answer is '{correct_answer}'.{solution_suffix}{' and ensure your final answer is a valid integer.' if written_solution else ''}"
        return dspy.Prediction(score=0.0, feedback=feedback_text)

    score = float(correct_answer == llm_answer)
    status = "correct" if score == 1.0 else "incorrect"
    feedback_text = f"Your answer is {status}. The correct answer is '{correct_answer}'.{solution_suffix}"
    return dspy.Prediction(score=score, feedback=feedback_text)


def fitness_fn(candidate: dict[str, str], example) -> tuple[float, Any, SideInfo]:
    """Fitness function for GEPA optimization with single example evaluation."""
    prediction = run_llm(example, candidate["prompt"])
    metric_result = math_metric(example, prediction)
    score = metric_result.score
    feedback = metric_result.feedback

    output = {
        "prompt": candidate["prompt"],
        "answer": prediction.answer,
        "score": score,
    }

    side_info = {
        "Input": example.input,
        "Output": prediction.answer,
        "Reasoning": getattr(prediction, "reasoning", ""),
        "ExecutionFeedback": feedback,
    }

    return (score, output, side_info)

## Running GEPA Optimization

Note: We use `valset` for generalization testing—GEPA optimizes on `trainset` but tracks performance on held-out `valset`.

In [None]:
from gepa.optimize_anything import (
    optimize_anything,
    GEPAConfig,
    EngineConfig,
    ReflectionConfig,
)

result = optimize_anything(
    seed_candidate=seed_candidate,
    fitness_fn=fitness_fn,
    dataset=trainset,   # Optimize on training set
    valset=valset,      # Track generalization on validation set
    config=GEPAConfig(
        engine=EngineConfig(
            max_metric_calls=800,
            track_best_outputs=True,
            parallel=True,      
            max_workers=32,
            cache_evaluation=True,
        ),
        reflection=ReflectionConfig(
            reflection_lm="openai/gpt-5",
            reflection_minibatch_size=3,  # Show 3 problems per reflection
        ),
    ),
)

print("\nOptimized prompt:")
print(result.best_candidate["prompt"])

Iteration 0: Base program full valset score: 0.4666666666666667 over 45 / 45 examples
Iteration 1: Selected program 0 score: 0.4666666666666667


## The Optimized Prompt

GEPA discovered a detailed, structured prompt with domain-specific strategies:

In [None]:
# This prompt was EVOLVED by GEPA, not written by a human!
# Starting from a simple "Solve carefully and provide the answer" prompt,
# GEPA discovered domain-specific strategies through reflection.

optimized_prompt = """
Solve from first principles with explicit checks. Requirements:

1) Model precisely:
- Define all objects, variables, and constraints algebraically/combinatorially.
- Choose one counting model (labeled vs indistinguishable) and stay consistent.
  For combinatorics, either label and divide at the end OR keep indistinguishable
  throughout—do not mix.
- For number-theory/decimal/ratio problems, state factorizations and gcd/lcm 
  relations explicitly.

2) Mapping/Counting rigor:
- When mapping elements between sets (e.g., m ↦ m/gcd(m,N)), prove 
  injectivity/surjectivity or handle overlaps via inclusion–exclusion.
- When computing probability, ensure numerator and denominator are counts 
  from the same sample space.
- Keep all computations exact (fractions/radicals/modular arithmetic); 
  avoid decimals unless terminating.

3) Geometry workflow:
- Draw and name a diagram (mentally or on paper). List candidate theorems: 
  power of a point, radical axis, homothety, similar triangles, cyclicity.
- Identify perpendiculars to tangents through centers; use midpoint/radical-axis 
  facts for intersecting circles and common tangents.
- Prefer exact relations (e.g., MP·MQ = (tangent length)^2) over coordinate guesses.

4) Sanity checks and diagnostics:
- If an assumption yields a contradiction (e.g., negative squared length), 
  discard and rebuild the setup.
- For combinatorics/NT counts, validate with a smaller analog (e.g., replace 
  9999 by 9 or 99) to detect double-counting before scaling up.
- For expressions of the form m√n, reduce n to be squarefree.
- Perform at least one independent cross-check (alternative derivation, 
  structural identity, modular check, or small-n analog).

5) Output:
- Extract exactly what is asked (e.g., remainder, perimeter, m+n). 
- Provide the final answer only as a single number with no extra text.
"""

# 🎯 What GEPA discovered:
# - Domain-specific heuristics for different math areas (geometry, combinatorics, NT)
# - Structured problem-solving workflow
# - Sanity checks and validation strategies
# - Explicit handling of common failure modes (mixing counting models, etc.)
#
# A human prompt engineer might take hours to discover these strategies!

### Key Takeaway

By including the model's **reasoning trace** in `side_info`, GEPA can understand *how* the model approaches problems—not just whether it got the answer right. This enables:

1. **Targeted improvements**: Fix specific reasoning errors, not random prompt tweaks
2. **Domain-specific strategies**: The prompt evolved to include geometry workflows, combinatorics rules, etc.
3. **Sanity checks**: GEPA discovered that asking for validation prevents common errors

The evolved prompt contains strategies that a human prompt engineer might take hours to discover through manual iteration.

---

<a id="section-6"></a>
# 6. Example 3: Agent Program Evolution — ARC-AGI

**Result: GEPA improves GPT-5's performance from 55.6% to 60.5% on ARC-AGI.**

This is the most ambitious example: optimizing not just prompts, but **entire agent architectures**—the DSPy program that defines how an LLM reasons about problems.

## The Challenge

[ARC-AGI](https://arcprize.org/) (Abstraction and Reasoning Corpus) is a benchmark designed to test general intelligence:
- Each task shows input-output grid transformation examples
- The agent must infer the transformation rule and apply it to test inputs
- Tasks require **visual reasoning, pattern recognition, and abstraction**

Hand-designing agent architectures for such tasks is extremely difficult.

## The Task

**Given**: A dataset of ARC-AGI grid transformation tasks
**Find**: A DSPy program (agent architecture) that maximizes accuracy

**What GEPA optimizes**: The entire DSPy program—signatures, modules, control flow, and prompting strategies.

<img src="./assets/blog/arc_agi_best_comparison.png" width="60%">

In [7]:
from examples.arc_agi.data import load_data

trainset, valset, testset = load_data()

Train set: 200
Val set: 200
Test set: 400


## The Seed Candidate

We start with a minimal Chain-of-Thought agent:

In [1]:
seed_candidate = {
    "program": """
import dspy
from typing import List
import pydantic

MATRIX = List[List[int]]

class TrainingExample(pydantic.BaseModel):
    input: MATRIX
    output: MATRIX

class SolveTaskSignature(dspy.Signature):
    training_examples: List[TrainingExample] = dspy.InputField(description="Input and output examples demonstrating the task to be performed.")
    test_inputs: List[MATRIX] = dspy.InputField(description="Input matrices to be solved following the task described in the training examples.")
    test_outputs: List[MATRIX] = dspy.OutputField(description="Output matrices corresponding to the test inputs.")

program = dspy.ChainOfThought(SolveTaskSignature)
"""
}

## The Fitness Function

The fitness function compiles and executes the DSPy program, capturing detailed error information:

In [6]:
import random
import dspy
import os

from gepa.adapters.dspy_full_program_adapter.full_program_adapter import DspyAdapter
from examples.arc_agi.main import metric_fn

seed = 0
api_key = os.environ.get("OPENAI_API_KEY")

task_lm = dspy.LM(
        model="openai/gpt-4.1-mini",
        temperature=1.0,
        max_tokens=32000,
        api_key=api_key,
        seed=seed,
        cache=False,
    )

adapter = DspyAdapter(
        task_lm=task_lm,
        metric_fn=metric_fn,
        num_threads=64,
        reflection_lm="openai/gpt-5",
        rng=random.Random(seed),
    )

def fitness_fn(candidate, example):
    program = candidate["program"]

    try:
        evaluation_results = adapter.evaluate(
            [example], candidate, capture_traces=True
        )
    except Exception as e:
        side_info = {"input": example, "error": str(e), "program": program}
        return (0.0, side_info, side_info)

    # Program error
    if (
        not isinstance(evaluation_results.trajectories, list)
        or len(evaluation_results.trajectories) == 0
    ):
        print("Error: ")
        print(evaluation_results.trajectories)
        side_info = {
            "input": example,
            "error": f"All examples failed. Program error: {str(evaluation_results.trajectories)}",
            "program": program,
        }
        return (0.0, side_info, side_info)

    # Process evaluations with no program errors
    trajectory = evaluation_results.trajectories[0]
    metric_result = trajectory.get("score")
    score = metric_result.get("score")
    feedback = metric_result.get("feedback")
    prediction = trajectory.get("prediction")

    side_info = {
        "input": example,
        "reasoning": prediction.get("reasoning"),
        "feedback": feedback,
        "output": prediction.get("test_outputs"),
    }

    return (score, side_info, side_info)

## Running GEPA Optimization

In [None]:
from examples.arc_agi.prompt import BACKGROUND
from gepa.optimize_anything import (
    EngineConfig,
    GEPAConfig,
    ReflectionConfig,
    optimize_anything,
)

result = optimize_anything(
    seed_candidate=seed_candidate,
    fitness_fn=fitness_fn,
    dataset=trainset,
    valset=valset,
    objective="Evolve a DSPy program that solves ARC-AGI grid transformation tasks.",
    background=BACKGROUND,
    config=GEPAConfig(
        engine=EngineConfig(
            max_metric_calls=4000,
            track_best_outputs=True,
            use_cloudpickle=True,
            parallel=True,
            max_workers=64,
            cache_evaluation=True,
        ),
        reflection=ReflectionConfig(
            reflection_lm="openai/gpt-5",
            reflection_minibatch_size=3,
        ),
    ),
)

  0%|          | 0/1 [00:00<?, ?it/s]
[A

[A[A


[A[A[A



[A[A[A[A





[A[A[A[A[A[A




[A[A[A[A[A






[A[A[A[A[A[A[A







[A[A[A[A[A[A[A[A








[A[A[A[A[A[A[A[A[A









[A[A[A[A[A[A[A[A[A[A










[A[A[A[A[A[A[A[A[A[A[A











[A[A[A[A[A[A[A[A[A[A[A[A













[A[A[A[A[A[A[A[A[A[A[A[A[A[A














[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A












[A[A[A[A[A[A[A[A[A[A[A[A[A















[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
















[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A

















[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A











[A[A[A[A[A[A[A[A[A[A[A[A











[A[A[A[A[A[A[A[A[A[A[A[A


















Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:11<00:00, 11.43s/it]

  PydanticSerializationUnexpectedValue(Expected 10 fields but got 6: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content="[[ ## re...: None}, annotations=[]), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...ider_specific_fields={}), input_type=Choices])
  return self.__pydantic_serializer__.to_python(
2026/01/22 11:48:58 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:12<00:00, 12.26s/it]

  PydanticSerializationUnexpectedValue(Expected 10 fields but got 6: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='[[ ## re...: None}, annotations=[]), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...ider_specific_fields={}), input_type=Choices])
  return self.__pydantic_serializer__.to_python(
2026/01/22 11:48:59 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:13<00:00, 13.26s/it]

2026/01/22 11:49:00 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:13<00:00, 13.76s/it]

2026/01/22 11:49:01 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:14<00:00, 14.93s/it]

2026/01/22 11:49:02 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:15<00:00, 15.44s/it]

2026/01/22 11:49:03 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



  0%|          | 0/1 [00:00<?, ?it/s]

[A[A

Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:15<00:00, 15.76s/it]

2026/01/22 11:49:03 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:16<00:00, 16.05s/it]

2026/01/22 11:49:03 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:16<00:00, 16.46s/it]

2026/01/22 11:49:04 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:16<00:00, 16.78s/it]

2026/01/22 11:49:04 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:16<00:00, 16.97s/it]

2026/01/22 11:49:04 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:17<00:00, 17.69s/it]

2026/01/22 11:49:05 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:17<00:00, 17.71s/it]

2026/01/22 11:49:05 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)













[A[A[A[A[A[A[A[A[A[A









Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:17<00:00, 17.75s/it]
  0%|          | 0/1 [00:00<?, ?it/s]

2026/01/22 11:49:05 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)


Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:18<00:00, 18.11s/it]

2026/01/22 11:49:05 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:18<00:00, 18.14s/it]

2026/01/22 11:49:05 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



  0%|          | 0/1 [00:00<?, ?it/s]


[A[A[A


Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:18<00:00, 18.46s/it]

2026/01/22 11:49:05 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:18<00:00, 18.45s/it]

2026/01/22 11:49:06 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



  0%|          | 0/1 [00:00<?, ?it/s]












[A[A[A[A[A[A[A[A[A[A[A[A[A












Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:19<00:00, 19.66s/it]

2026/01/22 11:49:07 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



  0%|          | 0/1 [00:00<?, ?it/s]
















[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
















Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:19<00:00, 19.72s/it]

2026/01/22 11:49:07 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:20<00:00, 20.24s/it]

2026/01/22 11:49:07 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:04<00:00,  4.29s/it]

2026/01/22 11:49:08 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:20<00:00, 20.77s/it]

2026/01/22 11:49:08 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:20<00:00, 20.80s/it]

2026/01/22 11:49:08 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:20<00:00, 20.90s/it]

2026/01/22 11:49:08 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



  0%|          | 0/1 [00:00<?, ?it/s]













[A[A[A[A[A[A[A[A[A[A[A[A[A[A













Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:21<00:00, 21.41s/it]

2026/01/22 11:49:08 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



  0%|          | 0/1 [00:00<?, ?it/s]











[A[A[A[A[A[A[A[A[A[A[A[A











Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:21<00:00, 21.68s/it]

2026/01/22 11:49:09 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:22<00:00, 22.41s/it]

2026/01/22 11:49:09 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:22<00:00, 22.58s/it]

2026/01/22 11:49:10 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:22<00:00, 22.66s/it]

2026/01/22 11:49:10 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



  0%|          | 0/1 [00:00<?, ?it/s]










[A[A[A[A[A[A[A[A[A[A[A










Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:22<00:00, 22.99s/it]

2026/01/22 11:49:10 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:23<00:00, 23.02s/it]

2026/01/22 11:49:10 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:23<00:00, 23.42s/it]

2026/01/22 11:49:11 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)









[A[A[A[A[A[A





Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:23<00:00, 23.51s/it]


2026/01/22 11:49:11 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)


  0%|          | 0/1 [00:00<?, ?it/s]







[A[A[A[A[A[A[A[A







Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:23<00:00, 23.88s/it]

2026/01/22 11:49:11 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



  0%|          | 0/1 [00:00<?, ?it/s]















[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A















Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:24<00:00, 24.01s/it]

2026/01/22 11:49:11 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:24<00:00, 24.08s/it]

2026/01/22 11:49:11 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:08<00:00,  8.92s/it]

2026/01/22 11:49:12 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



  0%|          | 0/1 [00:00<?, ?it/s]




[A[A[A[A[A




Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:25<00:00, 25.26s/it]

2026/01/22 11:49:12 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:12<00:00, 12.01s/it]

2026/01/22 11:49:12 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



  0%|          | 0/1 [00:00<?, ?it/s]



[A[A[A[A



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:25<00:00, 25.46s/it]

2026/01/22 11:49:12 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:26<00:00, 26.00s/it]

2026/01/22 11:49:13 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



  0%|          | 0/1 [00:00<?, ?it/s]














[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A














Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:26<00:00, 26.13s/it]

2026/01/22 11:49:13 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:26<00:00, 26.29s/it]

2026/01/22 11:49:13 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:26<00:00, 26.57s/it]

2026/01/22 11:49:14 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:26<00:00, 26.69s/it]

2026/01/22 11:49:14 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:27<00:00, 27.92s/it]

2026/01/22 11:49:15 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:28<00:00, 28.16s/it]

2026/01/22 11:49:15 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:07<00:00,  7.96s/it]

2026/01/22 11:49:15 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:28<00:00, 28.75s/it]

2026/01/22 11:49:16 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:28<00:00, 28.87s/it]

2026/01/22 11:49:16 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:16<00:00, 16.84s/it]

2026/01/22 11:49:16 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:11<00:00, 11.60s/it]

2026/01/22 11:49:16 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:11<00:00, 11.87s/it]

2026/01/22 11:49:17 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:30<00:00, 30.21s/it]

2026/01/22 11:49:17 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:07<00:00,  7.71s/it]

2026/01/22 11:49:17 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:30<00:00, 30.89s/it]

2026/01/22 11:49:18 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:12<00:00, 12.97s/it]

2026/01/22 11:49:18 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:08<00:00,  8.85s/it]

2026/01/22 11:49:19 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:07<00:00,  7.34s/it]

2026/01/22 11:49:19 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:31<00:00, 31.91s/it]

2026/01/22 11:49:19 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:32<00:00, 32.84s/it]

2026/01/22 11:49:20 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:13<00:00, 13.22s/it]

2026/01/22 11:49:20 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:32<00:00, 32.86s/it]

2026/01/22 11:49:20 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:15<00:00, 15.15s/it]

2026/01/22 11:49:20 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:21<00:00, 21.47s/it]

2026/01/22 11:49:20 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:09<00:00,  9.92s/it]

2026/01/22 11:49:20 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



  0%|          | 0/1 [00:00<?, ?it/s]
[A

[A[A


Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:18<00:00, 18.32s/it]

2026/01/22 11:49:21 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:12<00:00, 12.80s/it]

2026/01/22 11:49:21 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:34<00:00, 34.79s/it]

2026/01/22 11:49:22 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:34<00:00, 34.81s/it]

2026/01/22 11:49:22 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:18<00:00, 18.88s/it]

2026/01/22 11:49:22 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:16<00:00, 16.70s/it]

2026/01/22 11:49:22 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:14<00:00, 14.27s/it]

2026/01/22 11:49:22 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:15<00:00, 15.51s/it]

2026/01/22 11:49:22 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:35<00:00, 35.73s/it]

2026/01/22 11:49:23 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:12<00:00, 12.93s/it]

2026/01/22 11:49:23 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:12<00:00, 12.99s/it]

2026/01/22 11:49:24 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:07<00:00,  7.31s/it]

2026/01/22 11:49:24 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:12<00:00, 12.37s/it]

2026/01/22 11:49:24 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:19<00:00, 19.62s/it]

2026/01/22 11:49:25 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



  0%|          | 0/1 [00:00<?, ?it/s]






[A[A[A[A[A[A[A






Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:38<00:00, 38.04s/it]

2026/01/22 11:49:25 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:05<00:00,  5.13s/it]

2026/01/22 11:49:25 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:09<00:00,  9.67s/it]

2026/01/22 11:49:26 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



  0%|          | 0/1 [00:00<?, ?it/s]








[A[A[A[A[A[A[A[A[A








Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:39<00:00, 39.19s/it]

2026/01/22 11:49:26 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:39<00:00, 39.39s/it]

2026/01/22 11:49:26 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:15<00:00, 15.88s/it]

2026/01/22 11:49:27 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:39<00:00, 39.92s/it]

2026/01/22 11:49:27 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:04<00:00,  4.81s/it]

2026/01/22 11:49:27 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:14<00:00, 14.88s/it]

2026/01/22 11:49:27 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:07<00:00,  7.51s/it]

2026/01/22 11:49:27 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:10<00:00, 10.47s/it]

2026/01/22 11:49:28 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:05<00:00,  5.87s/it]

2026/01/22 11:49:28 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:19<00:00, 19.43s/it]

2026/01/22 11:49:28 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.50 / 1 (50.0%): 100%|██████████| 1/1 [00:20<00:00, 20.33s/it]

2026/01/22 11:49:28 INFO dspy.evaluate.evaluate: Average Metric: 0.5 / 1 (50.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:04<00:00,  4.76s/it]

2026/01/22 11:49:28 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:09<00:00,  9.97s/it]

2026/01/22 11:49:28 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:12<00:00, 12.63s/it]

2026/01/22 11:49:29 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:03<00:00,  3.71s/it]

2026/01/22 11:49:29 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



  0%|          | 0/1 [00:00<?, ?it/s]
[A
Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:42<00:00, 42.10s/it]

2026/01/22 11:49:29 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:21<00:00, 21.44s/it]

2026/01/22 11:49:29 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:14<00:00, 14.55s/it]

2026/01/22 11:49:30 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:11<00:00, 11.27s/it]

2026/01/22 11:49:30 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:14<00:00, 14.84s/it]

2026/01/22 11:49:30 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:17<00:00, 17.52s/it]

2026/01/22 11:49:31 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:23<00:00, 23.34s/it]

2026/01/22 11:49:31 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:04<00:00,  4.97s/it]

2026/01/22 11:49:31 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:15<00:00, 15.57s/it]

2026/01/22 11:49:31 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:21<00:00, 21.19s/it]t]

2026/01/22 11:49:32 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:10<00:00, 10.21s/it]

2026/01/22 11:49:32 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:45<00:00, 45.18s/it]

2026/01/22 11:49:32 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:09<00:00,  9.46s/it]

2026/01/22 11:49:32 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:04<00:00,  4.95s/it]

2026/01/22 11:49:33 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:17<00:00, 17.89s/it]

2026/01/22 11:49:33 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:46<00:00, 46.48s/it]

2026/01/22 11:49:33 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:16<00:00, 16.56s/it]

2026/01/22 11:49:34 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:12<00:00, 12.44s/it]

2026/01/22 11:49:34 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:13<00:00, 13.10s/it]

2026/01/22 11:49:34 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:15<00:00, 15.19s/it]

2026/01/22 11:49:34 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:33<00:00, 33.36s/it]

2026/01/22 11:49:34 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:29<00:00, 29.41s/it]

2026/01/22 11:49:34 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:03<00:00,  3.51s/it]

2026/01/22 11:49:34 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:21<00:00, 21.15s/it]

2026/01/22 11:49:34 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:24<00:00, 24.92s/it]

2026/01/22 11:49:34 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



  0%|          | 0/1 [00:00<?, ?it/s]

[A[A

Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:14<00:00, 14.74s/it]

2026/01/22 11:49:35 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:48<00:00, 48.42s/it]

2026/01/22 11:49:35 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:05<00:00,  5.93s/it]

2026/01/22 11:49:36 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:48<00:00, 48.59s/it]

2026/01/22 11:49:36 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:16<00:00, 16.82s/it]

2026/01/22 11:49:36 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:11<00:00, 11.46s/it]

2026/01/22 11:49:36 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:24<00:00, 24.16s/it]

2026/01/22 11:49:37 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:06<00:00,  6.87s/it]

2026/01/22 11:49:37 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:09<00:00,  9.73s/it]

2026/01/22 11:49:37 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:03<00:00,  3.02s/it]

2026/01/22 11:49:37 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:11<00:00, 11.30s/it]

2026/01/22 11:49:38 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:27<00:00, 27.83s/it]

2026/01/22 11:49:38 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:12<00:00, 12.32s/it]

2026/01/22 11:49:38 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:18<00:00, 18.68s/it]

2026/01/22 11:49:39 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:25<00:00, 25.32s/it]

2026/01/22 11:49:39 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:25<00:00, 25.49s/it]

2026/01/22 11:49:39 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:04<00:00,  4.76s/it]

2026/01/22 11:49:39 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:05<00:00,  5.47s/it]

2026/01/22 11:49:40 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:03<00:00,  3.52s/it]

2026/01/22 11:49:40 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:07<00:00,  7.73s/it]

2026/01/22 11:49:40 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:16<00:00, 16.79s/it]

2026/01/22 11:49:40 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:12<00:00, 12.46s/it]

2026/01/22 11:49:41 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:06<00:00,  6.80s/it]

2026/01/22 11:49:41 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:12<00:00, 12.86s/it]

2026/01/22 11:49:41 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:12<00:00, 12.72s/it]

2026/01/22 11:49:41 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:23<00:00, 23.40s/it]

2026/01/22 11:49:41 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:08<00:00,  8.49s/it]

2026/01/22 11:49:42 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:29<00:00, 29.43s/it]

2026/01/22 11:49:42 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:10<00:00, 10.57s/it]

2026/01/22 11:49:42 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:14<00:00, 14.02s/it]

2026/01/22 11:49:42 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:13<00:00, 13.76s/it]

2026/01/22 11:49:42 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:07<00:00,  7.97s/it]

2026/01/22 11:49:42 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:08<00:00,  8.48s/it]

2026/01/22 11:49:42 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:20<00:00, 20.44s/it]

2026/01/22 11:49:42 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:06<00:00,  6.64s/it]

2026/01/22 11:49:42 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:13<00:00, 13.86s/it]

2026/01/22 11:49:43 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:25<00:00, 25.66s/it]

2026/01/22 11:49:43 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:13<00:00, 13.66s/it]

2026/01/22 11:49:43 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:14<00:00, 14.52s/it]

2026/01/22 11:49:43 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:20<00:00, 20.46s/it]

2026/01/22 11:49:43 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:16<00:00, 16.38s/it]

2026/01/22 11:49:44 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:06<00:00,  6.01s/it]

2026/01/22 11:49:44 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:12<00:00, 12.58s/it]

2026/01/22 11:49:44 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:10<00:00, 10.20s/it]

2026/01/22 11:49:44 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:12<00:00, 12.77s/it]

2026/01/22 11:49:45 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:10<00:00, 10.34s/it]

2026/01/22 11:49:45 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:23<00:00, 23.04s/it]

2026/01/22 11:49:45 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:23<00:00, 23.02s/it]

2026/01/22 11:49:45 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:20<00:00, 20.25s/it]

2026/01/22 11:49:45 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:43<00:00, 43.82s/it]

2026/01/22 11:49:46 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:10<00:00, 10.96s/it]

2026/01/22 11:49:46 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:08<00:00,  8.50s/it]

2026/01/22 11:49:47 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:14<00:00, 14.74s/it]

2026/01/22 11:49:47 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:19<00:00, 19.91s/it]

2026/01/22 11:49:47 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:16<00:00, 16.93s/it]

2026/01/22 11:49:47 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:10<00:00, 10.53s/it]

2026/01/22 11:49:47 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:19<00:00, 19.90s/it]

2026/01/22 11:49:47 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:15<00:00, 15.51s/it]

2026/01/22 11:49:48 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)




[A
Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:27<00:00, 27.91s/it]

2026/01/22 11:49:48 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:14<00:00, 14.79s/it]

2026/01/22 11:49:48 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:17<00:00, 17.19s/it]

2026/01/22 11:49:48 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:15<00:00, 15.40s/it]

2026/01/22 11:49:49 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:16<00:00, 16.05s/it]

2026/01/22 11:49:49 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:47<00:00, 47.47s/it]

2026/01/22 11:49:51 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:13<00:00, 13.79s/it]

2026/01/22 11:49:51 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:49<00:00, 49.44s/it]

2026/01/22 11:49:53 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:21<00:00, 21.50s/it]

2026/01/22 11:49:58 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:29<00:00, 29.20s/it]

2026/01/22 11:49:58 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:24<00:00, 24.04s/it]

2026/01/22 11:50:00 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:36<00:00, 36.43s/it]

2026/01/22 11:50:00 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 1.00 / 1 (100.0%): 100%|██████████| 1/1 [00:26<00:00, 26.76s/it]

2026/01/22 11:50:02 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 1 (100.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:40<00:00, 40.39s/it]

2026/01/22 11:50:07 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:34<00:00, 34.77s/it]

2026/01/22 11:50:09 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)



Average Metric: 0.00 / 1 (0.0%): 100%|██████████| 1/1 [00:56<00:00, 56.11s/it]

2026/01/22 11:50:09 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 1 (0.0%)





## What GEPA Discovered

GEPA evolved the simple ChainOfThought into a sophisticated 5-step code synthesis pipeline:

In [None]:
# Evolved by GEPA - A code synthesis agent with self-refinement
# This program was DISCOVERED through optimization, not hand-written!

evolved_program = """
import dspy
from typing import List, Optional, Callable
import copy
import re

class SynthesizeTransform(dspy.Signature):
    '''
    Write valid Python code that defines: def transform(grid) -> grid
    
    Requirements:
    - Use only built-in Python (lists/loops/dicts/sets); no imports.
    - Must be general to similar-sized grids; do NOT hardcode indices.
    - Preserve separator rows/columns if present.
    '''
    training_examples: List[TrainingExample] = dspy.InputField()
    hint: str = dspy.InputField(desc="Feedback on prior failures.")
    code: str = dspy.OutputField(desc="Python code defining transform().")

class CodeSynthesisSolver(dspy.Module):
    def __init__(self, attempts=5):
        self.codegen = dspy.ChainOfThought(SynthesizeTransform)
        self.attempts = attempts
    
    def _verify_on_training(self, fn, training_examples):
        '''Validate transform matches ALL training outputs exactly.'''
        for idx, ex in enumerate(training_examples):
            pred = fn(copy.deepcopy(ex.input))
            if pred != ex.output:
                return False, f"Mismatch on example {idx}"
        return True, None
    
    def forward(self, training_examples, test_inputs):
        hint = "Infer a general rule from ALL training pairs."
        
        for attempt in range(self.attempts):
            # Step 1: Generate code hypothesis
            pred = self.codegen(training_examples=training_examples, hint=hint)
            code = extract_code_block(pred.code)
            
            # Step 2: Load and compile
            fn, load_error = load_transform_func(code)
            if fn is None:
                hint = f"Attempt {attempt} failed to compile: {load_error}"
                continue
            
            # Step 3: Validate on ALL training examples
            ok, validation_error = self._verify_on_training(fn, training_examples)
            
            if ok:
                # Step 4: Execute on test inputs
                outputs = [fn(copy.deepcopy(g)) for g in test_inputs]
                return dspy.Prediction(test_outputs=outputs)
            else:
                # Step 5: Self-refine based on error feedback
                hint = f"Attempt {attempt} incorrect: {validation_error}. Fix it."
        
        # Fallback: identity transform
        return dspy.Prediction(test_outputs=[copy.deepcopy(g) for g in test_inputs])

program = CodeSynthesisSolver(attempts=5)
"""

# 🎯 GEPA discovered SELF-REFINEMENT!
# 
# The evolved agent automatically:
# 1. Hypothesizes a transformation rule from examples
# 2. Generates Python code implementing it
# 3. Validates the code on ALL training examples
# 4. Self-refines if validation fails (up to 5 attempts)
# 5. Only executes on test inputs after validation passes
#
# This strategy was NOT programmed - it EMERGED from optimization!

### Key Takeaway: Emergent Self-Refinement

GEPA discovered **self-refinement**—having the LLM validate and fix its own code before producing outputs. This is remarkable because:

1. **Not programmed**: Self-refinement emerged from optimization, not from human design
2. **Sophisticated strategy**: The agent now verifies on training before applying to test
3. **Multi-attempt recovery**: Up to 5 refinement attempts with targeted feedback
4. **Code synthesis**: Instead of direct prediction, the agent writes executable code

This demonstrates GEPA's ability to discover **complex reasoning pipelines** that humans might not think to design.

---

<a id="section-7"></a>
# 7. Example 4: Algorithmic Discovery — Circle Packing

**Result: GEPA matches or exceeds AlphaEvolve, ShinkaEvolve, and OpenEvolve on circle packing.**

This example demonstrates **algorithmic discovery**—evolving code to solve a well-known NP-hard optimization problem.

## The Challenge

Circle packing is a classic problem with real-world applications (chip layout, material cutting, logistics):
- Pack N non-overlapping circles inside a unit square [0,1] × [0,1]
- Maximize the sum of all radii
- This is **NP-hard**—no known polynomial-time algorithm exists

Recent work from DeepMind (AlphaEvolve), and open-source efforts (ShinkaEvolve, OpenEvolve) have used LLMs to evolve packing algorithms.

## The Task

**Given**: The number of circles N (e.g., N=26)
**Find**: Python code that computes optimal circle placements

**What GEPA optimizes**: The packing algorithm code—placement strategies, local optimization, constraint handling.

<img src="./assets/blog/circle_packing_annotated2.png">

## Setting Up the Problem

Circle packing is a single-instance optimization problem—no dataset needed.

In [14]:
num_circles = 26
objective = f"Optimize circle packing code and refiner prompt to maximize sum of circle radii within a unit square for N={num_circles} circles."

## The Seed Candidate

In [12]:
from examples.circle_packing.llms import SEED_REFINEMENT_PROMPT


seed_candidate = {
    "code": '''
import numpy as np

def main(timeout, current_best_solution):
    """
    Circle packing optimization.

    Args:
        timeout: Time budget in seconds
        current_best_solution: Previous best circles array (n, 3) or None

    Returns:
        dict with 'circles' (n, 3) array and 'all_scores' list
    """
    n = 26

    # Use current_best_solution if provided, otherwise start fresh
    if current_best_solution is not None:
        circles = current_best_solution.copy()
    else:
        # Simple initial placement
        centers = np.zeros((n, 2))

        # Center circle
        centers[0] = [0.5, 0.5]

        # Ring of 8 around center
        for i in range(min(8, n - 1)):
            angle = 2 * np.pi * i / 8
            centers[i + 1] = [0.5 + 0.3 * np.cos(angle), 0.5 + 0.3 * np.sin(angle)]

        # Outer ring for remaining
        if n > 9:
            remaining = n - 9
            for i in range(remaining):
                angle = 2 * np.pi * i / remaining
                centers[i + 9] = [0.5 + 0.7 * np.cos(angle), 0.5 + 0.7 * np.sin(angle)]

        centers = np.clip(centers, 0.01, 0.99)
        radii = compute_max_radii(centers)
        circles = np.hstack([centers, radii.reshape(-1, 1)])

    score = float(np.sum(circles[:, 2]))
    return {'circles': circles, 'all_scores': [score]}


def compute_max_radii(centers):
    """Compute maximum radii that don't overlap and stay in unit square."""
    n = centers.shape[0]
    radii = np.ones(n)

    # Limit by distance to borders
    for i in range(n):
        x, y = centers[i]
        radii[i] = min(x, y, 1 - x, 1 - y)

    # Limit by distance to other circles
    for i in range(n):
        for j in range(i + 1, n):
            dist = np.sqrt(np.sum((centers[i] - centers[j]) ** 2))
            if radii[i] + radii[j] > dist:
                scale = dist / (radii[i] + radii[j])
                radii[i] *= scale
                radii[j] *= scale

    return radii
''', 
    "refiner_prompt": SEED_REFINEMENT_PROMPT,
}

### Refiner

In [6]:
import dspy
import os

class RefinerSignature(dspy.Signature):
    """Refine the code based on its evaluation results by fixing the errors and improving the performance."""

    refiner_prompt = dspy.InputField(desc="Instructions for how to refine the code")
    code_to_improve = dspy.InputField(desc="Code to improve")
    code_results = dspy.InputField(
        desc="Evaluation results of the code to improve by fixing the errors and improving the performance"
    )
    refined_code = dspy.OutputField(
        desc="Next iteration of improved code based on the evaluation results"
    )

refiner_predictor = dspy.Predict(RefinerSignature)

refiner_lm = dspy.LM(
    "openai/gpt-5.1",
    temperature=1.0,
    max_tokens=32000,
    api_key=os.environ.get("OPENAI_API_KEY"),
    cache=True,
)

## The Fitness Function

The fitness function validates constraints and returns detailed violation information:

In [16]:
from examples.circle_packing.utils import execute_code
from examples.circle_packing.main import refine_code, StateTracker

state_tracker = StateTracker()
timeout = 600

def compute_multiple_metrics(
    global_best_score: float, all_scores: list[float]
) -> dict[str, float]:
    candidate_best_score = max(all_scores)
    alpha_fixed = 0.1
    ema_fixed = all_scores[0]
    for s in all_scores[1:]:
        ema_fixed = alpha_fixed * s + (1 - alpha_fixed) * ema_fixed
    alpha_adaptive = 2.0 / (len(all_scores) + 1)
    ema_adaptive = all_scores[0]
    for s in all_scores[1:]:
        ema_adaptive = alpha_adaptive * s + (1 - alpha_adaptive) * ema_adaptive

    return {
        "max_score": max(all_scores),
        "mean_score": sum(all_scores) / len(all_scores),
        "ema_score_fixed": ema_fixed,
        "ema_score_adaptive": ema_adaptive,
        "score_improvement_from_previous_best": candidate_best_score
        - global_best_score,
    }

def fitness_fn(candidate: dict[str, str], *args, **kwargs):
    """
    Evaluate code candidate on batch of problems with optional refinement.
    """
    code_candidate = candidate["code"]

    # Code candidate evaluation
    global_best_score, global_best_solution = state_tracker.get_best_solution()
    cache_key = code_candidate
    code_candidate_cache = state_tracker.get(cache_key)

    if code_candidate_cache is not None:
        code_score, code_side_info = code_candidate_cache
    else:
        code_result = execute_code(code_candidate, timeout, global_best_solution)
        circles = None

        if code_result["success"]:
            circles = code_result["result"]["circles"]
            all_scores = code_result["result"]["all_scores"]
            code_score = code_result["result"]["validation_details"]["sum_radii"]
            code_side_info = {
                "scores": compute_multiple_metrics(global_best_score, all_scores),
                "Code": code_candidate,
                "Circles": circles,
                "Global best circles at the time of evaluation": global_best_solution,
                "Stdout": code_result["stdout"],
            }
        else:
            code_score = 0.0
            code_side_info = {
                "scores": {"sum_radii": 0.0},
                "Code": code_candidate,
                "Error": code_result["error"],
                "Traceback": code_result.get("traceback", ""),
                "Stdout": code_result["stdout"],
                "Validation Details": code_result.get("validation_details"),
            }

        # Cache after computing values
        state_tracker.set(
            cache_key,
            (code_score, code_side_info),
            score=code_score,
            solution=circles,
            artifact={
                "code": code_candidate,
                "arg_current_best_solution": global_best_solution,
                "validation details": code_result.get("validation_details"),
            },
        )

    print("Code candidate side info:")
    print(code_side_info)

    # Refiner prompt evaluation
    # Now that we've got the code's results, we can set a cache key as (prompt, code, best_solution)
    # the refiner will receive the code, the
    print("Refining code...")

    refiner_prompt_candidate = candidate["refiner_prompt"]
    global_best_score, global_best_solution = state_tracker.get_best_solution()

    # Refine code for this problem
    (
        refiner_score,
        refiner_code,
        refiner_side_info,
    ) = refine_code(
        code=code_candidate,
        code_score=code_score,
        code_side_info=code_side_info,
        refiner_prompt=refiner_prompt_candidate,
        refiner_predictor=refiner_predictor,
        refiner_lm=refiner_lm,
        timeout=timeout,
        state_tracker=state_tracker,
    )

    if refiner_score > code_score:
        best_score = refiner_score
        best_code = refiner_code
        best_circles = refiner_side_info.get("Circles", None)
    else:
        best_score = code_score
        best_code = code_candidate
        best_circles = code_side_info.get("Circles", None)

    if best_circles is not None:
        best_circles = best_circles.tolist()

    output = {
        "best_score": best_score,
        "best_code": best_code,
        "best_circles": best_circles,
        "code_candidate": code_candidate,
        "code_score": code_score,
        "refiner_prompt": refiner_prompt_candidate,
        "refiner_code": refiner_code,
        "refiner_score": refiner_score,
    }

    side_info = {
        "scores": {
            "best_score_from_code_and_refiner": max(code_score, refiner_score),
            "initial_code": code_score,
            "refiner_prompt": refiner_score,
        },
        "Input": {
            "Timeout (s)": timeout,
        },
        "code_specific_info": code_side_info,
        "refiner_prompt_specific_info": refiner_side_info,
    }

    return (best_score, output, side_info)


## Running GEPA Optimization

In [17]:
from gepa.optimize_anything import (
    EngineConfig,
    GEPAConfig,
    ReflectionConfig,
    optimize_anything,
)
from examples.circle_packing.llms import CIRCLE_PACKING_BACKGROUND, SEED_REFINEMENT_PROMPT


result = optimize_anything(
    seed_candidate=seed_candidate,
    fitness_fn=fitness_fn,
    objective=objective,
    background=CIRCLE_PACKING_BACKGROUND,
    config=GEPAConfig(
        engine=EngineConfig(
            max_metric_calls=200,
            track_best_outputs=True,
            frontier_type="objective",
            cache_evaluation=True,
        ),
        reflection=ReflectionConfig(
            reflection_lm="openai/gpt-5",
            reflection_minibatch_size=1,
        ),
    ),
)

New best solution found: 0.9798
Logging state...
Best score: 0.9798
Best solution: [[0.5        0.5        0.05831903]
 [0.8        0.5        0.05513527]
 [0.71213203 0.71213203 0.10466903]
 [0.5        0.8        0.08482429]
 [0.28786797 0.71213203 0.09642007]
 [0.2        0.5        0.08407917]
 [0.28786797 0.28786797 0.09670729]
 [0.5        0.2        0.09927795]
 [0.71213203 0.28786797 0.13033211]
 [0.99       0.5        0.01      ]
 [0.99       0.75286917 0.01      ]
 [0.99       0.97158695 0.01      ]
 [0.81201685 0.99       0.01      ]
 [0.56458785 0.99       0.01      ]
 [0.30843591 0.99       0.01      ]
 [0.07815575 0.99       0.01      ]
 [0.01       0.86850251 0.01      ]
 [0.01       0.62862466 0.01      ]
 [0.01       0.37137534 0.01      ]
 [0.01       0.13149749 0.01      ]
 [0.07815575 0.01       0.01      ]
 [0.30843591 0.01       0.01      ]
 [0.56458785 0.01       0.01      ]
 [0.81201685 0.01       0.01      ]
 [0.99       0.02841305 0.01      ]
 [0.99       0.24

KeyboardInterrupt: 

## What GEPA Discovered

GEPA evolved the simple grid-based baseline into a **sophisticated multi-strategy optimizer**. Here's what the evolved code includes:

### Strategies Discovered by GEPA

| Strategy | Description | How It Helps |
|----------|-------------|--------------|
| **Halton sequences** | Quasi-random initialization | Better initial coverage than random |
| **Zero-vector seeding** | Start from origin | Often near polynomial optima |
| **CMA-ES-style evolution** | Covariance matrix adaptation | Adapts search direction to landscape |
| **Quadratic surrogate models** | Local function approximation | Efficient local optimization |
| **Coordinate descent** | Per-dimension refinement | Fine-tunes individual coordinates |
| **Nelder-Mead subspace** | Simplex method in active dimensions | Exploits important variables |
| **Ridge-linear probes** | Gradient estimation from archive | Uses history for direction hints |

### Results

The evolved code achieves packing densities that **match or exceed** published results from:
- **AlphaEvolve** (DeepMind)
- **ShinkaEvolve**
- **OpenEvolve**

All without any human optimization expertise—just the problem definition and a baseline!

### Key Takeaway

GEPA automatically discovered advanced optimization strategies (Halton sequences, CMA-ES, surrogate models) that typically require expert knowledge to implement. The user only needed to:
1. Define the problem (pack circles)
2. Provide a naive baseline (grid placement)
3. Return informative `side_info` (violations, scores)

---

<a id="section-8"></a>
# 8. How It Works Under the Hood

GEPA (Generative Evolutionary Prompting with ASI) operates through a loop of **evaluation**, **reflection**, and **proposal**:

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                              GEPA LOOP                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌──────────────┐                                                          │
│   │  EVALUATE    │  Run fitness_fn on candidates                            │
│   │              │  → Collect scores AND side_info                          │
│   └──────┬───────┘                                                          │
│          │                                                                  │
│          ▼                                                                  │
│   ┌──────────────┐                                                          │
│   │   SELECT     │  Choose candidates for mutation                          │
│   │              │  → Pareto selection across objectives/instances          │
│   │              │  → Epsilon-greedy exploration                            │
│   └──────┬───────┘                                                          │
│          │                                                                  │
│          ▼                                                                  │
│   ┌──────────────┐                                                          │
│   │   REFLECT    │  LLM analyzes evaluation results                         │
│   │              │  → "Why did this candidate fail?"                        │
│   │              │  → Uses side_info to understand failure modes            │
│   └──────┬───────┘                                                          │
│          │                                                                  │
│          ▼                                                                  │
│   ┌──────────────┐                                                          │
│   │   PROPOSE    │  LLM generates improved candidates                       │
│   │              │  → Targeted mutations based on reflection                │
│   │              │  → Preserves successful behaviors                        │
│   └──────┬───────┘                                                          │
│          │                                                                  │
│          └──────────────────► REPEAT until stopping condition               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

## Key Components

### 1. Pareto Frontier
GEPA maintains a **Pareto frontier** of candidates that are optimal on different subsets of the data:
- **Multi-objective**: Some candidates optimize for accuracy, others for speed
- **Instance-level**: Some candidates excel on certain problem types
- **Diversity**: The frontier preserves diverse strategies for exploration

### 2. Reflective Mutation
Unlike **random mutation** in traditional evolutionary algorithms, GEPA uses LLMs to make **targeted improvements**:

| Traditional EA | GEPA |
|---------------|------|
| Random bit flips | LLM analyzes failure modes |
| Blind crossover | LLM preserves working patterns |
| Requires many generations | Sample-efficient |
| No domain knowledge | Uses side_info for context |

### 3. Side Information Flow
The `side_info` returned by your fitness function powers the reflection:

```python
# What the LLM sees during reflection:
"""
Current candidate: {code: "def solve(x): ..."}

Evaluation results on 3 examples:
  Example 1: Score 0.8
    Input: "Pack 26 circles"
    Output: circles array
    Error: "Circles 3 and 7 overlap"
    
  Example 2: Score 0.0  
    Input: "Pack 26 circles"
    Error: "IndexError on line 42"
    
  Example 3: Score 1.0
    Input: "Pack 26 circles"
    Output: Valid packing with sum_radii=2.89

Propose an improved version that fixes these issues.
"""
```

---

<a id="section-9"></a>
# 9. Conclusion: From Imperative to Declarative Optimization

We are witnessing a **paradigm shift** in optimization—from imperative implementations to declarative specifications:

| **Old Paradigm** | **New Paradigm with `optimize_anything`** |
|------------------|------------------------------------------|
| Imperative: specify *how* to optimize | Declarative: specify *what* to optimize |
| Different libraries for different problems | **One API for everything** |
| Mathematically-specific algorithms | Language-driven proposal generation |
| Scalar fitness only | **Rich diagnostic information (ASI)** |
| Random mutations | **Targeted, reflective mutations** |
| Expert knowledge required | LLM brings domain knowledge |

## The `optimize_anything` Vision

**If it can be represented as text, it can be optimized.**

| Domain | What You Optimize | Example |
|--------|-------------------|---------|
| **Code** | Algorithms, implementations | Black-box optimization code |
| **Prompts** | Instructions, examples | System prompts for math problems |
| **Agent Architectures** | Program structure, control flow | DSPy programs for ARC-AGI |
| **Configurations** | Hyperparameters, settings | JSON/YAML configs |
| **Data Structures** | Schemas, templates | API specifications |

## Why This Matters

1. **Democratization**: You don't need a PhD in optimization to solve hard problems
2. **Generalization**: One framework, infinite applications
3. **Sample Efficiency**: LLM reflection beats random search
4. **Emergent Capabilities**: GEPA discovers strategies you wouldn't think of

## Getting Started

```bash
pip install gepa
```

In [None]:
from gepa.optimize_anything import optimize_anything, GEPAConfig, EngineConfig, ReflectionConfig

# 1. Define your seed candidate (starting point)
seed_candidate = {
    "my_param": "initial value"  # Can be code, prompt, config, etc.
}

# 2. Define your fitness function (how to measure success)
def fitness_fn(candidate, example=None):
    # Run your system with the candidate
    output = run_my_system(candidate["my_param"], example)
    
    # Compute score (higher is better)
    score = compute_score(output, example)
    
    # Collect rich diagnostic information (ASI)
    side_info = {
        "Input": example,
        "Output": output,
        "Expected": example.get("answer") if example else None,
        "Error": get_error_message(output),
        "Feedback": analyze_performance(output),
    }
    
    return score, output, side_info

# 3. Run optimization
result = optimize_anything(
    seed_candidate=seed_candidate,
    fitness_fn=fitness_fn,
    dataset=my_examples,  # Optional: for multi-instance mode
    objective="Find a parameter that maximizes performance",  # Optional: guidance
    config=GEPAConfig(
        engine=EngineConfig(max_metric_calls=100),
        reflection=ReflectionConfig(reflection_lm="openai/gpt-4o"),
    ),
)

# 4. Use the optimized result
print("Best candidate:", result.best_candidate)
print("Best score:", result.best_score)

## Summary: What We Showed

| Example | What We Optimized | Key Insight |
|---------|-------------------|-------------|
| **Mathematical Optimization** | Python code for black-box optimization | GEPA discovers algorithms automatically |
| **Prompt Engineering** | System prompts for math problems | LLM reflection finds domain-specific strategies |
| **Agent Evolution** | DSPy programs for ARC-AGI | Self-refinement emerged without being programmed |
| **Algorithmic Discovery** | Circle packing algorithms | Matches state-of-the-art (AlphaEvolve, etc.) |

## Key Takeaways

1. **Unified Interface**: One API for prompts, code, configs, and agent architectures

2. **Side Information (ASI) is Key**: The more diagnostic information you provide, the better GEPA can reason about improvements

3. **Beyond Scalar Optimization**: Traditional optimizers only see scores; GEPA sees error messages, execution traces, and domain-specific feedback

4. **Emergent Capabilities**: Sophisticated strategies (like self-refinement in ARC-AGI) emerge without explicit programming

5. **The Convex Hull**: `optimize_anything` is designed to cover all text-based optimization problems under one abstraction

---

## Try It Yourself

**If you can express your system's parameters as text and compute a score with diagnostic feedback, GEPA can optimize it.**

```python
pip install gepa
```

```python
from gepa.optimize_anything import optimize_anything

result = optimize_anything(
    seed_candidate={"your_param": "your_value"},
    fitness_fn=your_fitness_function,
)
```

---

*GEPA is open-source. Star us on [GitHub](https://github.com/stanfordnlp/gepa)!*

---

## Appendix: Full Code Examples

The complete, runnable code for all examples in this post can be found in the `examples/` directory:

- `examples/new_polynomial/` — Mathematical optimization (EvalSet)
- `examples/math/` — Prompt engineering (AIME 2025)
- `examples/arc_agi/` — Agent program evolution (ARC-AGI)
- `examples/circle_packing/` — Algorithmic discovery (Circle Packing)

---

## Minimal Working Example: Optimize a Sorting Function

Here's a complete, runnable example that optimizes a Python sorting function:

In [None]:
"""
Minimal working example: Optimize a sorting function
This evolves Python code that sorts a list of numbers.
"""
import time
from gepa.optimize_anything import optimize_anything, GEPAConfig, EngineConfig, ReflectionConfig

# 1. SEED CANDIDATE: A naive bubble sort implementation
seed_candidate = {
    "code": """
def sort_list(arr):
    '''Sort a list of numbers in ascending order.'''
    n = len(arr)
    for i in range(n):
        for j in range(0, n-i-1):
            if arr[j] > arr[j+1]:
                arr[j], arr[j+1] = arr[j+1], arr[j]
    return arr
"""
}

# 2. DATASET: Test cases to optimize on
dataset = [
    {"input": [64, 34, 25, 12, 22, 11, 90], "expected": [11, 12, 22, 25, 34, 64, 90]},
    {"input": [5, 1, 4, 2, 8], "expected": [1, 2, 4, 5, 8]},
    {"input": [3, 3, 1, 2, 1], "expected": [1, 1, 2, 3, 3]},
    {"input": list(range(100, 0, -1)), "expected": list(range(1, 101))},  # Worst case
]

# 3. FITNESS FUNCTION: Measure correctness and speed
def fitness_fn(candidate, example):
    code = candidate["code"]
    
    try:
        # Execute the code
        exec(code, globals())
        
        # Time the execution
        start = time.time()
        result = sort_list(example["input"].copy())
        elapsed = time.time() - start
        
        # Check correctness
        correct = result == example["expected"]
        score = 1.0 if correct else 0.0
        
        # Bonus for speed (if correct)
        if correct and elapsed < 0.001:
            score += 0.1
        
        # Rich side_info for LLM reflection
        side_info = {
            "Input": example["input"],
            "Output": result,
            "Expected": example["expected"],
            "Correct": correct,
            "Time (ms)": elapsed * 1000,
            "Error": None,
        }
        
    except Exception as e:
        score = 0.0
        side_info = {
            "Input": example["input"],
            "Error": str(e),
            "Code": code,
        }
    
    return score, {"code": code, "result": result if 'result' in dir() else None}, side_info

# 4. RUN OPTIMIZATION
result = optimize_anything(
    seed_candidate=seed_candidate,
    fitness_fn=fitness_fn,
    dataset=dataset,
    objective="Optimize the sorting function for correctness and speed.",
    background="Consider algorithms like quicksort, mergesort, or heapsort.",
    config=GEPAConfig(
        engine=EngineConfig(max_metric_calls=50),
        reflection=ReflectionConfig(reflection_lm="openai/gpt-4o-mini"),
    ),
)

# 5. USE THE RESULT
print("=" * 60)
print("OPTIMIZED CODE:")
print("=" * 60)
print(result.best_candidate["code"])
print(f"\nBest score: {result.best_score}")

### What This Example Demonstrates

1. **Seed Candidate**: We start with a naive O(n²) bubble sort
2. **Dataset**: Four test cases including a worst-case reversed list
3. **Fitness Function**: 
   - Returns correctness score (0 or 1)
   - Returns **rich side_info** including input, output, timing, and errors
4. **Optimization**: GEPA will evolve the code to find faster algorithms
5. **Result**: Often discovers quicksort or similar O(n log n) algorithms

The key is the `side_info` dictionary—it tells GEPA exactly what went wrong so it can make targeted improvements.

---

## When to Use `optimize_anything`

### Best Use Cases

| Problem Type | Example | Why GEPA Excels |
|--------------|---------|-----------------|
| **Prompt Engineering** | System prompts, few-shot examples | LLM understands language nuances |
| **Code Evolution** | Algorithm design, bug fixes | LLM can read and write code |
| **Agent Architecture** | DSPy programs, reasoning pipelines | LLM can propose structural changes |
| **Configuration Tuning** | JSON/YAML configs | LLM understands parameter relationships |
| **Template Optimization** | Email templates, API specs | LLM understands domain context |

### When Traditional Methods May Be Better

| Problem Type | Better Alternative | Why |
|--------------|-------------------|-----|
| **Neural Network Training** | PyTorch + SGD | Gradient information is crucial |
| **Convex Optimization** | SciPy, CVXPY | Mathematical structure exploitable |
| **Combinatorial (small scale)** | OR-Tools, SAT solvers | Exact methods available |

### The Rule of Thumb

**Use `optimize_anything` when:**
1. The artifact being optimized can be meaningfully represented as text
2. You can provide informative feedback about why candidates fail
3. Domain knowledge would help but isn't easily encoded as math
4. The search space is too complex for grid/random search

---

*Questions? Issues? Contributions welcome at [github.com/stanfordnlp/gepa](https://github.com/stanfordnlp/gepa)*