EvoReplay
_{What Do Evolutionary Coding Agents Evolve?}

Nico Pelleriti · Sree Harsha Nelaturu · Zhanke Zhou · Zongze Li · Max Zimmer · Bo Han · Sebastian Pokutta

Overview

EvoReplay is the post-run analysis suite that accompanies our paper What Do Evolutionary Coding Agents Evolve? It takes the raw traces produced by an evolutionary code-search run — the population of candidate programs, their parent links, prompts, scores, and per-iteration metrics — and turns them into the static measurements, cycling detections, counterfactual replays, and LLM-judged edit-taxonomy labels we report in the paper.

The companion dataset of traces lives on the Hugging Face Hub:

Key Features

📐 Static analysis — lines of code, hyperparameter counts, lineage depth, and best-so-far trajectories per run
🔁 Cycling detection — line-level recycling of removed code, with structural-only and tuning-only modes
🏷️ Edit taxonomy — LLM-as-judge labelling of every parent → child diff into nine categories, with a hand-labelled gold set and inter-rater agreement tooling
🎯 Agentic Bayesian-optimisation tuning — an LLM proposes tunable knobs + intervals on a frozen program, scikit-optimize searches over them
🎬 Breakthrough replay — re-run the prompts that caused best-so-far updates under different models / context strategies
🐍 Python + C++ literal extractors so the same pipeline works on both supported trace languages

Supported Run Layouts

evo_replay operates on a <run_dir>/ and auto-detects which of two layouts it is reading. Both produce the same analyses.

Refined (preferred — produced by scripts/refine_outputs.py):

<run_dir>/
    meta.json
    run_config.yaml             (canonical 3 backends only)
    programs.jsonl              one row per unique program; canonical fields
                                (incl. solution_sha256, prompts_sha256)
    iterations.jsonl
    iter_scalars.jsonl
    blobs/<sha[:2]>/<sha>.{txt,json}    content-addressed code & prompts
    best/, logs/, analysis/     (canonical 3 backends; symlinks or copies)

Raw (legacy search-framework output):

<run_dir>/
    run_config.yaml
    run_info.json
    checkpoints/checkpoint_<N>/programs/<uuid>.json
    best/
    logs/

core.checkpoints.load_programs(run_dir) returns the same {pid: program_record} dict from either layout. For the refined layout it dereferences the content-addressed blobs and re-injects them as program["solution"] / program["prompts"], so downstream code does not need to know which layout it is reading.

Setup

# Clone
git clone https://github.com/ZIB-IOL/EvoReplay.git
cd EvoReplay

# Install (uses uv: https://docs.astral.sh/uv/)
uv sync

The breakthrough_replay/ module additionally depends on the underlying evolutionary-search framework that produced the traces. Install it from your local checkout:

uv pip install -e /path/to/search-framework

LLM endpoint

agentic_tuning/ and breakthrough_replay/ need an OpenAI-compatible endpoint:

export OPENAI_API_KEY=...
export EVO_REPLAY_API_BASE=https://your-endpoint/v1

Both also accept --api-base on the command line.

Pointing at the dataset

The example scripts and gold-set builder expect a local checkout of the companion dataset:

# Either fetch with the HF CLI:
huggingface-cli download ZIB-IOL/EvoTrace --repo-type dataset --local-dir ./evo_trace_anon

# ...or with git-lfs:
git clone https://huggingface.co/datasets/ZIB-IOL/EvoTrace evo_trace_anon

# Then tell EvoReplay where it lives (or pass --trace-root):
export EVO_TRACE_ROOT="$(pwd)/evo_trace_anon"

Usage

1. Static analysis (LOC, hyperparameter counts, lineage)

uv run python -m evo_replay.static.run_static <run_dir>

Auto-detects language (Python / C++) from run_config.yaml, falling back to the first programs.jsonl row, then best/best_program.{cpp,py}. Outputs land under <run_dir>/analysis/.

2. Cycling detection

# Raw cycling
uv run python -m evo_replay.cycling.detect_cycling <run_dir> \
    --csv <run_dir>/analysis/cycles_raw.csv

# Structural-only (strips numeric-tuning churn)
uv run python -m evo_replay.cycling.detect_cycling <run_dir> \
    --collapse-numbers --exclude-hyperparams \
    --csv <run_dir>/analysis/cycles_structural.csv

# Per-edit composition (pure-tuning vs structural)
uv run python -m evo_replay.cycling.classify_edits <run_dir>

3. Edit-taxonomy classification

LLM-judge every parent → child diff into the nine taxonomy categories. The on-disk cache is content-addressed, so re-runs and shared parents across runs do not re-pay:

# Score the judge against the hand-labelled gold set
uv run python -m evo_replay.edit_taxonomy.gold score

# Classify every edit in a run
uv run python -m evo_replay.edit_taxonomy.run_classify <run_dir>

The shipped wrapper scripts/classify_edits.sh runs the gold check followed by three representative runs from the dataset.

4. Agentic Bayesian-optimisation tuning

uv run python -m evo_replay.agentic_tuning.run_bo \
    --run-dir <run_dir> --program-id best \
    --evaluator <path_to_evaluator.py> \
    --calls 24 --initial-points 8 \
    --propose-model deepseek/deepseek-reasoner \
    --api-base "$EVO_REPLAY_API_BASE"

Pipeline: load a program from a run, ask an LLM to propose tunable hparams + intervals, rewrite the source as PARAMS = {...} + literal substitutions, then run skopt.gp_minimize against the evaluator.

Aggregate ceilings across an experiment dir:

uv run python -m evo_replay.agentic_tuning.aggregate_bo <experiment_dir>

5. Breakthrough replay (needs the search framework)

uv run python -m evo_replay.breakthrough_replay.run_replay <run_dir> \
    --top-events 3 \
    --models "model-a,model-b" \
    --prompts "exact,strict_diff,no_history,no_other_context" \
    --repeats 1 --attempts 3

Module Layout

Folder	Purpose
`core/`	Shared utilities: program loading, lineage walks, literal extractors
`static/`	LOC, hyperparameter counts, best-program lineage depth, paper figures
`cycling/`	Line-level cycling detection (raw + structural-only) + plots
`edit_taxonomy/`	Nine-category LLM-judge classifier + gold set + agreement tooling
`agentic_tuning/`	LLM-proposed Bayesian-optimisation tuning of hyperparameters
`breakthrough_replay/`	Replay best-so-far events under different models / prompts

Tests

uv sync --extra dev
uv run pytest -v

End-to-end smoketests are gated on EVO_REPLAY_TEST_RUN_DIR (they run the static + cycling pipelines against a real run directory) — the rest of the suite covers the rubric, judge, gold set, agreement, and BO rewriter without network or filesystem fixtures:

EVO_REPLAY_TEST_RUN_DIR=/path/to/some/run_dir uv run pytest -v

Dataset

The companion dataset of evolutionary code-search traces is published on the Hugging Face Hub as ZIB-IOL/EvoTrace. It contains the runs analysed in the paper across multiple search backends, benchmark domains, and model configurations. See the dataset card for layout and licence details.

Citation

If you use EvoReplay or the EvoTrace dataset in your research, please cite:

@misc{pelleriti2026evolutionarycodingagentsevolve,
      title={What Do Evolutionary Coding Agents Evolve?}, 
      author={Nico Pelleriti and Sree Harsha Nelaturu and Zhanke Zhou and Zongze Li and Max Zimmer and Bo Han and Sebastian Pokutta},
      year={2026},
      eprint={2605.20086},
      archivePrefix={arXiv},
      primaryClass={cs.NE},
      url={https://arxiv.org/abs/2605.20086}, 
}

License

This project is licensed under the Apache License 2.0 — see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
src/evo_replay		src/evo_replay
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CITATION.bib		CITATION.bib
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvoReplay
_{What Do Evolutionary Coding Agents Evolve?}

Overview

Key Features

Supported Run Layouts

Setup

LLM endpoint

Pointing at the dataset

Usage

1. Static analysis (LOC, hyperparameter counts, lineage)

2. Cycling detection

3. Edit-taxonomy classification

4. Agentic Bayesian-optimisation tuning

5. Breakthrough replay (needs the search framework)

Module Layout

Tests

Dataset

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EvoReplay What Do Evolutionary Coding Agents Evolve?

Overview

Key Features

Supported Run Layouts

Setup

LLM endpoint

Pointing at the dataset

Usage

1. Static analysis (LOC, hyperparameter counts, lineage)

2. Cycling detection

3. Edit-taxonomy classification

4. Agentic Bayesian-optimisation tuning

5. Breakthrough replay (needs the search framework)

Module Layout

Tests

Dataset

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

EvoReplay
_{What Do Evolutionary Coding Agents Evolve?}

Packages