Evolyth

Autonomous, explainable AI R&D for model architecture evolution.

Evolyth turns a machine-learning research problem into an evolutionary loop: propose a mutation, run the experiment, review the result, preserve the evidence, and queue the next best move.

Quickstart · Why · How it works · RP contract · Cloud GPU · Dashboard

Why Evolyth exists

Modern AI coding agents can write surprisingly good experimental code. The missing piece is not code generation; it is scientific discipline.

A model-improvement loop becomes useful only when each step is:

bounded: the agent can change the experiment, but only inside an explicit research-problem boundary;
measured: every candidate is executed and reduced to comparable metrics;
auditable: the exact model, context, stdout/stderr, metrics, review, and next hypotheses are preserved;
resumable: failures do not erase progress, and the next mutation queue remains durable;
portable: the same loop can run on a laptop for smoke tests or on a GPU-backed cloud runner for real training.

Evolyth is a compact implementation of that idea. It is designed for model architecture research, but the architecture is intentionally generic: any project that can expose a small filesystem contract can be evolved.

The technical story

Evolyth treats AI-assisted research as an orchestrated evolution graph/tree, not a chat session.

A human writes a Research Problem (RP): the goal, the training/evaluation script, and the mutable model file. Evolyth then creates isolated workspaces for each run, asks a mutation worker to make one bounded change, executes the result, extracts metrics, asks a reviewer to interpret what happened, stores the full artifact trail, and uses that evidence to choose the next step.

The key design decision is that the AI agent is not the system of record. The arena is. Agents are replaceable workers. The arena keeps the history, leaderboard, Pareto front, queue, lineage, and artifacts.

That makes Evolyth closer to a small autonomous lab than to a one-shot code generator.

Specifically for demo we have created a test project tiny-cifar

System Architecture

Diagram 1: Evolyth separates the research problem, AI workers, executors, artifact store, queue, API, and dashboard.

The architecture has four layers:

Research Problem boundary — the RP folder defines what the system is allowed to see and mutate.
Orchestration core — parent selection, workspace creation, mutation, execution, extraction, review, queueing, and registration.
Execution backends — local subprocess execution for quick iteration, and Cloud Run GPU execution for real experiments.
Observability surfaces — CLI, JSONL artifacts, leaderboard, Pareto front, API, and live dashboard.

This separation is the reason the loop stays debuggable. A failed reviewer does not invalidate a completed training run. A failed training run still produces artifacts. A noisy agent can be replaced without changing the executor or store.

Evolution loop

Diagram 2: Pareto frontier.

At a high level, one evolve step is:

select parent or queued mutation
→ create isolated workspace
→ build compact context packet
→ ask mutation agent for one bounded edit
→ validate allowed file changes
→ execute the research problem
→ extract metrics into a run record
→ review parent vs child
→ snapshot artifacts and register the run
→ enqueue recommended next mutations

The loop is deliberately conservative - it increases stability. Evolyth is not trying to let an agent rewrite an entire repository. It is trying to produce many small, comparable, evidence-backed experiments.

Research Problem contract

Evolyth's prime input is a Research Problem folder.

Minimum RP structure:

my_rp/
  goal_prompt.md      # objective and constraints for the model research task
  train_eval.py       # executable training/evaluation entrypoint
  model.py            # mutable model file, edited by the mutation agent

Expected outputs

The RP should write metrics and events under the run artifact directory. When the executor launches a run, it sets:

ACR_RUN_ID=<run_id>
ACR_ARTIFACT_DIR=<arena>/runs/<run_id>
PYTHONUNBUFFERED=1

A typical run directory looks like this:

.arena/runs/run_000001/
  model.py
  goal_prompt.md
  metrics.json
  events.jsonl
  run_summary.md
  stdout.txt
  stderr.txt
  manifest.json
  context.md
  mutation.json
  ds_review.json

This artifact-first contract is what lets Evolyth recover from partial failures and inspect every experiment after the fact.

Tiny-CIFAR experiment evidence

Evolyth has been exercised on the tiny-cifar research problem as a compact proof of the full autonomous loop: propose a bounded architecture mutation, execute it, extract comparable metrics, review the evidence, and queue the next move.

This is not presented as a full CIFAR-10 benchmark claim. It is an arena evidence pack: 16 registered runs with lineage, metrics, mutation summaries, reviewer observations, and next beliefs preserved as JSONL artifacts.

Result snapshot

Across the recorded tiny-cifar run history, Evolyth moved from the original tiny depthwise-residual CNN baseline to a clean 6-layer ConvMixer-style incumbent. The best-score candidate improved the multi-objective score by 20.3%, improved validation accuracy by 4.2 percentage points, reduced parameter count by 47.1%, and reduced serialized model size by 44.2% versus the baseline.

Role	Run	Score	Val acc	Params	Model KB	Latency ms	Train s
Baseline	run_000001	0.2391	37.5%	92,194	398.2	11.69	18.9
Best score	run_000014	0.2876	41.7%	48,778	222.2	12.88	51.1
Lowest latency	run_000003	0.2534	37.8%	31,978	154.9	10.06	34.4
Latency-optimised branch	run_000015	0.2827	41.2%	48,778	222.9	12.2	33

What the search learned

The winning branch was not bigger. The best-score model used 48,778 parameters versus 92,194 in the baseline.
The strongest repeated incumbent was a clean ConvMixer variant. Runs run_000008, run_000012, and run_000014 all recovered the same 41.7% validation accuracy pattern after reverting failed branches.
Negative evidence was preserved rather than hidden. Attention gates, extra depth, wider kernels, DropPath, and lower dropout all regressed under the short training budget.
Pareto tracking matters. The top score is not the only useful candidate; the low-latency and latency-optimised branches remain valuable for deployment-oriented follow-up.

Front	Runs
Accuracy/params	run_000003, run_000008, run_000012, run_000014
Accuracy/params/latency	run_000003, run_000016, run_000015, run_000014

Charts

Failed hypotheses are part of the contribution

Rejected hypothesis	Run	Evidence
SE in original DWBlocks	run_000002	-1.7 pp accuracy, -7.9% score vs parent
8-layer ConvMixer depth increase	run_000005	-1.4 pp accuracy, -6.5% score vs parent
SE in ConvMixer layers	run_000013	-2.9 pp accuracy vs parent
DropPath regularization	run_000011	-2.3 pp accuracy vs parent
Lower dropout 0.2→0.1	run_000016	-2.9 pp accuracy vs parent

These failures make the result more credible: Evolyth is not just generating code, it is maintaining a falsifiable experiment trail. The arena records which ideas did not work under the current budget and why they should be avoided or revisited under different conditions.

Why this increases Evolyth's potential

The tiny-cifar run demonstrates the system-level promise of Evolyth: it can preserve an auditable research loop, recover from poor mutations, identify efficient candidates, and turn model-search history into reusable scientific evidence. That is the practical difference between an AI coding agent that writes one model and an autonomous R&D arena that improves, remembers, and explains a sequence of experiments.

Our advantages

Autonomous model R&D

Run a multi-step architecture search loop where a coding agent proposes changes, the executor tests them, and a reviewer decides whether to continue, branch, or abandon an idea.

A durable experiment arena

Each run is preserved under .arena/runs/<run_id>/ with the model snapshot, metrics, events, stdout/stderr, manifest, mutation metadata, reviewer notes, and context.

Pluggable AI workers

Use Claude Code, a custom external command, or a no-op agent. The mutation and review interfaces are JSON contracts, so you can plug in other LLMs or deterministic tools.

Local and cloud execution

Start with local smoke tests. Switch to Google Cloud Run GPU jobs with the same run and evolve commands when experiments need acceleration.

Live UI and API

Use the NiceGUI dashboard to watch the arena evolve, or query the small API for leaderboards, Pareto front, queue state, lineage, and search.

Quickstart

The full setup will require a GCP account, details are here

1. Clone and install

git clone https://github.com/akaliutau/evolyth
cd evolyth

conda create -n evolyth python=3.12 -y
conda activate evolyth

pip install -r requirements.txt

2. Configure Claude Code, if using Claude workers

curl -fsSL https://claude.ai/install.sh | bash

Then configure API key with a .env file

You can also run the loop without Claude using the no-op or heuristic modes shown below.

3. Initialize an arena

python cli.py --arena .arena init --rp examples/tiny_rp

4. Run one baseline

python cli.py --arena .arena run \
  --rp examples/tiny_rp \
  --smoke \
  --mutation-type baseline \
  --mutation-summary "initial smoke baseline"

5. Inspect results

python cli.py --arena .arena leaderboard
python cli.py --arena .arena pareto
python cli.py --arena .arena context --rp examples/tiny_rp

6. Run autonomous evolution

Smoke-test the full loop without Claude:

python cli.py --arena .arena evolve \
  --rp examples/tiny_rp \
  --steps 2 \
  --agent noop \
  --reviewer heuristic \
  --smoke

Run Claude Code as the mutation worker:

python cli.py --arena .arena evolve \
  --rp examples/tiny_rp \
  --steps 5 \
  --agent claude-code \
  --reviewer heuristic \
  --smoke

Run Claude Code for both mutation and review:

python cli.py --arena .arena evolve \
  --rp examples/tiny_rp \
  --steps 5 \
  --agent claude-code \
  --reviewer claude-code \
  --smoke

For full training, remove --smoke and pass RP-specific arguments after --:

python cli.py --arena .arena evolve \
  --rp examples/tiny_rp \
  --steps 10 \
  --agent claude-code \
  --reviewer heuristic \
  -- --epochs 2 --max-steps 200

Autonomous loop in detail

1. Parent selection

Evolyth chooses a parent from the run store or consumes a queued mutation idea. The selector balances score, Pareto membership, exploration pressure, run status, and generation depth.

2. Isolated workspace

Each candidate gets a fresh workspace copied from the RP and, when relevant, the parent model. This avoids cross-run contamination.

3. Context packet

The system builds a compact context packet from the research goal, current model, parent metrics, leaderboard, queue, and prior beliefs. The mutation worker receives enough context to make a targeted edit without seeing unrelated repository state.

4. Mutation

The mutation agent is asked for one bounded change. In the default safety mode, only the configured mutable file is allowed to change.

5. Execution

The candidate is executed locally or in Cloud Run. The executor captures stdout, stderr, metrics, and artifacts.

6. Extraction

Metrics are converted into a normalized run record: status, score, accuracy, loss, parameters, model size, latency, mutation summary, parent id, and generation.

7. Review

A heuristic or LLM reviewer compares the child to its parent and writes a data-scientist-style review: whether the mutation was valid, whether it improved the branch, what was learned, and which mutations should be tried next.

8. Registration and queueing

The run is registered in the arena, lineage is updated, the leaderboard and Pareto front become queryable, and new mutation ideas are added to the durable queue.

Live dashboard

Start the dashboard:

python cli.py --arena .arena demo-ui --port 8080

The dashboard is intentionally read-only. It polls the arena files and visualizes the experiment graph without becoming another source of truth.

It shows:

branch constellation with parent-child edges;
latest run pulse, best-run highlight, and Pareto orbit;
best score, run counts, queue state, and failure state;
score skyline over time;
selected run metrics such as accuracy, loss, params, bytes, latency, train time, dry-run status, AI cost, and evolution time;
reviewer cockpit with recommendation, confidence, observation, next belief, and suggested mutations;
leaderboard, queue, and agent/executor/reviewer event stream.

Cloud GPU execution

Evolyth can execute the same RP through Google Cloud Run jobs by using the cloud-run executor.

python cli.py --arena .arena evolve \
  --rp /path/to/tiny-cifar \
  --steps 5 \
  --agent claude-code \
  --reviewer heuristic \
  --executor cloud-run \
  --cloud-spec /path/to/tiny-cifar/cloud_runner.yaml \
  -- --dataset cifar10 --epochs 2 --max-steps 200

The Cloud Run executor calls gcp_cloud_runner/application_cloud_runner.py, passes the Evolyth run id through to the cloud job, appends RP arguments to the configured runtime command, and syncs outputs back into .arena/runs/<run_id>/.

Why the cloud runner is built this way

The cloud runner separates image deployment from per-run source execution.

The reusable runner image is built only when the base image or Python dependencies change.
Each experiment packs selected source files from the RP, uploads them to GCS, and starts a fresh Cloud Run job.
The job downloads the source bundle and optional dataset, runs the configured command, uploads artifacts, and syncs them locally.
The source bundle and job can be cleaned up automatically after successful runs.

This keeps experiment launches fast while preserving a clean, reproducible execution boundary.

Example `cloud_runner.yaml`

name: torch-train
project_id: ${PROJECT_ID}
region: ${REGION}
bucket: ${BUCKET_NAME}
artifact_repo: ${AR_REPO}
service_account: ${SA_EMAIL}

image:
  name: acr-torch-runner
  tag: latest

build:
  base_image: pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime
  requirements: requirements.txt

files:
  include: ["train.py", "src/**", "configs/**", "requirements.txt"]
  required: ["train.py"]
  exclude: ["data/**", "artifacts/**", "checkpoints/**", "*.pt", "*.pth"]
  hashes: {}

source:
  gcs_prefix: gs://${BUCKET_NAME}/acr-sources/torch-train

runtime:
  command: ["python", "train.py"]
  workdir: /workspace/app

dataset:
  uri: gs://${BUCKET_NAME}/datasets/example-dataset/data.tar.gz
  container_dir: /workspace/dataset
  mode: auto
  unpack: auto

cloud_run:
  gpu: 1
  gpu_type: nvidia-l4
  cpu: 4
  memory: 16Gi
  task_timeout: 3600s

artifacts:
  container_dir: /workspace/artifacts
  gcs_prefix: gs://${BUCKET_NAME}/training-runs/torch-train

External worker contracts

Evolyth can use any mutation or review tool that speaks JSON over stdin/stdout.

Mutation worker input

{
  "rp_path": ".../.arena/workspaces/run_000001",
  "mutable_file": "model.py",
  "context": "compact evolution context",
  "current_model": "full model.py contents"
}

Mutation worker output

The worker may edit the workspace directly or return a complete replacement for model.py.

{
  "mutation_type": "safe_refinement",
  "mutation_summary": "one sentence",
  "hypothesis": "why this should help",
  "changed_files": ["model.py"],
  "model_py": "full replacement contents, optional"
}

Review worker input

{
  "parent": {},
  "child": {},
  "context": "..."
}

Review worker output

{
  "valid": true,
  "is_improvement": true,
  "branch_recommendation": "continue",
  "observation": "what happened",
  "next_belief": "what this suggests",
  "recommended_next_mutations": [
    {
      "mutation_type": "safe_refinement",
      "description": "bounded next idea",
      "expected_benefit": "why",
      "priority": 0.7
    }
  ]
}

This contract makes the AI layer replaceable. Claude Code is one adapter, not a hard dependency of the architecture.

CLI reference

# Validate an RP and initialize arena state
python cli.py --arena .arena init --rp examples/tiny_rp

# Execute and register one run
python cli.py --arena .arena run --rp examples/tiny_rp --smoke

# Run autonomous evolution
python cli.py --arena .arena evolve --rp examples/tiny_rp --steps 5 --agent claude-code --reviewer heuristic

# List queued mutation ideas
python cli.py --arena .arena queue

# Show best runs
python cli.py --arena .arena leaderboard

# Show Pareto front
python cli.py --arena .arena pareto

# Print context packet for a parent
python cli.py --arena .arena context --rp examples/tiny_rp --parent-id run_000001

# Search prior runs
python cli.py --arena .arena search depthwise

# Serve API
python cli.py --arena .arena serve --port 8000

# Serve dashboard
python cli.py --arena .arena demo-ui --port 8080

API

Start the API:

python cli.py --arena .arena serve --port 8000

Endpoints:

GET  /leaderboard
GET  /pareto
GET  /queue
GET  /runs/{run_id}
GET  /runs/{run_id}/lineage
GET  /search?q=depthwise
POST /runs/register

Repository map

cli.py                    command-line entry point
core/rp.py                Research Problem contract loader
core/workspace.py         isolated workspaces + single-file validation
core/agent.py             MutationAgent, Claude Code, external command, no-op adapters
core/orchestrator.py      select → mutate → execute → review → queue loop
core/executor.py          Executor interface, LocalExecutor, CloudRunExecutor
core/store.py             run registry, leaderboard, lineage, search storage
core/run_store.py         filesystem artifact snapshots and manifests
core/extractor.py         metrics.json → RunRecord
core/pareto.py            Pareto front utilities
core/context_builder.py   compact context packet builder
core/selection.py         parent-priority heuristic
core/queue.py             durable priority queue for mutation ideas
core/review.py            reviewer interface and adapters
core/api.py               FastAPI wrapper
core/ui_demo.py           read-only NiceGUI dashboard

gcp_cloud_runner/         reusable Cloud Run GPU job runner
examples/                 sample RPs and agent integrations

Design principles

1. The arena is the memory

LLM context is transient. The arena is persistent. Every run is preserved as data and artifacts so the system can resume, inspect, and branch.

2. Agents are workers, not authorities

Mutation and review agents can be Claude Code, external commands, or deterministic heuristics. Their outputs are recorded, but the orchestrator owns the lifecycle.

3. Small mutations beat heroic rewrites

A good autonomous research loop needs comparable deltas. Evolyth encourages one bounded mutation per run and validates allowed file changes by default.

4. Execution is the judge

The training/evaluation script is the objective source of truth. The reviewer can interpret results, but metrics determine the run record.

5. Failures are data

A failed run, timeout, parse error, or inconclusive review should leave artifacts behind. In autonomous R&D, reliability comes from preserving partial evidence, not pretending every step succeeds.

When to use Evolyth

Evolyth is a good fit when you have:

a model architecture or training pipeline with a clear evaluation script;
a score function that can compare candidates;
a narrow file or module that can be safely mutated;
enough experiment budget to test many small variants;
a need to inspect why one branch improved or regressed.

Useful to generate and triage candidates, then rerun promising results under stricter experimental controls.

Roadmap ideas

richer cost accounting for coding, review, and cloud compute;
first-class experiment budgets and stopping criteria;
stronger lineage visualization and branch comparison;
pluggable multi-objective scoring policies;
replay mode for deterministic audits;
distributed queue workers;
native support for more cloud/GPU backends;
benchmark packs for common model-search tasks.

Contributing

Contributions are welcome, especially around:

new executor backends;
additional reviewer strategies;
stronger safety validation for mutable files;
better dashboard views;
example Research Problems;
reproducibility and experiment-analysis tooling.

Before opening a large PR, start with an issue describing the research workflow or reliability problem you want to improve.

⚖️ License

Evolyth is open-source software distributed under the MIT License.

See the LICENSE file for more details.

Built exclusively for the Claude Code hackathon organized by Anthropic.
Evolyth - The Engine of Discovery.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
core		core
docs		docs
examples/tiny_rp		examples/tiny_rp
gcp_cloud_runner		gcp_cloud_runner
utils		utils
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
requirements-extra.txt		requirements-extra.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Evolyth

The technical story

Evolution loop

Expected outputs

Tiny-CIFAR experiment evidence

Result snapshot

What the search learned

Charts

Failed hypotheses are part of the contribution

Why this increases Evolyth's potential

Our advantages

Autonomous model R&D

A durable experiment arena

Pluggable AI workers

Local and cloud execution

Live UI and API

1. Clone and install

2. Configure Claude Code, if using Claude workers

3. Initialize an arena

4. Run one baseline

5. Inspect results

6. Run autonomous evolution

Autonomous loop in detail

1. Parent selection

2. Isolated workspace

3. Context packet

4. Mutation

5. Execution

6. Extraction

7. Review

8. Registration and queueing

Live dashboard

Why the cloud runner is built this way

Example cloud_runner.yaml

External worker contracts

Mutation worker input

Mutation worker output

Review worker input

Review worker output

CLI reference

API

Repository map

Design principles

1. The arena is the memory

2. Agents are workers, not authorities

3. Small mutations beat heroic rewrites

4. Execution is the judge

5. Failures are data

When to use Evolyth

Roadmap ideas

Contributing

⚖️ License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Languages

Example `cloud_runner.yaml`