ALPHA — DO NOT USE. Currently testing loop types with real problems. Example executions will be added for each of the 12 types. Moves out of alpha once all are validated.
A Claude Code skill that transforms an optimization problem into a fully scaffolded autonomous experiment loop. Describe what you want to optimize, and autoeval builds the eval suite, seed implementation, monitoring dashboard, and loop runner — everything you need to kick off an overnight experiment.
/plugin marketplace add abdielou/autoeval
/plugin install autoeval@abdielou-autoeval
/autoeval optimize my bin packing heuristic for minimum waste
/autoeval improve my classifier's accuracy on the test suite
/autoeval --auto tune my prompt to maximize extraction accuracy
autoeval walks you through defining the problem, designing the scoring function, and building the eval suite. Then it scaffolds the loop and tells you how to run it.
The --auto flag makes the later phases run autonomously after the interactive metric design is locked in.
| File | What it does |
|---|---|
program.md |
Meta-agent directive — goal, edit surface, iteration protocol, constraints |
evals/ |
Test cases, scoring functions, eval runner |
run-loop.py |
Launches claude sessions with auto-restart and timeout |
monitor.py |
Live dashboard at localhost:8080 |
steering.md |
Guide the agent mid-run without stopping the loop |
| Seed harness | Minimal baseline implementation with marked edit surface |
Terminal 1 — start the loop:
cd <output-dir>
python run-loop.pySessions auto-restart every N iterations with fresh context. Ctrl+C once kills the current session; Ctrl+C again within 3 seconds stops the loop entirely.
Terminal 2 — monitor progress:
cd <output-dir>
python monitor.pyLive chart showing score over iterations, component scores (toggleable), failed experiments as red X markers, hypothesis on hover, and summary stats. Auto-refreshes every 30 seconds.
The loop uses two models to balance cost and quality:
- Runner (sonnet) — handles the iteration cycle: editing code, running evals, committing
- Deep reasoning (opus) — called one-shot for hypothesis generation and exploration rounds
Opus-level creativity for the "think hard" steps at ~10% of the cost.
Guide the loop agent mid-run without stopping it. Append entries to steering.md tagged with the commit hash they apply after:
## after 7f5ad6f
Stop trying to optimize the greedy heuristic. The biggest gain is switching
to a scoring-based approach. Look at the best-fit-decreasing literature.
## after a1b2c3d
The classifier is plateauing at 96.2%. The bottleneck is feature engineering,
not model choice. Try polynomial features on columns 3-7.The agent reads this before each iteration and follows it. Old entries are automatically irrelevant once HEAD moves past them.
For permanent changes, edit program.md — takes effect at the next session restart.
Most optimization loops silently discard failed attempts. autoeval preserves them.
When an experiment fails, before reverting the code, the agent saves the full diff, hypothesis, score breakdown, and failure analysis to failed_experiments/. The monitoring dashboard shows a Failed Experiments table with the last 20 failures. Before each new hypothesis, the deep reasoning model sees past failures and avoids repeating them.
Failed iterations become institutional knowledge the loop builds up over time.
autoeval classifies your problem against 12 optimization loop types:
| Training | Agent Harness | Generative Output | Algorithm Performance |
| Retrieval & Ranking | Pipeline Optimization | Simulation Calibration | Strategy & Decision |
| Adversarial | Data Curation | Control Systems | Interface Optimization |
Each type defines what the agent edits and how it's scored. See loop-types.md for details.
The eval suite goes through integrity validation before the loop starts:
- Self-referential detection — catches evals that "grade their own homework"
- Discriminating power — verifies the eval can tell good output from bad
- Ceiling reachability — confirms the theoretical best score is actually achievable
- User review — plain-language walkthrough of what's scored, what's not, and how the agent could game it
autoresearch | autoagent | AIDE | OpenEvolve | AI-Scientist
MIT