Evaluate the whole stack, not just the model. Retort applies statistical Design of Experiments (DoE) to measure how AI coding agents actually perform across the variables that decide a real project — programming language × model version × tooling — on the tasks you care about. Every run is scored for whether it provably implements the spec, plus how fast, how expensive, and how clean.
Why not just read a leaderboard? Sites like llm-stats.com compare many models across many benchmarks — but they hold the stack constant and ignore programming language, the surrounding tooling, and time taken. They can't tell you whether Opus 4.8 is worth 4× the cost of 4.6 in Rust, how reliably each model gets a Go MCP server completely right, or how long any of it takes. Retort answers exactly that: point it at your languages, models, and tasks (or your own codebase) and it finds the leading stack variant for your problem.
- Factorial / fractional-factorial designs over
language × model × tooling(and any factors you add), generated automatically — run the full grid or a quarter-fraction. - Isolated playpens — each run gets a fresh workspace; the agent (
claude -p) implements the task, then the code is built and tested in place. - Scoring that checks the spec, not just the vibes. Eight built-in scorers (code quality, test coverage, defect rate, maintainability, idiomaticity, token efficiency, …) plus a conformance gate:
- Mechanical gate — if the tests don't run, the run fails (no proof = no pass).
- Spec gate — a second-opinion LLM eval (the judge defaults to the latest Claude model, tracking new releases) checks the code against a pinned requirement checklist and records
requirement_coverage; a run passes only if it implements the whole spec. (Single-pass LLM grading proved too noisy — haiku swung 0.33↔1.0 on identical code — so the gate uses a fixed checklist + a stronger judge + a two-attempt "second opinion" to kill false failures.) The eval self-checks:reevaluatepreflights the judge and errors instead of silently grading nothing.
retort diagnose— re-tests every failed run's archive and classifies it TOOLING (a scorer false-failure thatrescorerecovers) vs GENUINE (a real model/spec failure), with the cause. So you never have to hand-investigate a failure.- Cross-experiment master database —
retort aggregaterolls every experiment into one tidymaster.db/master.csv. - ANOVA + effects, live
retort monitor(shows in-flight runs across parallel shards), resumable sharded runs,cost_limit_usd.
This repo is the result of running it: ten experiments across two tasks and eight languages (Go, Python, Clojure, Rust, Java, TypeScript, Erlang, Elixir), four Claude models (Sonnet, Opus 4.6 / 4.7 / 4.8) plus Opus-4.8 fast mode, the next-tier Claude Fable 5, a Gemini cross-agent scaffold, and a prompt / test-methodology study (BDD / TDD / ATDD).
Retort has a couple of environment gotchas (Python ≥ 3.11; OApackage is a C++ extension built with cmake). The fastest install is to let Claude Code handle them — point it at a directory and ask:
$ cd Documents/GitHub
$ claude
> clone and install https://github.com/adrianco/retort here
⏺ Done. Retort is cloned and installed, and all tests pass.
- Installed cmake (Homebrew) — needed to build OApackage (C++ extension, no wheel here).
- Created a virtualenv with Python 3.12 at retort/.venv.
- Ran pip install -e ".[dev,test]" — built retort + oapackage from source.
git clone https://github.com/adrianco/retort.git
cd retort
pip install -e ".[dev,test]" # Python deps + builds OApackage (needs cmake)
retort --help # CLI loads → deps OK| Also needed | Why |
|---|---|
| Python 3.11+, C/C++ toolchain + cmake | runtime + building OApackage |
claude CLI, authenticated |
the agent runner shells out to claude -p … |
| Per-language toolchains | the scorer builds, tests, and lints the generated code — see the table below |
bd (beads) CLI |
only if a factor uses tooling: beads |
gemini CLI / omp (oh-my-pi) CLI |
only to run non-Claude agents — Google Gemini, or local/other models via oh-my-pi; see Comparing coding agents |
.devcontainer/ provisions all of this for Codespaces / Dev Containers (authenticate claude once).
You only need the toolchains for the languages you actually list as language
factor levels in workspace.yaml. The scorer shells out to each language's
real build/test/lint tools, so they must be on PATH — if a tool is missing
the run fails its mechanical gate (tests can't run = no pass). Install exactly
what you use:
| Language | Tools the scorer runs | macOS (Homebrew) | Debian/Ubuntu |
|---|---|---|---|
| python | pytest, coverage, ruff |
(bundled via pip install -e ".[dev,test]") |
(bundled via the pip extras) |
| typescript | node ≥20 + npm (npx pulls jest/vitest, tsc, eslint per project) |
brew install node |
apt install nodejs npm (or NodeSource for ≥20) |
| go | go test -cover, go vet |
brew install go |
apt install golang-go |
| rust | cargo test, cargo clippy |
brew install rustup-init && rustup-init -y |
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh |
| then add the linter: | rustup component add clippy |
rustup component add clippy |
|
| java | mvn test, jacoco, mvn compile (JDK 17+) |
brew install openjdk maven |
apt install default-jdk maven |
| clojure | clojure -M:test and lein test, cloverage, clj-kondo |
brew install clojure/tools/clojure leiningen borkdude/brew/clj-kondo |
clojure CLI + lein via the official scripts; clj-kondo from its releases |
| erlang | rebar3 eunit and rebar3 ct, rebar3 compile |
brew install erlang rebar3 |
apt install erlang rebar3 |
| elixir | mix test, mix compile --all-warnings (pulls Erlang/OTP) |
brew install elixir |
apt install elixir |
⚠️ Clojure needs both the Clojure CLI (clojure/clj) and Leiningen (lein) — agents pick either adeps.ednor aproject.cljlayout, and the scorer runs whichever the generated project uses. Erlang needsrebar3(for both EUnit and Common Test suites); Elixir needsmix(ships with Elixir). Java/Clojure/Erlang/Elixir all need a JDK/BEAM onPATH. A missing one of these is the single most common "why did every run of language X fail?" — verify withlein test,rebar3 --version,mix --version, etc. before launching an experiment.
You don't hand-write workspace.yaml or do the factorial math. Open claude in the repo and describe the experiment; it designs the matrix, checks prerequisites, estimates cost, confirms the decisions that matter, and runs it — every experiment here was built this way:
> compare opus 4.6, 4.7 and 4.8 across six languages on the brazil-bench task
⏺ That's a 3 model × 6 language × 2 tooling = 36-cell factorial, 3 replicates.
Estimated real API spend + hours of wall-clock. Confirm a few choices:
· full factorial or a quarter-fraction? · keep beads tooling or drop it? · run now, or set up and stop?
Claude then writes the workspace + design, installs toolchains, runs the cells (resuming across usage-limit windows, retrying failures, flagging cost), and reports — watch live with retort monitor <experiment>. You can also drive the CLI directly (retort init/run/monitor/report/aggregate).
Task sources. A task is what the agent builds (task.yaml + optional validate.py). tasks/registry.yaml indexes tasks by name → a canonical source: bundled in this repo (bundled://) or hosted in a git/GitHub repo (github://). List them with retort tasks list, and reference one by bare name (--task brazil-bench) or explicit URI. See tasks/README.md to add your own.
📝 For the narrative walkthrough of these findings — the reliability-vs-cost story, fast mode, the BEAM languages, and the measurement bugs along the way — see the companion model blog (updated as new models arrive). The prompt blog covers the separate question of whether the prompt — specifically the prescribed test methodology (BDD / TDD / ATDD vs none) — moves reliability.
The headline metric is pass-proportion: with N replicates of a stack, the fraction whose runs fully implement the spec (requirement_coverage == 1.0, a gate pass). Read it as the probability that a single run of that stack comes out completely correct — 3/3 → 1.00, 2/3 → 0.66, 1/3 → 0.33. A single sub-1.0 run is a fail.
Aggregated per model per task (larger samples → robust):
| Model | Brazil MCP (hard) | REST-API (easy) | Speed¹ | Cost/run¹ |
|---|---|---|---|---|
| opus-4.6 | 0.47 | 0.59 | 309 s | $1.30 |
| sonnet | 0.50 | 0.63 | 440 s | $1.10 |
| opus-4.7 | 0.85 | 1.00 | 774 s | $4.92 |
| opus-4.8 | 1.00 | 1.00 | 1035 s | $5.54 |
| opus-4.8-fast² | 1.00 | 1.00 | 887 s | $8.72 |
| fable-5³ | 1.00 | 1.00 | 1039 s | $8.98 |
¹ Brazil task. Pass-proportion = fraction of that model's runs that fully implement the spec.
² Fast mode (/fast), 4 languages (clojure/go/python/rust). Cost is at fast mode's 2× per-token rate (announcement) — see Fast mode.
³ Claude Fable 5 (claude-fable-5), same 4 languages. A distinct model a tier above Opus 4.8, priced at the same $10/$50 per Mtok rate as fast mode (2× Opus 4.8's standard rate); the CLI prices it natively. See exp-10 results.
- Newer is more reliable — markedly so on hard tasks. Opus-4.8 produces a completely-correct result 100% of the time on both tasks; 4.7 is 85% / 100%. The cheaper models (4.6, Sonnet) get the hard task completely right only ~half the time — they're a coin-flip.
- You pay steeply for that reliability. On the hard task Opus-4.8 is ~3× slower and ~4× pricier than 4.6 / Sonnet.
- Opus-4.7 is the value-reliability sweet spot — near-4.8 reliability for less, and tied with 4.8 on the easy task, where paying for 4.8 buys nothing.
- Fast mode is the same reliability at the highest price. Opus-4.8 fast matches 4.8's 1.00/1.00 and shaves wall-clock, but its 2× per-token rate makes it one of the costliest rows here ($8.72/run on the hard task) — you're buying latency, not value.
- A tier above 4.8 buys no extra reliability either. Claude Fable 5 — a distinct model above Opus 4.8, at the same $10/$50 rate as fast mode — also lands at 1.00 / 1.00, matching 4.8 exactly. But where 4.8 is already perfect there is no headroom to buy: Fable 5 is the priciest and slowest option on the hard task ($8.98, 1039 s), with no measurable reliability gain. Paying up a tier is pure overhead until a task is hard enough that 4.8 itself drops below 1.00 — neither task here reaches that.
- On easy tasks, almost anything works, so the cheapest reliable model wins (often 4.7 or even 4.6).
- It's a reliability-vs-cost decision, and it's task-dependent — precisely what a leaderboard can't tell you.
Best (model, tooling) per language, ranked pass-proportion → test coverage → speed → cost → code quality. Pass shown as passes/replicates.
REST-API CRUD (n = 3 per cell — robust): most stacks reach full coverage, so the ranking is decided by the speed tiebreaker (and cost only after that).
| Language | Leading stack | Pass | TestCov | Speed | Cost |
|---|---|---|---|---|---|
| clojure | opus-4.7 / none | 3/3 | 1.00 | 188 s | $0.92 |
| go | opus-4.8 / beads | 3/3 | 0.71 | 161 s | $0.72 |
| java | opus-4.7 / none | 3/3 | 1.00 | 168 s | $0.83 |
| python | opus-4.7 / none | 3/3 | 1.00 | 84 s | $0.50 |
| rust | opus-4.8-fast / none | 3/3 | 1.00 | 135 s | $1.06 |
| typescript | opus-4.8 / none | 3/3 | 0.97 | 119 s | $0.47 |
| erlang | opus-4.8 / none | 3/3 | 1.00 | 345 s | $1.35 |
| elixir | opus-4.8 / none | 3/3 | 1.00 | 207 s | $0.85 |
⚠️ Fast mode and the speed-before-cost ordering. Because the ranking weights speed above cost, fast mode is the ranked winner for Rust — but only by an 8-second margin (135 s vs 143 s foropus-4.6/beads, the runner-up) at more than 2× the price ($1.06 vs $0.48, at fast mode's 2× rate). Fast mode is also the close runner-up for Clojure and Go: always a touch faster, always pricier. If you weight cost at all, prefer the non-fast pick — fast mode's speed edge on routine work rarely justifies double the bill. (See Fast mode.)
Brazil MCP (hard task; per-cell replication is thinner, so treat the model-level result above as the firmer guide): the only model that is reliable across every language here is opus-4.8 (1.00) — at the cost/speed premium shown. The cheaper models succeed on some languages and fail on others, which is the whole point of measuring per-language rather than trusting one rank. Fast mode is not a leading pick here: on the hard task it matched 4.8's 1.00 reliability but, at the 2× rate, cost roughly double (~$8.70 vs ~$4.85/run on the shared languages) without being reliably faster — speeding up token output doesn't help a reasoning-bound task. Regular opus-4.8 dominates fast mode on Brazil.
Aggregated across all models and tooling for each language on each task (Pass = pass-proportion = probability of a completely-correct run):
| Language | Task | n | Pass | CodeQual | TestCov | Speed (s) | Cost ($) |
|---|---|---|---|---|---|---|---|
| clojure | Brazil MCP (hard) | 12 | 0.75 | 0.83 | 1.00 | 715 | 3.51 |
| clojure | REST-API (easy) | 21 | 0.62 | 0.75 | 0.90 | 302 | 1.10 |
| go | Brazil MCP (hard) | 13 | 0.69 | 1.00 | 0.58 | 773 | 4.35 |
| go | REST-API (easy) | 20 | 1.00 | 1.00 | 0.67 | 142 | 0.61 |
| java | Brazil MCP (hard) | 10 | 0.80 | 1.00 | 1.00 | 784 | 4.03 |
| java | REST-API (easy) | 23 | 0.52 | 1.00 | 1.00 | 208 | 0.78 |
| python | Brazil MCP (hard) | 14 | 0.86 | 0.73 | 0.90 | 638 | 3.30 |
| python | REST-API (easy) | 20 | 0.90 | 0.65 | 0.80 | 97 | 0.43 |
| rust | Brazil MCP (hard) | 10 | 0.50 | 0.83 | 0.93 | 717 | 3.97 |
| rust | REST-API (easy) | 23 | 0.96 | 0.83 | 1.00 | 169 | 0.60 |
| typescript | Brazil MCP (hard) | 12 | 0.92 | 0.61 | 0.82 | 617 | 3.31 |
| typescript | REST-API (easy) | 20 | 1.00 | 0.73 | 0.89 | 168 | 0.56 |
| erlang | REST-API (easy) | 6 | 1.00 | 1.00 | 1.00 | 349 | 1.49 |
| elixir | REST-API (easy) | 6 | 1.00 | 1.00 | 1.00 | 271 | 1.32 |
Reliability swings hard by both axes: Rust is near-perfect on the easy task (0.96) but a coin-flip on the hard one (0.50); Java runs the other way (0.80 hard / 0.52 easy); TypeScript and Python are strong on both. Code quality, by contrast, is steady within a language across tasks (consistent with the ANOVA below) — Go and Java stay at 1.00 regardless. There is no single "best language"; it depends on the job.
The two BEAM languages (Erlang, Elixir — exp-8, opus-4.7/4.8 on the REST API) are a clean addition: 1.00 on every measure — pass-proportion, test coverage, and code quality — making them the most consistently solid stacks on the easy task. Elixir on opus-4.8 was the cheapest/fastest of the pair (207 s, $0.85/run).
Opus-4.8 has a fast mode (the /fast toggle — same model, faster token output). Crucially, it is billed at 2× the standard per-token rate — $10/$50 vs $5/$25 per million input/output tokens, per the Opus 4.8 announcement. (The Claude CLI's reported total_cost_usd computes at the standard rate, so retort now scales fast-mode runs by 2× to record the cost you're actually billed — the figures below are corrected.) Experiment 7 ran it on the same languages as the regular-4.8 baseline (exp-5/6), both tasks:
| Task | Language | Fast 4.8 (speed / cost) | Regular 4.8 (speed / cost) | Pass (both) |
|---|---|---|---|---|
| REST-API (easy) | clojure | 208 s / $1.37 | 508 s / $1.92 | 1.00 |
| REST-API (easy) | go | 140 s / $1.17 | 147 s / $0.66 | 1.00 |
| REST-API (easy) | python | 90 s / $0.74 | 122 s / $0.50 | 1.00 |
| REST-API (easy) | rust | 135 s / $1.06 | 185 s / $0.71 | 1.00 |
| Brazil (hard) | clojure | 712 s / $6.18 | 941 s / $4.58 | 1.00 |
| Brazil (hard) | go | 959 s / $9.90 | 867 s / $4.59 | 1.00 |
| Brazil (hard) | python | 967 s / $9.91 | 899 s / $5.10 | 1.00 |
| Brazil (hard) | rust | 909 s / $8.90 | 1081 s / $6.09 | 1.00 |
- Reliability is identical — every fast cell holds pass-proportion 1.00, same as regular 4.8. Fast mode costs you nothing in correctness.
- But it is more expensive, not cheaper. At 2× per token it runs ~50–75% pricier than regular 4.8 on most easy-task languages (the lone exception, clojure, is an artifact of an outlier 508 s regular run), and roughly 2× the cost on the hard task.
- And the speedup only shows up on easy work. On the REST API fast mode is ~20–40% faster in wall-clock; on the hard, reasoning-bound task it's not reliably faster at all (Go and Python fast runs were actually slower than regular) — because the bottleneck is the model thinking, not emitting tokens.
- Takeaway: fast mode buys latency, not savings. It's worth the 2× premium only when wall-clock turnaround on routine work matters more than the bill. On hard tasks you pay double for no speed gain — don't.
Every experiment above held the prompt constant. Experiment-13 varies it — the
prescribed test methodology — on a methodology-neutral fork of the hard task
(BDD stripped from the repo, so the discipline comes only from the prompt):
language[go, python] × model[sonnet, opus-4.8-fast] × prompt[neutral, TDD, ATDD],
3 replicates. Pass-proportion (requirement_coverage == 1.0):
| model | language | neutral | TDD | ATDD |
|---|---|---|---|---|
| opus-4.8-fast | go | 1.00 | 1.00 | 1.00 |
| opus-4.8-fast | python | 1.00 | 1.00 | 1.00 |
| sonnet | go | 1.00 | 1.00 | 0.33 |
| sonnet | python | 1.00 | 1.00 | 1.00 |
- Prescribing a methodology barely moves reliability on a task the model already understands — 11 of 12 cells pass regardless. The lone drop is ATDD on the weakest stack (sonnet + go): ATDD front-loads the most work (executable acceptance tests through the public interface first), and the cheaper model on the stricter language occasionally didn't finish the spec.
- The methodology shows up in what tests get written, not whether it ships. ATDD yields lower unit-statement coverage (acceptance-test focused) than TDD/neutral, yet still meets the spec everywhere except sonnet/go. Pick a methodology for the tests it leaves behind, not for a reliability boost.
- Cost tracks the model, not the methodology. Full write-up: prompt blog · exp-13 results. (BDD, the fourth arm, needs its baselines re-scored on the same footing before a fair comparison — the recommended follow-up.)
The point of a designed experiment is that you can decompose the variance — for each response, how much is explained by language vs. model vs. tooling, and is it significant. Type-II ANOVA on the balanced experiments (cost/duration log-transformed, since they scale multiplicatively) gives a strikingly clean separation of concerns:
| Response | Dominant factor (share of variance) | What it means |
|---|---|---|
| code_quality | language ≈ 94–96% (p < 10⁻⁴⁰) · model ~0% (n.s.) | Quality is the language's, not the model's. Java/Go/Rust score high whoever writes the code. |
| test_coverage | language ≈ 92–95% (p < 10⁻¹⁵) · model ~0% | Same story — the language (and its test ecosystem) dominates. |
| duration | task ≈ 75%; then model on a fixed hard task (37% in exp-5); language ~6% | The task sets the clock; on hard tasks the newer model is the one that's slower. |
| cost | task ≈ 82%; tooling +10% (p < 0.001); language ~4% | The task sets the bill; beads tooling measurably adds cost. |
| requirement_coverage | model (borderline, p ≈ 0.06); ceiling on easy task | The only metric where the model choice shows up — reliability is what you're buying with a newer model. |
The headline ANOVA insight: language governs code quality and tests, task governs cost and time, and the model mostly governs spec-reliability (and, on hard tasks, speed). Picking a newer model to "write better code" is largely wasted — it writes more reliably, not more cleanly, and it costs you time and money to do so. (beads tooling shows up in exactly one place — extra cost and time — with no quality or coverage payoff, which is why it was dropped from the later experiments.)
Reproduce with retort report effects --db <experiment>/retort.db --metric <response>.
The prompt as a factor (explored — experiment-13). The experiments above held the instruction constant; exp-13 varies the prescribed test methodology (prompt is a first-class factor — named strategies in prompts/<level>.md). The result: on a task the model already understands, the methodology barely moves reliability — it changes what tests get written (ATDD trades unit coverage for acceptance coverage) more than whether the run ships. See Does the prompt matter? above and the prompt blog. Prompt strategy beyond test methodology (terse vs. detailed, worked examples) remains a one-line addition to the grid for future study.
Each row links to its full per-cell results table (every language × model × tooling, with pass-proportion, speed, cost, and quality, generated from master.db).
| # | Task | Models | Covered | Results table | Headline (clean data) |
|---|---|---|---|---|---|
| 1 | REST-API | Opus-4.6, Sonnet | 56 | results → | Both ~0.6 reliable; Java/Go/Rust strongest; cheap but not certain |
| 2 | Brazil | Opus-4.6, Sonnet | 22 | results → | Hard task exposes them — only ~half of runs fully correct |
| 3 | Brazil | Opus-4.6, 4.7 | 7 | results → | 4.7 more reliable but 3× slower, 5.5× pricier |
| 4 | Brazil | Opus-4.8 | 6 | results → | First 4.8 data: fully correct, but slowest/priciest |
| 5 | Brazil | Opus-4.7, 4.8 | 36 | results → | 4.8 = 1.00 pass vs 4.7 = 0.85, +47% time/cost |
| 6 | REST-API | Opus-4.7, 4.8 | 71 | results → | Both 1.00 — 4.7 the better value, 4.8 is overkill |
| 7 | Brazil + REST-API | Opus-4.8 fast | 24 | results → | Fast mode = 1.00 pass on both, but 2× per-token price — buys speed, not savings |
| 8 | REST-API | Opus-4.7, 4.8 (Erlang+Elixir) | 12 | results → | Both BEAM languages 1.00 on every measure |
| 10 | Brazil + REST-API | Claude Fable 5 | 24 | results → | A tier above 4.8: 1.00 pass on both, but ~2× cost / slowest — no reliability to buy where 4.8 is already 1.00 |
| 11 | REST-API | Gemini (gemini-2.5-pro) vs claude-code |
— | scaffold → | First cross-agent study. Harness validated end-to-end against the live Gemini CLI; runs pending free-tier capacity |
| 13 | Brazil (neutral fork) | Sonnet, Opus-4.8-fast × prompt[neutral/TDD/ATDD] | 36 | results → | Prompt / test-methodology study: methodology barely moves reliability (11/12 cells pass); ATDD trades unit coverage for acceptance coverage |
The combined dataset across all experiments with scored runs is in master.csv (and master.db), rebuildable with retort aggregate --out master.db --csv master.csv; it includes the 24 Fable 5 runs (experiment-10) and the 36 prompt-methodology runs (experiment-13). (Experiment-11 is a ready-to-run cross-agent scaffold — no result rows yet; see Comparing coding agents.)
All run data — per-run source, tests, scores, and the spec-eval output — is committed under experiment-N/runs/, combined in master.db / master.csv (retort aggregate).
Methodology notes. Of ~300 archived runs, 234 are completed runs with a reproducible requirement_coverage (the rest failed the tests-gate or are shard duplicates). The spec gate reads a pinned REQUIREMENTS.json per task (constant denominator) and judges with a strong second-opinion model — the judge now defaults to the latest Claude (earlier runs in this dataset used opus-4.6; exp-13 used opus-4.8). Cross-experiment model means mix language/tooling sets, so per-model conclusions lean on the larger within-task samples.
retort maturity scores every stack (a unique language × model × tooling × task combination) into a lifecycle phase — production / trial / screening / candidate — from a composite of replicate agreement, completion rate, reliability level, and replicate coverage. It's the "which stack should I actually use?" view. Across all 103 stacks in the combined data (maturity-report.txt, headline metric requirement_coverage):
| Phase | Stacks | What it means |
|---|---|---|
| production (≥0.85) | 67 | Reliable + reproducible — ship it |
| trial (0.65–0.85) | 18 | Promising, needs more evidence |
| screening (0.40–0.65) | 12 | Inconsistent — only on easy tasks |
| candidate (<0.40) | 6 | Avoid |
Two things fall out of the ranking:
- Every new stack reached production (12/12): all four fast-mode language cells on both tasks, and all four Erlang/Elixir cells, scored 1.00 maturity.
- The whole immature tail is the hard task — and overwhelmingly the hard task with
beadstooling. On Brazil,tooling=nonestacks average 0.88 maturity (18 production);tooling=beadsstacks average just 0.54 (only 2 production). Even Opus-4.8 drops to candidate on Brazil oncebeadsis bolted on. The tooling doesn't just add cost (see ANOVA) — on a hard task it actively destabilizes the run. That's the quantified reasonbeadswas dropped from the later experiments.
Regenerate with retort maturity --db <db> --metric requirement_coverage.
Roughly 60 of the archived runs are not completed-with-coverage. The strict gate is deliberate — if the tests don't run, the run fails — but it's worth separating harness measurement bugs (our fault, now fixed) from genuine model failures (the real signal). Each new experiment surfaced measurement bugs precisely because it exercised code paths the earlier ones never did:
- Elixir false-failures (harness). Every Elixir run initially scored
test_coverage=0and failed the gate — but the agents had written valid Elixir (a sample archive runs 17 tests, 0 failures). The scorer used the deprecatedmix do deps.get, testcomma syntax, removed in recent Elixir. Fixed tomix test; all 6 Elixir runs then passed at 1.00. A model that looked like it failed had actually succeeded. - Missing cost on the newest runs (harness). Experiments 7 & 8 recorded duration but
$0.00cost. The OMP-harness change (PR #6) routed the cost parser by agent name but dropped theunknown → claude-codefallback the command builder has, so for cells that didn't pin an agent, Claude ran and billed but its cost JSON was discarded. Fixed + regression-tested; re-run with cost intact. - Re-eval found zero runs (harness). The tooling-free designs (exp-7/8 vary only language × model) tripped a matcher that did
tooling = NULLin SQL — never true — soreevaluatesilently graded nothing. Fixed toIS NULL. - Fast-mode cost under-reported 2× (harness). Fast mode bills at double the standard per-token rate (announcement), but the CLI's
total_cost_usdreports the standard-rate figure (confirmed by probe). retort now applies the 2× multiplier for fast-mode runs — without it, the fast-mode cost comparison was wrong in fast's favour (it's a premium, not a saving). - A rerun harness that recorded its own failure as the model's (harness). An overnight pass tried to re-run the
beads-tooling false-failures in experiments 1/2/5 under the fixed harness. The rerun harness never launched the model — every cell came back in ~1–4 s with $0 cost and all-zero scores — yet it overwrote the previously-good runs with those instant failures (experiment-5 dropped from 36 to 18 completed; experiment-1 lost 3). The DBs were restored from.pre-rerun.baksnapshots; no cell actually changed state. The tell was the same one as always: a genuine failure burns minutes of model time, a harness failure fails instantly for $0. (Full breakdown in exp-10 results → Rerun outcomes.) - ATDD cross-package + python-deps false-failures (harness). The prompt-methodology study (exp-13) ran acceptance tests that drive the system through its public interface — and surfaced two more scorer blind spots.
go test -coverwithout-coverpkgscores an acceptance test in one package that exercises its siblings at 0% (the entire ATDD pattern), and python coverage ran the barepytestscript without the project's deps / withoutpython -m, so collection failed. Seven runs the gate marked "tests did not run" had actually built and passed at 77–96% coverage. Fixed (-coverpkg+ a-count=1profile total for Go; a project-deps venv +python -m pytestfor Python) and regression-tested; all 36 exp-13 runs then completed. This is exactly the classretort diagnosenow catches automatically (re-test each failure → tooling-vs-genuine), and thereevaluatehealth-check refuses to report success when its judge silently graded nothing. - Genuine failures (signal). The real failures cluster exactly where the data says they should: the hard task with cheaper models or
beadstooling, and the one ATDD × sonnet/go corner above. A handful of Erlang runs also flaked the tests-gate on first attempt and passed on--retry-failed— ordinary non-determinism, not a model limitation.
The lesson cuts both ways: a strict "tests must run" gate is essential to avoid scoring vibes — but you have to be sure a failure is the model's and not the harness's. Every measurement bug is fixed and covered by tests, and retort diagnose + the reevaluate health-check now make the tooling-vs-genuine call for you.
Every command is retort <command> [options]; add --help to any of them for the authoritative, version-specific list. Global: retort --version, retort --help. Most analysis/reporting commands take --db <experiment>/retort.db and --format text|json (some also csv/html) with -o/--output to write a file instead of stdout.
| Command | What it does | Key options |
|---|---|---|
init NAME |
Create a workspace dir: config template, visibility-aware .gitignore, and an initialized SQLite DB. |
--visibility public|private (default private = fail-closed, artifacts local-only); --force to overwrite. |
tasks list / tasks show NAME |
List registered tasks and their canonical source URIs (bundled:// for in-repo tasks, github:// for hosted ones), or show one task's source + description. |
--format text|json. |
design generate |
Generate a fractional-factorial design matrix (CSV) for a phase. Reads factors from --config or a JSON {factor: [levels]} on stdin; honors design.fraction. |
--phase screening|characterization (req); --config; -o CSV out. |
report aliasing |
Show the confounding structure of a fractional design — which effects are aliased and thus not independently estimable. | --phase (screening = Res III, characterization = Res IV); --max-order 1|2|3; --config or factors on stdin. |
intake |
Ingest a new factor level (e.g. a newly shipped model) and D-optimally augment the existing design with the minimum new runs. | --factor, --level (req); --phase; --nrestarts (optimizer restarts); -o. |
| Command | What it does | Key options |
|---|---|---|
run |
The core loop: generate design → provision isolated playpens → run claude -p per cell → build/test/score → store in the DB. |
--phase (req); --config; --task; --replicates; --design <csv> (run an exact, hand-trimmed matrix); --resume (skip completed cells); --retry-failed (with --resume, re-attempt cells that only ever failed); --shard INDEX/TOTAL (deterministic slice for parallel runners on a shared DB); --dry-run. |
monitor [TARGET] |
Live progress of a run DB: completed/remaining, per-cell coverage, cost + token totals, throughput, ETA. Safe to point at a DB being actively written. | TARGET = experiment dir or .db; --watch/--once; --interval; --total; --json. |
| Command | What it does | Key options |
|---|---|---|
evaluate [RUN_DIRS…] |
Run the evaluate-run skill over run archives (manual/retroactive grading, or after updating the skill). | --experiment-dir (bulk-eval all runs); --force; --workers (default 4). |
reevaluate |
Re-grade archived runs with the second-opinion spec eval, persisting requirement_coverage into the DB. Self-checks (preflights the judge; errors instead of silently grading nothing; reports matched/orphaned). Non-destructive (status unchanged), resumable (skips already-graded unless --force). |
--experiment-dir (req); --eval-model (default unset → the CLI's latest model, tracking new releases; pass an id to pin); --workers; --force. |
rescore |
Re-score archived runs with the current scorers (after fixing/upgrading one) and write corrected metrics back to the DB + scores.json. A run whose tests now run (test_coverage > 0) flips to completed. |
--experiment-dir (req); --only-failed; --metrics (subset, no gate); --workers; --dry-run. |
diagnose |
Deep-analyse every failed run: re-test its archive and classify TOOLING (scorer false-failure — rescore recovers it) vs GENUINE (real model/spec failure), with the cause. Read-only. |
--experiment-dir (req); --as-json. |
promote STACK_ID |
Evaluate a promotion gate and report whether a stack passes from one lifecycle phase to the next. | --from, --to (req); --evidence '{"p_value":0.05}'; --config (gate thresholds). |
| Command | What it does | Key options |
|---|---|---|
analyze |
Type-II ANOVA per response metric on a CSV: which factors have significant effects. Log-transform by default (multiplicative model for cost/time/tokens). | --data (req); -r/--responses (req, repeatable); -f/--factors; --interactions; --transform log|none; --significance; --residuals; --predict (estimate unrun cells + 95% CI). |
report effects |
Main effects + interaction effects (mean response per factor level / level-pair) for a design matrix. | --db, --matrix-id, --metric (all req); --format text|json|csv|html. |
report pareto |
Pareto-optimal stacks across multiple objectives (quality vs cost vs speed…). Prefix a metric with - to minimize. |
--data (req); --metric (req, repeatable, - to minimize); --group-by (default language,model,tooling). |
maturity |
Score each stack's maturity (replicate agreement + completion rate + headline level + coverage → production/trial/screening/candidate). The "which stack to use" report. | --db (req); --metric (headline, default code_quality — use requirement_coverage for reliability); --stack (filter). |
report wardley |
Wardley-map overlay placing each stack on the evolution axis (Genesis → Custom → Product → Commodity) from its lifecycle phase. | --db (req); --format. |
report dashboard |
One-screen workspace overview: active experiments, lifecycle states, budget usage, recent promotions. | --db (req); --format. |
report compare |
Run the compare-runs skill to contrast evaluated runs across factor dimensions → comparison.md. |
--experiment-dir; --group-by; -o. |
report web |
Static HTML report (sortable per-stack maturity table + run drill-down). Respects experiment.visibility (redacts in private mode). |
--db (req); --config; --out; --title. |
| Command | What it does | Key options |
|---|---|---|
aggregate |
Roll every experiment-*/retort.db into one tidy wide runs table (one row/run, a column per metric). Rebuilt from scratch — re-run after a reevaluation pass. |
--experiments-dir (default .); --out (default master.db); --csv. |
export csv |
Flatten one DB's experiment_runs + run_results into the wide CSV that analyze/pareto consume. |
--db (req); -o; --include-failed. |
export merge |
Union multiple per-experiment CSVs into one (each input label=path.csv), tagging rows by source — for cross-experiment ANOVA. |
INPUTS… (req); --tag-column (default experiment); -o. |
| Command | What it does | Key options |
|---|---|---|
visibility-check |
Audit which workspace artifacts would be published vs kept local per experiment.visibility; exits non-zero if a private workspace would leak a sensitive path. |
--config. |
plugin list |
List installed scorers/runners and what they contribute. | --format. |
plugin show NAME |
Detail for one scorer or runner (e.g. build_time, docker). |
— |
Feature-complete for single-agent claude-code experiments with the LocalRunner. Implemented: LocalRunner, all scorers + the conformance spec gate, factorial/fractional design generation (incl. prompt as a factor), ANOVA + effects, SQLite storage + cross-experiment aggregate/reevaluate (with a judge-tooling health-check), rescore + diagnose for failure recovery and tooling-vs-genuine classification, resumable sharded runs, retort monitor (live in-flight view across shards), cost_limit_usd, OMP local-agent profiles, and a Gemini CLI harness (agent becomes a factor — compare claude-code vs gemini head-to-head; see below). Not yet: DockerRunner (skeleton), the intake/scheduler paths.
The agent is the same variable as the model — it isn't a separate factor. The harness follows from the model id: a gemini-* model runs via Google's Gemini CLI, every Claude id via claude-code. So you just list the models you want in the model factor and the right agent is selected per cell:
factors:
model: { levels: [claude-opus-4-8, gemini-2.5-pro] } # agent follows the model
language: { levels: [go, python, rust, typescript] }retort analyze then decomposes how much of quality/reliability/cost is the model/agent versus the language and task. The gemini harness needs Google's Gemini CLI on PATH and a Gemini auth method (GEMINI_API_KEY, ADC, or a free OAuth login) in the environment. The CLI reports tokens but not a dollar cost, so retort derives cost from GEMINI_PRICING in local_runner.py (base-tier rates — verify against current Google pricing). The spec-gate judge stays on Claude (the latest model by default; pin one with reevaluate --eval-model <id>) so an independent model grades every agent fairly.
A local/self-hosted model whose name doesn't imply its harness can't be inferred, so it's routed by an explicit profile that overrides the model rule. The omp harness drives oh-my-pi (omp) — a terminal coding agent that natively supports local backends (Ollama, LM Studio, llama.cpp, vLLM) as well as cloud providers. This is how you put a local model in the grid (claude-code runs Claude; omp runs whatever local/other model you point it at).
Install omp (one of):
brew install can1357/tap/omp # macOS / Linux (Homebrew)
curl -fsSL https://omp.sh/install | sh # macOS / Linux (script)
bun install -g @oh-my-pi/pi-coding-agent # any platform with Bun ≥ 1.3.14Serve a local model with Ollama, then declare it to omp. This path is verified end-to-end (experiment-12: Qwen2.5-Coder-7B on bookshop/Go).
# ⚠️ Use the CASK, not the formula. `brew install ollama` (formula) ships
# WITHOUT its inference runner (llama-server) — every call then 500s with
# "llama-server binary not found". The cask bundles the runner:
brew install --cask ollama && open -a Ollama # starts the server on :11434
ollama pull qwen2.5-coder:7bomp's built-in Ollama integration launches its own llama-server and is brittle against modern Ollama installs. The reliable wiring is a custom openai-completions provider in ~/.omp/agent/models.yml pointed at Ollama's OpenAI-compatible endpoint (use a provider name without "ollama" in it so omp doesn't reroute to its launcher):
# ~/.omp/agent/models.yml
providers:
lmlocal:
baseUrl: http://localhost:11434/v1
apiKey: ollama # ignored by Ollama; any literal works
api: openai-completions
auth: apiKey
models:
- id: qwen2.5-coder:7b # the id Ollama serves; sent on the wire
name: Qwen2.5 Coder 7B (local)
input: [text]
contextWindow: 32768
maxTokens: 8192
cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 } # local = freeVerify with a one-shot before launching a run, then point a retort profile at the provider/id:
omp -p --no-session --mode json --model lmlocal/qwen2.5-coder:7b "reply ok" # should print usage + cost:0playpen:
local_agents:
qwen-local: { harness: omp, model: lmlocal/qwen2.5-coder:7b }
factors:
agent: { levels: [qwen-local, claude-code] } # explicit override for non-inferable local modelsretort invokes omp -p --no-session --mode json --model <model> … and parses its JSON usage events; local runs record $0 (or a hardware-cost estimate if local_inference_cost is configured). omp also supports LM Studio, llama.cpp, and vLLM the same way — see the provider docs. Note on local tool-calling: experiment-12 (two local models, both fail, $0) shows a usable local coding agent needs two things, and small models miss one each. qwen2.5-coder:7b fails on tool-call format — it emits the intended call as bare JSON, which Ollama returns in content (not tool_calls) on both /v1 and native /api/chat, despite a tools capability — so omp (which executes only structured tool_calls) runs nothing. llama3.2:3b fixes that (its tool calls serialize; the integration executes them end-to-end) but lacks the agentic capability to drive the real task. So you need a model whose tool calls Ollama can structure and that's capable enough to drive a multi-step loop (try qwen2.5:7b-instruct, llama3.1:8b, mistral-nemo). See experiment-12.
Adding another cloud agent is the same three-part adapter: a command branch, a usage parser, and one LocalHarness literal (plus a model-prefix rule in _harness_for_model if the new agent's models should auto-route).