An internal benchmark that measures the Mandrel framework's value-add — what its scaffolding buys you versus what it costs — and tracks that verdict across framework versions and model generations.
mandrel-bench is a consumer of Mandrel, not part of it. It installs a
pinned, published mandrel version, materializes the framework via
mandrel sync, then drives Mandrel's own /plan→/deliver pipeline (and a
bare-model control) over a set of scenarios, scores each run across five
dimensions, and emits a value-add report. Because the harness is held fixed
while the mandrel dependency version varies, it answers the standing
question: "is Mandrel still worth its tax at the current frontier?"
The dependency is one-directional:
mandrel-benchdepends onmandrel;mandrelnever depends onmandrel-bench. If it measures the framework, it lives here; if it is the framework, it lives in mandrel.
Benchmarking the framework from inside the mandrel dev repo has two flaws this repo exists to fix:
- It wouldn't test the real consumer contract. Mandrel ships as an npm
package that consumers install and
mandrel syncinto.agents/. A benchmark run here goes through that exact path, so it measures Mandrel the way real projects actually use it. - It would confound harness-version with framework-version. To compare
mandrel@1.70vsmandrel@1.71, you must hold the measuring instrument constant and vary only the thing measured. Here the harness is fixed and the framework-under-test is a single pinned dependency (dependencies.mandrelinpackage.json) — bump it to benchmark a new version.
Every dimension answers one of two questions: what does the scaffolding buy? (value) or what does it charge? (cost). The deliverable is the value/cost frontier — never a single collapsed score (that invites Goodhart gaming).
| Side | Dimension | Question it answers | Primary signal |
|---|---|---|---|
| Value | Quality | Is the delivered software correct & on-intent? | frozen per-scenario acceptance suite + acceptance-eval cross-check |
| Value | Planning fidelity | Did the plan match the work actually required? | decomposition accuracy, re-plan count, plan-vs-actual drift |
| Value | Autonomy | How little human intervention did it need? | HITL stops, agent::blocked, manual rescues |
| Cost | Efficiency | What did it cost absolutely? | wall-clock, tokens, dispatches |
| Cost | Overhead ratio | Ceremony tax vs. shippable output | ceremony ÷ codegen (tokens & time) |
Variance is the reporting method, not a sixth dimension: every score is reported as a distribution across N runs with a computed noise-band, and a Mandrel-vs-control delta is only called real when it clears that band.
- Difficulty monotonicity — across the scenario ladder, Efficiency must rise and Overhead ratio must fall as difficulty increases. A violation is a calibration warning (the instrument may be insensitive, or a scenario mis-graded).
- Overhead floor — the hello-world Mandrel-arm cost minus the control-arm cost estimates Mandrel's fixed ceremony tax on near-zero work — the most direct "ceremony-lite path for trivial scopes" signal. Feeds the report's Recommended improvements section.
For each (scenario × arm × run):
- Provision an ephemeral throwaway clone of a scenario's sandbox repo
(
bench/driver/sandbox.js) under the OS temp dir. The control arm has its.agents/stripped so the bare model gets no scaffolding. - Run a headless Claude Code session (
bench/driver/run-session.js→claude -p --output-format json). The mandrel arm drives/planthen/deliver(real authoring — never pre-staged plans); the control arm gets the bare task. The JSON envelope yields the real usage/cost actuals — the only cost source, measured identically for both arms. - Collect (
bench/collect/normalize.js) the run's lifecycle telemetry (temp/epic-<id>/lifecycle.ndjson+ per-Storysignals.ndjson, written by/deliver) plus the cost envelope into one per-run record conforming tobench/schemas/scorecard.schema.json. - Score (
bench/score/) the five dimensions, the Mandrel-vs-control differential, the noise-band, and the cross-scenario derived metrics. - Report & persist (
bench/report/) the value-add report (distributions, deltas, scaling view, Recommended improvements), append the stamped scorecard (model + framework-version + env) to the longitudinal store underresults/, and surface cross-run deltas. - Tear down the ephemeral workspace (teardown is path-containment-guarded so it can only ever delete the throwaway clone).
The model is pinned and recorded on every scorecard; comparisons are only ever like-model to like-model — this is not a model benchmark.
mandrel-bench/
├── bench/
│ ├── metrics/ # five-dimension formulas (README.md) + variance/noise-band
│ ├── schemas/ # scorecard.schema.json (the per-run record contract)
│ ├── driver/ # claude -p run launcher + ephemeral sandbox lifecycle + unattended.md
│ ├── scenarios/ # hello-world/ + crud-db/ defs, frozen oracles, acceptance-eval adapter
│ ├── collect/ # lifecycle + signals + cost-envelope → normalized per-run record
│ ├── score/ # dimensions + Mandrel-vs-control differential + derived metrics
│ ├── report/ # value-add report renderer + stamped persistence + cross-run compare
│ └── fixtures/ # sample scorecard + sample lifecycle ndjson (test fixtures)
├── tests/bench/ # node:test suites mirroring bench/ (pure-logic units)
├── results/ # committed longitudinal scorecard store (the over-time record)
├── docs/ # architecture.md (run model) + decisions.md (rationale)
└── package.json # pins `mandrel` — the version under test
Wired and exercised end to end. The full component set (metrics model +
scorecard schema, scenarios + frozen oracles, run driver + sandbox lifecycle,
collector, scoring + control differential + derived metrics, report +
persistence + cross-run compare) is tied together by a top-level orchestrator
(bench/run.js), a framework-under-test overlay (bench/driver/overlay.js),
and an app-runner (bench/driver/app-runner.js), all with node:test unit
suites.
The first benchmark result has landed — hello-world, both arms, N=1 on
mandrel@1.70.0 / claude-opus-4-8. The mandrel arm drove /plan→/deliver
fully headless and unattended against a throwaway mandrel-bench-sandbox repo;
see results/ for the scorecards and the value-add report.
Done this cycle:
- A top-level run orchestrator (
bench/run.js) that loopsN × scenarios × arms, runs overlay → driver → app-runner → collect → score → report, and writes toresults/. - The driver's "framework under test" is the installed
mandrelversion:overlay.jscopies this repo's materialized.agents/(+node_modules) into the mandrel-arm clone and repoints it at the sandbox repo. - An app-runner that starts the delivered app and probes it for the frozen Quality oracle.
- The first live N=1 smoke result, persisted to
results/.
Still open (deferred, separately planned):
- Scale to N≈8–10 and add the
crud-dbrung for a statistically meaningful verdict (the N=1 result is non-inferential — seeresults/). - CI for this repo (run the unit suites; the full benchmark is a periodic, manually-triggered capability report, not a per-PR gate).
- A first-class
/planheadless flag and an auto-merge gate that does not block a clean trivial run (both surfaced as findings by the first result).
The unit suites under tests/bench/ run standalone via npm test; the
scenario/acceptance-eval pieces additionally use the materialized .agents/.
npm install # pulls the pinned `mandrel` version under test
npm test # run the harness unit suites
# Full capability run (real claude -p sessions against a sandbox repo):
BENCH_SANDBOX_REPO_URL=https://github.com/<owner>/<sandbox>.git \
BENCH_SANDBOX_OWNER=<owner> BENCH_SANDBOX_REPO=<sandbox> \
BENCH_EPIC_ID=<seed-epic-in-sandbox> \
BENCH_ARMS=control,mandrel BENCH_SCENARIOS=hello-world BENCH_N=1 \
npm run benchTo benchmark a different framework version, bump dependencies.mandrel in
package.json, re-run npm install && npx mandrel sync, and re-run the
benchmark; the new scorecards append to results/ stamped with that version,
and the cross-run comparison surfaces the deltas.
- Mandrel — the framework under test.
docs/architecture.md— technical architecture (run model, components, data flow, security).docs/decisions.md— the decision log and rationale.results/— the scorecard store and value-add reports.
- Node
>=22.22.1 <25.npm installpulls the pinnedmandreland activates the Husky hooks (via thepreparescript). - Lint / format:
npm run lint(Biome + markdownlint),npm run format(Biome write). - Test:
npm test(node:test). The pure-logic units run standalone; the scenario / acceptance-eval suites additionally neednpx mandrel syncto materialize.agents/for the pinned version. - Hooks (Husky):
pre-commit→ lint-staged,commit-msg→ commitlint (Conventional Commits),pre-push→npm test. - Releases: release-please
versions + changelogs on
main(tagsvX.Y.Z); not published to npm. - CI: GitHub Actions —
lint+teston every PR tomain.
MIT