mandrel-bench

An internal benchmark that measures the Mandrel framework's value-add — what its scaffolding buys you versus what it costs — and tracks that verdict across framework versions and model generations.

mandrel-bench is a consumer of Mandrel, not part of it. It installs a pinned, published mandrel version, materializes the framework via mandrel sync, then drives Mandrel's own /plan→/deliver pipeline (and a bare-model control) over a set of scenarios, scores each run across five dimensions, and emits a value-add report. Because the harness is held fixed while the mandrel dependency version varies, it answers the standing question: "is Mandrel still worth its tax at the current frontier?"

The dependency is one-directional: mandrel-bench depends on mandrel; mandrel never depends on mandrel-bench. If it measures the framework, it lives here; if it is the framework, it lives in mandrel.

Why a separate repo

Benchmarking the framework from inside the mandrel dev repo has two flaws this repo exists to fix:

It wouldn't test the real consumer contract. Mandrel ships as an npm package that consumers install and mandrel sync into .agents/. A benchmark run here goes through that exact path, so it measures Mandrel the way real projects actually use it.
It would confound harness-version with framework-version. To compare mandrel@1.70 vs mandrel@1.71, you must hold the measuring instrument constant and vary only the thing measured. Here the harness is fixed and the framework-under-test is a single pinned dependency (dependencies.mandrel in package.json) — bump it to benchmark a new version.

What it measures

Every dimension answers one of two questions: what does the scaffolding buy? (value) or what does it charge? (cost). The deliverable is the value/cost frontier — never a single collapsed score (that invites Goodhart gaming).

Side	Dimension	Question it answers	Primary signal
Value	Quality	Is the delivered software correct & on-intent?	frozen per-scenario acceptance suite + `acceptance-eval` cross-check
Value	Planning fidelity	Did the plan match the work actually required?	decomposition accuracy, re-plan count, plan-vs-actual drift
Value	Autonomy	How little human intervention did it need?	HITL stops, `agent::blocked`, manual rescues
Cost	Efficiency	What did it cost absolutely?	wall-clock, tokens, dispatches
Cost	Overhead ratio	Ceremony tax vs. shippable output	ceremony ÷ codegen (tokens & time)

Variance is the reporting method, not a sixth dimension: every score is reported as a distribution across N runs with a computed noise-band, and a Mandrel-vs-control delta is only called real when it clears that band.

Cross-scenario derived metrics (relationships, not per-run scores)

Difficulty monotonicity — across the scenario ladder, Efficiency must rise and Overhead ratio must fall as difficulty increases. A violation is a calibration warning (the instrument may be insensitive, or a scenario mis-graded).
Overhead floor — the hello-world Mandrel-arm cost minus the control-arm cost estimates Mandrel's fixed ceremony tax on near-zero work — the most direct "ceremony-lite path for trivial scopes" signal. Feeds the report's Recommended improvements section.

How it works

For each (scenario × arm × run):

Provision an ephemeral throwaway clone of a scenario's sandbox repo (bench/driver/sandbox.js) under the OS temp dir. The control arm has its .agents/ stripped so the bare model gets no scaffolding.
Run a headless Claude Code session (bench/driver/run-session.js → claude -p --output-format json). The mandrel arm drives /plan then /deliver (real authoring — never pre-staged plans); the control arm gets the bare task. The JSON envelope yields the real usage/cost actuals — the only cost source, measured identically for both arms.
Collect (bench/collect/normalize.js) the run's lifecycle telemetry (temp/epic-<id>/lifecycle.ndjson + per-Story signals.ndjson, written by /deliver) plus the cost envelope into one per-run record conforming to bench/schemas/scorecard.schema.json.
Score (bench/score/) the five dimensions, the Mandrel-vs-control differential, the noise-band, and the cross-scenario derived metrics.
Report & persist (bench/report/) the value-add report (distributions, deltas, scaling view, Recommended improvements), append the stamped scorecard (model + framework-version + env) to the longitudinal store under results/, and surface cross-run deltas.
Tear down the ephemeral workspace (teardown is path-containment-guarded so it can only ever delete the throwaway clone).

The model is pinned and recorded on every scorecard; comparisons are only ever like-model to like-model — this is not a model benchmark.

Repository layout

mandrel-bench/
├── bench/
│   ├── metrics/      # five-dimension formulas (README.md) + variance/noise-band
│   ├── schemas/      # scorecard.schema.json (the per-run record contract)
│   ├── driver/       # claude -p run launcher + ephemeral sandbox lifecycle + unattended.md
│   ├── scenarios/    # hello-world/ + crud-db/ defs, frozen oracles, acceptance-eval adapter
│   ├── collect/      # lifecycle + signals + cost-envelope → normalized per-run record
│   ├── score/        # dimensions + Mandrel-vs-control differential + derived metrics
│   ├── report/       # value-add report renderer + stamped persistence + cross-run compare
│   └── fixtures/     # sample scorecard + sample lifecycle ndjson (test fixtures)
├── tests/bench/      # node:test suites mirroring bench/ (pure-logic units)
├── results/          # committed longitudinal scorecard store (the over-time record)
├── docs/             # architecture.md (run model) + decisions.md (rationale)
└── package.json      # pins `mandrel` — the version under test

Status

Wired and exercised end to end. The full component set (metrics model + scorecard schema, scenarios + frozen oracles, run driver + sandbox lifecycle, collector, scoring + control differential + derived metrics, report + persistence + cross-run compare) is tied together by a top-level orchestrator (bench/run.js), a framework-under-test overlay (bench/driver/overlay.js), and an app-runner (bench/driver/app-runner.js), all with node:test unit suites.

The first benchmark result has landed — hello-world, both arms, N=1 on mandrel@1.70.0 / claude-opus-4-8. The mandrel arm drove /plan→/deliver fully headless and unattended against a throwaway mandrel-bench-sandbox repo; see results/ for the scorecards and the value-add report.

Done this cycle:

A top-level run orchestrator (bench/run.js) that loops N × scenarios × arms, runs overlay → driver → app-runner → collect → score → report, and writes to results/.
The driver's "framework under test" is the installed mandrel version: overlay.js copies this repo's materialized .agents/ (+ node_modules) into the mandrel-arm clone and repoints it at the sandbox repo.
An app-runner that starts the delivered app and probes it for the frozen Quality oracle.
The first live N=1 smoke result, persisted to results/.

Still open (deferred, separately planned):

Scale to N≈8–10 and add the crud-db rung for a statistically meaningful verdict (the N=1 result is non-inferential — see results/).
CI for this repo (run the unit suites; the full benchmark is a periodic, manually-triggered capability report, not a per-PR gate).
A first-class /plan headless flag and an auto-merge gate that does not block a clean trivial run (both surfaced as findings by the first result).

The unit suites under tests/bench/ run standalone via npm test; the scenario/acceptance-eval pieces additionally use the materialized .agents/.

Running

npm install            # pulls the pinned `mandrel` version under test
npm test               # run the harness unit suites

# Full capability run (real claude -p sessions against a sandbox repo):
BENCH_SANDBOX_REPO_URL=https://github.com/<owner>/<sandbox>.git \
BENCH_SANDBOX_OWNER=<owner> BENCH_SANDBOX_REPO=<sandbox> \
BENCH_EPIC_ID=<seed-epic-in-sandbox> \
BENCH_ARMS=control,mandrel BENCH_SCENARIOS=hello-world BENCH_N=1 \
npm run bench

To benchmark a different framework version, bump dependencies.mandrel in package.json, re-run npm install && npx mandrel sync, and re-run the benchmark; the new scorecards append to results/ stamped with that version, and the cross-run comparison surfaces the deltas.

Development

Node >=22.22.1 <25. npm install pulls the pinned mandrel and activates the Husky hooks (via the prepare script).
Lint / format: npm run lint (Biome + markdownlint), npm run format (Biome write).
Test: npm test (node:test). The pure-logic units run standalone; the scenario / acceptance-eval suites additionally need npx mandrel sync to materialize .agents/ for the pinned version.
Hooks (Husky): pre-commit → lint-staged, commit-msg → commitlint (Conventional Commits), pre-push → npm test.
Releases: release-please versions + changelogs on main (tags vX.Y.Z); not published to npm.
CI: GitHub Actions — lint + test on every PR to main.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.agents		.agents
.claude		.claude
.github/workflows		.github/workflows
.husky		.husky
baselines		baselines
bench		bench
docs		docs
results		results
tests/bench		tests/bench
.agentrc.json		.agentrc.json
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.release-please-manifest.json		.release-please-manifest.json
CLAUDE.md		CLAUDE.md
README.md		README.md
biome.json		biome.json
commitlint.config.js		commitlint.config.js
package-lock.json		package-lock.json
package.json		package.json
release-please-config.json		release-please-config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mandrel-bench

Why a separate repo

What it measures

Cross-scenario derived metrics (relationships, not per-run scores)

How it works

Repository layout

Status

Running

See also

Development

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mandrel-bench

Why a separate repo

What it measures

Cross-scenario derived metrics (relationships, not per-run scores)

How it works

Repository layout

Status

Running

See also

Development

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages