Skip to content

dsj1984/mandrel-bench

Repository files navigation

mandrel-bench

An internal benchmark that measures the Mandrel framework's value-add — what its scaffolding buys you versus what it costs — and tracks that verdict across framework versions and model generations.

mandrel-bench is a consumer of Mandrel, not part of it. It installs a pinned, published mandrel version, materializes the framework via mandrel sync, then drives Mandrel's own /plan/deliver pipeline (and a bare-model control) over a set of scenarios, scores each run across five dimensions, and emits a value-add report. Because the harness is held fixed while the mandrel dependency version varies, it answers the standing question: "is Mandrel still worth its tax at the current frontier?"

The dependency is one-directional: mandrel-bench depends on mandrel; mandrel never depends on mandrel-bench. If it measures the framework, it lives here; if it is the framework, it lives in mandrel.


Why a separate repo

Benchmarking the framework from inside the mandrel dev repo has two flaws this repo exists to fix:

  1. It wouldn't test the real consumer contract. Mandrel ships as an npm package that consumers install and mandrel sync into .agents/. A benchmark run here goes through that exact path, so it measures Mandrel the way real projects actually use it.
  2. It would confound harness-version with framework-version. To compare mandrel@1.70 vs mandrel@1.71, you must hold the measuring instrument constant and vary only the thing measured. Here the harness is fixed and the framework-under-test is a single pinned dependency (dependencies.mandrel in package.json) — bump it to benchmark a new version.

What it measures

Every dimension answers one of two questions: what does the scaffolding buy? (value) or what does it charge? (cost). The deliverable is the value/cost frontier — never a single collapsed score (that invites Goodhart gaming).

Side Dimension Question it answers Primary signal
Value Quality Is the delivered software correct & on-intent? frozen per-scenario acceptance suite + acceptance-eval cross-check
Value Planning fidelity Did the plan match the work actually required? decomposition accuracy, re-plan count, plan-vs-actual drift
Value Autonomy How little human intervention did it need? HITL stops, agent::blocked, manual rescues
Cost Efficiency What did it cost absolutely? wall-clock, tokens, dispatches
Cost Overhead ratio Ceremony tax vs. shippable output ceremony ÷ codegen (tokens & time)

Variance is the reporting method, not a sixth dimension: every score is reported as a distribution across N runs with a computed noise-band, and a Mandrel-vs-control delta is only called real when it clears that band.

Cross-scenario derived metrics (relationships, not per-run scores)

  • Difficulty monotonicity — across the scenario ladder, Efficiency must rise and Overhead ratio must fall as difficulty increases. A violation is a calibration warning (the instrument may be insensitive, or a scenario mis-graded).
  • Overhead floor — the hello-world Mandrel-arm cost minus the control-arm cost estimates Mandrel's fixed ceremony tax on near-zero work — the most direct "ceremony-lite path for trivial scopes" signal. Feeds the report's Recommended improvements section.

How it works

For each (scenario × arm × run):

  1. Provision an ephemeral throwaway clone of a scenario's sandbox repo (bench/driver/sandbox.js) under the OS temp dir. The control arm has its .agents/ stripped so the bare model gets no scaffolding.
  2. Run a headless Claude Code session (bench/driver/run-session.jsclaude -p --output-format json). The mandrel arm drives /plan then /deliver (real authoring — never pre-staged plans); the control arm gets the bare task. The JSON envelope yields the real usage/cost actuals — the only cost source, measured identically for both arms.
  3. Collect (bench/collect/normalize.js) the run's lifecycle telemetry (temp/epic-<id>/lifecycle.ndjson + per-Story signals.ndjson, written by /deliver) plus the cost envelope into one per-run record conforming to bench/schemas/scorecard.schema.json.
  4. Score (bench/score/) the five dimensions, the Mandrel-vs-control differential, the noise-band, and the cross-scenario derived metrics.
  5. Report & persist (bench/report/) the value-add report (distributions, deltas, scaling view, Recommended improvements), append the stamped scorecard (model + framework-version + env) to the longitudinal store under results/, and surface cross-run deltas.
  6. Tear down the ephemeral workspace (teardown is path-containment-guarded so it can only ever delete the throwaway clone).

The model is pinned and recorded on every scorecard; comparisons are only ever like-model to like-model — this is not a model benchmark.


Repository layout

mandrel-bench/
├── bench/
│   ├── metrics/      # five-dimension formulas (README.md) + variance/noise-band
│   ├── schemas/      # scorecard.schema.json (the per-run record contract)
│   ├── driver/       # claude -p run launcher + ephemeral sandbox lifecycle + unattended.md
│   ├── scenarios/    # hello-world/ + crud-db/ defs, frozen oracles, acceptance-eval adapter
│   ├── collect/      # lifecycle + signals + cost-envelope → normalized per-run record
│   ├── score/        # dimensions + Mandrel-vs-control differential + derived metrics
│   ├── report/       # value-add report renderer + stamped persistence + cross-run compare
│   └── fixtures/     # sample scorecard + sample lifecycle ndjson (test fixtures)
├── tests/bench/      # node:test suites mirroring bench/ (pure-logic units)
├── results/          # committed longitudinal scorecard store (the over-time record)
├── docs/             # architecture.md (run model) + decisions.md (rationale)
└── package.json      # pins `mandrel` — the version under test

Status

Wired and exercised end to end. The full component set (metrics model + scorecard schema, scenarios + frozen oracles, run driver + sandbox lifecycle, collector, scoring + control differential + derived metrics, report + persistence + cross-run compare) is tied together by a top-level orchestrator (bench/run.js), a framework-under-test overlay (bench/driver/overlay.js), and an app-runner (bench/driver/app-runner.js), all with node:test unit suites.

The first benchmark result has landed — hello-world, both arms, N=1 on mandrel@1.70.0 / claude-opus-4-8. The mandrel arm drove /plan/deliver fully headless and unattended against a throwaway mandrel-bench-sandbox repo; see results/ for the scorecards and the value-add report.

Done this cycle:

  • A top-level run orchestrator (bench/run.js) that loops N × scenarios × arms, runs overlay → driver → app-runner → collect → score → report, and writes to results/.
  • The driver's "framework under test" is the installed mandrel version: overlay.js copies this repo's materialized .agents/ (+ node_modules) into the mandrel-arm clone and repoints it at the sandbox repo.
  • An app-runner that starts the delivered app and probes it for the frozen Quality oracle.
  • The first live N=1 smoke result, persisted to results/.

Still open (deferred, separately planned):

  • Scale to N≈8–10 and add the crud-db rung for a statistically meaningful verdict (the N=1 result is non-inferential — see results/).
  • CI for this repo (run the unit suites; the full benchmark is a periodic, manually-triggered capability report, not a per-PR gate).
  • A first-class /plan headless flag and an auto-merge gate that does not block a clean trivial run (both surfaced as findings by the first result).

The unit suites under tests/bench/ run standalone via npm test; the scenario/acceptance-eval pieces additionally use the materialized .agents/.


Running

npm install            # pulls the pinned `mandrel` version under test
npm test               # run the harness unit suites

# Full capability run (real claude -p sessions against a sandbox repo):
BENCH_SANDBOX_REPO_URL=https://github.com/<owner>/<sandbox>.git \
BENCH_SANDBOX_OWNER=<owner> BENCH_SANDBOX_REPO=<sandbox> \
BENCH_EPIC_ID=<seed-epic-in-sandbox> \
BENCH_ARMS=control,mandrel BENCH_SCENARIOS=hello-world BENCH_N=1 \
npm run bench

To benchmark a different framework version, bump dependencies.mandrel in package.json, re-run npm install && npx mandrel sync, and re-run the benchmark; the new scorecards append to results/ stamped with that version, and the cross-run comparison surfaces the deltas.


See also

Development

  • Node >=22.22.1 <25. npm install pulls the pinned mandrel and activates the Husky hooks (via the prepare script).
  • Lint / format: npm run lint (Biome + markdownlint), npm run format (Biome write).
  • Test: npm test (node:test). The pure-logic units run standalone; the scenario / acceptance-eval suites additionally need npx mandrel sync to materialize .agents/ for the pinned version.
  • Hooks (Husky): pre-commit → lint-staged, commit-msg → commitlint (Conventional Commits), pre-pushnpm test.
  • Releases: release-please versions + changelogs on main (tags vX.Y.Z); not published to npm.
  • CI: GitHub Actions — lint + test on every PR to main.

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors