scenebench

Open harness for running, measuring, and visualizing agent benchmarks.

Adapters for AutomationBench (Zapier), τ-bench (Sierra), LeRobot (Hugging Face), WorkArena (ServiceNow), SWE-bench, and more — plus benchmarks we author end-to-end (S4Bench for SAP, more coming).

Where things live.

scenebench — harness + content + scenecast extensions: bench fixtures, Python adapters, evaluators, reporters, and the typed vendor implementations (Gmail, Salesforce, Slack, …) for every domain its benchmarks cover. Vendor types ship as scenecast extensions — extends: ["email/mailbox"] etc.

scenecast — the API + canonical shapes: defineAsset, defineView, multi-format renderers, abstract primitives (Email, Message, Contact, Event, Task, Document). The lean dependency benchmarks build on.

scene-otel — the wire format: scene.set events, hashing, the generic scrubber.

If you only need types or rendering, install scenecast — its npm package is the lean entry point. scenebench is for benchmark runners and is best used by cloning the git repo (which carries fixtures and tasks not shipped to npm).

What scenebench gives you

Run any benchmark with scene-otel instrumentation. Adapters wrap each benchmark's native rollout loop; every step emits scene.set snapshots automatically.
Step-level metrics nobody else publishes. Belief-vs-truth alignment, drift counts, first-unsatisfiable-step, intent resolution rate. Beyond pass/fail.
Cross-benchmark normalization. Same metrics across AutomationBench, τ-bench, LeRobot. Compare belief accuracy of a model on software vs robotics tasks.
JSONL output you can scrub. Drops directly into the scene-otel static scrubber for turn-by-turn replay.
Markdown leaderboards for READMEs, PRs, Slack updates.

Layout

scenebench/
├── src/                          TypeScript core (npm-shipped)
│   ├── parse.ts                  JSONL → BenchRun
│   ├── evaluators/               metrics over BenchRuns
│   │   └── belief_truth.ts
│   ├── reporters/                BenchRun + metrics → markdown / leaderboard / JSONL
│   └── types.ts
├── types/                        vendor type declarations (npm-shipped, a-la-carte)
│   ├── gmail.ts                  authored via `defineAsset` from scenecast, extends "email/mailbox"
│   ├── slack.ts                  extends "message/channel"
│   ├── salesforce.ts             extends "contact/list"
│   ├── google_calendar.ts        extends "event/list"
│   ├── jira.ts · github.ts       extends "task/list"
│   ├── notion.ts                 extends "document/single"
│   ├── google_sheets.ts · stripe.ts · airtable.ts
│   └── index.ts                  vendorTypes registry
├── adapters/                     per-benchmark wrappers (git-only, not shipped to npm)
│   └── automationbench/          ✓ shipped
│       ├── scene.py              Python port of scene-otel's wire format
│       ├── instrument.py         Verifiers env subclass that emits scene events
│       ├── run.py                CLI: pick tasks, run model, dump JSONL
│       ├── schemas/              49 JSON Schemas synced from Zapier's Pydantic models
│       ├── tasks/                806 task definitions (initial_state + prompt + assertions)
│       └── scripts/              sync + fixture-build scripts
├── viewer/example-traces/        bundled JSONL fixtures (git-only)
├── benches/                      benchmarks WE author end-to-end (git-only)
│   └── (s4 / sfdc / d365 — coming, each with its own types/)
└── examples/

Adapters wrap external benchmarks (we don't own their tasks). Benches are ones we author end-to-end (tasks + fixtures + rubric). Vendor types are written using scenecast's defineAsset / defineView API and declare extends: ["email/mailbox"] etc. — so the canonicals in scenecast (Email, Message, Contact, Event, Task, Document) unify them at the abstract level.

The npm package ships only src/ and types/. Heavy data (806 AB tasks, 49 schemas, JSONL fixtures) lives in the git repo for benchmark runners — git clone, don't npm install. For genuinely large datasets we expect to publish to HuggingFace Datasets / S3 and reference them from benches/<bench>/DATA.md.

Quick start

Run AutomationBench (Python adapter)

The AutomationBench adapter wraps Sierra's Verifiers env. Requires their repo cloned somewhere + an Anthropic / OpenAI key.

# clone the AB repo somewhere; point AB_ROOT at it
export AB_ROOT=/path/to/AutomationBench

# from scenebench/adapters/automationbench/
PYTHONPATH=. $AB_ROOT/.venv/bin/python run.py 3
# → ../../viewer/example-traces/automationbench-real-*.jsonl

3 tasks against claude-haiku-4-5. Produces JSONL traces with scene.set snapshots per step.

Sync AB schemas + tasks (one-time, or after AB updates)

AB_ROOT=/path/to/AutomationBench \
  $AB_ROOT/.venv/bin/python adapters/automationbench/scripts/sync-automationbench.py
bun adapters/automationbench/scripts/build-ab-fixtures.ts

Read + evaluate the JSONL (TypeScript)

import { parseJsonl, evaluators, reporters } from "scenebench";

const runs = parseJsonl("trace.jsonl", "automationbench");
for (const run of runs) {
  const metrics = evaluators.evaluateRun(run);
  console.log(reporters.toMarkdown([{ run, metrics }]));
}

Output:

| task | reward | tokens | events | intents | drifts | duration |
|---|---|---|---|---|---|---|
| simple.email_sf_contact_phone_update | 1.00 | 6191+780 | 6 | 3 | 0 | 7600ms |

Metrics today

The belief_truth evaluator emits:

Metric	Meaning
`intent_rate`	fraction of tool calls where an intent was declared
`intent_resolution_rate`	fraction of intents that got resolved by an actual outcome
`drift_count`	intents whose value diverged from the actual outcome
`first_drift_step`	step index where the first drift occurred (or undefined)

Plus baseline counts:

Metric	Meaning
`events_total`	total scene events emitted
`events_intent` / `events_actual`	breakdown by kind
`scene_keys_distinct`	unique asset keys touched in the run

More evaluators planned: milestone tracking, inflection-step detection, cost-per-belief-correctness.

Adapters shipping today

Adapter	Status	Notes
AutomationBench	✓ shipped	wraps Sierra Verifiers' StatefulToolEnv; emits per-tool intent + actual
τ-bench	planned	same shape — Sierra's customer-service benchmark
LeRobot	planned	episode-level scene events from manipulation/navigation datasets
WorkArena	planned	ServiceNow workflow tasks
SWE-bench	planned	code agents — file system + test results as scene state

Roadmap

v0.0.1 (current)

✅ TypeScript core: types, parser, evaluators, markdown reporter
✅ AutomationBench adapter (Python)
✅ Belief-vs-truth + basic stats evaluators

Coming next

Milestone evaluator — assertion-aware scoring (PENDING / SATISFIED / BROKEN / UNSATISFIABLE) with first-unsatisfiable-step detection
Leaderboard reporter — per-model rollup across many runs, exportable as JSON / web
τ-bench adapter — second adapter validates the pattern
S4Bench — first benchmark we author end-to-end (SAP S/4HANA workflows)
CLI: npx scenebench run --benchmark X --model Y one-liner

License

MIT. See LICENSE.

scene-otel — wire format for scene events, with the static scrubber that visualizes scenebench output
scenecast — typed asset shapes + views; gives every benchmark consistent visual language
agent-otel — generic OTel router for agent telemetry
autocompile — observes repeated runs, compiles invariants to code

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
adapters/automationbench		adapters/automationbench
benches		benches
src		src
types		types
viewer/example-traces		viewer/example-traces
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scenebench

What scenebench gives you

Layout

Quick start

Run AutomationBench (Python adapter)

Sync AB schemas + tasks (one-time, or after AB updates)

Read + evaluate the JSONL (TypeScript)

Metrics today

Adapters shipping today

Roadmap

License

Related

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

scenebench

What scenebench gives you

Layout

Quick start

Run AutomationBench (Python adapter)

Sync AB schemas + tasks (one-time, or after AB updates)

Read + evaluate the JSONL (TypeScript)

Metrics today

Adapters shipping today

Roadmap

License

Related

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages