OpenBench

Open benchmarking framework for coding-agent CLIs.

Same tasks. Same environment. Comparable results.

Live Report — interactive results with per-task token breakdown

What's New

2026-04-13 — SWE-bench Verified integration: claude passes 3/3 Hard tasks (Django, Sympy)
2026-04-13 — Token tracking with input/cached/output breakdown for all 4 agents
2026-04-12 — First real benchmark run: claude, omc, codex, omx on 5 Practical tasks
2026-04-12 — Benchmark rules spec finalized: hidden tests, 5-run repetitions, per-difficulty reporting

Latest Results

Practical — 5 Easy tasks (all agents 5/5 pass)

Agent	Type	Time	Tokens (in)	Tokens (out)
claude	Claude Code (native)	93s	834k	2.4k
omc	Claude + OMC plugins	116s	880k	2.8k
codex	Codex CLI (native)	192s	767k	7.6k
omx	Codex + OMX wrapper	279s	1.9M	10.4k

SWE-bench — 3 Hard tasks (real GitHub issues)

Agent	Pass	Time	Tasks
claude	3/3	377s	django-17087, django-16493, sympy-24213
omc	2/3	271s	pending re-run with patch fix

Wrapper overhead: OMC adds ~25% time vs native claude. OMX adds ~45% vs native codex. Token counts are normalized (input includes cached). See live report for full breakdown.

Benchmark Suites

Suite	Tier	Tasks	What it tests
Practical	0	5 self-contained fixtures	Bug fixes, feature additions, refactoring — fast CI validation
SWE-bench	1	50 curated SWE-bench Verified issues	Production-grade coding on Django, Sympy, scikit-learn, matplotlib, Flask, Requests
Runtime	—	3 CLI metrics	Startup latency, peak memory, binary size

About SWE-bench

SWE-bench (ICLR 2024 Oral) is a benchmark for evaluating LLMs on real-world GitHub issues. Given a codebase and an issue description, the model generates a patch to resolve the problem.

OpenBench integrates SWE-bench Verified — a human-validated subset of 500 problems confirmed solvable by real software engineers (report). We use the official Docker-based evaluation harness for reproducible per-instance evaluation, while adding agent CLI comparison (time, tokens, cache) on top of the standard pass/fail grading.

Agents

Agent	CLI	What it is
`claude`	`claude -p`	Claude Code native — no plugins, no hooks
`omc`	`claude -p`	oh-my-claudecode — Claude Code + OMC orchestration layer
`codex`	`codex exec`	OpenAI Codex CLI native
`omx`	`omx exec`	oh-my-codex — Codex CLI + OMX wrapper

Quick Start

# Install
uv sync --python 3.11

# Check environment
openbench doctor

# Run practical benchmarks
openbench run --agent claude --suite practical
openbench run --agent codex --suite practical

# Run SWE-bench (requires dataset + Docker)
openbench fetch swe-bench
openbench run --agent claude --suite swe-bench

# Generate report
openbench report --format html --input results/<run-id>

Resource requirements for SWE-bench: Docker (images 2-8 GB each), ~50 GB disk for 50 instances, 5-15 min per task, API costs vary by agent.

CLI Reference

Command	Description
`openbench doctor`	Check environment readiness (agents, tools, Docker)
`openbench list agents`	List registered agents (`claude`, `omc`, `codex`, `omx`)
`openbench list suites`	List available suites (`practical`, `runtime`, `swe-bench`)
`openbench run --agent <a> --suite <s>`	Execute benchmark run
`openbench run --agent <a> --suite <s> --runs 5`	Execute with N repetitions
`openbench fetch swe-bench`	Download SWE-bench Verified dataset (500 instances)
`openbench report --format html --input <dir>`	Generate static HTML report

Benchmark Rules

Full specification: docs/specs/benchmark-rules.md

Rule	Detail
Prompting	Neutral task description, no agent-specific optimization
Tests	Hidden — applied only during evaluation, never shown to agents
Repetitions	5 runs per task for statistical significance
Metrics	Pass@1, Pass@5, Pass@5 strict — per difficulty level
Token tracking	Input (normalized), cached, output — comparable across providers
Code quality	Not scored — pass/fail by automated tests only (no LLM judge)

How it compares to other benchmarks

	OpenBench	SWE-bench	Aider Polyglot
Measures	Agent CLI frameworks	Model capability	Model code generation
Tasks	Practical + SWE-bench	Real GitHub issues	Exercism problems (6 languages)
Compares	Claude Code vs Codex vs wrappers	Agents on fixed scaffold	Raw model output
Tracks	Correctness + time + tokens + cache	Correctness only	Correctness only
Environment	Same prompt, hidden tests, Docker	Docker per-instance	Docker sandbox

Architecture

openbench/
├── agents/              # Agent adapters (claude, omc, codex, omx)
│   ├── claude_native.py # Claude Code -p (native)
│   ├── omc.py           # oh-my-claudecode (with OMC hooks)
│   ├── codex_native.py  # Codex CLI exec (native)
│   └── omx.py           # oh-my-codex (with OMX wrapper)
├── suites/
│   ├── runtime/         # CLI startup / memory / binary benchmarks
│   ├── practical/       # Self-contained fixture tasks (Tier 0)
│   └── swebench/        # SWE-bench Verified integration (Tier 1)
├── metrics/
│   ├── statistics.py    # Pass@1, Pass@5, Pass@5 strict computation
│   └── store.py         # Result persistence (JSON per-agent per-suite)
├── reporters/
│   ├── html_reporter.py # Static HTML report with tabs + collapsible details
│   ├── parser.py        # JSON artifact → report model
│   └── models.py        # Report data models
├── runner.py            # Execution engine (task loop, workspace management)
└── cli.py               # Click CLI (run, report, fetch, doctor, list)

Requirements

Python 3.11+ with uv
hyperfine, GNU time, du — for runtime suite
docker, git — for SWE-bench suite
Agent CLIs: claude and/or codex/omx

Development

uv sync --python 3.11 --extra dev
uv run ruff check .
uv run python -m pytest -q    # 61 tests

Tests use fixture-backed shim agents — no live API calls needed for CI.

Acknowledgments

SWE-bench by Princeton NLP — the benchmark dataset and Docker evaluation harness that powers our Tier 1 suite
claw-bench — predecessor project focused on CLI runtime overhead; OpenBench continues and expands this work into task-effectiveness benchmarking

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.omx/plans		.omx/plans
docs		docs
openbench		openbench
tasks		tasks
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenBench

What's New

Latest Results

Practical — 5 Easy tasks (all agents 5/5 pass)

SWE-bench — 3 Hard tasks (real GitHub issues)

Benchmark Suites

About SWE-bench

Agents

Quick Start

CLI Reference

Benchmark Rules

How it compares to other benchmarks

Architecture

Requirements

Development

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenBench

What's New

Latest Results

Practical — 5 Easy tasks (all agents 5/5 pass)

SWE-bench — 3 Hard tasks (real GitHub issues)

Benchmark Suites

About SWE-bench

Agents

Quick Start

CLI Reference

Benchmark Rules

How it compares to other benchmarks

Architecture

Requirements

Development

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages