Open benchmarking framework for coding-agent CLIs.
Same tasks. Same environment. Comparable results.
Live Report — interactive results with per-task token breakdown
- 2026-04-13 — SWE-bench Verified integration: claude passes 3/3 Hard tasks (Django, Sympy)
- 2026-04-13 — Token tracking with input/cached/output breakdown for all 4 agents
- 2026-04-12 — First real benchmark run: claude, omc, codex, omx on 5 Practical tasks
- 2026-04-12 — Benchmark rules spec finalized: hidden tests, 5-run repetitions, per-difficulty reporting
| Agent | Type | Time | Tokens (in) | Tokens (out) |
|---|---|---|---|---|
| claude | Claude Code (native) | 93s | 834k | 2.4k |
| omc | Claude + OMC plugins | 116s | 880k | 2.8k |
| codex | Codex CLI (native) | 192s | 767k | 7.6k |
| omx | Codex + OMX wrapper | 279s | 1.9M | 10.4k |
| Agent | Pass | Time | Tasks |
|---|---|---|---|
| claude | 3/3 | 377s | django-17087, django-16493, sympy-24213 |
| omc | 2/3 | 271s | pending re-run with patch fix |
Wrapper overhead: OMC adds ~25% time vs native claude. OMX adds ~45% vs native codex. Token counts are normalized (input includes cached). See live report for full breakdown.
| Suite | Tier | Tasks | What it tests |
|---|---|---|---|
| Practical | 0 | 5 self-contained fixtures | Bug fixes, feature additions, refactoring — fast CI validation |
| SWE-bench | 1 | 50 curated SWE-bench Verified issues | Production-grade coding on Django, Sympy, scikit-learn, matplotlib, Flask, Requests |
| Runtime | — | 3 CLI metrics | Startup latency, peak memory, binary size |
SWE-bench (ICLR 2024 Oral) is a benchmark for evaluating LLMs on real-world GitHub issues. Given a codebase and an issue description, the model generates a patch to resolve the problem.
OpenBench integrates SWE-bench Verified — a human-validated subset of 500 problems confirmed solvable by real software engineers (report). We use the official Docker-based evaluation harness for reproducible per-instance evaluation, while adding agent CLI comparison (time, tokens, cache) on top of the standard pass/fail grading.
| Agent | CLI | What it is |
|---|---|---|
claude |
claude -p |
Claude Code native — no plugins, no hooks |
omc |
claude -p |
oh-my-claudecode — Claude Code + OMC orchestration layer |
codex |
codex exec |
OpenAI Codex CLI native |
omx |
omx exec |
oh-my-codex — Codex CLI + OMX wrapper |
# Install
uv sync --python 3.11
# Check environment
openbench doctor
# Run practical benchmarks
openbench run --agent claude --suite practical
openbench run --agent codex --suite practical
# Run SWE-bench (requires dataset + Docker)
openbench fetch swe-bench
openbench run --agent claude --suite swe-bench
# Generate report
openbench report --format html --input results/<run-id>Resource requirements for SWE-bench: Docker (images 2-8 GB each), ~50 GB disk for 50 instances, 5-15 min per task, API costs vary by agent.
| Command | Description |
|---|---|
openbench doctor |
Check environment readiness (agents, tools, Docker) |
openbench list agents |
List registered agents (claude, omc, codex, omx) |
openbench list suites |
List available suites (practical, runtime, swe-bench) |
openbench run --agent <a> --suite <s> |
Execute benchmark run |
openbench run --agent <a> --suite <s> --runs 5 |
Execute with N repetitions |
openbench fetch swe-bench |
Download SWE-bench Verified dataset (500 instances) |
openbench report --format html --input <dir> |
Generate static HTML report |
Full specification: docs/specs/benchmark-rules.md
| Rule | Detail |
|---|---|
| Prompting | Neutral task description, no agent-specific optimization |
| Tests | Hidden — applied only during evaluation, never shown to agents |
| Repetitions | 5 runs per task for statistical significance |
| Metrics | Pass@1, Pass@5, Pass@5 strict — per difficulty level |
| Token tracking | Input (normalized), cached, output — comparable across providers |
| Code quality | Not scored — pass/fail by automated tests only (no LLM judge) |
| OpenBench | SWE-bench | Aider Polyglot | |
|---|---|---|---|
| Measures | Agent CLI frameworks | Model capability | Model code generation |
| Tasks | Practical + SWE-bench | Real GitHub issues | Exercism problems (6 languages) |
| Compares | Claude Code vs Codex vs wrappers | Agents on fixed scaffold | Raw model output |
| Tracks | Correctness + time + tokens + cache | Correctness only | Correctness only |
| Environment | Same prompt, hidden tests, Docker | Docker per-instance | Docker sandbox |
openbench/
├── agents/ # Agent adapters (claude, omc, codex, omx)
│ ├── claude_native.py # Claude Code -p (native)
│ ├── omc.py # oh-my-claudecode (with OMC hooks)
│ ├── codex_native.py # Codex CLI exec (native)
│ └── omx.py # oh-my-codex (with OMX wrapper)
├── suites/
│ ├── runtime/ # CLI startup / memory / binary benchmarks
│ ├── practical/ # Self-contained fixture tasks (Tier 0)
│ └── swebench/ # SWE-bench Verified integration (Tier 1)
├── metrics/
│ ├── statistics.py # Pass@1, Pass@5, Pass@5 strict computation
│ └── store.py # Result persistence (JSON per-agent per-suite)
├── reporters/
│ ├── html_reporter.py # Static HTML report with tabs + collapsible details
│ ├── parser.py # JSON artifact → report model
│ └── models.py # Report data models
├── runner.py # Execution engine (task loop, workspace management)
└── cli.py # Click CLI (run, report, fetch, doctor, list)
- Python 3.11+ with
uv hyperfine, GNUtime,du— for runtime suitedocker,git— for SWE-bench suite- Agent CLIs:
claudeand/orcodex/omx
uv sync --python 3.11 --extra dev
uv run ruff check .
uv run python -m pytest -q # 61 testsTests use fixture-backed shim agents — no live API calls needed for CI.
- SWE-bench by Princeton NLP — the benchmark dataset and Docker evaluation harness that powers our Tier 1 suite
- claw-bench — predecessor project focused on CLI runtime overhead; OpenBench continues and expands this work into task-effectiveness benchmarking
MIT — see LICENSE.