Minecraft is the measuring instrument, not the point. MineBench drops a language model into a live Minecraft world with nothing but a set of tool calls, gives it a goal ("make a stone pickaxe from scratch"), and watches what it actually does โ then scores it on six transferable dimensions of agentic skill. Same world, same tasks, any model. The result is a deterministic capability profile you can trust, not a vibe check.
Why Minecraft? Long-horizon goals, deep crafting dependencies, an open world that changes under the agent, and verifiable end-states (you either hold the pickaxe or you don't). It is the perfect proving ground for planning, tool use, and recovery โ the skills that matter for any agent.
A live, self-hosting dashboard turns every run into a deterministic scorecard, a head-to-head leaderboard, and a step-by-step replay of the agent's reasoning.
Models are ranked by success rate and average score, then broken down across the six agentic dimensions โ so you can see where one model beats another, not just that it does.
Every task-vs-model cell is colour-coded by outcome and clickable to drill into the run. Below it, a sortable history of every run with score, progress, steps, and errors.
Watch the agent think in real time. Every tool call carries a one-sentence rationale, so you see the model plan backward from the goal, gather wood, and recover when a drop is left on the ground.
| ๐ฏ A capability profile, not one number | Six deterministic dimensions instead of a single win/lose scalar โ you learn what a model is good at. |
| ๐งฎ Deterministic & unbiased scoring | No LLM judge. No elapsed-time bias (that just measures latency). Every score is a pure function of the run trace. |
| ๐ชช Verified outcomes, never self-reported | Success is detected by the harness from real world state โ the model saying "done" proves nothing. |
| ๐ Any model, one interface | Azure OpenAI deployments and 20+ models via the GitHub Copilot API, swapped with a single flag. |
| โ๏ธ Head-to-head mode | Run two models in isolated same-seed worlds (or one shared world) and compare capability profiles side by side. |
| ๐บ Live, zero-setup dashboard | Auto-launches the Minecraft server, streams the agent's thoughts, and renders the leaderboard โ all from one command. |
Each is a deterministic function of the run trace. A dimension is null (excluded from the
average) when a run never exercised it โ keeping the benchmark unbiased.
| Dimension | What it measures |
|---|---|
| Completion | How far down the task's dependency chain the agent got (milestone progress). |
| Planning | Did it pursue prerequisites before dependents? (no premature attempts) |
| Tool use | Valid actions that respect preconditions (1 โ self-inflicted errors). |
| Adaptation | After a self-caused failure, does the next action change? (not looping) |
| Robustness | Recovery after an external disturbance (e.g. another bot grabs your resource). |
| Efficiency | Productive-action ratio โ not duration, not raw step count. |
Errors are treated as diagnostics, not blunt penalties. A failed tool call might be exploration, or the world changing under the agent โ so MineBench only ever penalises looping (repeating an action that changed nothing), never honest, isolated failures.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Minecraft world โโobservationโโโถโ AGENT (swappable LLM) โ
โฒ โ brain ยท skills ยท models โ
โ โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ โ one tool call / step
โ โผ
โโโโโโดโโโโโโโโโโโโ execute โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ HARNESS โโโโโโโโโโโโโโโโ TOOL: mine / craft / move / โ
โ runner ยท env โ + verify โ smelt / look_aroundโฆ โ
โโโโโโฌโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ trace
โผ
โโโโโโโโโโโโโโโโโโ score โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SCORING โโโโโโโโโโโโโโโถโ DASHBOARD (live + history) โ
โ scorer ยท DAG โ profile โ leaderboard ยท matrix ยท traceโ
โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The loop, every step: buildObservation โ agent.act โ executeAction โ record โ checkSuccess.
The agent perceives the world only through a structured JSON observation (position, inventory,
surroundings, a coordinate radar of nearby resources) โ never raw pixels โ and acts only through
tool calls. The harness owns success detection; the scorer turns the trace into a capability
profile; the dashboard renders it.
agent/ The decision-maker
brain.js System prompt + conversation loop (turns an observation into one action)
skills.js In-world tools (mine, craft, smelt, place, look_around, read_dataโฆ) + navigation
observation.js Turns the live world into the structured state the model perceives
models/ Swappable model adapters (Azure OpenAI + GitHub Copilot)
harness/ The runner: applies a task's setup, drives the loop, detects success from world state
scoring/ The judge: milestone DAG โ six-dimension capability profile (deterministic)
dashboard/ Live + historical web UI (leaderboard, taskรmodel matrix, step-by-step replay)
tasks/ Benchmark tasks as JSON (goal, setup, success spec, milestone graph)
A task owns its goal, world setup, a verifiable success spec, and a milestone dependency graph (a DAG, so any valid solution path scores fairly via backward entailment). Adding a benchmark is a data change, not a code change:
Prerequisites: Node.js, a Java Minecraft server jar (auto-managed), and model credentials โ
AZURE_OPENAI_*in.envfor Azure deployments, orCOPILOT_TOKENfor Copilot models.
The dashboard auto-starts the Minecraft server, lets you pick a task + model, and streams the run
live. No manual server setup, no /op by hand.
npm install
npm run dashboard # โ http://localhost:8099, click Start# Single run (server auto-starts; world is reused next time)
npm run bench -- --task stone_pickaxe --model copilot/gpt-5.4 --verbose
# Free-form ad-hoc goal (no auto-scoring)
npm run bench -- --goal "Mine 3 oak_log" --model gpt-4o
# Head-to-head: two models, isolated same-seed worlds
npm run bench -- --task stone_pickaxe --model-a copilot/gpt-5.4 --model-b copilot/claude-opus-4.8npm test # deterministic scorer self-tests (no network, no live bot)๐ Full command reference and flags: commands.md ยท Architecture deep-dive:
context.md.
Tasks span a difficulty ramp from a one-resource smoke test to a deep, multi-stage tech tree โ each chosen to stress a different agentic muscle.
| Difficulty | Task | Tests |
|---|---|---|
| 1 | gather_wood |
Navigation + find/mine a single resource (smoke test) |
| 1 | obtain_beef / obtain_chicken / obtain_mutton |
Mob perception + combat |
| 2 | kill_bot_a / kill_bot_b |
PvP duel (scored from the server's real death packet) |
| 3 | make_bed |
Multi-resource gathering + crafting |
| 3 | stone_pickaxe |
Long-horizon tool-tier reasoning + crafting dependencies |
| 5 | iron_pickaxe |
Very long-horizon: mining, furnace, smelting, multi-stage crafting |
| 6 | gold_ingot |
Deep tech-tree resource pipeline |
Every task is auto-scored against its success spec โ success is read from real world state, never from the model's own claim.
Node.js ยท mineflayer (Minecraft bot protocol) ยท mineflayer-pathfinder (navigation) ยท minecraft-data (recipes/loot ground truth) ยท Azure OpenAI + GitHub Copilot model APIs ยท zero-dependency vanilla-JS dashboard.
MineBench treats Minecraft as a proxy for general agentic capability. The six dimensions โ planning, tool use, adaptation, robustness, efficiency, completion โ are exactly the skills a model needs to operate autonomously anywhere. As models improve, the tasks deepen; the measuring instrument stays honest, deterministic, and transparent.
Built for the hackathon. Minecraft is just the instrument โ the score is the story. ๐ฉ



{ "id": "stone_pickaxe", "goal": "Make a stone_pickaxe from scratch.", "difficulty": 3, "max_steps": 60, "success": { "inventory": { "stone_pickaxe": 1 } }, "milestones": [ /* wood โ planks โ sticks โ table โ wooden pickaxe โ cobblestone โ stone pickaxe */ ] }