Skip to content

bodanp/MineBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

75 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŸฉ MineBench

A Minecraft benchmark for general agentic capability in LLMs

Minecraft is the measuring instrument, not the point. MineBench drops a language model into a live Minecraft world with nothing but a set of tool calls, gives it a goal ("make a stone pickaxe from scratch"), and watches what it actually does โ€” then scores it on six transferable dimensions of agentic skill. Same world, same tasks, any model. The result is a deterministic capability profile you can trust, not a vibe check.

Why Minecraft? Long-horizon goals, deep crafting dependencies, an open world that changes under the agent, and verifiable end-states (you either hold the pickaxe or you don't). It is the perfect proving ground for planning, tool use, and recovery โ€” the skills that matter for any agent.


๐Ÿ“Š See it in action

A live, self-hosting dashboard turns every run into a deterministic scorecard, a head-to-head leaderboard, and a step-by-step replay of the agent's reasoning.

Leaderboard & capability profile

Models are ranked by success rate and average score, then broken down across the six agentic dimensions โ€” so you can see where one model beats another, not just that it does.

MineBench dashboard โ€” overview, leaderboard, and per-model capability profile

Task ร— model matrix & full run history

Every task-vs-model cell is colour-coded by outcome and clickable to drill into the run. Below it, a sortable history of every run with score, progress, steps, and errors.

MineBench dashboard โ€” taskร—model matrix and run history

Live agent reasoning

Watch the agent think in real time. Every tool call carries a one-sentence rationale, so you see the model plan backward from the goal, gather wood, and recover when a drop is left on the ground.

MineBench live dashboard โ€” the agent's step-by-step thoughts and actions


โญ What makes MineBench different

๐ŸŽฏ A capability profile, not one number Six deterministic dimensions instead of a single win/lose scalar โ€” you learn what a model is good at.
๐Ÿงฎ Deterministic & unbiased scoring No LLM judge. No elapsed-time bias (that just measures latency). Every score is a pure function of the run trace.
๐Ÿชช Verified outcomes, never self-reported Success is detected by the harness from real world state โ€” the model saying "done" proves nothing.
๐ŸŒ Any model, one interface Azure OpenAI deployments and 20+ models via the GitHub Copilot API, swapped with a single flag.
โš”๏ธ Head-to-head mode Run two models in isolated same-seed worlds (or one shared world) and compare capability profiles side by side.
๐Ÿ“บ Live, zero-setup dashboard Auto-launches the Minecraft server, streams the agent's thoughts, and renders the leaderboard โ€” all from one command.

๐Ÿง  The six capability dimensions

Each is a deterministic function of the run trace. A dimension is null (excluded from the average) when a run never exercised it โ€” keeping the benchmark unbiased.

Dimension What it measures
Completion How far down the task's dependency chain the agent got (milestone progress).
Planning Did it pursue prerequisites before dependents? (no premature attempts)
Tool use Valid actions that respect preconditions (1 โˆ’ self-inflicted errors).
Adaptation After a self-caused failure, does the next action change? (not looping)
Robustness Recovery after an external disturbance (e.g. another bot grabs your resource).
Efficiency Productive-action ratio โ€” not duration, not raw step count.

Errors are treated as diagnostics, not blunt penalties. A failed tool call might be exploration, or the world changing under the agent โ€” so MineBench only ever penalises looping (repeating an action that changed nothing), never honest, isolated failures.


๐Ÿ—๏ธ How it works

                                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   Minecraft world  โ”€โ”€observationโ”€โ”€โ–ถโ”‚  AGENT (swappable LLM)      โ”‚
        โ–ฒ                           โ”‚  brain ยท skills ยท models    โ”‚
        โ”‚                           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚                                          โ”‚ one tool call / step
        โ”‚                                          โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   execute    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚  HARNESS       โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚  TOOL: mine / craft / move / โ”‚
   โ”‚  runner ยท env  โ”‚   + verify   โ”‚        smelt / look_aroundโ€ฆ  โ”‚
   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚ trace
        โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    score     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚  SCORING       โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚  DASHBOARD (live + history)  โ”‚
   โ”‚  scorer ยท DAG  โ”‚   profile    โ”‚  leaderboard ยท matrix ยท traceโ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The loop, every step: buildObservation โ†’ agent.act โ†’ executeAction โ†’ record โ†’ checkSuccess. The agent perceives the world only through a structured JSON observation (position, inventory, surroundings, a coordinate radar of nearby resources) โ€” never raw pixels โ€” and acts only through tool calls. The harness owns success detection; the scorer turns the trace into a capability profile; the dashboard renders it.

Project layout

agent/        The decision-maker
  brain.js        System prompt + conversation loop (turns an observation into one action)
  skills.js       In-world tools (mine, craft, smelt, place, look_around, read_dataโ€ฆ) + navigation
  observation.js  Turns the live world into the structured state the model perceives
  models/         Swappable model adapters (Azure OpenAI + GitHub Copilot)
harness/      The runner: applies a task's setup, drives the loop, detects success from world state
scoring/      The judge: milestone DAG โ†’ six-dimension capability profile (deterministic)
dashboard/    Live + historical web UI (leaderboard, taskร—model matrix, step-by-step replay)
tasks/        Benchmark tasks as JSON (goal, setup, success spec, milestone graph)

Tasks are declarative JSON

A task owns its goal, world setup, a verifiable success spec, and a milestone dependency graph (a DAG, so any valid solution path scores fairly via backward entailment). Adding a benchmark is a data change, not a code change:

{
  "id": "stone_pickaxe",
  "goal": "Make a stone_pickaxe from scratch.",
  "difficulty": 3,
  "max_steps": 60,
  "success": { "inventory": { "stone_pickaxe": 1 } },
  "milestones": [ /* wood โ†’ planks โ†’ sticks โ†’ table โ†’ wooden pickaxe โ†’ cobblestone โ†’ stone pickaxe */ ]
}

๐Ÿš€ Quick start

Prerequisites: Node.js, a Java Minecraft server jar (auto-managed), and model credentials โ€” AZURE_OPENAI_* in .env for Azure deployments, or COPILOT_TOKEN for Copilot models.

Launch the dashboard (recommended)

The dashboard auto-starts the Minecraft server, lets you pick a task + model, and streams the run live. No manual server setup, no /op by hand.

npm install
npm run dashboard          # โ†’ http://localhost:8099, click Start

Or run from the CLI

# Single run (server auto-starts; world is reused next time)
npm run bench -- --task stone_pickaxe --model copilot/gpt-5.4 --verbose

# Free-form ad-hoc goal (no auto-scoring)
npm run bench -- --goal "Mine 3 oak_log" --model gpt-4o

# Head-to-head: two models, isolated same-seed worlds
npm run bench -- --task stone_pickaxe --model-a copilot/gpt-5.4 --model-b copilot/claude-opus-4.8

Run the tests

npm test                   # deterministic scorer self-tests (no network, no live bot)

๐Ÿ“– Full command reference and flags: commands.md ยท Architecture deep-dive: context.md.


๐ŸŽฎ The benchmark suite

Tasks span a difficulty ramp from a one-resource smoke test to a deep, multi-stage tech tree โ€” each chosen to stress a different agentic muscle.

Difficulty Task Tests
1 gather_wood Navigation + find/mine a single resource (smoke test)
1 obtain_beef / obtain_chicken / obtain_mutton Mob perception + combat
2 kill_bot_a / kill_bot_b PvP duel (scored from the server's real death packet)
3 make_bed Multi-resource gathering + crafting
3 stone_pickaxe Long-horizon tool-tier reasoning + crafting dependencies
5 iron_pickaxe Very long-horizon: mining, furnace, smelting, multi-stage crafting
6 gold_ingot Deep tech-tree resource pipeline

Every task is auto-scored against its success spec โ€” success is read from real world state, never from the model's own claim.


๐Ÿ› ๏ธ Tech stack

Node.js ยท mineflayer (Minecraft bot protocol) ยท mineflayer-pathfinder (navigation) ยท minecraft-data (recipes/loot ground truth) ยท Azure OpenAI + GitHub Copilot model APIs ยท zero-dependency vanilla-JS dashboard.


๐Ÿ”ญ Vision

MineBench treats Minecraft as a proxy for general agentic capability. The six dimensions โ€” planning, tool use, adaptation, robustness, efficiency, completion โ€” are exactly the skills a model needs to operate autonomously anywhere. As models improve, the tasks deepen; the measuring instrument stays honest, deterministic, and transparent.

Built for the hackathon. Minecraft is just the instrument โ€” the score is the story. ๐ŸŸฉ

About

Hack a thon

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors