BloonsBench

gpt-5-mini playing on Monkey Lane

A benchmark for evaluating LLM agents by having them play Bloons Tower Defense 5.

The agent sees screenshots of the game, reads cash/lives/round via OCR, and uses tools to place towers, upgrade them, and start rounds.

Leaderboard

Model	Runs	Best Round	Avg Round	Avg Towers	Avg Tokens
google/gemini-3-flash-preview	7	65	44.0	5	1.6M
openai/gpt-5-mini	3	65	49.3	4	2.8M
anthropic/claude-sonnet-4.6	1	59	59.0	13	2.5M
Qwen3.5 27B	5	43	37.8	6	1.0M
openai/gpt-5-nano	3	40	25.7	2	683K
Qwen3.5 35B A3B	3	37	32.0	4	1.3M
Qwen3.5 9B	2	30	30.0	6	1.2M

Best Runs

google/gemini-3-flash-preview — Round 65 Towers: #1 ninja_monkey (280.0,350.0) [4/2], #2 bomb_tower (250.0,235.0) [4/2], #3 ninja_monkey (200.0,235.0) [4/2], #4 monkey_apprentice (150.0,320.0) [4/2], #5 bomb_tower (280.0,390.0) [2/4], #6 ninja_monkey (715.0,240.0) [4/2], #7 super_monkey (400.0,145.0) [2/2], #8 bomb_tower (675.0,240.0) [2/4]

openai/gpt-5-mini — Round 65 No tower data available.

anthropic/claude-sonnet-4.6 — Round 59 Towers: #1 dart_monkey (420.0,145.0) [2/3], #2 dart_monkey (585.0,145.0) [2/3], #3 dart_monkey (150.0,320.0) [2/3], #4 dart_monkey (715.0,240.0) [2/3], #5 dart_monkey (280.0,350.0) [2/3], #6 bomb_tower (500.0,430.0) [3/2], #7 ninja_monkey (350.0,145.0) [4/2], #8 tack_shooter (260.0,235.0) [4/2], #9 bomb_tower (480.0,235.0) [2/4], #10 sniper_monkey (400.0,350.0) [2/2], #11 ice_tower (470.0,145.0) [2/4], #12 ninja_monkey (530.0,145.0) [4/2], #13 super_monkey (100.0,430.0) [2/2]

Qwen3.5 27B — Round 43 Towers: #1 dart_monkey (280.0,350.0) [2/3], #2 dart_monkey (150.0,320.0) [2/3], #3 glue_gunner (240.0,235.0) [2/1], #4 tack_shooter (260.0,145.0) [3/2], #5 ice_tower (715.0,240.0) [2/2], #6 bomb_tower (400.0,145.0) [2/4], #7 ninja_monkey (500.0,145.0) [2/4], #8 spike_factory (600.0,145.0) [0/0]

openai/gpt-5-nano — Round 40 Towers: #1 dart_monkey (150.0,320.0) [2/3], #3 dart_monkey (210.0,235.0) [2/3], #4 spike_factory (260.0,150.0) [3/2], #5 dart_monkey (280.0,350.0) [3/2], #6 dart_monkey (430.0,260.0) [4/2]

Qwen3.5 35B A3B — Round 37 Towers: #1 dart_monkey (280.0,350.0) [2/3], #2 tack_shooter (150.0,320.0) [3/2], #4 monkey_village (200.0,235.0) [2/1], #5 ninja_monkey (585.0,145.0) [3/2]

Qwen3.5 9B — Round 30 Towers: #1 dart_monkey (280.0,350.0) [2/3], #2 dart_monkey (150.0,320.0) [2/3], #3 dart_monkey (715.0,240.0) [3/2], #4 dart_monkey (350.0,150.0) [3/2], #5 dart_monkey (500.0,400.0) [3/2], #6 dart_monkey (600.0,300.0) [3/0]

Submit Your Results

Run an agent: python scripts/run_agent.py --model <your-model>
A submission file is auto-generated in results/submissions/
Fork the repo, commit your submission file, and open a PR

Setup

uv venv && source .venv/bin/activate
uv pip install -r requirements.txt
python -m playwright install chromium

Create a .env file with your OpenRouter API key:

echo "OPENROUTER_API_KEY=sk-or-your-key-here" > .env

Download the game SWF:

curl -L -o ./game/btd5.swf "https://github.com/SJRNoodles/Flash-Game-Archive/raw/master/bloons-tower-defense-5.swf"

Run the game:

python scripts/run_agent.py --model openai/gpt-5-nano

Run

python scripts/run_agent.py --model openai/gpt-5-nano

# Or, inject your own save file
python scripts/run_agent.py --model openai/gpt-5-nano --saves saves/my-save.json

# Extended thinking (for models that support it, e.g. Gemini)
python scripts/run_agent.py --model google/gemini-3-flash-preview --reasoning high

Reasoning effort: Models that support extended thinking (e.g. Gemini) can use --reasoning low|medium|high. Higher effort produces more verbose thinking traces but costs more tokens. Leaderboard submissions must use --reasoning low (the default) to ensure fair comparisons.

Saves are base64-encoded SOL files written to localStorage before the Flash VM starts (via deferred Ruffle loading).

Interactive CLI

Play manually or test the harness:

python scripts/run_mcp.py --cli

Agent tools

Tool	Description
`observe`	Screenshot the current game state
`place_tower`	Place a tower at (x, y) on the map
`upgrade_tower`	Upgrade a tower along path 1 or 2
`sell_tower`	Sell a tower for cash back
`set_target`	Set targeting: first / last / close / strong
`start_round`	Start the next round (fast-forward, 7s wait)
`status`	Show placed towers, cash, lives, round
`list_towers`	List all towers with costs and upgrade paths
`click`	Raw click at (x, y) — escape hatch for stuck UI
`send_key`	Press a key (e.g. Escape to cancel placement)
`wait`	Wait N milliseconds

How it works

Ruffle emulates Flash, Playwright drives Chromium, EasyOCR reads the HUD, and the agent talks to OpenRouter for LLM inference.

A local HTTP server (port 8890) serves the Ruffle wrapper + SWF
Playwright launches Chromium, navigates to the game, and auto-clicks through menus to round 1
The agent loop: screenshot → OCR for game state → LLM decides actions → execute tools → repeat
Between rounds, the agent places/upgrades towers; then calls start_round to begin
Our harness remembers tower positions and gives upgrade path info to the LLM to bypass computer usage limitations and avoid desyncs.
OCR polls for the GO button to detect round completion, and for GAME OVER to detect loss
Context distillation kicks in when the conversation gets long, summarizing history to stay within token limits

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
game		game
harness		harness
results/submissions		results/submissions
saves		saves
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
image.png		image.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BloonsBench

Leaderboard

Best Runs

Submit Your Results

Setup

Run

Interactive CLI

Agent tools

How it works

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

BloonsBench

Leaderboard

Best Runs

Submit Your Results

Setup

Run

Interactive CLI

Agent tools

How it works

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages