Skip to content

cnqso/bloonsbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BloonsBench

BloonsBench gameplay gpt-5-mini playing on Monkey Lane

A benchmark for evaluating LLM agents by having them play Bloons Tower Defense 5.

The agent sees screenshots of the game, reads cash/lives/round via OCR, and uses tools to place towers, upgrade them, and start rounds.

Leaderboard

Model Runs Best Round Avg Round Avg Towers Avg Tokens
google/gemini-3-flash-preview 7 65 44.0 5 1.6M
openai/gpt-5-mini 3 65 49.3 4 2.8M
anthropic/claude-sonnet-4.6 1 59 59.0 13 2.5M
Qwen3.5 27B 5 43 37.8 6 1.0M
openai/gpt-5-nano 3 40 25.7 2 683K
Qwen3.5 35B A3B 3 37 32.0 4 1.3M
Qwen3.5 9B 2 30 30.0 6 1.2M

Best Runs

google/gemini-3-flash-preview — Round 65 Towers: #1 ninja_monkey (280.0,350.0) [4/2], #2 bomb_tower (250.0,235.0) [4/2], #3 ninja_monkey (200.0,235.0) [4/2], #4 monkey_apprentice (150.0,320.0) [4/2], #5 bomb_tower (280.0,390.0) [2/4], #6 ninja_monkey (715.0,240.0) [4/2], #7 super_monkey (400.0,145.0) [2/2], #8 bomb_tower (675.0,240.0) [2/4]

openai/gpt-5-mini — Round 65 No tower data available.

anthropic/claude-sonnet-4.6 — Round 59 Towers: #1 dart_monkey (420.0,145.0) [2/3], #2 dart_monkey (585.0,145.0) [2/3], #3 dart_monkey (150.0,320.0) [2/3], #4 dart_monkey (715.0,240.0) [2/3], #5 dart_monkey (280.0,350.0) [2/3], #6 bomb_tower (500.0,430.0) [3/2], #7 ninja_monkey (350.0,145.0) [4/2], #8 tack_shooter (260.0,235.0) [4/2], #9 bomb_tower (480.0,235.0) [2/4], #10 sniper_monkey (400.0,350.0) [2/2], #11 ice_tower (470.0,145.0) [2/4], #12 ninja_monkey (530.0,145.0) [4/2], #13 super_monkey (100.0,430.0) [2/2]

Qwen3.5 27B — Round 43 Towers: #1 dart_monkey (280.0,350.0) [2/3], #2 dart_monkey (150.0,320.0) [2/3], #3 glue_gunner (240.0,235.0) [2/1], #4 tack_shooter (260.0,145.0) [3/2], #5 ice_tower (715.0,240.0) [2/2], #6 bomb_tower (400.0,145.0) [2/4], #7 ninja_monkey (500.0,145.0) [2/4], #8 spike_factory (600.0,145.0) [0/0]

openai/gpt-5-nano — Round 40 Towers: #1 dart_monkey (150.0,320.0) [2/3], #3 dart_monkey (210.0,235.0) [2/3], #4 spike_factory (260.0,150.0) [3/2], #5 dart_monkey (280.0,350.0) [3/2], #6 dart_monkey (430.0,260.0) [4/2]

Qwen3.5 35B A3B — Round 37 Towers: #1 dart_monkey (280.0,350.0) [2/3], #2 tack_shooter (150.0,320.0) [3/2], #4 monkey_village (200.0,235.0) [2/1], #5 ninja_monkey (585.0,145.0) [3/2]

Qwen3.5 9B — Round 30 Towers: #1 dart_monkey (280.0,350.0) [2/3], #2 dart_monkey (150.0,320.0) [2/3], #3 dart_monkey (715.0,240.0) [3/2], #4 dart_monkey (350.0,150.0) [3/2], #5 dart_monkey (500.0,400.0) [3/2], #6 dart_monkey (600.0,300.0) [3/0]

Submit Your Results

  1. Run an agent: python scripts/run_agent.py --model <your-model>
  2. A submission file is auto-generated in results/submissions/
  3. Fork the repo, commit your submission file, and open a PR

Setup

uv venv && source .venv/bin/activate
uv pip install -r requirements.txt
python -m playwright install chromium

Create a .env file with your OpenRouter API key:

echo "OPENROUTER_API_KEY=sk-or-your-key-here" > .env

Download the game SWF:

curl -L -o ./game/btd5.swf "https://github.com/SJRNoodles/Flash-Game-Archive/raw/master/bloons-tower-defense-5.swf"

Run the game:

python scripts/run_agent.py --model openai/gpt-5-nano

Run

python scripts/run_agent.py --model openai/gpt-5-nano

# Or, inject your own save file
python scripts/run_agent.py --model openai/gpt-5-nano --saves saves/my-save.json

# Extended thinking (for models that support it, e.g. Gemini)
python scripts/run_agent.py --model google/gemini-3-flash-preview --reasoning high

Reasoning effort: Models that support extended thinking (e.g. Gemini) can use --reasoning low|medium|high. Higher effort produces more verbose thinking traces but costs more tokens. Leaderboard submissions must use --reasoning low (the default) to ensure fair comparisons.

Saves are base64-encoded SOL files written to localStorage before the Flash VM starts (via deferred Ruffle loading).

Interactive CLI

Play manually or test the harness:

python scripts/run_mcp.py --cli

Agent tools

Tool Description
observe Screenshot the current game state
place_tower Place a tower at (x, y) on the map
upgrade_tower Upgrade a tower along path 1 or 2
sell_tower Sell a tower for cash back
set_target Set targeting: first / last / close / strong
start_round Start the next round (fast-forward, 7s wait)
status Show placed towers, cash, lives, round
list_towers List all towers with costs and upgrade paths
click Raw click at (x, y) — escape hatch for stuck UI
send_key Press a key (e.g. Escape to cancel placement)
wait Wait N milliseconds

How it works

Ruffle emulates Flash, Playwright drives Chromium, EasyOCR reads the HUD, and the agent talks to OpenRouter for LLM inference.

  1. A local HTTP server (port 8890) serves the Ruffle wrapper + SWF
  2. Playwright launches Chromium, navigates to the game, and auto-clicks through menus to round 1
  3. The agent loop: screenshot → OCR for game state → LLM decides actions → execute tools → repeat
  4. Between rounds, the agent places/upgrades towers; then calls start_round to begin
  5. Our harness remembers tower positions and gives upgrade path info to the LLM to bypass computer usage limitations and avoid desyncs.
  6. OCR polls for the GO button to detect round completion, and for GAME OVER to detect loss
  7. Context distillation kicks in when the conversation gets long, summarizing history to stay within token limits

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages