gpt-5-mini playing on Monkey Lane
A benchmark for evaluating LLM agents by having them play Bloons Tower Defense 5.
The agent sees screenshots of the game, reads cash/lives/round via OCR, and uses tools to place towers, upgrade them, and start rounds.
| Model | Runs | Best Round | Avg Round | Avg Towers | Avg Tokens |
|---|---|---|---|---|---|
| google/gemini-3-flash-preview | 7 | 65 | 44.0 | 5 | 1.6M |
| openai/gpt-5-mini | 3 | 65 | 49.3 | 4 | 2.8M |
| anthropic/claude-sonnet-4.6 | 1 | 59 | 59.0 | 13 | 2.5M |
| Qwen3.5 27B | 5 | 43 | 37.8 | 6 | 1.0M |
| openai/gpt-5-nano | 3 | 40 | 25.7 | 2 | 683K |
| Qwen3.5 35B A3B | 3 | 37 | 32.0 | 4 | 1.3M |
| Qwen3.5 9B | 2 | 30 | 30.0 | 6 | 1.2M |
google/gemini-3-flash-preview — Round 65 Towers: #1 ninja_monkey (280.0,350.0) [4/2], #2 bomb_tower (250.0,235.0) [4/2], #3 ninja_monkey (200.0,235.0) [4/2], #4 monkey_apprentice (150.0,320.0) [4/2], #5 bomb_tower (280.0,390.0) [2/4], #6 ninja_monkey (715.0,240.0) [4/2], #7 super_monkey (400.0,145.0) [2/2], #8 bomb_tower (675.0,240.0) [2/4]
openai/gpt-5-mini — Round 65 No tower data available.
anthropic/claude-sonnet-4.6 — Round 59 Towers: #1 dart_monkey (420.0,145.0) [2/3], #2 dart_monkey (585.0,145.0) [2/3], #3 dart_monkey (150.0,320.0) [2/3], #4 dart_monkey (715.0,240.0) [2/3], #5 dart_monkey (280.0,350.0) [2/3], #6 bomb_tower (500.0,430.0) [3/2], #7 ninja_monkey (350.0,145.0) [4/2], #8 tack_shooter (260.0,235.0) [4/2], #9 bomb_tower (480.0,235.0) [2/4], #10 sniper_monkey (400.0,350.0) [2/2], #11 ice_tower (470.0,145.0) [2/4], #12 ninja_monkey (530.0,145.0) [4/2], #13 super_monkey (100.0,430.0) [2/2]
Qwen3.5 27B — Round 43 Towers: #1 dart_monkey (280.0,350.0) [2/3], #2 dart_monkey (150.0,320.0) [2/3], #3 glue_gunner (240.0,235.0) [2/1], #4 tack_shooter (260.0,145.0) [3/2], #5 ice_tower (715.0,240.0) [2/2], #6 bomb_tower (400.0,145.0) [2/4], #7 ninja_monkey (500.0,145.0) [2/4], #8 spike_factory (600.0,145.0) [0/0]
openai/gpt-5-nano — Round 40 Towers: #1 dart_monkey (150.0,320.0) [2/3], #3 dart_monkey (210.0,235.0) [2/3], #4 spike_factory (260.0,150.0) [3/2], #5 dart_monkey (280.0,350.0) [3/2], #6 dart_monkey (430.0,260.0) [4/2]
Qwen3.5 35B A3B — Round 37 Towers: #1 dart_monkey (280.0,350.0) [2/3], #2 tack_shooter (150.0,320.0) [3/2], #4 monkey_village (200.0,235.0) [2/1], #5 ninja_monkey (585.0,145.0) [3/2]
Qwen3.5 9B — Round 30 Towers: #1 dart_monkey (280.0,350.0) [2/3], #2 dart_monkey (150.0,320.0) [2/3], #3 dart_monkey (715.0,240.0) [3/2], #4 dart_monkey (350.0,150.0) [3/2], #5 dart_monkey (500.0,400.0) [3/2], #6 dart_monkey (600.0,300.0) [3/0]
- Run an agent:
python scripts/run_agent.py --model <your-model> - A submission file is auto-generated in
results/submissions/ - Fork the repo, commit your submission file, and open a PR
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt
python -m playwright install chromiumCreate a .env file with your OpenRouter API key:
echo "OPENROUTER_API_KEY=sk-or-your-key-here" > .envDownload the game SWF:
curl -L -o ./game/btd5.swf "https://github.com/SJRNoodles/Flash-Game-Archive/raw/master/bloons-tower-defense-5.swf"Run the game:
python scripts/run_agent.py --model openai/gpt-5-nanopython scripts/run_agent.py --model openai/gpt-5-nano
# Or, inject your own save file
python scripts/run_agent.py --model openai/gpt-5-nano --saves saves/my-save.json
# Extended thinking (for models that support it, e.g. Gemini)
python scripts/run_agent.py --model google/gemini-3-flash-preview --reasoning highReasoning effort: Models that support extended thinking (e.g. Gemini) can use --reasoning low|medium|high. Higher effort produces more verbose thinking traces but costs more tokens. Leaderboard submissions must use --reasoning low (the default) to ensure fair comparisons.
Saves are base64-encoded SOL files written to localStorage before the Flash VM starts (via deferred Ruffle loading).
Play manually or test the harness:
python scripts/run_mcp.py --cli| Tool | Description |
|---|---|
observe |
Screenshot the current game state |
place_tower |
Place a tower at (x, y) on the map |
upgrade_tower |
Upgrade a tower along path 1 or 2 |
sell_tower |
Sell a tower for cash back |
set_target |
Set targeting: first / last / close / strong |
start_round |
Start the next round (fast-forward, 7s wait) |
status |
Show placed towers, cash, lives, round |
list_towers |
List all towers with costs and upgrade paths |
click |
Raw click at (x, y) — escape hatch for stuck UI |
send_key |
Press a key (e.g. Escape to cancel placement) |
wait |
Wait N milliseconds |
Ruffle emulates Flash, Playwright drives Chromium, EasyOCR reads the HUD, and the agent talks to OpenRouter for LLM inference.
- A local HTTP server (port 8890) serves the Ruffle wrapper + SWF
- Playwright launches Chromium, navigates to the game, and auto-clicks through menus to round 1
- The agent loop: screenshot → OCR for game state → LLM decides actions → execute tools → repeat
- Between rounds, the agent places/upgrades towers; then calls
start_roundto begin - Our harness remembers tower positions and gives upgrade path info to the LLM to bypass computer usage limitations and avoid desyncs.
- OCR polls for the GO button to detect round completion, and for GAME OVER to detect loss
- Context distillation kicks in when the conversation gets long, summarizing history to stay within token limits