HLL: Interactive CAPTCHA Benchmark for Multimodal Agents

HLL is a CAPTCHA benchmark for evaluating multimodal GUI agents in an end-to-end interaction loop. It provides a local CAPTCHA web server, configurable webpage distraction, dynamic interaction validation, and MobileAgent-based evaluation scripts.

What Is Included

CAPTCHA web server: python -m web.captcha_page.
Single MobileAgent demo: run_captcha_demo.py.
Multi-model batch evaluation: run_captcha_eval.py and eval_captcha.sh.
Configuration templates: config.yaml.example and .env.example.

Benchmark Task Families

HLL follows the ten base task families used in the paper. Difficulty, distraction, and dynamic interaction validation are evaluation dimensions layered on top of these families; they are not counted as separate task families.

Paper task family	CLI type	Interaction
Text transcription	`text`	Read a visual code and enter it correctly
Natural-image sequence selection	`click_real`	Click targets in the required order over natural-image content
Icon sequence selection	`click_icon`	Click symbolic targets in the specified sequence
Slider alignment	`slider_real`	Drag a slider until a visual gap is aligned
Jigsaw alignment	`puzzle_real`	Drag a missing component to its correct geometric position
Missing-patch selection	`missing_piece`	Select the local patch that completes an image
Tile restoration	`pic_puzzle`	Restore an image by swapping misplaced tiles
Board reconfiguration	`five_line`	Move a board piece until a valid five-in-a-row state is formed
Category-guided image selection	`image_grid`	Select all grid images matching a semantic category
Logic-and-arithmetic interaction	`logic_arithmetic`	Solve arithmetic, symbolic, or interface-mediated reasoning challenges

Dynamic validation uses the corresponding _dynamic CLI suffix where implemented, for example slider_real_dynamic, click_real_dynamic, image_grid_dynamic, and logic_arithmetic_dynamic. Internal smoke-test and debugging types are kept out of the benchmark task list.

Installation

For the CAPTCHA web server only:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements-server.txt

For MobileAgent demos and batch evaluation:

pip install -r requirements.txt

MobileAgent evaluation requires an Android emulator or device, working ADB, and an OpenAI-compatible model API.

Configuration

Copy the configuration template:

cp config.yaml.example config.yaml

Common fields:

env:
  adb_path: "/path/to/android-sdk/platform-tools/adb"
  API_url: "https://api.openai.com/v1"
  token: ""
  caption_model: "gpt-4o"
  reflect_model: "gpt-4o"
  judge_model: "gpt-4o"

captcha:
  unsplash:
    access_key: ""

If env.API_url or env.token is empty, the evaluation scripts read OPENAI_API_URL and OPENAI_API_KEY. PEXELS_API_KEY can provide remote images for missing_piece; UNSPLASH_ACCESS_KEY or captcha.unsplash.access_key can provide remote images for pic_puzzle and image_grid.

Quick Start

Run a local CAPTCHA page:

python -m web.captcha_page --type text --port 8000 --no-debug

Open http://localhost:8000.

Useful examples:

python -m web.captcha_page --type slider_real --difficulty hard --port 8000 --no-debug
python -m web.captcha_page --type text --distraction-level 2 --port 8000 --no-debug
python -m web.captcha_page --type image_grid_dynamic --port 8000 --no-debug

Debug mode is available for local development:

python -m web.captcha_page --type text --port 8000 --debug

Agent Evaluation

Single MobileAgent demo:

python run_captcha_demo.py \
  --config config.yaml \
  --type slider_real \
  --attempts 3 \
  --distraction-level 1

Batch evaluation:

python run_captcha_eval.py \
  --config config.yaml \
  --models gpt-4o \
  --captcha-types text click_real click_icon slider_real puzzle_real missing_piece pic_puzzle five_line image_grid logic_arithmetic \
  --samples-per-type 5 \
  --attempts 1 \
  --max-steps 15 \
  --sample-timeout-sec 1200 \
  --output-dir eval_outputs/hll_repro_demo

Script example:

bash eval_captcha.sh gpt-4o

Project Map

HLL/
├── web/                         # CAPTCHA web server and task implementations
│   ├── captcha_page.py          # HTTP server entry point
│   └── captcha/                 # CAPTCHA implementations
├── MobileAgent/                 # Android GUI-agent evaluation loop
├── run_captcha_demo.py          # Single closed-loop demo
├── run_captcha_eval.py          # Batch evaluation entry point
├── eval_captcha.sh              # Batch evaluation example script
├── config.yaml.example          # Configuration template
├── requirements-server.txt      # Minimal web-server dependencies
└── requirements.txt             # Full evaluation dependencies

Evaluation Output

run_captcha_eval.py writes results to the directory passed with --output-dir, including aggregated results, per-sample metadata, agent traces, and screenshots.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
AppAgent		AppAgent
MobileAgent		MobileAgent
figure		figure
gui_agent		gui_agent
web		web
.env.example		.env.example
.gitignore		.gitignore
CAPTCHA_GUIDE.md		CAPTCHA_GUIDE.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
SECURITY.md		SECURITY.md
config.yaml.example		config.yaml.example
eval_captcha.sh		eval_captcha.sh
requirements-server.txt		requirements-server.txt
requirements.txt		requirements.txt
run.py		run.py
run_appagent_hooks.py		run_appagent_hooks.py
run_captcha_demo.py		run_captcha_demo.py
run_captcha_eval.py		run_captcha_eval.py
test_captcha_server.py		test_captcha_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HLL: Interactive CAPTCHA Benchmark for Multimodal Agents

What Is Included

Benchmark Task Families

Installation

Configuration

Quick Start

Agent Evaluation

Project Map

Evaluation Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HLL: Interactive CAPTCHA Benchmark for Multimodal Agents

What Is Included

Benchmark Task Families

Installation

Configuration

Quick Start

Agent Evaluation

Project Map

Evaluation Output

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages