Skip to content

XinhaoS0101/HLL

Repository files navigation

HLL: Interactive CAPTCHA Benchmark for Multimodal Agents

中文 README

HLL is a CAPTCHA benchmark for evaluating multimodal GUI agents in an end-to-end interaction loop. It provides a local CAPTCHA web server, configurable webpage distraction, dynamic interaction validation, and MobileAgent-based evaluation scripts.

What Is Included

  • CAPTCHA web server: python -m web.captcha_page.
  • Single MobileAgent demo: run_captcha_demo.py.
  • Multi-model batch evaluation: run_captcha_eval.py and eval_captcha.sh.
  • Configuration templates: config.yaml.example and .env.example.

Benchmark Task Families

HLL follows the ten base task families used in the paper. Difficulty, distraction, and dynamic interaction validation are evaluation dimensions layered on top of these families; they are not counted as separate task families.

Paper task family CLI type Interaction
Text transcription text Read a visual code and enter it correctly
Natural-image sequence selection click_real Click targets in the required order over natural-image content
Icon sequence selection click_icon Click symbolic targets in the specified sequence
Slider alignment slider_real Drag a slider until a visual gap is aligned
Jigsaw alignment puzzle_real Drag a missing component to its correct geometric position
Missing-patch selection missing_piece Select the local patch that completes an image
Tile restoration pic_puzzle Restore an image by swapping misplaced tiles
Board reconfiguration five_line Move a board piece until a valid five-in-a-row state is formed
Category-guided image selection image_grid Select all grid images matching a semantic category
Logic-and-arithmetic interaction logic_arithmetic Solve arithmetic, symbolic, or interface-mediated reasoning challenges

Dynamic validation uses the corresponding _dynamic CLI suffix where implemented, for example slider_real_dynamic, click_real_dynamic, image_grid_dynamic, and logic_arithmetic_dynamic. Internal smoke-test and debugging types are kept out of the benchmark task list.

Installation

For the CAPTCHA web server only:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements-server.txt

For MobileAgent demos and batch evaluation:

pip install -r requirements.txt

MobileAgent evaluation requires an Android emulator or device, working ADB, and an OpenAI-compatible model API.

Configuration

Copy the configuration template:

cp config.yaml.example config.yaml

Common fields:

env:
  adb_path: "/path/to/android-sdk/platform-tools/adb"
  API_url: "https://api.openai.com/v1"
  token: ""
  caption_model: "gpt-4o"
  reflect_model: "gpt-4o"
  judge_model: "gpt-4o"

captcha:
  unsplash:
    access_key: ""

If env.API_url or env.token is empty, the evaluation scripts read OPENAI_API_URL and OPENAI_API_KEY. PEXELS_API_KEY can provide remote images for missing_piece; UNSPLASH_ACCESS_KEY or captcha.unsplash.access_key can provide remote images for pic_puzzle and image_grid.

Quick Start

Run a local CAPTCHA page:

python -m web.captcha_page --type text --port 8000 --no-debug

Open http://localhost:8000.

Useful examples:

python -m web.captcha_page --type slider_real --difficulty hard --port 8000 --no-debug
python -m web.captcha_page --type text --distraction-level 2 --port 8000 --no-debug
python -m web.captcha_page --type image_grid_dynamic --port 8000 --no-debug

Debug mode is available for local development:

python -m web.captcha_page --type text --port 8000 --debug

Agent Evaluation

Single MobileAgent demo:

python run_captcha_demo.py \
  --config config.yaml \
  --type slider_real \
  --attempts 3 \
  --distraction-level 1

Batch evaluation:

python run_captcha_eval.py \
  --config config.yaml \
  --models gpt-4o \
  --captcha-types text click_real click_icon slider_real puzzle_real missing_piece pic_puzzle five_line image_grid logic_arithmetic \
  --samples-per-type 5 \
  --attempts 1 \
  --max-steps 15 \
  --sample-timeout-sec 1200 \
  --output-dir eval_outputs/hll_repro_demo

Script example:

bash eval_captcha.sh gpt-4o

Project Map

HLL/
├── web/                         # CAPTCHA web server and task implementations
│   ├── captcha_page.py          # HTTP server entry point
│   └── captcha/                 # CAPTCHA implementations
├── MobileAgent/                 # Android GUI-agent evaluation loop
├── run_captcha_demo.py          # Single closed-loop demo
├── run_captcha_eval.py          # Batch evaluation entry point
├── eval_captcha.sh              # Batch evaluation example script
├── config.yaml.example          # Configuration template
├── requirements-server.txt      # Minimal web-server dependencies
└── requirements.txt             # Full evaluation dependencies

Evaluation Output

run_captcha_eval.py writes results to the directory passed with --output-dir, including aggregated results, per-sample metadata, agent traces, and screenshots.

About

No description, website, or topics provided.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors