HLL is a CAPTCHA benchmark for evaluating multimodal GUI agents in an end-to-end interaction loop. It provides a local CAPTCHA web server, configurable webpage distraction, dynamic interaction validation, and MobileAgent-based evaluation scripts.
- CAPTCHA web server:
python -m web.captcha_page. - Single MobileAgent demo:
run_captcha_demo.py. - Multi-model batch evaluation:
run_captcha_eval.pyandeval_captcha.sh. - Configuration templates:
config.yaml.exampleand.env.example.
HLL follows the ten base task families used in the paper. Difficulty, distraction, and dynamic interaction validation are evaluation dimensions layered on top of these families; they are not counted as separate task families.
| Paper task family | CLI type | Interaction |
|---|---|---|
| Text transcription | text |
Read a visual code and enter it correctly |
| Natural-image sequence selection | click_real |
Click targets in the required order over natural-image content |
| Icon sequence selection | click_icon |
Click symbolic targets in the specified sequence |
| Slider alignment | slider_real |
Drag a slider until a visual gap is aligned |
| Jigsaw alignment | puzzle_real |
Drag a missing component to its correct geometric position |
| Missing-patch selection | missing_piece |
Select the local patch that completes an image |
| Tile restoration | pic_puzzle |
Restore an image by swapping misplaced tiles |
| Board reconfiguration | five_line |
Move a board piece until a valid five-in-a-row state is formed |
| Category-guided image selection | image_grid |
Select all grid images matching a semantic category |
| Logic-and-arithmetic interaction | logic_arithmetic |
Solve arithmetic, symbolic, or interface-mediated reasoning challenges |
Dynamic validation uses the corresponding _dynamic CLI suffix where implemented, for example slider_real_dynamic, click_real_dynamic, image_grid_dynamic, and logic_arithmetic_dynamic. Internal smoke-test and debugging types are kept out of the benchmark task list.
For the CAPTCHA web server only:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements-server.txtFor MobileAgent demos and batch evaluation:
pip install -r requirements.txtMobileAgent evaluation requires an Android emulator or device, working ADB, and an OpenAI-compatible model API.
Copy the configuration template:
cp config.yaml.example config.yamlCommon fields:
env:
adb_path: "/path/to/android-sdk/platform-tools/adb"
API_url: "https://api.openai.com/v1"
token: ""
caption_model: "gpt-4o"
reflect_model: "gpt-4o"
judge_model: "gpt-4o"
captcha:
unsplash:
access_key: ""If env.API_url or env.token is empty, the evaluation scripts read OPENAI_API_URL and OPENAI_API_KEY. PEXELS_API_KEY can provide remote images for missing_piece; UNSPLASH_ACCESS_KEY or captcha.unsplash.access_key can provide remote images for pic_puzzle and image_grid.
Run a local CAPTCHA page:
python -m web.captcha_page --type text --port 8000 --no-debugOpen http://localhost:8000.
Useful examples:
python -m web.captcha_page --type slider_real --difficulty hard --port 8000 --no-debug
python -m web.captcha_page --type text --distraction-level 2 --port 8000 --no-debug
python -m web.captcha_page --type image_grid_dynamic --port 8000 --no-debugDebug mode is available for local development:
python -m web.captcha_page --type text --port 8000 --debugSingle MobileAgent demo:
python run_captcha_demo.py \
--config config.yaml \
--type slider_real \
--attempts 3 \
--distraction-level 1Batch evaluation:
python run_captcha_eval.py \
--config config.yaml \
--models gpt-4o \
--captcha-types text click_real click_icon slider_real puzzle_real missing_piece pic_puzzle five_line image_grid logic_arithmetic \
--samples-per-type 5 \
--attempts 1 \
--max-steps 15 \
--sample-timeout-sec 1200 \
--output-dir eval_outputs/hll_repro_demoScript example:
bash eval_captcha.sh gpt-4oHLL/
├── web/ # CAPTCHA web server and task implementations
│ ├── captcha_page.py # HTTP server entry point
│ └── captcha/ # CAPTCHA implementations
├── MobileAgent/ # Android GUI-agent evaluation loop
├── run_captcha_demo.py # Single closed-loop demo
├── run_captcha_eval.py # Batch evaluation entry point
├── eval_captcha.sh # Batch evaluation example script
├── config.yaml.example # Configuration template
├── requirements-server.txt # Minimal web-server dependencies
└── requirements.txt # Full evaluation dependencies
run_captcha_eval.py writes results to the directory passed with --output-dir, including aggregated results, per-sample metadata, agent traces, and screenshots.