Evaluates the Browser Use Cloud API v3 on the Online-Mind2Web benchmark — 300 real-world web navigation tasks across 136 websites.
| Model | Tasks | Passed | Pass Rate |
|---|---|---|---|
bu-max |
300 | 291 | 97% |
Detailed results for each task are in results/.
9 tasks were not completed — 2 are fundamentally impossible for any browser automation agent, and 7 were genuine failures:
| Task | Reason |
|---|---|
| Create a meme with a frog as the background and leave the only text with "Enjoy your life". | Impossible — requires image editing outside the browser |
| Pass the first trending chess puzzle. | Impossible — requires board interaction not supported |
| Browse senior spayed/neutered dogs near zip code 90028 | Failed |
| If there are any used Apple Mac Studio that has M4 Max 16-core processor, add the cheapest one to the cart | Failed |
| My name is Mike. I want to send my friend Christene a Dillard's "Merry Christmas" eGift Card | Failed |
| Find the cheapest travel deal or discount to Thailand that lasts more than 10 days | Failed |
| Find an Airbnb in Cleveland for three nights | Failed |
| Using a calculator to determine how much I can have in my 401(k) account at retirement | Failed |
| Create a list of drip coffee makers that are on sale and within $25-60 | Failed |
Excluding the 2 impossible tasks: 291/298 = 97.7%.
run.py loads all 300 Online-Mind2Web tasks from HuggingFace, dispatches them to the
Browser Use v3 API in parallel, and saves each session's output to
results/{task_id}/result.json. Results are resumable — tasks with an existing
result.json are skipped automatically.
# 1. Install dependencies
pip install httpx datasets python-dotenv
# 2. Set your API key
cp .env.example .env
# Edit .env and fill in BROWSER_USE_API_KEYpython run.py [options]| Flag | Default | Description |
|---|---|---|
--model |
bu-ultra |
Browser Use model: bu-mini, bu-max, or bu-ultra |
--concurrency |
25 |
Number of sessions running in parallel |
--results-dir |
results |
Directory to save results |
--split |
test |
HuggingFace dataset split |
--limit |
(all) | Run only the first N tasks — useful for smoke tests |
--task-ids-file |
(none) | Path to a JSON list of task IDs to run (subset filter) |
# Full run — all 300 tasks with bu-ultra (default)
python run.py
# Run with bu-max, 20 parallel sessions
python run.py --model bu-max --concurrency 20
# Smoke test — first 10 tasks
python run.py --limit 10
# Resume an interrupted run (pass the same results-dir)
python run.py --results-dir results
# Run a specific subset of task IDs
python run.py --task-ids-file my_task_ids.json
Results are written to results/{task_id}/result.json:
results/
{task_id}/
result.json
result.json fields:
| Field | Description |
|---|---|
task_id |
Benchmark task identifier |
task |
Natural-language instruction |
final_result_response |
Agent's final output |
judge.score |
0 or 100 (binary pass/fail) — generated by Claude, see below |
judge.reasoning |
Judge's step-by-step reasoning — generated by Claude, see below |
_meta |
Run metadata (model, session_id, status, cost) — only present for new runs |
The judge fields are not produced by run.py. To generate them, open Claude Code in this
repo and paste the following prompt:
You are an expert evaluator judging whether a web automation agent completed a task.
Judging principles:
1. Outcome over journey — Did the user get what they needed?
2. Trust extractions only when supported — Specific details count only if the action trace shows the agent reached the source.
3. Accept reasonable interpretations — Semantic equivalence, paraphrasing, and concise summaries all count.
4. "Not found" is a valid answer — A thorough search that correctly finds nothing IS task completion.
5. No output = fail — An empty or purely navigational response is a failure.
6. Wrong information = fail — Incorrect facts, wrong prices, wrong dates are failures.
Scores are binary: 0 (fail) or 100 (pass). No partial credit.
Read results/{task_id}/result.json. Evaluate the agent's `final_result_response` against
the `task`. Write your verdict as:
**Score:** 0 or 100
**Reasoning:** [Step-by-step explanation]
To judge all tasks, glob results/*/result.json and evaluate each one.
| Variable | Required | Description |
|---|---|---|
BROWSER_USE_API_KEY |
Yes | Browser Use Cloud API key (v3) |
If you use Online-Mind2Web in your work, please cite the original paper:
@article{xue2025illusionprogressassessingcurrent,
title={An Illusion of Progress? Assessing the Current State of Web Agents},
author={Tianci Xue and Weijian Qi and Tianneng Shi and Chan Hee Song and
Boyu Gou and Dawn Song and Huan Sun and Yu Su},
year={2025},
eprint={2504.01382},
archivePrefix={arXiv},
primaryClass={cs.AI},
}