Online-mind2web

Evaluates the Browser Use Cloud API v3 on the Online-Mind2Web benchmark — 300 real-world web navigation tasks across 136 websites.

Results

Model	Tasks	Passed	Pass Rate
`bu-max`	300	291	97%

Detailed results for each task are in results/.

Tasks not completed

9 tasks were not completed — 2 are fundamentally impossible for any browser automation agent, and 7 were genuine failures:

Task	Reason
Create a meme with a frog as the background and leave the only text with "Enjoy your life".	Impossible — requires image editing outside the browser
Pass the first trending chess puzzle.	Impossible — requires board interaction not supported
Browse senior spayed/neutered dogs near zip code 90028	Failed
If there are any used Apple Mac Studio that has M4 Max 16-core processor, add the cheapest one to the cart	Failed
My name is Mike. I want to send my friend Christene a Dillard's "Merry Christmas" eGift Card	Failed
Find the cheapest travel deal or discount to Thailand that lasts more than 10 days	Failed
Find an Airbnb in Cleveland for three nights	Failed
Using a calculator to determine how much I can have in my 401(k) account at retirement	Failed
Create a list of drip coffee makers that are on sale and within $25-60	Failed

Excluding the 2 impossible tasks: 291/298 = 97.7%.

How it works

run.py loads all 300 Online-Mind2Web tasks from HuggingFace, dispatches them to the Browser Use v3 API in parallel, and saves each session's output to results/{task_id}/result.json. Results are resumable — tasks with an existing result.json are skipped automatically.

Setup

# 1. Install dependencies
pip install httpx datasets python-dotenv

# 2. Set your API key
cp .env.example .env
# Edit .env and fill in BROWSER_USE_API_KEY

Running the benchmark

python run.py [options]

Flag	Default	Description
`--model`	`bu-ultra`	Browser Use model: `bu-mini`, `bu-max`, or `bu-ultra`
`--concurrency`	`25`	Number of sessions running in parallel
`--results-dir`	`results`	Directory to save results
`--split`	`test`	HuggingFace dataset split
`--limit`	(all)	Run only the first N tasks — useful for smoke tests
`--task-ids-file`	(none)	Path to a JSON list of task IDs to run (subset filter)

# Full run — all 300 tasks with bu-ultra (default)
python run.py

# Run with bu-max, 20 parallel sessions
python run.py --model bu-max --concurrency 20

# Smoke test — first 10 tasks
python run.py --limit 10

# Resume an interrupted run (pass the same results-dir)
python run.py --results-dir results

# Run a specific subset of task IDs
python run.py --task-ids-file my_task_ids.json

Output format

Results are written to results/{task_id}/result.json:

results/
  {task_id}/
    result.json

result.json fields:

Field	Description
`task_id`	Benchmark task identifier
`task`	Natural-language instruction
`final_result_response`	Agent's final output
`judge.score`	0 or 100 (binary pass/fail) — generated by Claude, see below
`judge.reasoning`	Judge's step-by-step reasoning — generated by Claude, see below
`_meta`	Run metadata (model, session_id, status, cost) — only present for new runs

The judge fields are not produced by run.py. To generate them, open Claude Code in this repo and paste the following prompt:

You are an expert evaluator judging whether a web automation agent completed a task.

Judging principles:
1. Outcome over journey — Did the user get what they needed?
2. Trust extractions only when supported — Specific details count only if the action trace shows the agent reached the source.
3. Accept reasonable interpretations — Semantic equivalence, paraphrasing, and concise summaries all count.
4. "Not found" is a valid answer — A thorough search that correctly finds nothing IS task completion.
5. No output = fail — An empty or purely navigational response is a failure.
6. Wrong information = fail — Incorrect facts, wrong prices, wrong dates are failures.

Scores are binary: 0 (fail) or 100 (pass). No partial credit.

Read results/{task_id}/result.json. Evaluate the agent's `final_result_response` against
the `task`. Write your verdict as:

**Score:** 0 or 100
**Reasoning:** [Step-by-step explanation]

To judge all tasks, glob results/*/result.json and evaluate each one.

Environment variables

Variable	Required	Description
`BROWSER_USE_API_KEY`	Yes	Browser Use Cloud API key (v3)

Citation

If you use Online-Mind2Web in your work, please cite the original paper:

@article{xue2025illusionprogressassessingcurrent,
  title={An Illusion of Progress? Assessing the Current State of Web Agents},
  author={Tianci Xue and Weijian Qi and Tianneng Shi and Chan Hee Song and
          Boyu Gou and Dawn Song and Huan Sun and Yu Su},
  year={2025},
  eprint={2504.01382},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
results		results
.env.example		.env.example
LICENSE		LICENSE
README.md		README.md
online_mind2web_results.png		online_mind2web_results.png
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Online-mind2web

Results

Tasks not completed

How it works

Setup

Running the benchmark

Output format

Environment variables

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Online-mind2web

Results

Tasks not completed

How it works

Setup

Running the benchmark

Output format

Environment variables

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages