Skip to content

browser-use/online-mind2web

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Online-mind2web

Check out the full blog post

Evaluates the Browser Use Cloud API v3 on the Online-Mind2Web benchmark — 300 real-world web navigation tasks across 136 websites.

Results

Browser Agent Benchmark

Model Tasks Passed Pass Rate
bu-max 300 291 97%

Detailed results for each task are in results/.

Tasks not completed

9 tasks were not completed — 2 are fundamentally impossible for any browser automation agent, and 7 were genuine failures:

Task Reason
Create a meme with a frog as the background and leave the only text with "Enjoy your life". Impossible — requires image editing outside the browser
Pass the first trending chess puzzle. Impossible — requires board interaction not supported
Browse senior spayed/neutered dogs near zip code 90028 Failed
If there are any used Apple Mac Studio that has M4 Max 16-core processor, add the cheapest one to the cart Failed
My name is Mike. I want to send my friend Christene a Dillard's "Merry Christmas" eGift Card Failed
Find the cheapest travel deal or discount to Thailand that lasts more than 10 days Failed
Find an Airbnb in Cleveland for three nights Failed
Using a calculator to determine how much I can have in my 401(k) account at retirement Failed
Create a list of drip coffee makers that are on sale and within $25-60 Failed

Excluding the 2 impossible tasks: 291/298 = 97.7%.

How it works

run.py loads all 300 Online-Mind2Web tasks from HuggingFace, dispatches them to the Browser Use v3 API in parallel, and saves each session's output to results/{task_id}/result.json. Results are resumable — tasks with an existing result.json are skipped automatically.


Setup

# 1. Install dependencies
pip install httpx datasets python-dotenv

# 2. Set your API key
cp .env.example .env
# Edit .env and fill in BROWSER_USE_API_KEY

Running the benchmark

python run.py [options]
Flag Default Description
--model bu-ultra Browser Use model: bu-mini, bu-max, or bu-ultra
--concurrency 25 Number of sessions running in parallel
--results-dir results Directory to save results
--split test HuggingFace dataset split
--limit (all) Run only the first N tasks — useful for smoke tests
--task-ids-file (none) Path to a JSON list of task IDs to run (subset filter)
# Full run — all 300 tasks with bu-ultra (default)
python run.py

# Run with bu-max, 20 parallel sessions
python run.py --model bu-max --concurrency 20

# Smoke test — first 10 tasks
python run.py --limit 10

# Resume an interrupted run (pass the same results-dir)
python run.py --results-dir results

# Run a specific subset of task IDs
python run.py --task-ids-file my_task_ids.json

Output format

Results are written to results/{task_id}/result.json:

results/
  {task_id}/
    result.json

result.json fields:

Field Description
task_id Benchmark task identifier
task Natural-language instruction
final_result_response Agent's final output
judge.score 0 or 100 (binary pass/fail) — generated by Claude, see below
judge.reasoning Judge's step-by-step reasoning — generated by Claude, see below
_meta Run metadata (model, session_id, status, cost) — only present for new runs

The judge fields are not produced by run.py. To generate them, open Claude Code in this repo and paste the following prompt:

You are an expert evaluator judging whether a web automation agent completed a task.

Judging principles:
1. Outcome over journey — Did the user get what they needed?
2. Trust extractions only when supported — Specific details count only if the action trace shows the agent reached the source.
3. Accept reasonable interpretations — Semantic equivalence, paraphrasing, and concise summaries all count.
4. "Not found" is a valid answer — A thorough search that correctly finds nothing IS task completion.
5. No output = fail — An empty or purely navigational response is a failure.
6. Wrong information = fail — Incorrect facts, wrong prices, wrong dates are failures.

Scores are binary: 0 (fail) or 100 (pass). No partial credit.

Read results/{task_id}/result.json. Evaluate the agent's `final_result_response` against
the `task`. Write your verdict as:

**Score:** 0 or 100
**Reasoning:** [Step-by-step explanation]

To judge all tasks, glob results/*/result.json and evaluate each one.

Environment variables

Variable Required Description
BROWSER_USE_API_KEY Yes Browser Use Cloud API key (v3)

Citation

If you use Online-Mind2Web in your work, please cite the original paper:

@article{xue2025illusionprogressassessingcurrent,
  title={An Illusion of Progress? Assessing the Current State of Web Agents},
  author={Tianci Xue and Weijian Qi and Tianneng Shi and Chan Hee Song and
          Boyu Gou and Dawn Song and Huan Sun and Yu Su},
  year={2025},
  eprint={2504.01382},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages