Skip to content

bettyguo/computer_use_agent

careful-browse

A local-VLM browser agent built around three deliberate choices:

  1. Accessibility-tree first. Most agents look at pixels and ground via VLM. We look at the DOM accessibility tree first (faster, semantically labelled, cheaper) and fall back to a vision-language model only when the AT tree is missing or ambiguous. Every routing decision is logged.
  2. Refusals enforced in code, not prompts. Password fields, payment flows, social-media posts, sensitive domains, robots.txt-disallowed paths, and SSRF-style private-network navigation are blocked architecturally. The filter has its own test suite, including jailbreak attempts.
  3. Honest benchmarks, including the failures. The repo ships a failure atlas alongside the success metrics. Cherry-picked GIFs come paired with the failure GIF. Numbers are pre-registered.

Status: 0.1.0a1. The architecture and safety surface are real and tested. Headline benchmark numbers and the hosted demo are pending live runs. Don't cite numbers from this repo until they're published.

What this is

Is Is not
Apache-2.0 browser agent on local VLMs (Qwen2.5-VL 7B default). A frontier-API wrapper. No OpenAI/Anthropic key path in 0.1, on purpose.
Read-only and no-credential by default. A general authenticated browsing assistant.
Auditable: every action carries a routing decision, a safety verdict, an execution result. A black-box system you should trust without inspection.
A research artifact with a published failure atlas. A claim to outperform closed agents. We are not racing them.
Identifies as automation (no fingerprint evasion). A stealth bot. Sites that block automation block this agent.

Related work

  • UI-TARS: purpose-built UI VLM. Used here as a grounding baseline.
  • Stagehand: closed-LLM browser agent with opportunistic AT-tree usage.
  • Browser-Use: frontier-API browser agent. Different identity, different scope.

If you maintain one of these and the comparison reads as unfair, open an issue.

Quick start

uv sync --all-extras
uv run playwright install chromium

# unit tests, no VLM, no live browser
uv run pytest

# run with a local VLM (see docs/serving-vlm.md for serving Qwen2.5-VL 7B)
export CAREFUL_BROWSE_VLM_URL=http://localhost:8000
uv run careful-browse run "search Wikipedia for 'attention is all you need' and read me the first sentence"

# benchmarks (requires VLM server)
make bench

Architecture

The planner turns the user goal plus a screenshot and AT-tree snapshot into typed steps. For each step the router resolves the target via the AT tree (preferred) or the VLM (fallback). The resolved action passes through the safety filter; denies are logged, allows go to a Playwright executor. After each action the verifier checks the expected effect; failures consume a retry budget and trigger a replan. The full trajectory is written to disk for every run. Full detail: docs/architecture.md.

Safety

Development

make ci         # lint + typecheck + unit tests
make test       # pytest
make bench      # benchmark runs (requires VLM server)
make safety     # red-team suite
make demo       # local demo against Playwright + local VLM

Useful scripts and tools:

  • scripts/smoke.py: end-to-end smoke test against Wikipedia using a hand-written plan. No VLM server needed; uses FixedPlanner + the noop VLM. Useful for verifying that Playwright + the safety filter + the AT-tree router are wired up before standing up a real VLM.
  • careful-browse safety-check <trajectory.json> --replay: re-evaluates a recorded trajectory against the current filter and reports any divergence.
  • python -m bench.tools.atlas <run_dir>/: produces the failure-atlas JSON from a benchmark run.
  • python -m bench.tools.safety_metrics --red-team-only: per-category true-deny / true-allow counts from the red-team suite.

Serving a local VLM: docs/serving-vlm.md.

Contributing

CONTRIBUTING.md, CODE_OF_CONDUCT.md. Safety filter bypass reports get priority; see SECURITY.md.

License

Apache-2.0. See LICENSE.

About

Open-source local-VLM browser agent. AT-tree-first routing with VLM fallback, refusals enforced in code, honest benchmarks including the failure atlas.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors