A local-VLM browser agent built around three deliberate choices:
- Accessibility-tree first. Most agents look at pixels and ground via VLM. We look at the DOM accessibility tree first (faster, semantically labelled, cheaper) and fall back to a vision-language model only when the AT tree is missing or ambiguous. Every routing decision is logged.
- Refusals enforced in code, not prompts. Password fields, payment flows, social-media posts, sensitive domains, robots.txt-disallowed paths, and SSRF-style private-network navigation are blocked architecturally. The filter has its own test suite, including jailbreak attempts.
- Honest benchmarks, including the failures. The repo ships a failure atlas alongside the success metrics. Cherry-picked GIFs come paired with the failure GIF. Numbers are pre-registered.
Status: 0.1.0a1. The architecture and safety surface are real and tested. Headline benchmark numbers and the hosted demo are pending live runs. Don't cite numbers from this repo until they're published.
| Is | Is not |
|---|---|
| Apache-2.0 browser agent on local VLMs (Qwen2.5-VL 7B default). | A frontier-API wrapper. No OpenAI/Anthropic key path in 0.1, on purpose. |
| Read-only and no-credential by default. | A general authenticated browsing assistant. |
| Auditable: every action carries a routing decision, a safety verdict, an execution result. | A black-box system you should trust without inspection. |
| A research artifact with a published failure atlas. | A claim to outperform closed agents. We are not racing them. |
| Identifies as automation (no fingerprint evasion). | A stealth bot. Sites that block automation block this agent. |
- UI-TARS: purpose-built UI VLM. Used here as a grounding baseline.
- Stagehand: closed-LLM browser agent with opportunistic AT-tree usage.
- Browser-Use: frontier-API browser agent. Different identity, different scope.
If you maintain one of these and the comparison reads as unfair, open an issue.
uv sync --all-extras
uv run playwright install chromium
# unit tests, no VLM, no live browser
uv run pytest
# run with a local VLM (see docs/serving-vlm.md for serving Qwen2.5-VL 7B)
export CAREFUL_BROWSE_VLM_URL=http://localhost:8000
uv run careful-browse run "search Wikipedia for 'attention is all you need' and read me the first sentence"
# benchmarks (requires VLM server)
make bench
The planner turns the user goal plus a screenshot and AT-tree snapshot into typed steps. For each step the router resolves the target via the AT tree (preferred) or the VLM (fallback). The resolved action passes through the safety filter; denies are logged, allows go to a Playwright executor. After each action the verifier checks the expected effect; failures consume a retry budget and trigger a replan. The full trajectory is written to disk for every run. Full detail: docs/architecture.md.
- Refusal categories are at src/careful_browse/safety/categories.py.
- Red-team suite at tests/safety/; CI runs it with strict markers and 100% pass required.
- For bypass reports, see SECURITY.md. For the ethics statement, ETHICS.md.
make ci # lint + typecheck + unit tests
make test # pytest
make bench # benchmark runs (requires VLM server)
make safety # red-team suite
make demo # local demo against Playwright + local VLM
Useful scripts and tools:
scripts/smoke.py: end-to-end smoke test against Wikipedia using a hand-written plan. No VLM server needed; usesFixedPlanner+ the noop VLM. Useful for verifying that Playwright + the safety filter + the AT-tree router are wired up before standing up a real VLM.careful-browse safety-check <trajectory.json> --replay: re-evaluates a recorded trajectory against the current filter and reports any divergence.python -m bench.tools.atlas <run_dir>/: produces the failure-atlas JSON from a benchmark run.python -m bench.tools.safety_metrics --red-team-only: per-category true-deny / true-allow counts from the red-team suite.
Serving a local VLM: docs/serving-vlm.md.
CONTRIBUTING.md, CODE_OF_CONDUCT.md. Safety filter bypass reports get priority; see SECURITY.md.
Apache-2.0. See LICENSE.