careful-browse

A local-VLM browser agent built around three deliberate choices:

Accessibility-tree first. Most agents look at pixels and ground via VLM. We look at the DOM accessibility tree first (faster, semantically labelled, cheaper) and fall back to a vision-language model only when the AT tree is missing or ambiguous. Every routing decision is logged.
Refusals enforced in code, not prompts. Password fields, payment flows, social-media posts, sensitive domains, robots.txt-disallowed paths, and SSRF-style private-network navigation are blocked architecturally. The filter has its own test suite, including jailbreak attempts.
Honest benchmarks, including the failures. The repo ships a failure atlas alongside the success metrics. Cherry-picked GIFs come paired with the failure GIF. Numbers are pre-registered.

Status: 0.1.0a1. The architecture and safety surface are real and tested. Headline benchmark numbers and the hosted demo are pending live runs. Don't cite numbers from this repo until they're published.

What this is

Is	Is not
Apache-2.0 browser agent on local VLMs (Qwen2.5-VL 7B default).	A frontier-API wrapper. No OpenAI/Anthropic key path in 0.1, on purpose.
Read-only and no-credential by default.	A general authenticated browsing assistant.
Auditable: every action carries a routing decision, a safety verdict, an execution result.	A black-box system you should trust without inspection.
A research artifact with a published failure atlas.	A claim to outperform closed agents. We are not racing them.
Identifies as automation (no fingerprint evasion).	A stealth bot. Sites that block automation block this agent.

Related work

UI-TARS: purpose-built UI VLM. Used here as a grounding baseline.
Stagehand: closed-LLM browser agent with opportunistic AT-tree usage.
Browser-Use: frontier-API browser agent. Different identity, different scope.

If you maintain one of these and the comparison reads as unfair, open an issue.

Quick start

uv sync --all-extras
uv run playwright install chromium

# unit tests, no VLM, no live browser
uv run pytest

# run with a local VLM (see docs/serving-vlm.md for serving Qwen2.5-VL 7B)
export CAREFUL_BROWSE_VLM_URL=http://localhost:8000
uv run careful-browse run "search Wikipedia for 'attention is all you need' and read me the first sentence"

# benchmarks (requires VLM server)
make bench

Architecture

The planner turns the user goal plus a screenshot and AT-tree snapshot into typed steps. For each step the router resolves the target via the AT tree (preferred) or the VLM (fallback). The resolved action passes through the safety filter; denies are logged, allows go to a Playwright executor. After each action the verifier checks the expected effect; failures consume a retry budget and trigger a replan. The full trajectory is written to disk for every run. Full detail: docs/architecture.md.

Safety

Refusal categories are at src/careful_browse/safety/categories.py.
Red-team suite at tests/safety/; CI runs it with strict markers and 100% pass required.
For bypass reports, see SECURITY.md. For the ethics statement, ETHICS.md.

Development

make ci         # lint + typecheck + unit tests
make test       # pytest
make bench      # benchmark runs (requires VLM server)
make safety     # red-team suite
make demo       # local demo against Playwright + local VLM

Useful scripts and tools:

scripts/smoke.py: end-to-end smoke test against Wikipedia using a hand-written plan. No VLM server needed; uses FixedPlanner + the noop VLM. Useful for verifying that Playwright + the safety filter + the AT-tree router are wired up before standing up a real VLM.
careful-browse safety-check <trajectory.json> --replay: re-evaluates a recorded trajectory against the current filter and reports any divergence.
python -m bench.tools.atlas <run_dir>/: produces the failure-atlas JSON from a benchmark run.
python -m bench.tools.safety_metrics --red-team-only: per-category true-deny / true-allow counts from the red-team suite.

Serving a local VLM: docs/serving-vlm.md.

Contributing

CONTRIBUTING.md, CODE_OF_CONDUCT.md. Safety filter bypass reports get priority; see SECURITY.md.

License

Apache-2.0. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

careful-browse

What this is

Related work

Quick start

Architecture

Safety

Development

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github		.github
bench		bench
demo		demo
docs		docs
scripts		scripts
src/careful_browse		src/careful_browse
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
ETHICS.md		ETHICS.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

careful-browse

What this is

Related work

Quick start

Architecture

Safety

Development

Contributing

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages