Summary
Add browser automation capabilities to the ChatAgent via a new `BrowserToolsMixin`, enabling the agent to navigate websites, interact with page elements, and extract content — all within a sandboxed browser environment.
Why Browser First (Before Desktop CUA)
- Sandboxed — browser actions can't damage the OS
- No credential risk — doesn't capture desktop passwords/emails
- Structured DOM — accessibility tree snapshots are more reliable than coordinate-based clicking
- Cross-platform — identical behavior on Windows/Linux/Mac
- High user demand — "research this", "fill out this form", "check this page"
MVP Tools (5 core)
| Tool |
Description |
| `browse_navigate(url)` |
Navigate to a URL, return page title and status |
| `browse_snapshot()` |
Get accessibility tree or screenshot of current page |
| `browse_click(selector)` |
Click an element by CSS selector or accessibility ref |
| `browse_type(selector, text)` |
Type text into an input field |
| `browse_extract_content()` |
Extract visible text content from current page |
Optional MVP+ Tools
- `browse_scroll(direction)` — scroll page up/down
- `browse_back()` — navigate back
- `browse_screenshot()` — capture page as PNG for VLM analysis
- `browse_evaluate(js)` — execute JavaScript (with security constraints)
Architecture
src/gaia/agents/chat/tools/browser_tools.py (NEW — BrowserToolsMixin)
├── Uses Playwright (headless Chromium) under the hood
├── Persistent browser context per agent session
├── Accessibility tree parsing for element identification
└── Security: URL allowlist, no file:// protocol, rate limiting
src/gaia/agents/chat/agent.py (MODIFY — register mixin)
├── Add BrowserToolsMixin to ChatAgent class
├── Enable via config: enable_browser=True
└── Lazy browser initialization (only when tools first used)
Security Policy
@dataclass
class BrowserSecurityPolicy:
allowed_domains: List[str] = None # None = all allowed
blocked_domains: List[str] = None # Blocked even if in allowed
blocked_protocols: List[str] = ["file", "ftp", "data"]
max_pages_per_session: int = 50
navigation_timeout_ms: int = 30000
allow_javascript: bool = True
allow_downloads: bool = False
Dependencies
Add to `setup.py`:
"browser": ["playwright>=1.40.0"]
Post-install: `playwright install chromium` (~150MB one-time download)
Integration with Chat UI
- SSE events already handle `tool_start`/`tool_end`/`tool_result` — no backend changes needed
- Frontend: optionally render page screenshots inline in chat
- Agent activity panel shows browsing steps naturally
Files to Create/Modify
- `src/gaia/agents/chat/tools/browser_tools.py` (NEW, ~400 lines)
- `src/gaia/agents/chat/agent.py` (MODIFY, +15 lines)
- `src/gaia/agents/chat/tools/init.py` (MODIFY, +2 lines)
- `setup.py` (MODIFY, +1 extra)
- `tests/unit/chat/test_browser_tools.py` (NEW, ~300 lines)
- `docs/guides/browser.mdx` (NEW, ~400 lines)
- `docs/docs.json` (MODIFY, +1 nav entry)
Acceptance Criteria
Summary
Add browser automation capabilities to the ChatAgent via a new `BrowserToolsMixin`, enabling the agent to navigate websites, interact with page elements, and extract content — all within a sandboxed browser environment.
Why Browser First (Before Desktop CUA)
MVP Tools (5 core)
Optional MVP+ Tools
Architecture
Security Policy
Dependencies
Add to `setup.py`:
Post-install: `playwright install chromium` (~150MB one-time download)
Integration with Chat UI
Files to Create/Modify
Acceptance Criteria