Skip to content

Implement BrowserToolsMixin with Playwright for web browsing agent #458

@kovtcharov

Description

@kovtcharov

Summary

Add browser automation capabilities to the ChatAgent via a new `BrowserToolsMixin`, enabling the agent to navigate websites, interact with page elements, and extract content — all within a sandboxed browser environment.

Why Browser First (Before Desktop CUA)

  • Sandboxed — browser actions can't damage the OS
  • No credential risk — doesn't capture desktop passwords/emails
  • Structured DOM — accessibility tree snapshots are more reliable than coordinate-based clicking
  • Cross-platform — identical behavior on Windows/Linux/Mac
  • High user demand — "research this", "fill out this form", "check this page"

MVP Tools (5 core)

Tool Description
`browse_navigate(url)` Navigate to a URL, return page title and status
`browse_snapshot()` Get accessibility tree or screenshot of current page
`browse_click(selector)` Click an element by CSS selector or accessibility ref
`browse_type(selector, text)` Type text into an input field
`browse_extract_content()` Extract visible text content from current page

Optional MVP+ Tools

  • `browse_scroll(direction)` — scroll page up/down
  • `browse_back()` — navigate back
  • `browse_screenshot()` — capture page as PNG for VLM analysis
  • `browse_evaluate(js)` — execute JavaScript (with security constraints)

Architecture

src/gaia/agents/chat/tools/browser_tools.py  (NEW — BrowserToolsMixin)
├── Uses Playwright (headless Chromium) under the hood
├── Persistent browser context per agent session
├── Accessibility tree parsing for element identification
└── Security: URL allowlist, no file:// protocol, rate limiting

src/gaia/agents/chat/agent.py  (MODIFY — register mixin)
├── Add BrowserToolsMixin to ChatAgent class
├── Enable via config: enable_browser=True
└── Lazy browser initialization (only when tools first used)

Security Policy

@dataclass
class BrowserSecurityPolicy:
    allowed_domains: List[str] = None   # None = all allowed
    blocked_domains: List[str] = None   # Blocked even if in allowed
    blocked_protocols: List[str] = ["file", "ftp", "data"]
    max_pages_per_session: int = 50
    navigation_timeout_ms: int = 30000
    allow_javascript: bool = True
    allow_downloads: bool = False

Dependencies

Add to `setup.py`:

"browser": ["playwright>=1.40.0"]

Post-install: `playwright install chromium` (~150MB one-time download)

Integration with Chat UI

  • SSE events already handle `tool_start`/`tool_end`/`tool_result` — no backend changes needed
  • Frontend: optionally render page screenshots inline in chat
  • Agent activity panel shows browsing steps naturally

Files to Create/Modify

  • `src/gaia/agents/chat/tools/browser_tools.py` (NEW, ~400 lines)
  • `src/gaia/agents/chat/agent.py` (MODIFY, +15 lines)
  • `src/gaia/agents/chat/tools/init.py` (MODIFY, +2 lines)
  • `setup.py` (MODIFY, +1 extra)
  • `tests/unit/chat/test_browser_tools.py` (NEW, ~300 lines)
  • `docs/guides/browser.mdx` (NEW, ~400 lines)
  • `docs/docs.json` (MODIFY, +1 nav entry)

Acceptance Criteria

  • Agent can navigate to a URL and report page content
  • Agent can click elements and fill forms
  • Multi-step web task works end-to-end (e.g., search → click result → extract info)
  • Security policy enforced (blocked domains, protocols, rate limits)
  • Browser cleaned up on session end (no leaked processes)
  • Unit tests with mocked browser
  • Documentation with examples

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentbrowser-useBrowser automation and control featureschatChat SDK changesdomain:agent-coreFramework, tools, registry, memory, skills, orchestrationenhancementNew feature or requestp1medium prioritytrack:consumer-appHermes-competitor consumer product — mobile-first, voice + messaging + memory + skills

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions