Skip to content

Implement DesktopToolsMixin for screenshot, mouse, and keyboard control (CUA) #460

@kovtcharov

Description

@kovtcharov

Summary

Add desktop control capabilities to GAIA via a `DesktopToolsMixin`, enabling the agent to observe the screen via screenshots, analyze them with VLM, and perform mouse/keyboard actions — creating a full Computer Use Agent (CUA) loop.

CUA Loop

1. Agent takes screenshot → PNG base64
2. VLM (Qwen3-VL-4B) analyzes screenshot → identifies UI elements
3. Agent decides action → click(x, y) or type_text("hello")
4. Repeat until task complete

MVP Tools (5 core)

Tool Description
`desktop_screenshot()` Capture primary monitor as PNG, return base64
`desktop_click(x, y)` Click at screen coordinates
`desktop_type(text)` Type text via keyboard
`desktop_press_key(key)` Press a single key (Enter, Tab, Escape, etc.)
`desktop_scroll(direction, amount)` Scroll mouse wheel up/down

Optional MVP+ Tools

  • `desktop_drag(x1, y1, x2, y2)` — mouse drag
  • `desktop_get_active_window()` — return window title
  • `desktop_move_mouse(x, y)` — move without clicking
  • `desktop_analyze_screen(task)` — combined screenshot + VLM analysis

Architecture

src/gaia/agents/chat/tools/desktop_tools.py  (NEW — DesktopToolsMixin)
├── Uses pyautogui for mouse/keyboard
├── Uses mss for fast screenshot capture
├── Integrates VLMClient for screenshot analysis
├── Security policy enforcement
└── Rate limiting on actions

src/gaia/agents/chat/agent.py  (MODIFY)
├── Add DesktopToolsMixin (opt-in via config)
└── enable_desktop_control: bool = False

Dependencies

"desktop": [
    "pyautogui>=0.9.53",   # Mouse/keyboard control (cross-platform, ~100KB)
    "mss>=6.1.0",          # Fast screenshot capture (~50KB)
]

`pillow` already in base dependencies for image processing.

VLM Integration

  • Uses existing `VLMClient` (`src/gaia/llm/vlm_client.py`)
  • Model: `Qwen3-VL-4B-Instruct-GGUF` (already supported)
  • Screenshot → base64 → VLM prompt asking to identify elements
  • Returns structured description of screen state

Security (see separate issue for full policy)

  • Opt-in only: `GAIA_DESKTOP_CONTROL_ENABLED=false` by default
  • CLI flag: `gaia chat --enable-desktop-control`
  • Warning banner on activation
  • App allowlist/blocklist
  • Rate limiting (max 60 clicks/min)
  • No keyboard input containing credential keywords

Files to Create/Modify

  • `src/gaia/agents/chat/tools/desktop_tools.py` (NEW, ~350 lines)
  • `src/gaia/agents/chat/agent.py` (MODIFY, +20 lines)
  • `src/gaia/agents/chat/tools/init.py` (MODIFY, +2 lines)
  • `setup.py` (MODIFY, +1 extra)
  • `tests/unit/chat/test_desktop_tools.py` (NEW, ~250 lines)
  • `docs/guides/desktop-control.mdx` (NEW, ~500 lines)
  • `docs/docs.json` (MODIFY, +1 nav entry)

Acceptance Criteria

  • Agent can take screenshot and describe what it sees via VLM
  • Agent can click a UI element by coordinates
  • Agent can type text and press keys
  • Multi-step desktop task works (e.g., open app → click button → type text)
  • Opt-in only — disabled by default with warning
  • Security policy enforced
  • Unit tests with mocked pyautogui
  • SSE events show CUA actions in Chat UI agent activity panel

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentcuaComputer Use Agentdomain:multimodalVoice (ASR/TTS), Vision (VLM), Image gen (SD), CUAenhancementNew feature or requestp2low prioritytrack:consumer-appHermes-competitor consumer product — mobile-first, voice + messaging + memory + skills

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions