Summary
Add desktop control capabilities to GAIA via a `DesktopToolsMixin`, enabling the agent to observe the screen via screenshots, analyze them with VLM, and perform mouse/keyboard actions — creating a full Computer Use Agent (CUA) loop.
CUA Loop
1. Agent takes screenshot → PNG base64
2. VLM (Qwen3-VL-4B) analyzes screenshot → identifies UI elements
3. Agent decides action → click(x, y) or type_text("hello")
4. Repeat until task complete
MVP Tools (5 core)
| Tool |
Description |
| `desktop_screenshot()` |
Capture primary monitor as PNG, return base64 |
| `desktop_click(x, y)` |
Click at screen coordinates |
| `desktop_type(text)` |
Type text via keyboard |
| `desktop_press_key(key)` |
Press a single key (Enter, Tab, Escape, etc.) |
| `desktop_scroll(direction, amount)` |
Scroll mouse wheel up/down |
Optional MVP+ Tools
- `desktop_drag(x1, y1, x2, y2)` — mouse drag
- `desktop_get_active_window()` — return window title
- `desktop_move_mouse(x, y)` — move without clicking
- `desktop_analyze_screen(task)` — combined screenshot + VLM analysis
Architecture
src/gaia/agents/chat/tools/desktop_tools.py (NEW — DesktopToolsMixin)
├── Uses pyautogui for mouse/keyboard
├── Uses mss for fast screenshot capture
├── Integrates VLMClient for screenshot analysis
├── Security policy enforcement
└── Rate limiting on actions
src/gaia/agents/chat/agent.py (MODIFY)
├── Add DesktopToolsMixin (opt-in via config)
└── enable_desktop_control: bool = False
Dependencies
"desktop": [
"pyautogui>=0.9.53", # Mouse/keyboard control (cross-platform, ~100KB)
"mss>=6.1.0", # Fast screenshot capture (~50KB)
]
`pillow` already in base dependencies for image processing.
VLM Integration
- Uses existing `VLMClient` (`src/gaia/llm/vlm_client.py`)
- Model: `Qwen3-VL-4B-Instruct-GGUF` (already supported)
- Screenshot → base64 → VLM prompt asking to identify elements
- Returns structured description of screen state
Security (see separate issue for full policy)
- Opt-in only: `GAIA_DESKTOP_CONTROL_ENABLED=false` by default
- CLI flag: `gaia chat --enable-desktop-control`
- Warning banner on activation
- App allowlist/blocklist
- Rate limiting (max 60 clicks/min)
- No keyboard input containing credential keywords
Files to Create/Modify
- `src/gaia/agents/chat/tools/desktop_tools.py` (NEW, ~350 lines)
- `src/gaia/agents/chat/agent.py` (MODIFY, +20 lines)
- `src/gaia/agents/chat/tools/init.py` (MODIFY, +2 lines)
- `setup.py` (MODIFY, +1 extra)
- `tests/unit/chat/test_desktop_tools.py` (NEW, ~250 lines)
- `docs/guides/desktop-control.mdx` (NEW, ~500 lines)
- `docs/docs.json` (MODIFY, +1 nav entry)
Acceptance Criteria
Summary
Add desktop control capabilities to GAIA via a `DesktopToolsMixin`, enabling the agent to observe the screen via screenshots, analyze them with VLM, and perform mouse/keyboard actions — creating a full Computer Use Agent (CUA) loop.
CUA Loop
MVP Tools (5 core)
Optional MVP+ Tools
Architecture
Dependencies
`pillow` already in base dependencies for image processing.
VLM Integration
Security (see separate issue for full policy)
Files to Create/Modify
Acceptance Criteria