Read what's on screen without taking a screenshot.
ScreenRead gives AI agents access to the macOS accessibility tree — the same structured data that powers VoiceOver and other screen readers. Instead of capturing pixels and feeding them through vision models, your agent gets instant, structured text describing every UI element on screen.
~100ms instead of 1-3 seconds. Zero hallucination — it reads what the OS knows, not what a model thinks it sees.
Most AI agent tooling uses screenshots to "see" the screen:
- Capture PNG (~200ms)
- Base64 encode and transfer (~500KB-2MB)
- Vision model processes pixels (expensive, slow)
- Model describes what it thinks it sees (sometimes wrong)
But ~90% of agent tasks are text-based: "what does the error say?", "is this button visible?", "what's the page title?". Screenshots are overkill.
ScreenRead skips all of that. It asks macOS directly: "What UI elements exist in this window?" and returns structured text instantly.
| Screenshots | ScreenRead | |
|---|---|---|
| Speed | 1-3 seconds | ~100ms |
| Token cost | High (vision model) | Low (text) |
| Accuracy | Can hallucinate text | Exact (reads from OS) |
| Scope | Web only (Playwright) or full screen | Any macOS app |
| Good for | Visual checks (layout, colors) | Content verification, UI state |
Use ScreenRead for the 90% of tasks that are about content and structure. Keep screenshots for the 10% that need pixels.
git clone https://github.com/Bambushu/screenread.git
cd screenread
swift build -c release
cp .build/release/screenread ~/.local/bin/
cp .build/release/screenread-mcp ~/.local/bin/- macOS 13+ (Ventura or later)
- Accessibility permission (System Settings > Privacy & Security > Accessibility)
# Read the frontmost app
screenread
# Read a specific app
screenread --app Safari
# Fuzzy match a window title
screenread --window "inbox"
# Text only (no structure)
screenread --app Warp --text-only
# Shallow read (depth 2)
screenread --app Finder --shallow
# Full text, no truncation
screenread --app Terminal --full
# JSON output
screenread --app Safari --json
# Search for text across all open windows
screenread --find "error"
screenread --find "Submit"
# List all open windows
screenread --list
# Filter by role
screenread --app Safari --role AXButton,AXLink
# Exclude roles
screenread --app Safari --ignore AXGroup,AXScrollArea
# List interactive elements with click coordinates
screenread --clickable --app Safari
screenread --clickable --app Finder --json
# Watch for UI changes (poll every 2s, Ctrl+C to stop)
screenread --watch --app Safari
screenread --watch --app Safari --interval 5
# Stream JSONL (one JSON object per node per line)
screenread --stream --app Safari
screenread --stream --app Safari | jq 'select(.role == "AXButton")'Add to your MCP config:
Claude Code (project-scoped .mcp.json in your project root):
{
"mcpServers": {
"screenread": {
"command": "screenread-mcp"
}
}
}Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"screenread": {
"command": "/path/to/screenread-mcp"
}
}
}This exposes five tools to any MCP-compatible client:
Read the accessibility tree of a specific app or window. With no parameters, reads the frontmost (active) app.
| Parameter | Type | Description |
|---|---|---|
app |
string | App name (e.g. "Safari") |
window |
string | Fuzzy match on window title |
pid |
integer | Target by process ID |
depth |
integer | Max tree depth (default: 5). Use 0 for unlimited — may be slow on large apps. |
textOnly |
boolean | Text only, no structure |
roles |
string | Comma-separated AX roles to include (e.g. "AXButton,AXLink") |
ignore |
string | Comma-separated AX roles to exclude (e.g. "AXGroup,AXScrollArea") |
List all open windows. Returns one line per window in the format: AppName [PID] — Window Title. No parameters.
Search for visible text across all open windows. Plain substring match (no regex).
| Parameter | Type | Description |
|---|---|---|
query |
string | Plain text substring to search for (required) |
caseSensitive |
boolean | Case-sensitive (default: false) |
Results are capped at 100 matches. Use screenread_snapshot with a specific app for more targeted results.
List interactive elements (buttons, links, text fields) with their click coordinates.
| Parameter | Type | Description |
|---|---|---|
app |
string | App name (e.g. "Safari") |
window |
string | Fuzzy match on window title |
pid |
integer | Target by process ID |
roles |
string | Override default interactive roles (e.g. "AXButton,AXLink") |
Returns a table with role, label, center x/y coordinates, and state (enabled/disabled/focused/selected).
Watch an app for UI changes over a duration.
| Parameter | Type | Description |
|---|---|---|
app |
string | App name (e.g. "Safari") |
window |
string | Fuzzy match on window title |
pid |
integer | Target by process ID |
duration |
integer | How long to watch in seconds (default: 10, max: 60) |
interval |
integer | Poll interval in seconds (default: 2, min: 1) |
textOnly |
boolean | Compare text content only (default: false) |
Polls the accessibility tree at the given interval and reports additions, removals, and value/state changes.
screenread/
├── Sources/
│ ├── ScreenReadCore/ # Shared library
│ │ ├── AXHelpers.swift # Shared AX attribute accessors
│ │ ├── AXTreeWalker.swift # Recursive accessibility tree traversal + streaming callback
│ │ ├── Formatter.swift # Text tree, text-only, JSON, clickable output
│ │ ├── MCPProtocol.swift # JSON-RPC types, tool dispatch, parameter validation
│ │ ├── StreamFormatter.swift # JSONL single-node encoder
│ │ ├── TargetResolver.swift # App/window/PID resolution with fuzzy matching
│ │ ├── TreeDiffer.swift # Compare two tree snapshots for changes
│ │ └── Types.swift # AXNode, WalkResult, WindowInfo, errors
│ ├── screenread/ # CLI (uses ArgumentParser)
│ └── screenread-mcp/ # MCP server (Content-Length framed stdio)
└── Tests/
└── ScreenReadCoreTests/ # 27 tests across 4 suites
The core library (ScreenReadCore) does all the work. Both the CLI and MCP server are thin wrappers around it.
macOS only. ScreenRead uses Apple's AXUIElement accessibility API, which has no equivalent on other platforms. Linux would need AT-SPI, Windows needs UI Automation — fundamentally different APIs.
MIT