UI Perception Engine

A perception layer for AI agents — human-like understanding of web interfaces.

Fuses structural (DOM + a11y tree + CSS), visual (three-tier vision pipeline), and temporal data into a unified UI Scene Graph that an LLM can reason over fluently. Exposed as an MCP server so Claude can navigate and inspect any live URL.

What It Does

Navigates to any URL in a real Playwright browser
Extracts a compact, LLM-readable scene graph from the live DOM
Detects UI elements visually via OmniParser V2, understands layout via Qwen3-VL, and performs deep UX analysis via Claude Vision
Tracks UI transitions and diffs between states
Predicts affordances (what you can interact with, and what happens when you do)
Exposes everything as MCP tools that Claude can call directly

MCP Tools

Tool	Description
`navigate`	Navigate to a URL and return the initial scene graph
`get_scene`	Re-capture the current scene (compact text or full JSON)
`get_affordances`	List interactive elements ranked by priority
`act`	Execute browser actions: click, type, scroll, hover, keypress, navigate, wait
`get_console_logs`	Return captured browser console messages
`get_network_errors`	Return failed network requests
`get_screenshot`	Capture a screenshot and return it as an image
`detect_elements`	Run element detection (OmniParser or Claude Vision fallback)
`analyze_visual`	Visual understanding via Qwen3-VL: hierarchy, contrast, spacing, UX
`compare_states`	Diff current vs previous scene graph — shows what changed
`watch`	Start real-time keyframe capture (CDP screencast + perceptual hashing)
`stop_watch`	Stop keyframe capture and return summary of changes

Three-tier vision pipeline (`visual=true`)

Both navigate and get_scene accept an optional visual: true parameter. When enabled, the engine runs a three-tier vision pipeline:

Tier A: OmniParser V2 — Fast element detection (~0.8s). A YOLOv8 + Florence-2 model running as a Python sidecar on port 8100. Detects buttons, inputs, images, icons, and other UI elements with bounding boxes and labels.
Tier B: Qwen3-VL via Ollama — Visual understanding (~2-4s). Analyzes the screenshot with detected element context to assess visual hierarchy, contrast issues, spacing problems, affordance clarity, and state indicators.
Tier C: Claude Vision API — Deep UX analysis (~3-5s, on-demand only). Provides detailed qualitative analysis when requested via analyze_visual or depth: 'deep'.

Graceful degradation: Each tier skips silently if its backing service is unavailable. The system works with any combination of services running — from all three tiers down to structural-only analysis with no vision services at all.

Use detect_elements for fast Tier A detection only, or analyze_visual for the full Tier A + B pipeline.

`act` action types

Type	Parameters
`click`	`x`, `y`
`clickSelector`	`selector`
`type`	`text`, `selector` (optional)
`scroll`	`direction` (`up`/`down`), `amount` (optional)
`hover`	`x`, `y`
`wait`	`ms`
`navigate`	`url`
`back`	—
`pressKey`	`key`

Installation

From npm

npm install -g ui-perception-engine

Or use without installing:

npx ui-perception-engine

Install Playwright browsers (first time only):

npx playwright install chromium

From source

git clone https://github.com/dirkknibbe/uipe.git
cd uipe
pnpm install
pnpm build

Claude Code MCP Configuration

Add to your Claude Code MCP config (~/.claude/mcp.json or project .mcp.json):

{
  "mcpServers": {
    "ui-perception-engine": {
      "command": "npx",
      "args": ["ui-perception-engine"]
    }
  }
}

Or if installed globally:

{
  "mcpServers": {
    "ui-perception-engine": {
      "command": "uipe"
    }
  }
}

Or from a local clone:

{
  "mcpServers": {
    "ui-perception-engine": {
      "command": "node",
      "args": ["/path/to/uipe/ui-perception-engine/dist/src/mcp/index.js"],
      "env": {
        "OLLAMA_URL": "http://localhost:11434",
        "OLLAMA_MODEL": "qwen3-vl:8b",
        "OMNIPARSER_URL": "http://localhost:8100",
        "ANTHROPIC_API_KEY": "sk-ant-..."
      }
    }
  }
}

Environment variables

Variable	Purpose
`ANTHROPIC_API_KEY`	Claude Vision API — detection fallback + deep analysis (Tier C)
`OLLAMA_URL`	Ollama server URL (default: `http://localhost:11434`)
`OLLAMA_MODEL`	Vision model name (default: `qwen3-vl:8b`)
`OMNIPARSER_URL`	OmniParser V2 sidecar URL (default: `http://localhost:8100`)

See .env.example for the full list of configurable variables including frame capture, browser, and temporal settings.

Local Vision Services

The three-tier vision pipeline uses two local services. Both are optional — the system degrades gracefully without them.

Ollama (Tier B — visual understanding):

# Install Ollama: https://ollama.com
ollama pull qwen3-vl:8b
ollama list                    # verify model is available
# Ollama serves on http://localhost:11434 by default

OmniParser V2 (Tier A — element detection):

OmniParser runs as a Python FastAPI sidecar on port 8100. See the Local Vision Handoff doc section 5 for full setup instructions.

# Quick check if OmniParser is running:
curl -s http://localhost:8100/health

Without local services: If neither Ollama nor OmniParser is running, visual=true falls back to Claude Vision API (requires ANTHROPIC_API_KEY). If no vision service is available at all, the engine uses structural-only analysis (DOM + a11y tree).

Using with the `live-deployment-check` Skill

The live-deployment-check skill pairs directly with this MCP server to visually verify a deployed site or app — catching broken images, empty routes, stuck spinners, and placeholder text that only surface in a real browser.

Workflow

1. navigate(url)            → load the page, get initial scene
2. get_scene()              → re-capture after JS hydrates (critical for SPAs)
3. get_console_logs()       → check for JS errors (type="error")
4. get_network_errors()     → check for failed API/resource requests
5. Scan scene output        → look for broken signals (see below)
6. act() on nav links       → walk routes, verify each one loads
7. Report findings          → list what's working and what's broken

Common Signals in Scene Output

# Broken image:
img[img]:"broken"

# Empty SPA route (component failed to load):
router-outlet[element]          ← no children = problem

# Stuck loading spinner:
progressbar[progressbar]        ← present after JS settles = API error

# Route loaded correctly:
router-outlet[element]
  app-order-list[element]:"Order Management..."   ← has content = good

Example

// After deploying an Angular app
navigate("http://your-app.vercel.app")
get_scene()                     // wait for hydration
→ check router-outlet has content, no broken img nodes

// Walk routes
act({ type: "clickSelector", selector: "a[href='/orders']" })
get_scene()
→ verify orders page loaded

act({ type: "clickSelector", selector: "a[href='/customers']" })
get_scene()
→ verify customers page loaded

What to Check

Broken images — img nodes with "broken" content
Empty routes — router-outlet with no child elements
Stuck spinners — progressbar still present after get_scene()
Placeholder text — undefined, null, TODO, <repo-url> in visible text
Error pages — 404 or error component rendered instead of expected content

Development

pnpm test          # run tests
pnpm test:watch    # watch mode
pnpm lint          # lint
pnpm build         # compile TypeScript
pnpm mcp           # start MCP server (after build)
pnpm start:dev     # check services + start MCP server

Architecture

src/
├── config.ts           ← centralized config (dotenv)
├── types/              ← shared types (contracts between pipelines)
├── browser/            ← BrowserRuntime (Playwright)
├── pipelines/
│   ├── structural/     ← DOM + a11y tree extraction
│   ├── visual/
│   │   ├── index.ts        ← Three-tier orchestrator (detect/understand/deep)
│   │   ├── omniparser.ts   ← OmniParser V2 client (Tier A)
│   │   ├── claude-vision.ts ← Claude Vision API (Tier C)
│   │   ├── ollama-vision.ts ← Qwen3-VL via Ollama (Tier B)
│   │   └── frame-capture.ts ← CDP screencast + perceptual hashing
│   ├── fusion/         ← merge visual + structural → SceneGraph
│   ├── temporal/       ← change detection + state tracking
│   └── affordance/     ← predict interaction outcomes
├── mcp/                ← MCP server (12 tools)
└── utils/

Viewport default: 1280x720 (configurable via env) Screenshot format: PNG (lossless, required for vision models)

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
scripts		scripts
sidecar/omniparser		sidecar/omniparser
src		src
tests		tests
.env.example		.env.example
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.mcp.json.example		.mcp.json.example
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json
tsconfig.node.json		tsconfig.node.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UI Perception Engine

What It Does

MCP Tools

Three-tier vision pipeline (`visual=true`)

`act` action types

Installation

From npm

From source

Claude Code MCP Configuration

Environment variables

Local Vision Services

Using with the `live-deployment-check` Skill

Workflow

Common Signals in Scene Output

Example

What to Check

Development

Architecture

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UI Perception Engine

What It Does

MCP Tools

Three-tier vision pipeline (visual=true)

act action types

Installation

From npm

From source

Claude Code MCP Configuration

Environment variables

Local Vision Services

Using with the live-deployment-check Skill

Workflow

Common Signals in Scene Output

Example

What to Check

Development

Architecture

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Three-tier vision pipeline (`visual=true`)

`act` action types

Using with the `live-deployment-check` Skill

Packages