AI-Powered Locators: Proposed architecture discussion #138

vivekkrishna · 2026-05-06T10:22:50Z

vivekkrishna
May 6, 2026

The roadmap lists AI-Powered Locators as a V2 priority with several open questions. After digging into the codebase, I want to propose a concrete approach and get alignment before anyone starts building.

Why this isn't the binary the roadmap describes
The roadmap frames the choice as:

Screenshot → model → coordinates or DOM → model → selector

Both have real problems:

Screenshot → coordinates: the model must do spatial reasoning and output precise pixel positions — fragile across zoom, scroll, resize, retina
DOM → selector: the full DOM can be enormous, and fails entirely on canvas elements or anything not in the accessibility tree
The annotated screenshot is a third path the roadmap didn't consider — and it's already implemented in Vibium.

How annotated screenshots work (relevant context)
vibium screenshot --annotate does the following:

Runs browserMap — discovers all interactive elements, stores them as @e1 → CSS selector, @e2 → CSS selector, ... in a refMap
Injects red numbered badge overlays into the DOM at each element's getBoundingClientRect() position
Captures the screenshot
Removes the badges
The refMap lives in memory. So when a model looks at the annotated screenshot and identifies badge 3, Vibium already has the CSS selector for @e3 — the model never needs to output coordinates. The model's job is label identification, not spatial reasoning. You get the visual grounding of a screenshot with the precision of a selector.

Proposed Resolution Pipeline

vibe.do("click login button")
│
▼

Accessibility tree extraction ← no vision, cheap
│
▼
Claude Haiku: match query → @en ← ~50ms, ~$0.0001
found → Vibium resolves @en → selector → done
not found ↓
│
▼
Annotated screenshot
│
▼
Claude Sonnet vision: pick @en ← ~500ms, ~$0.003
found → Vibium resolves @en → selector → done
not found ↓
│
▼
Plain screenshot → model returns (x,y) ← last resort, fragile
│
▼
Cache result keyed on hash(url + normalized_query)

Steps 1-2 cover the majority of cases cheaply. Steps 3-4 handle visually complex pages without coordinate inference. Step 5 is last resort only — for canvas elements and anything outside the accessibility tree and refMap. Caching makes repeated agent runs nearly free after warmup; invalidation can use BiDi DOM mutation events.

On the open questions
Local model (Qwen-VL) vs API?
API for V2 — local models are an optimization after the approach is validated.

Ambiguity handling?
The annotated screenshot collapses most of it — the model sees labeled elements and picks one. If still ambiguous, surface the candidates back to the caller.

Caching?
Yes, keyed on hash(url + normalized_query). Invalidate on BiDi DOM mutation events.

Two levels:
Result cache (in Vibium): if the same query on the same URL was resolved before, skip the model entirely and return the cached selector directly
Prompt cache (via Anthropic API or Bedrock): when the model is called, mark the accessibility tree / annotated screenshot as the cached prefix — same page, multiple vibe.do() calls, ~90% input token cost reduction on cache hits

Open for discussion
Does browserMap reliably capture enough elements, or do we hit shadow DOM / canvas gaps that push us to the last resort too often?
Should vibe.check() be a separate assertion code path rather than a locator variant?
Any objections to Claude as the default model backend vs. keeping it model-agnostic from day one?

vivekkrishna · 2026-05-06T10:32:16Z

vivekkrishna
May 6, 2026
Author

Model integration layer
Build a model-agnostic Go interface (Resolver) with backends for Anthropic API and Bedrock. Takes a prompt + optional image, returns a string. Prompt caching (cache_control markers on accessibility tree / screenshot prefix) lives here.
Locator resolution pipeline
Wire the pipeline in the Go daemon as an internal function:

Step 1: accessibility tree → Haiku (text only) → @en
Step 2: annotated screenshot → Sonnet vision → @en
Step 3: plain screenshot → Sonnet vision → (x,y) (last resort)
Each step falls through only if the previous returns no match
3. Result cache
In-memory map in the daemon keyed on hash(url + normalized_query) → selector. No external dependency needed for V2.

DOM mutation invalidation
Listen for BiDi DOM mutation events to invalidate cache entries for the affected URL. Prevents stale selectors after page updates.
vibe.do() command
New CLI command + MCP tool. Accepts natural language action ("click the login button"), runs the resolution pipeline, executes the action.
vibe.find() natural language upgrade
Extend the existing find command to accept natural language in addition to CSS selectors. Resolution pipeline returns the matched element ref.
vibe.check() assertion path
Separate from the existing checkbox check command — a new assertion variant ("verify the dashboard loaded") that resolves via pipeline and returns pass/fail.
Ambiguity handling
If resolution returns multiple candidates, surface them to the caller rather than guessing. Define the response contract (JSON list of candidates + confidence).
Client library bindings
Expose do(), natural language find(), and assertion check() in JS, Python, and Java clients.

0 replies

vivekkrishna · 2026-05-11T15:13:40Z

vivekkrishna
May 11, 2026
Author

two gaps worth correcting before anyone starts building:

Gap 1: A11y tree and refMap are disconnected systems

The a11y tree (vibium:page.a11yTree) returns {role, name, children} nodes. The refMap is built separately by browserMap via querySelectorAll. They share no cross-reference — so step 1-2 as written ("a11y tree → Haiku → @en") doesn't work. Haiku gets role/name nodes with no @en attached and has no way to produce one.

Fix: run browserMap first (which already produces @e1 label, @e2 label output), then send that alongside the a11y tree data to Haiku so it can pick an @en directly. The combined input gives Haiku both the semantic richness of the a11y tree and the actionable refs from the map.

Gap 2: browserMap doesn't pierce shadow DOM

mapScript uses querySelectorAll which doesn't pierce shadow DOM. So elements inside shadow DOM won't appear in refMap — meaning the annotated screenshot won't badge them either. The a11y tree handles shadow DOM correctly (getChildren checks el.shadowRoot), but browserMap doesn't. For shadow DOM elements the pipeline falls through to last resort (plain screenshot → coordinates) more often than the proposal implies. Worth being explicit about this boundary.

Minor: cache check should be first

The pipeline lists cache as step 6 (after everything). It should be the first check — lookup before any model call — with storage after successful resolution.

0 replies

vivekkrishna · 2026-05-11T15:22:00Z

vivekkrishna
May 11, 2026
Author

Follow-up: splitting this into testable PRs

Based on the corrected pipeline, here's how I'd break the implementation into independently testable PRs:

PR 1 — Resolver interface + Anthropic backend
Foundation only, no CLI changes. A Go Resolver interface: takes prompt + optional image, returns string. One concrete implementation: Anthropic API with prompt caching markers on the a11y tree / screenshot prefix. Tested via Go unit tests (mock + real API).

PR 2 — Combined browserMap + a11y tree enrichment
New internal function that merges both into a single output: @e1 | button | "Sign In". Surfaced as vibium map --rich. Testable: run against a real page, verify @en refs have role and name attached. No model calls yet.

PR 3 — vibe.do() text path (MVP)
First fully working user-facing slice: cache lookup → combined map → Haiku picks @en → execute → cache store. New vibium do "..." CLI command + MCP tool. Testable end-to-end: vibium do "click the sign in button" on a page with reasonable a11y labels. This is the PR that proves the concept.

PR 4 — vibe.do() vision fallback
Extends PR 3: when Haiku returns no match, fall back to annotated screenshot → Sonnet vision → @en. Testable: run against a page with poor or missing a11y labels where PR 3 alone would fail.

PR 5 — vibe.check() assertion path
Separate from the existing checkbox check command. New command + MCP tool: takes a natural language assertion, returns pass/fail. Testable: vibium check "the dashboard is loaded" passing and failing against appropriate pages.

PR 6 — Natural language vibe.find()
Extend the existing find command to accept natural language in addition to CSS selectors. Runs the resolution pipeline, returns the matched element ref. Testable: vibium find "the blue submit button" returns @en.

PR 7 — Client library bindings
Expose do(), natural language find(), and assertion check() in JS, Python, and Java clients. Tested via existing client test suites.

PR 8 — DOM mutation cache invalidation
BiDi DOM mutation event listener in the daemon to invalidate cache entries when the page changes. Testable: resolve an element, mutate the DOM, verify re-resolution runs fresh.

PRs 1-2 are invisible foundations. PR 3 is the first thing a user can actually run. PRs 4-6 widen coverage. PRs 7-8 are polish and reliability. Each PR leaves main in a shippable state.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AI-Powered Locators: Proposed architecture discussion #138

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

AI-Powered Locators: Proposed architecture discussion #138

Uh oh!

vivekkrishna May 6, 2026

Replies: 3 comments

Uh oh!

vivekkrishna May 6, 2026 Author

Uh oh!

vivekkrishna May 11, 2026 Author

Uh oh!

vivekkrishna May 11, 2026 Author

vivekkrishna
May 6, 2026

vivekkrishna
May 6, 2026
Author

vivekkrishna
May 11, 2026
Author

vivekkrishna
May 11, 2026
Author