AI-Powered Locators: Proposed architecture discussion #138
Replies: 3 comments
-
Step 1: accessibility tree → Haiku (text only) → @en
|
Beta Was this translation helpful? Give feedback.
-
|
two gaps worth correcting before anyone starts building: Gap 1: A11y tree and refMap are disconnected systems The a11y tree (vibium:page.a11yTree) returns {role, name, children} nodes. The refMap is built separately by browserMap via querySelectorAll. They share no cross-reference — so step 1-2 as written ("a11y tree → Haiku → @en") doesn't work. Haiku gets role/name nodes with no @en attached and has no way to produce one. Fix: run browserMap first (which already produces @e1 label, @e2 label output), then send that alongside the a11y tree data to Haiku so it can pick an @en directly. The combined input gives Haiku both the semantic richness of the a11y tree and the actionable refs from the map. Gap 2: browserMap doesn't pierce shadow DOM mapScript uses querySelectorAll which doesn't pierce shadow DOM. So elements inside shadow DOM won't appear in refMap — meaning the annotated screenshot won't badge them either. The a11y tree handles shadow DOM correctly (getChildren checks el.shadowRoot), but browserMap doesn't. For shadow DOM elements the pipeline falls through to last resort (plain screenshot → coordinates) more often than the proposal implies. Worth being explicit about this boundary. Minor: cache check should be first The pipeline lists cache as step 6 (after everything). It should be the first check — lookup before any model call — with storage after successful resolution. |
Beta Was this translation helpful? Give feedback.
-
|
Follow-up: splitting this into testable PRs Based on the corrected pipeline, here's how I'd break the implementation into independently testable PRs: PR 1 — Resolver interface + Anthropic backend PR 2 — Combined browserMap + a11y tree enrichment PR 3 — vibe.do() text path (MVP) PR 4 — vibe.do() vision fallback PR 5 — vibe.check() assertion path PR 6 — Natural language vibe.find() PR 7 — Client library bindings PR 8 — DOM mutation cache invalidation PRs 1-2 are invisible foundations. PR 3 is the first thing a user can actually run. PRs 4-6 widen coverage. PRs 7-8 are polish and reliability. Each PR leaves main in a shippable state. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The roadmap lists AI-Powered Locators as a V2 priority with several open questions. After digging into the codebase, I want to propose a concrete approach and get alignment before anyone starts building.
Why this isn't the binary the roadmap describes
The roadmap frames the choice as:
Screenshot → model → coordinates or DOM → model → selector
Both have real problems:
Screenshot → coordinates: the model must do spatial reasoning and output precise pixel positions — fragile across zoom, scroll, resize, retina
DOM → selector: the full DOM can be enormous, and fails entirely on canvas elements or anything not in the accessibility tree
The annotated screenshot is a third path the roadmap didn't consider — and it's already implemented in Vibium.
How annotated screenshots work (relevant context)
vibium screenshot --annotate does the following:
Runs browserMap — discovers all interactive elements, stores them as @e1 → CSS selector, @e2 → CSS selector, ... in a refMap
Injects red numbered badge overlays into the DOM at each element's getBoundingClientRect() position
Captures the screenshot
Removes the badges
The refMap lives in memory. So when a model looks at the annotated screenshot and identifies badge 3, Vibium already has the CSS selector for @e3 — the model never needs to output coordinates. The model's job is label identification, not spatial reasoning. You get the visual grounding of a screenshot with the precision of a selector.
Proposed Resolution Pipeline
vibe.do("click login button")
│
▼
│
▼
found → Vibium resolves @en → selector → done
not found ↓
│
▼
│
▼
found → Vibium resolves @en → selector → done
not found ↓
│
▼
│
▼
Steps 1-2 cover the majority of cases cheaply. Steps 3-4 handle visually complex pages without coordinate inference. Step 5 is last resort only — for canvas elements and anything outside the accessibility tree and refMap. Caching makes repeated agent runs nearly free after warmup; invalidation can use BiDi DOM mutation events.
On the open questions
Local model (Qwen-VL) vs API?
API for V2 — local models are an optimization after the approach is validated.
Ambiguity handling?
The annotated screenshot collapses most of it — the model sees labeled elements and picks one. If still ambiguous, surface the candidates back to the caller.
Caching?
Yes, keyed on hash(url + normalized_query). Invalidate on BiDi DOM mutation events.
Two levels:
Result cache (in Vibium): if the same query on the same URL was resolved before, skip the model entirely and return the cached selector directly
Prompt cache (via Anthropic API or Bedrock): when the model is called, mark the accessibility tree / annotated screenshot as the cached prefix — same page, multiple vibe.do() calls, ~90% input token cost reduction on cache hits
Open for discussion
Does browserMap reliably capture enough elements, or do we hit shadow DOM / canvas gaps that push us to the last resort too often?
Should vibe.check() be a separate assertion code path rather than a locator variant?
Any objections to Claude as the default model backend vs. keeping it model-agnostic from day one?
Beta Was this translation helpful? Give feedback.
All reactions