Add screenshot option to Extract#2149
Conversation
🦋 Changeset detectedLatest commit: d0c7500 The changes in this PR will be included in the next version bump. This PR includes changesets to release 5 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
5241d31 to
2b4eca9
Compare
✱ Stainless preview builds for stagehandThis PR will update the
|
There was a problem hiding this comment.
No issues found across 13 files
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Architecture diagram
sequenceDiagram
participant Client as "Client (User Code)"
participant V3 as "V3.extract()"
participant ExtractHandler as "ExtractHandler"
participant Page as "Page (Playwright)"
participant Inference as "extract() (inference.ts)"
participant LLM as "LLMClient (AI SDK)"
participant Prompts as "Prompt Builders"
Note over Client,Prompts: NEW: Screenshot extraction flow
Client->>V3: extract(instruction, { screenshot: true })
V3->>ExtractHandler: extract({ screenshot: true, ... })
ExtractHandler->>ExtractHandler: resolveLlmClient(model)
alt llmClient.type is NOT 'aisdk'
ExtractHandler-->>Client: throw StagehandInvalidArgumentError
end
ExtractHandler->>Page: captureHybridSnapshot(...)
Page-->>ExtractHandler: snapshot (combinedTree)
ExtractHandler->>Page: page.screenshot({ fullPage: false, type: 'png' })
Page-->>ExtractHandler: screenshotBuffer (Buffer)
ExtractHandler->>Inference: extract({ screenshot: screenshotBuffer, ... })
alt llmClient.type is NOT 'aisdk'
Inference-->>ExtractHandler: throw StagehandInvalidArgumentError
end
Inference->>Prompts: buildExtractSystemPrompt(includeScreenshot=true)
Prompts-->>Inference: system prompt referencing accessibility tree and screenshot
Note over Inference: Builds multimodal user message
Inference->>Prompts: buildExtractUserPrompt(instruction, domElements, screenshotDataUrl)
Prompts->>Prompts: Convert screenshot buffer to data URL
alt screenshotDataUrl present
Prompts-->>Inference: Multimodal message (text + image_url)
else no screenshot
Prompts-->>Inference: Text-only message
end
Inference->>LLM: createChatCompletion(messages with image_url)
LLM-->>Inference: LLM response + usage
Inference-->>ExtractHandler: extracted data
ExtractHandler-->>V3: results
V3-->>Client: extracted content
pirate
left a comment
There was a problem hiding this comment.
looks good, just make sure we dont emit full base64 to any DB/logs
why
Some pages do not properly encapsulate the necessary information required to extract the content requested by the user, and can benefit from a hybrid approach with vision.
what changed
extract({ options: { screenshot: true } })flag. Captures the current viewport as PNG and sends it as animage_urlpart alongside the a11y tree text in the extraction LLM call.false— existing callers see no behavior change.StagehandInvalidArgumentErrorotherwise (validated in both the handler andinference.extract, with the handler check running before the screenshot is taken).ensureTimeRemaining()so it respects the extract timeout.ExtractOptionstype,ExtractOptionsSchema, OpenAPIExtractOptions,StagehandAPIClient.extractwire payload,V3.extract→ExtractHandler→inference.extract→buildExtract{System,User}Prompt.test plan
extract-screenshot.test.ts(new) — multimodal message shape on happy path; rejection for non-AI SDK clients.timeout-handlers.test.ts— handler captures viewport when enabled, skips capture by default, rejects screenshot + non-AI SDK client without touchingpage.screenshotor the snapshot pipeline.api-client-serialization.test.ts(renamed fromapi-client-observe-variables.test.ts) — wire payload preservesoptions.screenshot.Summary by cubic
Adds a screenshot option to
extract()that sends the current viewport screenshot with the accessibility tree to improve results on visually-driven pages.extract({ options: { screenshot: boolean } })captures a PNG of the current viewport and includes it with the accessibility tree; defaults to false; only supported withaisdkclients (throwsStagehandInvalidArgumentErrorotherwise).image_urldata URI; logs when a screenshot is used.ExtractOptions.screenshot); the API client forwards the flag; tests cover prompt shape, viewport capture, default-off behavior, timeout guards, client-type validation, and serialization.Written for commit d0c7500. Summary will update on new commits. Review in cubic