feat(verifier): record agent trajectories by miguelg719 · Pull Request #2131 · browserbase/stagehand

miguelg719 · 2026-05-15T20:58:41Z

Why

The new verifier needs richer evidence than a final screenshot, especially for DOM and Hybrid agent modes where the important facts often live in tool returns, ARIA snapshots, and per-step observations. This PR adds trajectory recording without changing the verifier judgment engine.

What Changed

Added typed agent events for screenshot, step-finished, step-observed, and final-answer events.
Added listener-gated post-step probes for screenshots and ARIA trees.
Attached the settled post-turn probe to every tool call in a DOM/Hybrid turn.
Added CUA step evidence pairing and final answer capture.
Added TrajectoryRecorder persistence and a smoke script for trajectory shape and disk layout.

Tests

pnpm --filter @browserbasehq/stagehand run typecheck
pnpm --filter @browserbasehq/stagehand-evals run typecheck
node --import tsx packages/evals/scripts/verify-trajectory-recorder.ts
git diff --check

changeset-bot · 2026-05-15T20:58:53Z

🦋 Changeset detected

Latest commit: 2c95836

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages

Name	Type
@browserbasehq/stagehand	Patch
@browserbasehq/browse-cli	Patch
@browserbasehq/stagehand-evals	Patch
@browserbasehq/stagehand-server-v3	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

cubic-dev-ai

No issues found across 8 files

Confidence score: 5/5

Automated review surfaced no issues in the provided summaries.
No files require special attention.

Architecture diagram

sequenceDiagram
    participant Agent as Agent Handlers
    participant Bus as Event Bus
    participant Recorder as TrajectoryRecorder
    participant FS as File System
    participant Page as Browser Page

    Note over Agent,FS: NEW: Step-level evidence capture (DOM/Hybrid mode)

    Agent->>Agent: onStepFinish callback fires
    Agent->>Agent: stepCounter++ (per tool call)
    Agent->>Bus: emit agent_step_finished_event
    Note over Bus: stepIndex, actionName, actionArgs, reasoning, toolOutput, finishedAt
    Agent->>Page: page.screenshot() (post-step probe)
    Page-->>Agent: screenshot Buffer
    Agent->>Agent: captureAriaTreeProbe(v3)
    Note over Agent: Best-effort, token-budgeted a11y tree capture
    Agent-->>Agent: ariaTree string | undefined
    loop For each tool call in turn
        Agent->>Bus: emit agent_screenshot_taken_event
        Note over Bus: stepIndex, screenshot, url, evidenceRole: "probe"
        Agent->>Bus: emit agent_step_observed_event
        Note over Bus: stepIndex, url, ariaTree (optional), scroll (optional)
    end
    opt done tool call present
        Agent->>Agent: Build lastFinalAnswer
        Agent->>Bus: emit agent_final_answer_event
    end

    Note over Agent,FS: NEW: Step-level evidence capture (CUA mode)

    Agent->>Agent: screenshotProvider called
    Agent->>Page: page.screenshot()
    Page-->>Agent: screenshot Buffer
    Agent->>Bus: emit agent_screenshot_taken_event
    Note over Bus: stepIndex++, screenshot, url, evidenceRole: "agent"
    Agent->>Agent: executeAction(action)
    Agent->>Agent: emitCuaActionStep()
    Agent->>Bus: emit agent_step_finished_event
    Note over Bus: stepIndex paired with preceding screenshot
    Agent->>Page: page.screenshot() (post-action probe)
    Page-->>Agent: probe screenshot
    Agent->>Bus: emit agent_screenshot_taken_event
    Note over Bus: same stepIndex, screenshot, url, evidenceRole: "probe"
    Agent->>Agent: captureAriaTreeProbe(v3)
    Agent->>Bus: emit agent_step_observed_event
    Note over Bus: stepIndex, url, ariaTree (optional)

    Note over Agent,FS: NEW: Trajectory assembly and persistence

    Recorder->>Bus: subscribe to agent_step_finished_event
    Recorder->>Bus: subscribe to agent_screenshot_taken_event
    Recorder->>Bus: subscribe to agent_step_observed_event
    Recorder->>Bus: subscribe to agent_final_answer_event
    Bus-->>Recorder: events arrive (may be out-of-order)
    Recorder->>Recorder: ensurePartial(stepIndex)
    Recorder->>Recorder: Merge evidence into partial steps

    alt persistEnabled (env-gated by VERIFIER_PERSIST_TRAJECTORIES)
        Recorder->>Recorder: assembleSteps()
        Recorder->>FS: mkdir -p .trajectories/{runId}/{taskId}/
        Recorder->>FS: write trajectory.json
        Recorder->>FS: write core.log
        Recorder->>FS: write task_data.json
        Recorder->>FS: write times.json
        Recorder->>FS: write screenshots/probe/{1..N}.png
        alt verdict provided
            Recorder->>FS: write scores/mmrubric_v1.json
            Recorder->>FS: update task_data.json with verdict
        end
    else persistence disabled
        Recorder->>Recorder: Return in-memory Trajectory only
    end
    Recorder-->>Agent: Trajectory object

_{Re-trigger cubic}

cubic-dev-ai

1 issue found across 4 files (changes from recent commits).

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic}

cubic-dev-ai

1 issue found across 10 files (changes from recent commits).

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic}

monadoid · 2026-05-22T12:22:38Z

@@ -0,0 +1,5 @@
+---
+"@browserbasehq/stagehand": patch


Do we need these three different patch files or just 1?

I can consolidate at the end if we want to have all be 1 patch

cubic-dev-ai

2 issues found across 15 files (changes from recent commits).

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic}

User-supplied onEvidence callbacks must never abort the agent loop. Wrap the callback once where each handler receives it; internal emit sites keep calling it as a plain await. Also unify CUA step_finished.toolOutput construction behind a shared inferCuaToolOutput helper alongside the existing inferToolOutput.

cubic-dev-ai · 2026-05-22T23:47:51Z

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

miguelg719 · 2026-05-23T01:33:10Z

@cubic-dev-ai review

cubic-dev-ai · 2026-05-23T01:33:15Z

@cubic-dev-ai review

@miguelg719 I have started the AI code review. It will take a few minutes to complete.

cubic-dev-ai

0 issues found across 4 files (changes from recent commits).

_{Re-trigger cubic}

…abel The non-fatal wrapper now logs `onEvidence callback failed for <event.type>` from a single boundary helper rather than the per-site `CUA screenshot evidence callback failed`. Update the assertion to match.

Follow-up cleanup on the sequential-recorder refactor: - Drop step.index from TrajectoryStep; array position is the canonical index. Trajectory writer and v3Evaluator use entries()/map index. - Drop unused scroll field from AgentStepObservedEvent, AgentFinalObservation, and ProbeEvidence — no producer ever set it. - Require evidenceRole on AgentScreenshotEvidenceEvent; the role routes the event into different recorder slots, so a missing role can't silently misroute. - Flatten the identity mergeAgentEvidence in onStepFinished. - Drop unused url field from the recorder's pending screenshot slots. - Remove the no-op TrajectoryRecorder.start() method and test call sites. - Remove the dead early-return guard in onStepObserved.

The CUA handler calls inferToolOutput directly now that the general helper handles the {success: boolean, error?: ...} shape via normalizeError.

A single post-turn probe is fanned across every step of a multi-tool turn, so those steps share the same screenshot Buffer by reference. writeTrajectoryDir was writing an identical PNG per step (probe/1.png, probe/2.png, ...). Dedupe by Buffer identity: write the PNG once and point every sharing step's screenshotPath at the same file. Behavior-preserving for single-probe steps. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…oss batched steps Two trajectory-fidelity gaps in CUA runs: 1. Failed actions were dropped. emitCuaActionStep only ran after a successful executeAction; a throwing action jumped to catch and rethrew, so no step_finished was recorded. Now the catch emits a step_finished {ok:false, error} (with a best-effort post-failure probe) before rethrowing, in a nested try/catch so evidence emission never masks the original error. emitCuaActionStep now takes an explicit toolOutput instead of deriving it from `result ?? {success:true}`. 2. Batched actions lost the agent screenshot. A CUA provider can choose several actions from one screenshot, but the recorder cleared the pending agent screenshot after the first step_finished, so later steps got no tier-1 frame. Renamed to latestAgentScreenshot; it now applies to every step until a newer agent screenshot replaces it (wiped on cancel()). writeTrajectoryDir dedupes the now-shared agent Buffer by identity so it isn't written once per step. Public onEvidence contract doc updated to describe the replay semantics. Tests: failed-action emits step_finished{ok:false} and rethrows; batched two-action turn shares the agent screenshot across both steps. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

screenshotProvider/captureAndSendScreenshot call emitCuaScreenshot unconditionally; early-return when no recorder is attached so a plain CUA run does no extra work (the lastAgentScreenshotUrl bookkeeping is only read by evidence-gated code). Mirrors the emitCuaActionStep call-site gating. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cubic-dev-ai Bot reviewed May 15, 2026

View reviewed changes

miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from 72774c7 to da0c152 Compare May 15, 2026 21:23

miguelg719 force-pushed the miguelgonzalez/verifier-02-backend-routing branch from d7d2c59 to 2765781 Compare May 15, 2026 21:23

miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from da0c152 to d77e596 Compare May 15, 2026 21:45

miguelg719 commented May 15, 2026

View reviewed changes

Comment thread packages/core/lib/v3/agent/AnthropicCUAClient.ts Outdated

miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from 8e4fbe2 to 56a3465 Compare May 15, 2026 22:33

miguelg719 commented May 15, 2026

View reviewed changes

Comment thread packages/core/lib/v3/agent/AnthropicCUAClient.ts Outdated

miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch 3 times, most recently from fd043bc to 635b3d2 Compare May 16, 2026 05:50

miguelg719 force-pushed the miguelgonzalez/verifier-02-backend-routing branch from 60e4321 to 231a90d Compare May 18, 2026 23:54

miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from 635b3d2 to 0f37a65 Compare May 18, 2026 23:54

cubic-dev-ai Bot reviewed May 19, 2026

View reviewed changes

Comment thread packages/core/lib/v3/verifier/trajectory.ts

miguelg719 force-pushed the miguelgonzalez/verifier-02-backend-routing branch from 231a90d to 50cdf0a Compare May 19, 2026 00:36

miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from ddb59bb to 3a7ef3f Compare May 19, 2026 00:36

cubic-dev-ai Bot reviewed May 21, 2026

View reviewed changes

Comment thread packages/core/lib/v3/handlers/v3CuaAgentHandler.ts

miguelg719 force-pushed the miguelgonzalez/verifier-02-backend-routing branch from 50cdf0a to 84042d4 Compare May 22, 2026 00:54

monadoid reviewed May 22, 2026

View reviewed changes

Comment thread packages/core/lib/v3/handlers/v3AgentHandler.ts Outdated

monadoid reviewed May 22, 2026

View reviewed changes

Comment thread packages/evals/framework/trajectoryRecorder.ts Outdated

miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from ba69dd9 to 9d222ca Compare May 22, 2026 15:07

miguelg719 changed the base branch from miguelgonzalez/verifier-02-backend-routing to main May 22, 2026 15:11

miguelg719 changed the base branch from main to miguelgonzalez/verifier-02-backend-routing May 22, 2026 15:19

miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from 9d222ca to a2c2112 Compare May 22, 2026 15:34

miguelg719 changed the base branch from miguelgonzalez/verifier-02-backend-routing to main May 22, 2026 19:39

miguelg719 changed the base branch from main to miguelgonzalez/verifier-02-backend-routing May 22, 2026 20:06

miguelg719 added 3 commits May 22, 2026 13:09

feat(verifier): record agent trajectories

6f27af2

fix(verifier): align trajectory naming

40e7ab3

chore(evals): remove upstream trajectory references

c25367b

miguelg719 changed the base branch from miguelgonzalez/verifier-02-backend-routing to main May 22, 2026 20:11

miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from a2c2112 to 2ba6c1f Compare May 22, 2026 20:11

miguelg719 added 2 commits May 22, 2026 13:37

fix(verifier): redact inline screenshot payloads

2780db2

refactor(verifier): centralize trajectory evidence handling

25fadb1

cubic-dev-ai Bot reviewed May 22, 2026

View reviewed changes

Comment thread packages/core/lib/v3/agent/utils/toolOutputEvidence.ts Outdated

Comment thread packages/core/lib/v3/verifier/evidenceNormalization.ts

seanmcguire12 reviewed May 22, 2026

View reviewed changes

Comment thread packages/core/lib/v3/agent/utils/toolOutputEvidence.ts

seanmcguire12 reviewed May 22, 2026

View reviewed changes

Comment thread packages/evals/framework/trajectoryRecorder.ts Outdated

seanmcguire12 reviewed May 22, 2026

View reviewed changes

Comment thread packages/core/lib/v3/handlers/v3AgentHandler.ts

Comment thread packages/core/lib/v3/handlers/v3CuaAgentHandler.ts Outdated

seanmcguire12 reviewed May 22, 2026

View reviewed changes

Comment thread packages/core/lib/v3/handlers/v3CuaAgentHandler.ts Outdated

cubic-dev-ai Bot reviewed May 23, 2026

View reviewed changes

miguelg719 added 5 commits May 22, 2026 18:49

test(agent): update warning-message assertion to generic onEvidence l…

4d203ca

…abel The non-fatal wrapper now logs `onEvidence callback failed for <event.type>` from a single boundary helper rather than the per-site `CUA screenshot evidence callback failed`. Update the assertion to match.

fix(verifier): preserve final evidence observations

2418db3

Remove verifier trajectory timestamps

1252462

refactor(verifier): simplify evidence event sequencing

b4a1537

miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from a7bef3e to d6fb72b Compare May 24, 2026 17:24

miguelg719 and others added 5 commits May 24, 2026 11:36

refactor(verifier): drop unused inferCuaToolOutput

754a54b

The CUA handler calls inferToolOutput directly now that the general helper handles the {success: boolean, error?: ...} shape via normalizeError.

only emit step when evidenceCallback is provided

a748399

seanmcguire12 approved these changes May 26, 2026

View reviewed changes

miguelg719 merged commit e102a89 into main May 26, 2026
424 of 425 checks passed

This was referenced May 26, 2026

Version Packages #2110

Open

Version Packages edisplay/stagehand#5

Open

Version Packages CloudEngineHub/stagehand#1

Open

Conversation

miguelg719 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What Changed

Tests

Uh oh!

changeset-bot Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

monadoid May 22, 2026

Choose a reason for hiding this comment

Uh oh!

miguelg719 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot commented May 22, 2026

Uh oh!

miguelg719 commented May 23, 2026

Uh oh!

cubic-dev-ai Bot commented May 23, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

miguelg719 commented May 15, 2026 •

edited

Loading

changeset-bot Bot commented May 15, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading