Skip to content

feat(verifier): record agent trajectories#2131

Merged
miguelg719 merged 27 commits into
mainfrom
miguelgonzalez/verifier-03-trajectory-recorder
May 26, 2026
Merged

feat(verifier): record agent trajectories#2131
miguelg719 merged 27 commits into
mainfrom
miguelgonzalez/verifier-03-trajectory-recorder

Conversation

@miguelg719
Copy link
Copy Markdown
Collaborator

@miguelg719 miguelg719 commented May 15, 2026

Why

The new verifier needs richer evidence than a final screenshot, especially for DOM and Hybrid agent modes where the important facts often live in tool returns, ARIA snapshots, and per-step observations. This PR adds trajectory recording without changing the verifier judgment engine.

What Changed

  • Added typed agent events for screenshot, step-finished, step-observed, and final-answer events.
  • Added listener-gated post-step probes for screenshots and ARIA trees.
  • Attached the settled post-turn probe to every tool call in a DOM/Hybrid turn.
  • Added CUA step evidence pairing and final answer capture.
  • Added TrajectoryRecorder persistence and a smoke script for trajectory shape and disk layout.

Tests

  • pnpm --filter @browserbasehq/stagehand run typecheck
  • pnpm --filter @browserbasehq/stagehand-evals run typecheck
  • node --import tsx packages/evals/scripts/verify-trajectory-recorder.ts
  • git diff --check

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 15, 2026

🦋 Changeset detected

Latest commit: 2c95836

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages
Name Type
@browserbasehq/stagehand Patch
@browserbasehq/browse-cli Patch
@browserbasehq/stagehand-evals Patch
@browserbasehq/stagehand-server-v3 Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 8 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.
Architecture diagram
sequenceDiagram
    participant Agent as Agent Handlers
    participant Bus as Event Bus
    participant Recorder as TrajectoryRecorder
    participant FS as File System
    participant Page as Browser Page

    Note over Agent,FS: NEW: Step-level evidence capture (DOM/Hybrid mode)

    Agent->>Agent: onStepFinish callback fires
    Agent->>Agent: stepCounter++ (per tool call)
    Agent->>Bus: emit agent_step_finished_event
    Note over Bus: stepIndex, actionName, actionArgs, reasoning, toolOutput, finishedAt
    Agent->>Page: page.screenshot() (post-step probe)
    Page-->>Agent: screenshot Buffer
    Agent->>Agent: captureAriaTreeProbe(v3)
    Note over Agent: Best-effort, token-budgeted a11y tree capture
    Agent-->>Agent: ariaTree string | undefined
    loop For each tool call in turn
        Agent->>Bus: emit agent_screenshot_taken_event
        Note over Bus: stepIndex, screenshot, url, evidenceRole: "probe"
        Agent->>Bus: emit agent_step_observed_event
        Note over Bus: stepIndex, url, ariaTree (optional), scroll (optional)
    end
    opt done tool call present
        Agent->>Agent: Build lastFinalAnswer
        Agent->>Bus: emit agent_final_answer_event
    end

    Note over Agent,FS: NEW: Step-level evidence capture (CUA mode)

    Agent->>Agent: screenshotProvider called
    Agent->>Page: page.screenshot()
    Page-->>Agent: screenshot Buffer
    Agent->>Bus: emit agent_screenshot_taken_event
    Note over Bus: stepIndex++, screenshot, url, evidenceRole: "agent"
    Agent->>Agent: executeAction(action)
    Agent->>Agent: emitCuaActionStep()
    Agent->>Bus: emit agent_step_finished_event
    Note over Bus: stepIndex paired with preceding screenshot
    Agent->>Page: page.screenshot() (post-action probe)
    Page-->>Agent: probe screenshot
    Agent->>Bus: emit agent_screenshot_taken_event
    Note over Bus: same stepIndex, screenshot, url, evidenceRole: "probe"
    Agent->>Agent: captureAriaTreeProbe(v3)
    Agent->>Bus: emit agent_step_observed_event
    Note over Bus: stepIndex, url, ariaTree (optional)

    Note over Agent,FS: NEW: Trajectory assembly and persistence

    Recorder->>Bus: subscribe to agent_step_finished_event
    Recorder->>Bus: subscribe to agent_screenshot_taken_event
    Recorder->>Bus: subscribe to agent_step_observed_event
    Recorder->>Bus: subscribe to agent_final_answer_event
    Bus-->>Recorder: events arrive (may be out-of-order)
    Recorder->>Recorder: ensurePartial(stepIndex)
    Recorder->>Recorder: Merge evidence into partial steps

    alt persistEnabled (env-gated by VERIFIER_PERSIST_TRAJECTORIES)
        Recorder->>Recorder: assembleSteps()
        Recorder->>FS: mkdir -p .trajectories/{runId}/{taskId}/
        Recorder->>FS: write trajectory.json
        Recorder->>FS: write core.log
        Recorder->>FS: write task_data.json
        Recorder->>FS: write times.json
        Recorder->>FS: write screenshots/probe/{1..N}.png
        alt verdict provided
            Recorder->>FS: write scores/mmrubric_v1.json
            Recorder->>FS: update task_data.json with verdict
        end
    else persistence disabled
        Recorder->>Recorder: Return in-memory Trajectory only
    end
    Recorder-->>Agent: Trajectory object
Loading

Re-trigger cubic

@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from 72774c7 to da0c152 Compare May 15, 2026 21:23
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-02-backend-routing branch from d7d2c59 to 2765781 Compare May 15, 2026 21:23
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from da0c152 to d77e596 Compare May 15, 2026 21:45
Comment thread packages/core/lib/v3/agent/AnthropicCUAClient.ts Outdated
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from 8e4fbe2 to 56a3465 Compare May 15, 2026 22:33
Comment thread packages/core/lib/v3/agent/AnthropicCUAClient.ts Outdated
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch 3 times, most recently from fd043bc to 635b3d2 Compare May 16, 2026 05:50
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-02-backend-routing branch from 60e4321 to 231a90d Compare May 18, 2026 23:54
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from 635b3d2 to 0f37a65 Compare May 18, 2026 23:54
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

Comment thread packages/core/lib/v3/verifier/trajectory.ts
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-02-backend-routing branch from 231a90d to 50cdf0a Compare May 19, 2026 00:36
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from ddb59bb to 3a7ef3f Compare May 19, 2026 00:36
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 10 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

Comment thread packages/core/lib/v3/handlers/v3CuaAgentHandler.ts
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-02-backend-routing branch from 50cdf0a to 84042d4 Compare May 22, 2026 00:54
@@ -0,0 +1,5 @@
---
"@browserbasehq/stagehand": patch
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these three different patch files or just 1?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can consolidate at the end if we want to have all be 1 patch

Comment thread packages/core/lib/v3/handlers/v3AgentHandler.ts Outdated
Comment thread packages/evals/framework/trajectoryRecorder.ts Outdated
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from ba69dd9 to 9d222ca Compare May 22, 2026 15:07
@miguelg719 miguelg719 changed the base branch from miguelgonzalez/verifier-02-backend-routing to main May 22, 2026 15:11
@miguelg719 miguelg719 changed the base branch from main to miguelgonzalez/verifier-02-backend-routing May 22, 2026 15:19
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from 9d222ca to a2c2112 Compare May 22, 2026 15:34
@miguelg719 miguelg719 changed the base branch from miguelgonzalez/verifier-02-backend-routing to main May 22, 2026 19:39
@miguelg719 miguelg719 changed the base branch from main to miguelgonzalez/verifier-02-backend-routing May 22, 2026 20:06
@miguelg719 miguelg719 changed the base branch from miguelgonzalez/verifier-02-backend-routing to main May 22, 2026 20:11
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from a2c2112 to 2ba6c1f Compare May 22, 2026 20:11
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 15 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

Comment thread packages/core/lib/v3/agent/utils/toolOutputEvidence.ts Outdated
Comment thread packages/core/lib/v3/verifier/evidenceNormalization.ts
Comment thread packages/core/lib/v3/agent/utils/toolOutputEvidence.ts
Comment thread packages/evals/framework/trajectoryRecorder.ts Outdated
Comment thread packages/core/lib/v3/handlers/v3AgentHandler.ts
Comment thread packages/core/lib/v3/handlers/v3CuaAgentHandler.ts Outdated
Comment thread packages/core/lib/v3/handlers/v3CuaAgentHandler.ts Outdated
User-supplied onEvidence callbacks must never abort the agent loop. Wrap
the callback once where each handler receives it; internal emit sites
keep calling it as a plain await. Also unify CUA step_finished.toolOutput
construction behind a shared inferCuaToolOutput helper alongside the
existing inferToolOutput.
@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented May 22, 2026

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

@miguelg719
Copy link
Copy Markdown
Collaborator Author

@cubic-dev-ai review

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented May 23, 2026

@cubic-dev-ai review

@miguelg719 I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 4 files (changes from recent commits).

Re-trigger cubic

…abel

The non-fatal wrapper now logs `onEvidence callback failed for <event.type>`
from a single boundary helper rather than the per-site
`CUA screenshot evidence callback failed`. Update the assertion to match.
Follow-up cleanup on the sequential-recorder refactor:

- Drop step.index from TrajectoryStep; array position is the canonical
  index. Trajectory writer and v3Evaluator use entries()/map index.
- Drop unused scroll field from AgentStepObservedEvent, AgentFinalObservation,
  and ProbeEvidence — no producer ever set it.
- Require evidenceRole on AgentScreenshotEvidenceEvent; the role routes the
  event into different recorder slots, so a missing role can't silently
  misroute.
- Flatten the identity mergeAgentEvidence in onStepFinished.
- Drop unused url field from the recorder's pending screenshot slots.
- Remove the no-op TrajectoryRecorder.start() method and test call sites.
- Remove the dead early-return guard in onStepObserved.
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from a7bef3e to d6fb72b Compare May 24, 2026 17:24
miguelg719 and others added 5 commits May 24, 2026 11:36
The CUA handler calls inferToolOutput directly now that the general helper
handles the {success: boolean, error?: ...} shape via normalizeError.
A single post-turn probe is fanned across every step of a multi-tool
turn, so those steps share the same screenshot Buffer by reference.
writeTrajectoryDir was writing an identical PNG per step
(probe/1.png, probe/2.png, ...). Dedupe by Buffer identity: write the
PNG once and point every sharing step's screenshotPath at the same
file. Behavior-preserving for single-probe steps.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…oss batched steps

Two trajectory-fidelity gaps in CUA runs:

1. Failed actions were dropped. emitCuaActionStep only ran after a successful
   executeAction; a throwing action jumped to catch and rethrew, so no
   step_finished was recorded. Now the catch emits a step_finished
   {ok:false, error} (with a best-effort post-failure probe) before rethrowing,
   in a nested try/catch so evidence emission never masks the original error.
   emitCuaActionStep now takes an explicit toolOutput instead of deriving it
   from `result ?? {success:true}`.

2. Batched actions lost the agent screenshot. A CUA provider can choose several
   actions from one screenshot, but the recorder cleared the pending agent
   screenshot after the first step_finished, so later steps got no tier-1 frame.
   Renamed to latestAgentScreenshot; it now applies to every step until a newer
   agent screenshot replaces it (wiped on cancel()). writeTrajectoryDir dedupes
   the now-shared agent Buffer by identity so it isn't written once per step.
   Public onEvidence contract doc updated to describe the replay semantics.

Tests: failed-action emits step_finished{ok:false} and rethrows; batched
two-action turn shares the agent screenshot across both steps.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
screenshotProvider/captureAndSendScreenshot call emitCuaScreenshot
unconditionally; early-return when no recorder is attached so a plain CUA
run does no extra work (the lastAgentScreenshotUrl bookkeeping is only read
by evidence-gated code). Mirrors the emitCuaActionStep call-site gating.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@miguelg719 miguelg719 merged commit e102a89 into main May 26, 2026
424 of 425 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants