feat(verifier): record agent trajectories#2131
Conversation
🦋 Changeset detectedLatest commit: 2c95836 The changes in this PR will be included in the next version bump. This PR includes changesets to release 4 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
No issues found across 8 files
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Architecture diagram
sequenceDiagram
participant Agent as Agent Handlers
participant Bus as Event Bus
participant Recorder as TrajectoryRecorder
participant FS as File System
participant Page as Browser Page
Note over Agent,FS: NEW: Step-level evidence capture (DOM/Hybrid mode)
Agent->>Agent: onStepFinish callback fires
Agent->>Agent: stepCounter++ (per tool call)
Agent->>Bus: emit agent_step_finished_event
Note over Bus: stepIndex, actionName, actionArgs, reasoning, toolOutput, finishedAt
Agent->>Page: page.screenshot() (post-step probe)
Page-->>Agent: screenshot Buffer
Agent->>Agent: captureAriaTreeProbe(v3)
Note over Agent: Best-effort, token-budgeted a11y tree capture
Agent-->>Agent: ariaTree string | undefined
loop For each tool call in turn
Agent->>Bus: emit agent_screenshot_taken_event
Note over Bus: stepIndex, screenshot, url, evidenceRole: "probe"
Agent->>Bus: emit agent_step_observed_event
Note over Bus: stepIndex, url, ariaTree (optional), scroll (optional)
end
opt done tool call present
Agent->>Agent: Build lastFinalAnswer
Agent->>Bus: emit agent_final_answer_event
end
Note over Agent,FS: NEW: Step-level evidence capture (CUA mode)
Agent->>Agent: screenshotProvider called
Agent->>Page: page.screenshot()
Page-->>Agent: screenshot Buffer
Agent->>Bus: emit agent_screenshot_taken_event
Note over Bus: stepIndex++, screenshot, url, evidenceRole: "agent"
Agent->>Agent: executeAction(action)
Agent->>Agent: emitCuaActionStep()
Agent->>Bus: emit agent_step_finished_event
Note over Bus: stepIndex paired with preceding screenshot
Agent->>Page: page.screenshot() (post-action probe)
Page-->>Agent: probe screenshot
Agent->>Bus: emit agent_screenshot_taken_event
Note over Bus: same stepIndex, screenshot, url, evidenceRole: "probe"
Agent->>Agent: captureAriaTreeProbe(v3)
Agent->>Bus: emit agent_step_observed_event
Note over Bus: stepIndex, url, ariaTree (optional)
Note over Agent,FS: NEW: Trajectory assembly and persistence
Recorder->>Bus: subscribe to agent_step_finished_event
Recorder->>Bus: subscribe to agent_screenshot_taken_event
Recorder->>Bus: subscribe to agent_step_observed_event
Recorder->>Bus: subscribe to agent_final_answer_event
Bus-->>Recorder: events arrive (may be out-of-order)
Recorder->>Recorder: ensurePartial(stepIndex)
Recorder->>Recorder: Merge evidence into partial steps
alt persistEnabled (env-gated by VERIFIER_PERSIST_TRAJECTORIES)
Recorder->>Recorder: assembleSteps()
Recorder->>FS: mkdir -p .trajectories/{runId}/{taskId}/
Recorder->>FS: write trajectory.json
Recorder->>FS: write core.log
Recorder->>FS: write task_data.json
Recorder->>FS: write times.json
Recorder->>FS: write screenshots/probe/{1..N}.png
alt verdict provided
Recorder->>FS: write scores/mmrubric_v1.json
Recorder->>FS: update task_data.json with verdict
end
else persistence disabled
Recorder->>Recorder: Return in-memory Trajectory only
end
Recorder-->>Agent: Trajectory object
72774c7 to
da0c152
Compare
d7d2c59 to
2765781
Compare
da0c152 to
d77e596
Compare
8e4fbe2 to
56a3465
Compare
fd043bc to
635b3d2
Compare
60e4321 to
231a90d
Compare
635b3d2 to
0f37a65
Compare
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic | Re-trigger cubic
231a90d to
50cdf0a
Compare
ddb59bb to
3a7ef3f
Compare
There was a problem hiding this comment.
1 issue found across 10 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic | Re-trigger cubic
50cdf0a to
84042d4
Compare
| @@ -0,0 +1,5 @@ | |||
| --- | |||
| "@browserbasehq/stagehand": patch | |||
There was a problem hiding this comment.
Do we need these three different patch files or just 1?
There was a problem hiding this comment.
I can consolidate at the end if we want to have all be 1 patch
ba69dd9 to
9d222ca
Compare
9d222ca to
a2c2112
Compare
a2c2112 to
2ba6c1f
Compare
There was a problem hiding this comment.
2 issues found across 15 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic | Re-trigger cubic
User-supplied onEvidence callbacks must never abort the agent loop. Wrap the callback once where each handler receives it; internal emit sites keep calling it as a plain await. Also unify CUA step_finished.toolOutput construction behind a shared inferCuaToolOutput helper alongside the existing inferToolOutput.
|
You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment |
|
@cubic-dev-ai review |
@miguelg719 I have started the AI code review. It will take a few minutes to complete. |
…abel The non-fatal wrapper now logs `onEvidence callback failed for <event.type>` from a single boundary helper rather than the per-site `CUA screenshot evidence callback failed`. Update the assertion to match.
Follow-up cleanup on the sequential-recorder refactor: - Drop step.index from TrajectoryStep; array position is the canonical index. Trajectory writer and v3Evaluator use entries()/map index. - Drop unused scroll field from AgentStepObservedEvent, AgentFinalObservation, and ProbeEvidence — no producer ever set it. - Require evidenceRole on AgentScreenshotEvidenceEvent; the role routes the event into different recorder slots, so a missing role can't silently misroute. - Flatten the identity mergeAgentEvidence in onStepFinished. - Drop unused url field from the recorder's pending screenshot slots. - Remove the no-op TrajectoryRecorder.start() method and test call sites. - Remove the dead early-return guard in onStepObserved.
a7bef3e to
d6fb72b
Compare
The CUA handler calls inferToolOutput directly now that the general helper
handles the {success: boolean, error?: ...} shape via normalizeError.
A single post-turn probe is fanned across every step of a multi-tool turn, so those steps share the same screenshot Buffer by reference. writeTrajectoryDir was writing an identical PNG per step (probe/1.png, probe/2.png, ...). Dedupe by Buffer identity: write the PNG once and point every sharing step's screenshotPath at the same file. Behavior-preserving for single-probe steps. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…oss batched steps
Two trajectory-fidelity gaps in CUA runs:
1. Failed actions were dropped. emitCuaActionStep only ran after a successful
executeAction; a throwing action jumped to catch and rethrew, so no
step_finished was recorded. Now the catch emits a step_finished
{ok:false, error} (with a best-effort post-failure probe) before rethrowing,
in a nested try/catch so evidence emission never masks the original error.
emitCuaActionStep now takes an explicit toolOutput instead of deriving it
from `result ?? {success:true}`.
2. Batched actions lost the agent screenshot. A CUA provider can choose several
actions from one screenshot, but the recorder cleared the pending agent
screenshot after the first step_finished, so later steps got no tier-1 frame.
Renamed to latestAgentScreenshot; it now applies to every step until a newer
agent screenshot replaces it (wiped on cancel()). writeTrajectoryDir dedupes
the now-shared agent Buffer by identity so it isn't written once per step.
Public onEvidence contract doc updated to describe the replay semantics.
Tests: failed-action emits step_finished{ok:false} and rethrows; batched
two-action turn shares the agent screenshot across both steps.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
screenshotProvider/captureAndSendScreenshot call emitCuaScreenshot unconditionally; early-return when no recorder is attached so a plain CUA run does no extra work (the lastAgentScreenshotUrl bookkeeping is only read by evidence-gated code). Mirrors the emitCuaActionStep call-site gating. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Why
The new verifier needs richer evidence than a final screenshot, especially for DOM and Hybrid agent modes where the important facts often live in tool returns, ARIA snapshots, and per-step observations. This PR adds trajectory recording without changing the verifier judgment engine.
What Changed
TrajectoryRecorderpersistence and a smoke script for trajectory shape and disk layout.Tests
pnpm --filter @browserbasehq/stagehand run typecheckpnpm --filter @browserbasehq/stagehand-evals run typechecknode --import tsx packages/evals/scripts/verify-trajectory-recorder.tsgit diff --check