feat(verifier): add verifier evaluator shell and types by miguelg719 · Pull Request #2157 · browserbase/stagehand

miguelg719 · 2026-05-22T00:54:32Z

Replacement for #2130, which was merged into the PR1 branch instead of main. This branch is rebased onto current main and contains the PR2 verifier evaluator shell/type changes.

Summary by cubic

Adds a verifier evaluator shell with public trajectory/rubric/result types and utilities, plus a V3Evaluator.verify(trajectory) facade that uses the legacy backend by default without breaking existing flows.

New Features
- New public types under v3/verifier: Trajectory, Rubric, EvaluationResult, Verifier, and more.
- Utilities: normalizeRubric, loadTrajectoryFromDisk (rehydrates screenshots and image modalities), nextResultFilename.
- V3Evaluator supports verify() and generateRubric(). Backend selectable via STAGEHAND_EVALUATOR_BACKEND (legacy default; verifier is stubbed for future). Legacy path maps trajectory screenshots/final answer/reasoning to the old evaluator and returns an EvaluationResult.
- Re-exported on the V3 public API: Stagehand.loadTrajectoryFromDisk, Stagehand.nextResultFilename, Stagehand.normalizeRubric.
Bug Fixes
- Normalizes dataset rubrics to public camelCase and validates required fields; strips serialized earned_points noise.
- Secures screenshotPath to stay within the trajectory directory and decodes on-disk bytesBase64 to Buffer.
- Derives the verification task from the Trajectory to prevent mismatched task specs and simplify verify().

^{Written for commit 3d8c324. Summary will update on new commits. Review in cubic}

changeset-bot · 2026-05-22T00:54:36Z

🦋 Changeset detected

Latest commit: 3d8c324

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages

Name	Type
@browserbasehq/stagehand	Patch
@browserbasehq/browse-cli	Patch
@browserbasehq/stagehand-evals	Patch
@browserbasehq/stagehand-server-v3	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

cubic-dev-ai

No issues found across 11 files

Confidence score: 5/5

Automated review surfaced no issues in the provided summaries.
No files require special attention.

Architecture diagram

sequenceDiagram
    participant UserCode as "User/CLI"
    participant V3Eval as V3Evaluator
    participant LegacyEval as LegacyV3Evaluator
    participant TrajUtil as "trajectory.ts"
    participant FileSys as "File System"
    participant BackendEnv as STAGEHAND_EVALUATOR_BACKEND
    participant LLM as LLM

    Note over UserCode,BackendEnv: Verifier Facade Initialization

    UserCode->>V3Eval: new V3Evaluator(v3Instance, opts)
    V3Eval->>BackendEnv: Read env var
    alt backend=verifier
        V3Eval->>V3Eval: Store backend=verifier
    else backend=legacy
        V3Eval->>V3Eval: Store backend=legacy
    end

    Note over UserCode,FileSys: NEW: verify(trajectory, taskSpec)

    UserCode->>V3Eval: verify(trajectory, taskSpec)
    V3Eval->>V3Eval: assertVerifierInput()
    alt backend=verifier
        V3Eval-->>UserCode: Throw "backend not available"
    else backend=legacy
        V3Eval->>V3Eval: collectLegacyScreenshots(trajectory)
        V3Eval->>V3Eval: renderLegacyAgentReasoning(trajectory)
        alt no screenshots AND no finalAnswer
            V3Eval-->>UserCode: legacyInsufficientEvidenceResult()
        else
            V3Eval->>LegacyEval: ask({question, screenshot, answer, agentReasoning})
            LegacyEval-->>V3Eval: legacy result (YES/NO/INVALID)
            V3Eval->>V3Eval: legacyEvaluationToResult()
            V3Eval-->>UserCode: EvaluationResult with rawSteps.backend="legacy"
        end
    end

    Note over UserCode,V3Eval: NEW: generateRubric(taskSpec)

    UserCode->>V3Eval: generateRubric(taskSpec)
    alt backend=verifier
        V3Eval-->>UserCode: Throw "backend not available"
    else backend=legacy
        V3Eval->>V3Eval: Create single criterion rubric
        V3Eval-->>UserCode: Rubric { items: [legacyTaskCompletionCriterion] }
    end

    Note over UserCode,FileSys: NEW: On-disk Trajectory Loading

    UserCode->>TrajUtil: loadTrajectoryFromDisk(dir)
    TrajUtil->>FileSys: readFile(trajectory.json)
    FileSys-->>TrajUtil: raw JSON
    TrajUtil->>TrajUtil: Parse JSON
    loop each step
        TrajUtil->>FileSys: readFile(screenshotPath) for probe
        alt screenshot file exists
            FileSys-->>TrajUtil: Buffer
            TrajUtil->>TrajUtil: Set probeEvidence.screenshot
        else file missing
            TrajUtil->>TrajUtil: Leave screenshot unset
        end
        alt image modality with bytesBase64
            TrajUtil->>TrajUtil: Decode base64 → Buffer
        end
    end
    TrajUtil-->>UserCode: Hydrated Trajectory

    Note over UserCode,FileSys: NEW: Path Security Check

    TrajUtil->>TrajUtil: resolveWithinTrajectoryDir(candidate)
    alt path escapes trajectory directory
        TrajUtil-->>UserCode: Throw error
    else safe
        TrajUtil->>FileSys: readFile(resolved)
    end

    Note over UserCode,FileSys: Runtime: Legacy Evaluator with Answer

    LegacyEval->>LegacyEval: _evaluateWithMultipleScreenshots()
    rect over LegacyEval
        Note over LegacyEval: CHANGED: included answer in prompt
    end
    LegacyEval->>LLM: prompt(text + image contents + "the answer is {answer}")
    Note over LLM: NEW: answer appended to user message
    LLM-->>LegacyEval: YES/NO + reasoning
    LegacyEval-->>UserCode: LegacyEvaluationResult

_{Re-trigger cubic}

monadoid · 2026-05-22T11:37:45Z

+
+  const trajectoryPath = path.join(trajectoryDir, "trajectory.json");
+  const raw = await fs.readFile(trajectoryPath, "utf8");
+  const parsed = JSON.parse(raw) as Trajectory & {


This could be made more typesafe at runtime if we used zod at the parsing boundary, like: TrajectorySchema.safeParse(JSON.parse(raw))

and then you could z.infer to still have a Trajectory type (the array of trajectory steps could all be part of the zod schema too)

but might be a nit!

going to add an extra pr at the end to parse as much as possible, this one is tricky because we use downstream some precomputed rubrics from webtailbench specifically that don't match our schema (snake)

monadoid · 2026-05-22T11:45:23Z

+ * return an EvaluationResult — they MUST NOT touch a live browser.
+ */
+export interface Verifier {
+  verify(trajectory: Trajectory, taskSpec: TaskSpec): Promise<EvaluationResult>;


Since the trajectory already contains a taskspec within it, can we remove the additional taskSpec that gets passed in here? This would make it simpler, but maybe i'm missing something.

good catch 🤝

monadoid · 2026-05-22T12:18:40Z

+ * Snake-case dataset fields are accepted here so serialized quirks do not leak
+ * into the canonical rubric type.
+ */
+export function normalizeRubric(rubric: unknown): Rubric | undefined {


Also here - a rubric could just be a zod object + a z.infer type, and then this logic could be built into the zod object (via superrefine or otherwise), and it might be a bit simpler, but optional nit!

miguelg719 added 14 commits May 21, 2026 17:54

feat(verifier): add verifier evaluator shell

ca4d3f0

fix(verifier): normalize public rubric naming

2fb8bb3

style(verifier): format rubric normalizer

d8b6dc3

chore(verifier): remove upstream verifier references

017c70e

docs(verifier): remove rollout comments from public types

3f4b770

refactor(verifier): consolidate public types

c44a3c7

refactor(verifier): remove rollout stub reason

2f81239

refactor(verifier): remove proxy type barrels

9b71108

fix(verifier): keep rubric earned points numeric

f31fd98

fix(verifier): constrain trajectory screenshot paths

1aa1578

test(verifier): cover trajectory normalization boundaries

99cf719

test(verifier): cover evaluator facade helpers

4825e92

fix(verifier): clean public result API

36e9342

docs(verifier): align evaluator changeset wording

84042d4

cubic-dev-ai Bot reviewed May 22, 2026

View reviewed changes

monadoid reviewed May 22, 2026

View reviewed changes

monadoid approved these changes May 22, 2026

View reviewed changes

monadoid reviewed May 22, 2026

View reviewed changes

fix(verifier): derive verification task from trajectory

3d8c324

monadoid approved these changes May 22, 2026

View reviewed changes

miguelg719 merged commit 2cd60a3 into main May 22, 2026
381 of 422 checks passed

github-actions Bot mentioned this pull request May 22, 2026

Version Packages #2110

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(verifier): add verifier evaluator shell and types#2157

feat(verifier): add verifier evaluator shell and types#2157
miguelg719 merged 15 commits into
mainfrom
miguelgonzalez/verifier-02-backend-routing

miguelg719 commented May 22, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

changeset-bot Bot commented May 22, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

monadoid May 22, 2026 •

edited

Loading

Uh oh!

miguelg719 May 22, 2026

Uh oh!

monadoid May 22, 2026 •

edited

Loading

Uh oh!

miguelg719 May 22, 2026

Uh oh!

monadoid May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

miguelg719 commented May 22, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Uh oh!

changeset-bot Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

monadoid May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miguelg719 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

monadoid May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miguelg719 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

monadoid May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

miguelg719 commented May 22, 2026 •

edited by cubic-dev-ai Bot

Loading

changeset-bot Bot commented May 22, 2026 •

edited

Loading

monadoid May 22, 2026 •

edited

Loading

monadoid May 22, 2026 •

edited

Loading