Skip to content

feat(verifier): add verifier evaluator shell and types#2157

Merged
miguelg719 merged 15 commits into
mainfrom
miguelgonzalez/verifier-02-backend-routing
May 22, 2026
Merged

feat(verifier): add verifier evaluator shell and types#2157
miguelg719 merged 15 commits into
mainfrom
miguelgonzalez/verifier-02-backend-routing

Conversation

@miguelg719
Copy link
Copy Markdown
Collaborator

@miguelg719 miguelg719 commented May 22, 2026

Replacement for #2130, which was merged into the PR1 branch instead of main. This branch is rebased onto current main and contains the PR2 verifier evaluator shell/type changes.


Summary by cubic

Adds a verifier evaluator shell with public trajectory/rubric/result types and utilities, plus a V3Evaluator.verify(trajectory) facade that uses the legacy backend by default without breaking existing flows.

  • New Features

    • New public types under v3/verifier: Trajectory, Rubric, EvaluationResult, Verifier, and more.
    • Utilities: normalizeRubric, loadTrajectoryFromDisk (rehydrates screenshots and image modalities), nextResultFilename.
    • V3Evaluator supports verify() and generateRubric(). Backend selectable via STAGEHAND_EVALUATOR_BACKEND (legacy default; verifier is stubbed for future). Legacy path maps trajectory screenshots/final answer/reasoning to the old evaluator and returns an EvaluationResult.
    • Re-exported on the V3 public API: Stagehand.loadTrajectoryFromDisk, Stagehand.nextResultFilename, Stagehand.normalizeRubric.
  • Bug Fixes

    • Normalizes dataset rubrics to public camelCase and validates required fields; strips serialized earned_points noise.
    • Secures screenshotPath to stay within the trajectory directory and decodes on-disk bytesBase64 to Buffer.
    • Derives the verification task from the Trajectory to prevent mismatched task specs and simplify verify().

Written for commit 3d8c324. Summary will update on new commits. Review in cubic

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 22, 2026

🦋 Changeset detected

Latest commit: 3d8c324

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages
Name Type
@browserbasehq/stagehand Patch
@browserbasehq/browse-cli Patch
@browserbasehq/stagehand-evals Patch
@browserbasehq/stagehand-server-v3 Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 11 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.
Architecture diagram
sequenceDiagram
    participant UserCode as "User/CLI"
    participant V3Eval as V3Evaluator
    participant LegacyEval as LegacyV3Evaluator
    participant TrajUtil as "trajectory.ts"
    participant FileSys as "File System"
    participant BackendEnv as STAGEHAND_EVALUATOR_BACKEND
    participant LLM as LLM

    Note over UserCode,BackendEnv: Verifier Facade Initialization

    UserCode->>V3Eval: new V3Evaluator(v3Instance, opts)
    V3Eval->>BackendEnv: Read env var
    alt backend=verifier
        V3Eval->>V3Eval: Store backend=verifier
    else backend=legacy
        V3Eval->>V3Eval: Store backend=legacy
    end

    Note over UserCode,FileSys: NEW: verify(trajectory, taskSpec)

    UserCode->>V3Eval: verify(trajectory, taskSpec)
    V3Eval->>V3Eval: assertVerifierInput()
    alt backend=verifier
        V3Eval-->>UserCode: Throw "backend not available"
    else backend=legacy
        V3Eval->>V3Eval: collectLegacyScreenshots(trajectory)
        V3Eval->>V3Eval: renderLegacyAgentReasoning(trajectory)
        alt no screenshots AND no finalAnswer
            V3Eval-->>UserCode: legacyInsufficientEvidenceResult()
        else
            V3Eval->>LegacyEval: ask({question, screenshot, answer, agentReasoning})
            LegacyEval-->>V3Eval: legacy result (YES/NO/INVALID)
            V3Eval->>V3Eval: legacyEvaluationToResult()
            V3Eval-->>UserCode: EvaluationResult with rawSteps.backend="legacy"
        end
    end

    Note over UserCode,V3Eval: NEW: generateRubric(taskSpec)

    UserCode->>V3Eval: generateRubric(taskSpec)
    alt backend=verifier
        V3Eval-->>UserCode: Throw "backend not available"
    else backend=legacy
        V3Eval->>V3Eval: Create single criterion rubric
        V3Eval-->>UserCode: Rubric { items: [legacyTaskCompletionCriterion] }
    end

    Note over UserCode,FileSys: NEW: On-disk Trajectory Loading

    UserCode->>TrajUtil: loadTrajectoryFromDisk(dir)
    TrajUtil->>FileSys: readFile(trajectory.json)
    FileSys-->>TrajUtil: raw JSON
    TrajUtil->>TrajUtil: Parse JSON
    loop each step
        TrajUtil->>FileSys: readFile(screenshotPath) for probe
        alt screenshot file exists
            FileSys-->>TrajUtil: Buffer
            TrajUtil->>TrajUtil: Set probeEvidence.screenshot
        else file missing
            TrajUtil->>TrajUtil: Leave screenshot unset
        end
        alt image modality with bytesBase64
            TrajUtil->>TrajUtil: Decode base64 → Buffer
        end
    end
    TrajUtil-->>UserCode: Hydrated Trajectory

    Note over UserCode,FileSys: NEW: Path Security Check

    TrajUtil->>TrajUtil: resolveWithinTrajectoryDir(candidate)
    alt path escapes trajectory directory
        TrajUtil-->>UserCode: Throw error
    else safe
        TrajUtil->>FileSys: readFile(resolved)
    end

    Note over UserCode,FileSys: Runtime: Legacy Evaluator with Answer

    LegacyEval->>LegacyEval: _evaluateWithMultipleScreenshots()
    rect over LegacyEval
        Note over LegacyEval: CHANGED: included answer in prompt
    end
    LegacyEval->>LLM: prompt(text + image contents + "the answer is {answer}")
    Note over LLM: NEW: answer appended to user message
    LLM-->>LegacyEval: YES/NO + reasoning
    LegacyEval-->>UserCode: LegacyEvaluationResult
Loading

Re-trigger cubic


const trajectoryPath = path.join(trajectoryDir, "trajectory.json");
const raw = await fs.readFile(trajectoryPath, "utf8");
const parsed = JSON.parse(raw) as Trajectory & {
Copy link
Copy Markdown
Contributor

@monadoid monadoid May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be made more typesafe at runtime if we used zod at the parsing boundary, like: TrajectorySchema.safeParse(JSON.parse(raw))

and then you could z.infer to still have a Trajectory type (the array of trajectory steps could all be part of the zod schema too)

but might be a nit!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to add an extra pr at the end to parse as much as possible, this one is tricky because we use downstream some precomputed rubrics from webtailbench specifically that don't match our schema (snake)

Comment thread packages/core/lib/v3/verifier/types.ts Outdated
* return an EvaluationResult — they MUST NOT touch a live browser.
*/
export interface Verifier {
verify(trajectory: Trajectory, taskSpec: TaskSpec): Promise<EvaluationResult>;
Copy link
Copy Markdown
Contributor

@monadoid monadoid May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the trajectory already contains a taskspec within it, can we remove the additional taskSpec that gets passed in here? This would make it simpler, but maybe i'm missing something.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch 🤝

* Snake-case dataset fields are accepted here so serialized quirks do not leak
* into the canonical rubric type.
*/
export function normalizeRubric(rubric: unknown): Rubric | undefined {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here - a rubric could just be a zod object + a z.infer type, and then this logic could be built into the zod object (via superrefine or otherwise), and it might be a bit simpler, but optional nit!

@miguelg719 miguelg719 merged commit 2cd60a3 into main May 22, 2026
381 of 422 checks passed
@github-actions github-actions Bot mentioned this pull request May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants