Evals v2 by miguelg719 · Pull Request #2011 · browserbase/stagehand

miguelg719 · 2026-04-18T23:57:03Z

why

Evals CLI v2.0

what changed

You can find this diagram in ARCHITECTURE.md

flowchart TB
    subgraph Entrypoints["Entrypoints and Distribution"]
        CLI["cli.ts<br/>single-shot argv"]
        REPL["tui/repl.ts<br/>interactive shell<br/>Esc abort support"]
        BuildCLI["scripts/build-cli.ts<br/>bundled dist/cli bin"]
        BuildESM["scripts/build-esm.ts<br/>dist/esm + assets + datasets"]
        LegacyCLI["cli-legacy.ts / index.eval.ts<br/>--legacy escape hatch"]
    end

    subgraph TUI["TUI Command Layer"]
        Help["help<br/>usage text"]
        List["list<br/>registry views"]
        ConfigCmd["config<br/>defaults + core tool setup"]
        NewTask["new<br/>task scaffold"]
        Experiments["experiments<br/>Braintrust inspect/compare"]
        RunCmd["run<br/>progress + dry-run plan"]
        Parse["parse.ts<br/>flag validation<br/>target shorthand<br/>env override scope"]
        ConfigFile[("evals.config.json<br/>defaults, core config,<br/>benchmark metadata")]
    end

    subgraph Registry["Task Registry and Target Resolution"]
        RuntimePaths["runtimePaths.ts<br/>source vs dist roots"]
        Discovery["framework/discovery.ts<br/>filesystem scan<br/>category overrides<br/>lazy/eager metadata"]
        Resolver["resolveTarget()<br/>tier, category, task,<br/>partial name"]
        CoreTasks[("core/tasks/*<br/>deterministic tool tasks")]
        BenchTasks[("tasks/bench/*<br/>act, extract, observe,<br/>agent, combination, suites")]
        DefineTask["framework/defineTask.ts<br/>defineCoreTask<br/>defineBenchTask"]
    end

    subgraph Planning["Run Planning"]
        ResolvedOptions["ResolvedRunOptions<br/>env, trials, concurrency,<br/>model, provider, api,<br/>harness, agent mode"]
        BenchPlanner["framework/benchPlanner.ts<br/>model matrix<br/>suite expansion<br/>metadata/tags"]
        TaskConfig["taskConfig.ts<br/>default models<br/>agent model entries"]
        Datasets[("datasets/*<br/>WebVoyager<br/>OnlineMind2Web<br/>WebTailBench")]
        DryRun["dry-run JSON<br/>tasks, skippedTasks,<br/>envOverrides, runOptions"]
    end

    subgraph Runner["Unified Runner"]
        RunEvals["framework/runner.ts<br/>generate testcases<br/>Braintrust Eval wrapper<br/>progress callbacks"]
        BraintrustAdapter["framework/braintrust.ts<br/>lazy import<br/>traced spans"]
        Cleanup["activeRunCleanup.ts<br/>signal cleanup<br/>registered session cleanup"]
        Scoring["scoring.ts<br/>passRate<br/>exactMatch<br/>errorMatch"]
        Summary["summary.ts<br/>local run summary"]
    end

    subgraph CoreTier["Core Tier Execution"]
        CoreContext["framework/context.ts<br/>buildCoreContext"]
        FixtureServer["core/fixtures<br/>local deterministic pages"]
        CoreTargets["core/targets<br/>local Chrome<br/>Browserbase CDP"]
        CoreTools["core/tools registry<br/>understudy_code<br/>playwright_code<br/>cdp_code<br/>MCP + browse_cli"]
        CoreAssertions["assertions + metrics<br/>adapter-backed results"]
    end

    subgraph BenchTier["Bench Tier Execution"]
        BenchRunner["framework/benchRunner.ts<br/>load task module<br/>execute bench task"]
        TaskLoader["framework/taskLoader.ts<br/>defineBenchTask + legacy<br/>module loader"]
        HarnessRegistry["framework/benchHarness.ts<br/>stagehand + claude_code<br/>(codex slot pending)"]
        InitV3["initV3.ts<br/>V3 session + agent"]
        AISDK["AISdkClientWrapped<br/>provider/model client"]
        ClaudeCodeRunner["framework/claudeCodeRunner.ts<br/>Claude Code SDK adapter<br/>prompt + result parse"]
        ClaudeCodeToolAdapter["framework/claudeCodeToolAdapter.ts<br/>tool-surface handoff<br/>browse_cli wrapper"]
        ExternalHarnessPlan["framework/externalHarnessPlan.ts<br/>webvoyager / onlineMind2Web /<br/>webtailbench task plan"]
        BenchSuites["suites/*<br/>dataset-backed suite cases"]
        BenchTaskDefs["bench task modules<br/>defineBenchTask or legacy fn"]
    end

    subgraph Reporting["Reporting and Inspection"]
        Progress["tui/progress.ts<br/>live task table"]
        ResultsView["tui/results.ts<br/>model summary<br/>results table"]
        ReportLib["lib/braintrust-report.ts<br/>experiment fetch<br/>comparison data"]
        RenderReport["scripts/render-braintrust-core-report.ts<br/>core report rendering"]
    end

    subgraph External["External Systems"]
        Braintrust["Braintrust<br/>experiments, spans,<br/>scores, comparisons"]
        Browserbase["Browserbase<br/>remote browser sessions"]
        LocalBrowser["Local Chromium<br/>fixture-backed sessions"]
        Providers["LLM Providers<br/>OpenAI, Anthropic,<br/>Google, others"]
        Stagehand["Stagehand package<br/>V3, AgentProvider,<br/>available models"]
        ClaudeCodeSDK["@anthropic-ai/claude-agent-sdk<br/>external Claude Code agent<br/>headless run"]
    end

    %% Entrypoints
    CLI --> Parse
    CLI --> RunCmd
    CLI --> List
    CLI --> ConfigCmd
    CLI --> Experiments
    CLI --> NewTask
    CLI --> Help
    REPL --> Parse
    REPL --> RunCmd
    BuildCLI --> CLI
    BuildESM --> RuntimePaths
    LegacyCLI --> LegacyRunner["legacy runner path<br/>pre-refactor behavior"]

    %% TUI and config
    ConfigCmd <--> ConfigFile
    Parse --> ConfigFile
    Parse --> ResolvedOptions
    Parse --> DryRun
    RunCmd --> ResolvedOptions
    RunCmd --> RuntimePaths
    RunCmd --> Discovery
    List --> Discovery
    NewTask --> CoreTasks
    NewTask --> BenchTasks
    Experiments --> ReportLib

    %% Registry
    RuntimePaths --> CoreTasks
    RuntimePaths --> BenchTasks
    CoreTasks --> Discovery
    BenchTasks --> Discovery
    DefineTask --> CoreTasks
    DefineTask --> BenchTasks
    Discovery --> Resolver
    Resolver --> RunCmd

    %% Planning
    RunCmd --> BenchPlanner
    RunCmd --> DryRun
    ResolvedOptions --> BenchPlanner
    TaskConfig --> BenchPlanner
    Datasets --> BenchSuites
    BenchSuites --> BenchPlanner
    BenchPlanner --> RunEvals
    Resolver --> RunEvals

    %% Runner
    RunCmd --> RunEvals
    RunEvals --> BraintrustAdapter
    RunEvals --> Cleanup
    RunEvals --> Scoring
    RunEvals --> Summary
    RunEvals --> CoreContext
    RunEvals --> BenchRunner
    RunEvals --> Progress
    Progress --> ResultsView

    %% Core path
    CoreContext --> FixtureServer
    CoreContext --> CoreTargets
    CoreContext --> CoreTools
    CoreContext --> CoreAssertions
    CoreTargets --> LocalBrowser
    CoreTargets --> Browserbase
    CoreTools --> LocalBrowser
    CoreTools --> Browserbase

    %% Bench path
    BenchRunner --> HarnessRegistry
    BenchRunner --> TaskLoader
    TaskLoader --> BenchTaskDefs
    HarnessRegistry --> InitV3
    InitV3 --> Stagehand
    InitV3 --> Browserbase
    InitV3 --> LocalBrowser
    HarnessRegistry --> AISDK
    AISDK --> Providers
    BenchRunner --> BenchTaskDefs
    BenchTaskDefs --> Stagehand
    BenchSuites --> Datasets

    %% Claude Code external harness path
    HarnessRegistry --> ClaudeCodeRunner
    HarnessRegistry --> ClaudeCodeToolAdapter
    HarnessRegistry --> ExternalHarnessPlan
    ClaudeCodeToolAdapter --> CoreTools
    ClaudeCodeToolAdapter --> CoreTargets
    ClaudeCodeRunner --> ClaudeCodeSDK
    ClaudeCodeSDK --> Providers

    %% Reporting and external
    BraintrustAdapter --> Braintrust
    ReportLib --> Braintrust
    RenderReport --> ReportLib
    Summary --> Braintrust
    Stagehand --> Providers

    %% Styling
    style Entrypoints fill:#263238,stroke:#111111,stroke-width:3px,color:#ffffff
    style TUI fill:#0d47a1,stroke:#082d69,stroke-width:3px,color:#ffffff
    style Registry fill:#1b5e20,stroke:#0d3010,stroke-width:3px,color:#ffffff
    style Planning fill:#6a1b9a,stroke:#4a148c,stroke-width:3px,color:#ffffff
    style Runner fill:#004d40,stroke:#00251f,stroke-width:3px,color:#ffffff
    style CoreTier fill:#bf360c,stroke:#7f2207,stroke-width:3px,color:#ffffff
    style BenchTier fill:#1565c0,stroke:#0d47a1,stroke-width:3px,color:#ffffff
    style Reporting fill:#4e342e,stroke:#2d1b17,stroke-width:3px,color:#ffffff
    style External fill:#7f0000,stroke:#4d0000,stroke-width:3px,color:#ffffff

    style CLI fill:#eceff1,stroke:#263238,stroke-width:2px,color:#000000
    style REPL fill:#eceff1,stroke:#263238,stroke-width:2px,color:#000000
    style BuildCLI fill:#cfd8dc,stroke:#263238,stroke-width:2px,color:#000000
    style BuildESM fill:#cfd8dc,stroke:#263238,stroke-width:2px,color:#000000
    style LegacyCLI fill:#ffcdd2,stroke:#b71c1c,stroke-width:2px,color:#000000
    style LegacyRunner fill:#ffcdd2,stroke:#b71c1c,stroke-width:2px,color:#000000

    style Parse fill:#bbdefb,stroke:#0d47a1,stroke-width:2px,color:#000000
    style RunCmd fill:#bbdefb,stroke:#0d47a1,stroke-width:2px,color:#000000
    style ConfigFile fill:#e3f2fd,stroke:#0d47a1,stroke-width:2px,color:#000000
    style Discovery fill:#c8e6c9,stroke:#1b5e20,stroke-width:2px,color:#000000
    style Resolver fill:#a5d6a7,stroke:#1b5e20,stroke-width:2px,color:#000000
    style CoreTasks fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px,color:#000000
    style BenchTasks fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px,color:#000000

    style BenchPlanner fill:#e1bee7,stroke:#6a1b9a,stroke-width:2px,color:#000000
    style Datasets fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#000000
    style RunEvals fill:#b2dfdb,stroke:#004d40,stroke-width:2px,color:#000000
    style BraintrustAdapter fill:#e0f2f1,stroke:#004d40,stroke-width:2px,color:#000000
    style Cleanup fill:#e0f2f1,stroke:#004d40,stroke-width:2px,color:#000000

    style CoreContext fill:#ffe0b2,stroke:#bf360c,stroke-width:2px,color:#000000
    style CoreTargets fill:#ffcc80,stroke:#bf360c,stroke-width:2px,color:#000000
    style CoreTools fill:#ffcc80,stroke:#bf360c,stroke-width:2px,color:#000000
    style BenchRunner fill:#bbdefb,stroke:#1565c0,stroke-width:2px,color:#000000
    style HarnessRegistry fill:#90caf9,stroke:#1565c0,stroke-width:2px,color:#000000
    style InitV3 fill:#90caf9,stroke:#1565c0,stroke-width:2px,color:#000000
    style TaskLoader fill:#bbdefb,stroke:#1565c0,stroke-width:2px,color:#000000
    style ClaudeCodeRunner fill:#90caf9,stroke:#1565c0,stroke-width:2px,color:#000000
    style ClaudeCodeToolAdapter fill:#90caf9,stroke:#1565c0,stroke-width:2px,color:#000000
    style ExternalHarnessPlan fill:#bbdefb,stroke:#1565c0,stroke-width:2px,color:#000000

    style Braintrust fill:#ffcdd2,stroke:#7f0000,stroke-width:2px,color:#000000
    style Browserbase fill:#ffcdd2,stroke:#7f0000,stroke-width:2px,color:#000000
    style Providers fill:#ffcdd2,stroke:#7f0000,stroke-width:2px,color:#000000
    style Stagehand fill:#ffcdd2,stroke:#7f0000,stroke-width:2px,color:#000000
    style ClaudeCodeSDK fill:#ffcdd2,stroke:#7f0000,stroke-width:2px,color:#000000

test plan

Summary by cubic

Evals v2 rebuilds the suite into a core + bench system with auto-discovered tasks, adapter-backed tools, a unified Braintrust runner, and a stable CLI/TUI. It now stabilizes the playwright_code and cdp_code tool surfaces, adds browse_cli and MCP adapters, ships a Braintrust comparison report, fixes simple_google_search, guards file-upload evals with cleanup-on-failure, and removes outdated planning docs.

New Features
- Core: deterministic tasks under packages/evals/core/tasks/**, shared contracts/assertions/metrics, local fixture server, lazy getEnv(), runner-provided Chrome targets (local, Browserbase) with artifacts/cleanup and a shared cleanup registry.
- Tools: stable playwright_code and cdp_code; plus understudy_code, playwright_mcp, chrome_devtools_mcp, browse_cli.
- Bench: matrix planner and external-suite support (WebVoyager, OnlineMind2Web, WebTailBench); harness registry with stagehand and claude_code, Claude Code SDK boundary, and a browse_cli tool handoff; bench runner propagates sessionUrl/debugUrl.
- CLI/TUI: stable dist/cli (pnpm evals ...) with bin/evals, experiments command, startup warning suppression, quieter logs via EvalLogger echo toggle; legacy CLI via pnpm evals:old.
- Reporting: Braintrust data layer (lib/braintrust-report.ts), report:core script, and summaries that include the experiment URL and scores.
- CI/Deps/Docs: defaults updated (EVAL_AGENT_MODELS= openai/gpt-5.4-mini, anthropic/claude-haiku-4-5, google/gemini-3-flash-preview; EVAL_CATEGORIES includes regression); .gitignore ignores Playwright/MCP artifacts; version bumped to 2.0.0; add playwright, @anthropic-ai/claude-agent-sdk, and vitest; updated README, architecture diagram, and skills/browser guide.
Migration
- Tasks are auto-discovered; remove task lists from evals.config.json.
- Bench tasks must default-export defineBenchTask({ name }) under packages/evals/tasks/bench/**; core tasks live under packages/evals/core/tasks/** via defineCoreTask.
- Use the new CLI (pnpm evals ... or the evals bin); the legacy interface remains via pnpm evals:old.
- Set BROWSERBASE_API_KEY (and optionally BROWSERBASE_PROJECT_ID); set CHROME_PATH if auto-detect fails; EVAL_ENV is chosen at runtime.
- Filters/categories changed; remove webbench references — use b:webvoyager or b:onlineMind2Web.

^{Written for commit 5a0056d. Summary will update on new commits. Review in cubic}

changeset-bot · 2026-04-18T23:57:07Z

⚠️ No Changeset found

Latest commit: ea9895a

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

cubic-dev-ai

14 issues found across 427 files

Confidence score: 2/5

There is a high-confidence async cleanup bug in packages/evals/framework/benchRunner.ts: missing await can run finally too early, potentially tearing down browser sessions while work is still in flight and masking downstream errors.
packages/evals/framework/assertions.ts and packages/evals/core/targets/localChrome.ts introduce meaningful user/runtime risk: unsanitized assertion errors may leak sensitive values, and Chrome/temp-dir cleanup can be skipped on debugger URL failures, causing resource leaks.
Given multiple medium-to-high severity findings (6–8/10) across execution flow, resource lifecycle, and task stability, this does not yet look safe to merge without fixes despite clear, targeted remediation paths.
Pay close attention to packages/evals/framework/benchRunner.ts, packages/evals/framework/assertions.ts, and packages/evals/core/targets/localChrome.ts - async lifecycle ordering, error-data exposure, and failure-path cleanup are the key risk areas.

Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed. cubic prioritises the most important files to review.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/suites/gaia.ts">

<violation number="1" location="packages/evals/suites/gaia.ts:11">
P2: Duplicate code: `normalizeModelEntries` is copy-pasted identically in all four suite files (`gaia.ts`, `webvoyager.ts`, `onlineMind2Web.ts`, `webtailbench.ts`). Extract it into a shared utility (e.g., `utils.ts` which is already imported here) to avoid divergence when the mode-inference logic changes.</violation>

<violation number="2" location="packages/evals/suites/gaia.ts:17">
P2: Custom agent: **Ensure we never check against hardcoded lists of allowed LLM model names**

New code introduces allowlist-based model-name checking (`AVAILABLE_CUA_MODELS.includes`) to drive runtime mode, which violates the rule against hardcoded allowed-model checks.</violation>
</file>

<file name="packages/evals/suites/onlineMind2Web.ts">

<violation number="1" location="packages/evals/suites/onlineMind2Web.ts:17">
P2: Custom agent: **Ensure we never check against hardcoded lists of allowed LLM model names**

New code gates model behavior using a hardcoded model-name allowlist (`AVAILABLE_CUA_MODELS.includes(modelName)`), violating the rule prohibiting hardcoded model-name checks.</violation>
</file>

<file name="packages/evals/framework/assertions.ts">

<violation number="1" location="packages/evals/framework/assertions.ts:30">
P1: Custom agent: **Exception and error message sanitization**

Assertion helper failures expose unsanitized runtime values in user-visible error messages, risking secret/data leakage.</violation>
</file>

<file name="packages/evals/core/targets/browserbase.ts">

<violation number="1" location="packages/evals/core/targets/browserbase.ts:7">
P2: Use `||` instead of `??` for environment variable fallback chains. If `BROWSERBASE_API_KEY` is explicitly set to an empty string (e.g., `BROWSERBASE_API_KEY=` in `.env` or Docker), `??` won't fall through to `BB_API_KEY` because `""` is not nullish. The error thrown then misleadingly says `BROWSERBASE_API_KEY` is required, even if `BB_API_KEY` holds a valid key.</violation>
</file>

<file name="packages/evals/core/targets/index.ts">

<violation number="1" location="packages/evals/core/targets/index.ts:16">
P2: `environment` and `toolSurface` are accepted in the input but never used. In particular, `environment` is not validated against the chosen `startupProfile`, so contradictory combinations (e.g., `LOCAL` + `runner_provided_browserbase_cdp`) are silently accepted. Either remove the unused fields or add a consistency check.</violation>
</file>

<file name="packages/evals/tasks/bench/act/iframe_scroll.ts">

<violation number="1" location="packages/evals/tasks/bench/act/iframe_scroll.ts:28">
P2: A 1-pixel tolerance for the 50% scroll check is extremely tight and likely to cause flaky results. Sub-pixel rendering, rounding, and varying viewport sizes can easily produce offsets of several pixels. Consider a more forgiving threshold (e.g., 50–100px or a percentage of `scrollHeight`).</violation>
</file>

<file name="packages/evals/framework/metrics.ts">

<violation number="1" location="packages/evals/framework/metrics.ts:42">
P2: `value: singleValue` is included in the multi-sample summary where it's just a duplicate of `min`. Per the interface docs, `value` is meant only for single measurements. Remove it from the multi-sample return to match the documented contract and avoid confusing consumers.</violation>
</file>

<file name="packages/evals/framework/benchRunner.ts">

<violation number="1" location="packages/evals/framework/benchRunner.ts:69">
P1: Missing `await` on `legacyFn` inside try/catch/finally. Without `await`, the `finally` block (cleanup) runs before the returned promise settles, tearing down browser sessions while the task is still running, and any rejection from `legacyFn` bypasses the `catch` block's error handling. The `definition.fn` path above correctly uses `return await`.</violation>

<violation number="2" location="packages/evals/framework/benchRunner.ts:101">
P2: `JSON.stringify` on a native `Error` produces `"{}"` because `message`/`stack`/`name` are non-enumerable. The `TaskResult.error` field will be an empty object, losing all error information. Serialize the relevant fields explicitly instead.</violation>
</file>

<file name="packages/evals/runtimePaths.ts">

<violation number="1" location="packages/evals/runtimePaths.ts:124">
P2: `path.join` will produce backslash paths on Windows, breaking consistency with the rest of this file which normalizes all paths to forward slashes. Use template literals with `/` (or `path.posix.join`) to match the convention.

(Based on your team's feedback about preferring `/` over `path.join` to avoid introducing backslashes.) [FEEDBACK_USED]</violation>
</file>

<file name="packages/evals/core/tasks/actions/click_coordinates.ts">

<violation number="1" location="packages/evals/core/tasks/actions/click_coordinates.ts:9">
P2: Missing wait-for-visible before reading the bounding box. The sibling `click.ts` task waits for the button to be visible before interacting; this task skips that step, so `page.evaluate` can run before the element is laid out, producing incorrect coordinates or a confusing `null` error instead of a proper timeout.</violation>
</file>

<file name="packages/evals/core/targets/localChrome.ts">

<violation number="1" location="packages/evals/core/targets/localChrome.ts:18">
P2: The candidates list only contains the macOS Chrome path. Add common Linux paths so auto-detection works on CI runners without requiring `CHROME_PATH`.</violation>

<violation number="2" location="packages/evals/core/targets/localChrome.ts:127">
P1: If `waitForDebuggerUrl` throws (e.g., on timeout), the spawned Chrome process and temp directory are leaked because `cleanup` is only returned on success. Wrap the call in a try/catch that terminates Chrome and removes the directory on failure.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/framework/runner.ts">

<violation number="1" location="packages/evals/framework/runner.ts:458">
P2: The returned `experimentName` doesn't reflect the Braintrust-assigned name. `generateSummary` receives `evalResult.summary?.experimentName ?? experimentName`, but the return value still uses the local `experimentName`. Extract the resolved name into a variable and use it in both places so callers get a consistent experiment name.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/tasks/bench/observe/observe_yc_startup.ts">

<violation number="1" location="packages/evals/tasks/bench/observe/observe_yc_startup.ts:28">
P1: Broken CSS selector: the `.` (dot) before `_results_18olp_345` was dropped and replaced with a space. This makes the selector invalid — it looks for a nonexistent HTML element type `_results_18olp_345` as a descendant, instead of matching a `div` with both classes. This locator will never match, silently degrading the test.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/tasks/bench/observe/observe_file_uploads.ts">

<violation number="1" location="packages/evals/tasks/bench/observe/observe_file_uploads.ts:43">
P1: Missing `await v3.close()` in a `finally` block. Every other observe task cleans up the Stagehand instance this way; omitting it leaks the session.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-04-29T02:55:37Z

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

miguelg719 · 2026-04-29T20:37:31Z

@cubic-dev-ai review

cubic-dev-ai · 2026-04-29T20:38:14Z

@cubic-dev-ai review

@miguelg719 I have started the AI code review. It will take a few minutes to complete.

cubic-dev-ai

5 issues found across 432 files

Confidence score: 3/5

There is some merge risk because several medium-severity, high-confidence findings point to user-facing error handling regressions (unsanitized exceptions and generic errors) rather than isolated style issues.
The most severe concern is in packages/evals/core/targets/browserbase.ts and packages/evals/tasks/bench/act/csr_in_oopif.ts, where generic/unsanitized errors can expose raw exception details in eval output and reduce consistency of failure handling.
Similar patterns in packages/evals/core/targets/index.ts, packages/evals/tasks/bench/act/csr_in_spif.ts, and packages/evals/tasks/bench/act/login.ts suggest a broader sanitization gap that is likely user-visible if not addressed.
Pay close attention to packages/evals/core/targets/browserbase.ts, packages/evals/tasks/bench/act/csr_in_oopif.ts, packages/evals/core/targets/index.ts, packages/evals/tasks/bench/act/csr_in_spif.ts, and packages/evals/tasks/bench/act/login.ts - user-facing errors should be typed and sanitized to avoid leaking stack/context details.

Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed. cubic prioritises the most important files to review.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/core/targets/browserbase.ts">

<violation number="1" location="packages/evals/core/targets/browserbase.ts:11">
P2: Custom agent: **Exception and error message sanitization**

Uses generic `new Error()` in a user-facing failure path instead of a typed/sanitized error class.</violation>
</file>

<file name="packages/evals/tasks/bench/act/csr_in_oopif.ts">

<violation number="1" location="packages/evals/tasks/bench/act/csr_in_oopif.ts:41">
P2: Custom agent: **Exception and error message sanitization**

Raw exception text is returned in user-facing eval output instead of a sanitized error message.</violation>
</file>

<file name="packages/evals/core/targets/index.ts">

<violation number="1" location="packages/evals/core/targets/index.ts:26">
P2: Custom agent: **Exception and error message sanitization**

Uses generic `Error` for user-visible validation failures instead of a typed error class.</violation>
</file>

<file name="packages/evals/tasks/bench/act/csr_in_spif.ts">

<violation number="1" location="packages/evals/tasks/bench/act/csr_in_spif.ts:40">
P2: Custom agent: **Exception and error message sanitization**

Raw exception text is returned to the user-facing failure message without sanitization.</violation>
</file>

<file name="packages/evals/tasks/bench/act/login.ts">

<violation number="1" location="packages/evals/tasks/bench/act/login.ts:32">
P2: Custom agent: **Exception and error message sanitization**

Returns the raw caught error object in the task result instead of a sanitized error payload, which can leak stack/context data into eval output.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

Entire-Checkpoint: f268fd2c0c30

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

pirate · 2026-05-01T18:35:40Z

+    case "chrome_devtools_mcp":
+      return new ChromeDevtoolsMcpTool();
+    case "browse_cli":
+      return new BrowseCliTool();


should we add more competitors? 😁

pirate reviewed Apr 20, 2026

View reviewed changes

Comment thread packages/evals/core/contracts/tool.ts

pirate approved these changes Apr 20, 2026

View reviewed changes

miguelg719 marked this pull request as ready for review April 28, 2026 20:58

cubic-dev-ai Bot reviewed Apr 28, 2026

View reviewed changes

miguelg719 added act These changes pertain to the act function extract These changes pertain to the extract function observe These changes pertain to the observe function combination These changes affect multiple Stagehand functions agent Agentic evals labels Apr 28, 2026

miguelg719 changed the title ~~Evals overhaul~~ Evals v2 Apr 28, 2026

miguelg719 removed the agent Agentic evals label Apr 28, 2026

cubic-dev-ai Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread packages/evals/framework/runner.ts Outdated

miguelg719 removed the combination These changes affect multiple Stagehand functions label Apr 28, 2026

cubic-dev-ai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread packages/evals/tasks/bench/observe/observe_yc_startup.ts Outdated

cubic-dev-ai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread packages/evals/tasks/bench/observe/observe_file_uploads.ts

miguelg719 force-pushed the evals-overhaul branch from 097068a to 4019a8a Compare April 29, 2026 20:27

miguelg719 added the agent Agentic evals label Apr 29, 2026

cubic-dev-ai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread packages/evals/core/targets/browserbase.ts

Comment thread packages/evals/tasks/bench/act/csr_in_oopif.ts

Comment thread packages/evals/core/targets/index.ts

Comment thread packages/evals/tasks/bench/act/csr_in_spif.ts

Comment thread packages/evals/tasks/bench/act/login.ts

miguelg719 removed the agent Agentic evals label Apr 29, 2026

miguelg719 added 7 commits April 30, 2026 14:58

unclean evals overhaul

aef6659

use wrapper function for all evals

e1bb6a6

Initial abstraction for core runner

1d86fda

updates on sprint 2

ff795ab

add more tool adapters and fixes

9a70879

Entire-Checkpoint: f268fd2c0c30

add browse cli as a tool

71a92bb

proper mcp adapters

72e83ff

miguelg719 and others added 25 commits April 30, 2026 14:58

first pass at bench runner

f2431a3

fixed benchmarks

9b6265c

claude agents sdk runner wired up, missing evaluator

ef24814

remove experimental claude code flag

167abf9

cleanup

f484d14

fix CI issues

eb9de77

ci updates

e3e52c8

Update packages/evals/runtimePaths.ts

2c573da

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

rerun ci

9c91315

Fix summary in CI workflow

8fc7d90

more CI fixes

12922e6

fix session and debug urls not showing in traces

0836d0b

update broken selectors

d724e2a

add timeout to wait for page load on act/heal_simple_google_search

be5d0cd

fix remaining stale evals

1587cfc

add finally block to file uploads eval

0e2e85e

fix simple_google_search

92c4b9e

update models for agent evals and summaries/previews

08eba12

address review comments

7143e1e

prettier

2687ece

Remove planning files

908c055

add stable playwright_code/cdp_code tool surfaces

b3423d6

linting

850cea8

linting

309b788

update lockfile

a1ad2e0

miguelg719 force-pushed the evals-overhaul branch from 1d9ca2c to a1ad2e0 Compare April 30, 2026 22:02

update to use hybrid models

ea9895a

pirate approved these changes May 1, 2026

View reviewed changes

pirate reviewed May 1, 2026

View reviewed changes

miguelg719 merged commit 2eb9416 into main May 1, 2026
207 checks passed

Conversation

miguelg719 commented Apr 18, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

Evals CLI v2.0

what changed

test plan

Summary by cubic

Uh oh!

changeset-bot Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot commented Apr 29, 2026

Uh oh!

miguelg719 commented Apr 29, 2026

Uh oh!

cubic-dev-ai Bot commented Apr 29, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pirate May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

miguelg719 commented Apr 18, 2026 •

edited by cubic-dev-ai Bot

Loading

changeset-bot Bot commented Apr 18, 2026 •

edited

Loading