Skip to content

Evals v2#2011

Merged
miguelg719 merged 41 commits intomainfrom
evals-overhaul
May 1, 2026
Merged

Evals v2#2011
miguelg719 merged 41 commits intomainfrom
evals-overhaul

Conversation

@miguelg719
Copy link
Copy Markdown
Collaborator

@miguelg719 miguelg719 commented Apr 18, 2026

why

Evals CLI v2.0

Screenshot 2026-04-28 at 3 34 44 PM

what changed

You can find this diagram in ARCHITECTURE.md

flowchart TB
    subgraph Entrypoints["Entrypoints and Distribution"]
        CLI["cli.ts<br/>single-shot argv"]
        REPL["tui/repl.ts<br/>interactive shell<br/>Esc abort support"]
        BuildCLI["scripts/build-cli.ts<br/>bundled dist/cli bin"]
        BuildESM["scripts/build-esm.ts<br/>dist/esm + assets + datasets"]
        LegacyCLI["cli-legacy.ts / index.eval.ts<br/>--legacy escape hatch"]
    end

    subgraph TUI["TUI Command Layer"]
        Help["help<br/>usage text"]
        List["list<br/>registry views"]
        ConfigCmd["config<br/>defaults + core tool setup"]
        NewTask["new<br/>task scaffold"]
        Experiments["experiments<br/>Braintrust inspect/compare"]
        RunCmd["run<br/>progress + dry-run plan"]
        Parse["parse.ts<br/>flag validation<br/>target shorthand<br/>env override scope"]
        ConfigFile[("evals.config.json<br/>defaults, core config,<br/>benchmark metadata")]
    end

    subgraph Registry["Task Registry and Target Resolution"]
        RuntimePaths["runtimePaths.ts<br/>source vs dist roots"]
        Discovery["framework/discovery.ts<br/>filesystem scan<br/>category overrides<br/>lazy/eager metadata"]
        Resolver["resolveTarget()<br/>tier, category, task,<br/>partial name"]
        CoreTasks[("core/tasks/*<br/>deterministic tool tasks")]
        BenchTasks[("tasks/bench/*<br/>act, extract, observe,<br/>agent, combination, suites")]
        DefineTask["framework/defineTask.ts<br/>defineCoreTask<br/>defineBenchTask"]
    end

    subgraph Planning["Run Planning"]
        ResolvedOptions["ResolvedRunOptions<br/>env, trials, concurrency,<br/>model, provider, api,<br/>harness, agent mode"]
        BenchPlanner["framework/benchPlanner.ts<br/>model matrix<br/>suite expansion<br/>metadata/tags"]
        TaskConfig["taskConfig.ts<br/>default models<br/>agent model entries"]
        Datasets[("datasets/*<br/>WebVoyager<br/>OnlineMind2Web<br/>WebTailBench")]
        DryRun["dry-run JSON<br/>tasks, skippedTasks,<br/>envOverrides, runOptions"]
    end

    subgraph Runner["Unified Runner"]
        RunEvals["framework/runner.ts<br/>generate testcases<br/>Braintrust Eval wrapper<br/>progress callbacks"]
        BraintrustAdapter["framework/braintrust.ts<br/>lazy import<br/>traced spans"]
        Cleanup["activeRunCleanup.ts<br/>signal cleanup<br/>registered session cleanup"]
        Scoring["scoring.ts<br/>passRate<br/>exactMatch<br/>errorMatch"]
        Summary["summary.ts<br/>local run summary"]
    end

    subgraph CoreTier["Core Tier Execution"]
        CoreContext["framework/context.ts<br/>buildCoreContext"]
        FixtureServer["core/fixtures<br/>local deterministic pages"]
        CoreTargets["core/targets<br/>local Chrome<br/>Browserbase CDP"]
        CoreTools["core/tools registry<br/>understudy_code<br/>playwright_code<br/>cdp_code<br/>MCP + browse_cli"]
        CoreAssertions["assertions + metrics<br/>adapter-backed results"]
    end

    subgraph BenchTier["Bench Tier Execution"]
        BenchRunner["framework/benchRunner.ts<br/>load task module<br/>execute bench task"]
        TaskLoader["framework/taskLoader.ts<br/>defineBenchTask + legacy<br/>module loader"]
        HarnessRegistry["framework/benchHarness.ts<br/>stagehand + claude_code<br/>(codex slot pending)"]
        InitV3["initV3.ts<br/>V3 session + agent"]
        AISDK["AISdkClientWrapped<br/>provider/model client"]
        ClaudeCodeRunner["framework/claudeCodeRunner.ts<br/>Claude Code SDK adapter<br/>prompt + result parse"]
        ClaudeCodeToolAdapter["framework/claudeCodeToolAdapter.ts<br/>tool-surface handoff<br/>browse_cli wrapper"]
        ExternalHarnessPlan["framework/externalHarnessPlan.ts<br/>webvoyager / onlineMind2Web /<br/>webtailbench task plan"]
        BenchSuites["suites/*<br/>dataset-backed suite cases"]
        BenchTaskDefs["bench task modules<br/>defineBenchTask or legacy fn"]
    end

    subgraph Reporting["Reporting and Inspection"]
        Progress["tui/progress.ts<br/>live task table"]
        ResultsView["tui/results.ts<br/>model summary<br/>results table"]
        ReportLib["lib/braintrust-report.ts<br/>experiment fetch<br/>comparison data"]
        RenderReport["scripts/render-braintrust-core-report.ts<br/>core report rendering"]
    end

    subgraph External["External Systems"]
        Braintrust["Braintrust<br/>experiments, spans,<br/>scores, comparisons"]
        Browserbase["Browserbase<br/>remote browser sessions"]
        LocalBrowser["Local Chromium<br/>fixture-backed sessions"]
        Providers["LLM Providers<br/>OpenAI, Anthropic,<br/>Google, others"]
        Stagehand["Stagehand package<br/>V3, AgentProvider,<br/>available models"]
        ClaudeCodeSDK["@anthropic-ai/claude-agent-sdk<br/>external Claude Code agent<br/>headless run"]
    end

    %% Entrypoints
    CLI --> Parse
    CLI --> RunCmd
    CLI --> List
    CLI --> ConfigCmd
    CLI --> Experiments
    CLI --> NewTask
    CLI --> Help
    REPL --> Parse
    REPL --> RunCmd
    BuildCLI --> CLI
    BuildESM --> RuntimePaths
    LegacyCLI --> LegacyRunner["legacy runner path<br/>pre-refactor behavior"]

    %% TUI and config
    ConfigCmd <--> ConfigFile
    Parse --> ConfigFile
    Parse --> ResolvedOptions
    Parse --> DryRun
    RunCmd --> ResolvedOptions
    RunCmd --> RuntimePaths
    RunCmd --> Discovery
    List --> Discovery
    NewTask --> CoreTasks
    NewTask --> BenchTasks
    Experiments --> ReportLib

    %% Registry
    RuntimePaths --> CoreTasks
    RuntimePaths --> BenchTasks
    CoreTasks --> Discovery
    BenchTasks --> Discovery
    DefineTask --> CoreTasks
    DefineTask --> BenchTasks
    Discovery --> Resolver
    Resolver --> RunCmd

    %% Planning
    RunCmd --> BenchPlanner
    RunCmd --> DryRun
    ResolvedOptions --> BenchPlanner
    TaskConfig --> BenchPlanner
    Datasets --> BenchSuites
    BenchSuites --> BenchPlanner
    BenchPlanner --> RunEvals
    Resolver --> RunEvals

    %% Runner
    RunCmd --> RunEvals
    RunEvals --> BraintrustAdapter
    RunEvals --> Cleanup
    RunEvals --> Scoring
    RunEvals --> Summary
    RunEvals --> CoreContext
    RunEvals --> BenchRunner
    RunEvals --> Progress
    Progress --> ResultsView

    %% Core path
    CoreContext --> FixtureServer
    CoreContext --> CoreTargets
    CoreContext --> CoreTools
    CoreContext --> CoreAssertions
    CoreTargets --> LocalBrowser
    CoreTargets --> Browserbase
    CoreTools --> LocalBrowser
    CoreTools --> Browserbase

    %% Bench path
    BenchRunner --> HarnessRegistry
    BenchRunner --> TaskLoader
    TaskLoader --> BenchTaskDefs
    HarnessRegistry --> InitV3
    InitV3 --> Stagehand
    InitV3 --> Browserbase
    InitV3 --> LocalBrowser
    HarnessRegistry --> AISDK
    AISDK --> Providers
    BenchRunner --> BenchTaskDefs
    BenchTaskDefs --> Stagehand
    BenchSuites --> Datasets

    %% Claude Code external harness path
    HarnessRegistry --> ClaudeCodeRunner
    HarnessRegistry --> ClaudeCodeToolAdapter
    HarnessRegistry --> ExternalHarnessPlan
    ClaudeCodeToolAdapter --> CoreTools
    ClaudeCodeToolAdapter --> CoreTargets
    ClaudeCodeRunner --> ClaudeCodeSDK
    ClaudeCodeSDK --> Providers

    %% Reporting and external
    BraintrustAdapter --> Braintrust
    ReportLib --> Braintrust
    RenderReport --> ReportLib
    Summary --> Braintrust
    Stagehand --> Providers

    %% Styling
    style Entrypoints fill:#263238,stroke:#111111,stroke-width:3px,color:#ffffff
    style TUI fill:#0d47a1,stroke:#082d69,stroke-width:3px,color:#ffffff
    style Registry fill:#1b5e20,stroke:#0d3010,stroke-width:3px,color:#ffffff
    style Planning fill:#6a1b9a,stroke:#4a148c,stroke-width:3px,color:#ffffff
    style Runner fill:#004d40,stroke:#00251f,stroke-width:3px,color:#ffffff
    style CoreTier fill:#bf360c,stroke:#7f2207,stroke-width:3px,color:#ffffff
    style BenchTier fill:#1565c0,stroke:#0d47a1,stroke-width:3px,color:#ffffff
    style Reporting fill:#4e342e,stroke:#2d1b17,stroke-width:3px,color:#ffffff
    style External fill:#7f0000,stroke:#4d0000,stroke-width:3px,color:#ffffff

    style CLI fill:#eceff1,stroke:#263238,stroke-width:2px,color:#000000
    style REPL fill:#eceff1,stroke:#263238,stroke-width:2px,color:#000000
    style BuildCLI fill:#cfd8dc,stroke:#263238,stroke-width:2px,color:#000000
    style BuildESM fill:#cfd8dc,stroke:#263238,stroke-width:2px,color:#000000
    style LegacyCLI fill:#ffcdd2,stroke:#b71c1c,stroke-width:2px,color:#000000
    style LegacyRunner fill:#ffcdd2,stroke:#b71c1c,stroke-width:2px,color:#000000

    style Parse fill:#bbdefb,stroke:#0d47a1,stroke-width:2px,color:#000000
    style RunCmd fill:#bbdefb,stroke:#0d47a1,stroke-width:2px,color:#000000
    style ConfigFile fill:#e3f2fd,stroke:#0d47a1,stroke-width:2px,color:#000000
    style Discovery fill:#c8e6c9,stroke:#1b5e20,stroke-width:2px,color:#000000
    style Resolver fill:#a5d6a7,stroke:#1b5e20,stroke-width:2px,color:#000000
    style CoreTasks fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px,color:#000000
    style BenchTasks fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px,color:#000000

    style BenchPlanner fill:#e1bee7,stroke:#6a1b9a,stroke-width:2px,color:#000000
    style Datasets fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#000000
    style RunEvals fill:#b2dfdb,stroke:#004d40,stroke-width:2px,color:#000000
    style BraintrustAdapter fill:#e0f2f1,stroke:#004d40,stroke-width:2px,color:#000000
    style Cleanup fill:#e0f2f1,stroke:#004d40,stroke-width:2px,color:#000000

    style CoreContext fill:#ffe0b2,stroke:#bf360c,stroke-width:2px,color:#000000
    style CoreTargets fill:#ffcc80,stroke:#bf360c,stroke-width:2px,color:#000000
    style CoreTools fill:#ffcc80,stroke:#bf360c,stroke-width:2px,color:#000000
    style BenchRunner fill:#bbdefb,stroke:#1565c0,stroke-width:2px,color:#000000
    style HarnessRegistry fill:#90caf9,stroke:#1565c0,stroke-width:2px,color:#000000
    style InitV3 fill:#90caf9,stroke:#1565c0,stroke-width:2px,color:#000000
    style TaskLoader fill:#bbdefb,stroke:#1565c0,stroke-width:2px,color:#000000
    style ClaudeCodeRunner fill:#90caf9,stroke:#1565c0,stroke-width:2px,color:#000000
    style ClaudeCodeToolAdapter fill:#90caf9,stroke:#1565c0,stroke-width:2px,color:#000000
    style ExternalHarnessPlan fill:#bbdefb,stroke:#1565c0,stroke-width:2px,color:#000000

    style Braintrust fill:#ffcdd2,stroke:#7f0000,stroke-width:2px,color:#000000
    style Browserbase fill:#ffcdd2,stroke:#7f0000,stroke-width:2px,color:#000000
    style Providers fill:#ffcdd2,stroke:#7f0000,stroke-width:2px,color:#000000
    style Stagehand fill:#ffcdd2,stroke:#7f0000,stroke-width:2px,color:#000000
    style ClaudeCodeSDK fill:#ffcdd2,stroke:#7f0000,stroke-width:2px,color:#000000

Loading

test plan


Summary by cubic

Evals v2 rebuilds the suite into a core + bench system with auto-discovered tasks, adapter-backed tools, a unified Braintrust runner, and a stable CLI/TUI. It now stabilizes the playwright_code and cdp_code tool surfaces, adds browse_cli and MCP adapters, ships a Braintrust comparison report, fixes simple_google_search, guards file-upload evals with cleanup-on-failure, and removes outdated planning docs.

  • New Features

    • Core: deterministic tasks under packages/evals/core/tasks/**, shared contracts/assertions/metrics, local fixture server, lazy getEnv(), runner-provided Chrome targets (local, Browserbase) with artifacts/cleanup and a shared cleanup registry.
    • Tools: stable playwright_code and cdp_code; plus understudy_code, playwright_mcp, chrome_devtools_mcp, browse_cli.
    • Bench: matrix planner and external-suite support (WebVoyager, OnlineMind2Web, WebTailBench); harness registry with stagehand and claude_code, Claude Code SDK boundary, and a browse_cli tool handoff; bench runner propagates sessionUrl/debugUrl.
    • CLI/TUI: stable dist/cli (pnpm evals ...) with bin/evals, experiments command, startup warning suppression, quieter logs via EvalLogger echo toggle; legacy CLI via pnpm evals:old.
    • Reporting: Braintrust data layer (lib/braintrust-report.ts), report:core script, and summaries that include the experiment URL and scores.
    • CI/Deps/Docs: defaults updated (EVAL_AGENT_MODELS= openai/gpt-5.4-mini, anthropic/claude-haiku-4-5, google/gemini-3-flash-preview; EVAL_CATEGORIES includes regression); .gitignore ignores Playwright/MCP artifacts; version bumped to 2.0.0; add playwright, @anthropic-ai/claude-agent-sdk, and vitest; updated README, architecture diagram, and skills/browser guide.
  • Migration

    • Tasks are auto-discovered; remove task lists from evals.config.json.
    • Bench tasks must default-export defineBenchTask({ name }) under packages/evals/tasks/bench/**; core tasks live under packages/evals/core/tasks/** via defineCoreTask.
    • Use the new CLI (pnpm evals ... or the evals bin); the legacy interface remains via pnpm evals:old.
    • Set BROWSERBASE_API_KEY (and optionally BROWSERBASE_PROJECT_ID); set CHROME_PATH if auto-detect fails; EVAL_ENV is chosen at runtime.
    • Filters/categories changed; remove webbench references — use b:webvoyager or b:onlineMind2Web.

Written for commit 5a0056d. Summary will update on new commits. Review in cubic

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 18, 2026

⚠️ No Changeset found

Latest commit: ea9895a

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Comment thread packages/evals/core/contracts/tool.ts
@miguelg719 miguelg719 marked this pull request as ready for review April 28, 2026 20:58
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 issues found across 427 files

Confidence score: 2/5

  • There is a high-confidence async cleanup bug in packages/evals/framework/benchRunner.ts: missing await can run finally too early, potentially tearing down browser sessions while work is still in flight and masking downstream errors.
  • packages/evals/framework/assertions.ts and packages/evals/core/targets/localChrome.ts introduce meaningful user/runtime risk: unsanitized assertion errors may leak sensitive values, and Chrome/temp-dir cleanup can be skipped on debugger URL failures, causing resource leaks.
  • Given multiple medium-to-high severity findings (6–8/10) across execution flow, resource lifecycle, and task stability, this does not yet look safe to merge without fixes despite clear, targeted remediation paths.
  • Pay close attention to packages/evals/framework/benchRunner.ts, packages/evals/framework/assertions.ts, and packages/evals/core/targets/localChrome.ts - async lifecycle ordering, error-data exposure, and failure-path cleanup are the key risk areas.

Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed. cubic prioritises the most important files to review.

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/suites/gaia.ts">

<violation number="1" location="packages/evals/suites/gaia.ts:11">
P2: Duplicate code: `normalizeModelEntries` is copy-pasted identically in all four suite files (`gaia.ts`, `webvoyager.ts`, `onlineMind2Web.ts`, `webtailbench.ts`). Extract it into a shared utility (e.g., `utils.ts` which is already imported here) to avoid divergence when the mode-inference logic changes.</violation>

<violation number="2" location="packages/evals/suites/gaia.ts:17">
P2: Custom agent: **Ensure we never check against hardcoded lists of allowed LLM model names**

New code introduces allowlist-based model-name checking (`AVAILABLE_CUA_MODELS.includes`) to drive runtime mode, which violates the rule against hardcoded allowed-model checks.</violation>
</file>

<file name="packages/evals/suites/onlineMind2Web.ts">

<violation number="1" location="packages/evals/suites/onlineMind2Web.ts:17">
P2: Custom agent: **Ensure we never check against hardcoded lists of allowed LLM model names**

New code gates model behavior using a hardcoded model-name allowlist (`AVAILABLE_CUA_MODELS.includes(modelName)`), violating the rule prohibiting hardcoded model-name checks.</violation>
</file>

<file name="packages/evals/framework/assertions.ts">

<violation number="1" location="packages/evals/framework/assertions.ts:30">
P1: Custom agent: **Exception and error message sanitization**

Assertion helper failures expose unsanitized runtime values in user-visible error messages, risking secret/data leakage.</violation>
</file>

<file name="packages/evals/core/targets/browserbase.ts">

<violation number="1" location="packages/evals/core/targets/browserbase.ts:7">
P2: Use `||` instead of `??` for environment variable fallback chains. If `BROWSERBASE_API_KEY` is explicitly set to an empty string (e.g., `BROWSERBASE_API_KEY=` in `.env` or Docker), `??` won't fall through to `BB_API_KEY` because `""` is not nullish. The error thrown then misleadingly says `BROWSERBASE_API_KEY` is required, even if `BB_API_KEY` holds a valid key.</violation>
</file>

<file name="packages/evals/core/targets/index.ts">

<violation number="1" location="packages/evals/core/targets/index.ts:16">
P2: `environment` and `toolSurface` are accepted in the input but never used. In particular, `environment` is not validated against the chosen `startupProfile`, so contradictory combinations (e.g., `LOCAL` + `runner_provided_browserbase_cdp`) are silently accepted. Either remove the unused fields or add a consistency check.</violation>
</file>

<file name="packages/evals/tasks/bench/act/iframe_scroll.ts">

<violation number="1" location="packages/evals/tasks/bench/act/iframe_scroll.ts:28">
P2: A 1-pixel tolerance for the 50% scroll check is extremely tight and likely to cause flaky results. Sub-pixel rendering, rounding, and varying viewport sizes can easily produce offsets of several pixels. Consider a more forgiving threshold (e.g., 50–100px or a percentage of `scrollHeight`).</violation>
</file>

<file name="packages/evals/framework/metrics.ts">

<violation number="1" location="packages/evals/framework/metrics.ts:42">
P2: `value: singleValue` is included in the multi-sample summary where it's just a duplicate of `min`. Per the interface docs, `value` is meant only for single measurements. Remove it from the multi-sample return to match the documented contract and avoid confusing consumers.</violation>
</file>

<file name="packages/evals/framework/benchRunner.ts">

<violation number="1" location="packages/evals/framework/benchRunner.ts:69">
P1: Missing `await` on `legacyFn` inside try/catch/finally. Without `await`, the `finally` block (cleanup) runs before the returned promise settles, tearing down browser sessions while the task is still running, and any rejection from `legacyFn` bypasses the `catch` block's error handling. The `definition.fn` path above correctly uses `return await`.</violation>

<violation number="2" location="packages/evals/framework/benchRunner.ts:101">
P2: `JSON.stringify` on a native `Error` produces `"{}"` because `message`/`stack`/`name` are non-enumerable. The `TaskResult.error` field will be an empty object, losing all error information. Serialize the relevant fields explicitly instead.</violation>
</file>

<file name="packages/evals/runtimePaths.ts">

<violation number="1" location="packages/evals/runtimePaths.ts:124">
P2: `path.join` will produce backslash paths on Windows, breaking consistency with the rest of this file which normalizes all paths to forward slashes. Use template literals with `/` (or `path.posix.join`) to match the convention.

(Based on your team's feedback about preferring `/` over `path.join` to avoid introducing backslashes.) [FEEDBACK_USED]</violation>
</file>

<file name="packages/evals/core/tasks/actions/click_coordinates.ts">

<violation number="1" location="packages/evals/core/tasks/actions/click_coordinates.ts:9">
P2: Missing wait-for-visible before reading the bounding box. The sibling `click.ts` task waits for the button to be visible before interacting; this task skips that step, so `page.evaluate` can run before the element is laid out, producing incorrect coordinates or a confusing `null` error instead of a proper timeout.</violation>
</file>

<file name="packages/evals/core/targets/localChrome.ts">

<violation number="1" location="packages/evals/core/targets/localChrome.ts:18">
P2: The candidates list only contains the macOS Chrome path. Add common Linux paths so auto-detection works on CI runners without requiring `CHROME_PATH`.</violation>

<violation number="2" location="packages/evals/core/targets/localChrome.ts:127">
P1: If `waitForDebuggerUrl` throws (e.g., on timeout), the spawned Chrome process and temp directory are leaked because `cleanup` is only returned on success. Wrap the call in a try/catch that terminates Chrome and removes the directory on failure.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread packages/evals/framework/assertions.ts
Comment thread packages/evals/framework/benchRunner.ts Outdated
Comment thread packages/evals/core/targets/localChrome.ts Outdated
Comment thread packages/evals/suites/gaia.ts Outdated
Comment thread packages/evals/suites/onlineMind2Web.ts Outdated
Comment thread packages/evals/framework/benchRunner.ts Outdated
Comment thread packages/evals/runtimePaths.ts Outdated
Comment thread packages/evals/core/tasks/actions/click_coordinates.ts
Comment thread packages/evals/suites/gaia.ts Outdated
Comment thread packages/evals/core/targets/localChrome.ts
@miguelg719 miguelg719 added act These changes pertain to the act function extract These changes pertain to the extract function observe These changes pertain to the observe function combination These changes affect multiple Stagehand functions agent Agentic evals labels Apr 28, 2026
@miguelg719 miguelg719 changed the title Evals overhaul Evals v2 Apr 28, 2026
@miguelg719 miguelg719 removed the agent Agentic evals label Apr 28, 2026
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/framework/runner.ts">

<violation number="1" location="packages/evals/framework/runner.ts:458">
P2: The returned `experimentName` doesn't reflect the Braintrust-assigned name. `generateSummary` receives `evalResult.summary?.experimentName ?? experimentName`, but the return value still uses the local `experimentName`. Extract the resolved name into a variable and use it in both places so callers get a consistent experiment name.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread packages/evals/framework/runner.ts Outdated
@miguelg719 miguelg719 removed the combination These changes affect multiple Stagehand functions label Apr 28, 2026
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/tasks/bench/observe/observe_yc_startup.ts">

<violation number="1" location="packages/evals/tasks/bench/observe/observe_yc_startup.ts:28">
P1: Broken CSS selector: the `.` (dot) before `_results_18olp_345` was dropped and replaced with a space. This makes the selector invalid — it looks for a nonexistent HTML element type `_results_18olp_345` as a descendant, instead of matching a `div` with both classes. This locator will never match, silently degrading the test.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread packages/evals/tasks/bench/observe/observe_yc_startup.ts Outdated
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/tasks/bench/observe/observe_file_uploads.ts">

<violation number="1" location="packages/evals/tasks/bench/observe/observe_file_uploads.ts:43">
P1: Missing `await v3.close()` in a `finally` block. Every other observe task cleans up the Stagehand instance this way; omitting it leaks the session.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread packages/evals/tasks/bench/observe/observe_file_uploads.ts
@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented Apr 29, 2026

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

@miguelg719 miguelg719 added the agent Agentic evals label Apr 29, 2026
@miguelg719
Copy link
Copy Markdown
Collaborator Author

@cubic-dev-ai review

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented Apr 29, 2026

@cubic-dev-ai review

@miguelg719 I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 432 files

Confidence score: 3/5

  • There is some merge risk because several medium-severity, high-confidence findings point to user-facing error handling regressions (unsanitized exceptions and generic errors) rather than isolated style issues.
  • The most severe concern is in packages/evals/core/targets/browserbase.ts and packages/evals/tasks/bench/act/csr_in_oopif.ts, where generic/unsanitized errors can expose raw exception details in eval output and reduce consistency of failure handling.
  • Similar patterns in packages/evals/core/targets/index.ts, packages/evals/tasks/bench/act/csr_in_spif.ts, and packages/evals/tasks/bench/act/login.ts suggest a broader sanitization gap that is likely user-visible if not addressed.
  • Pay close attention to packages/evals/core/targets/browserbase.ts, packages/evals/tasks/bench/act/csr_in_oopif.ts, packages/evals/core/targets/index.ts, packages/evals/tasks/bench/act/csr_in_spif.ts, and packages/evals/tasks/bench/act/login.ts - user-facing errors should be typed and sanitized to avoid leaking stack/context details.

Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed. cubic prioritises the most important files to review.

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/core/targets/browserbase.ts">

<violation number="1" location="packages/evals/core/targets/browserbase.ts:11">
P2: Custom agent: **Exception and error message sanitization**

Uses generic `new Error()` in a user-facing failure path instead of a typed/sanitized error class.</violation>
</file>

<file name="packages/evals/tasks/bench/act/csr_in_oopif.ts">

<violation number="1" location="packages/evals/tasks/bench/act/csr_in_oopif.ts:41">
P2: Custom agent: **Exception and error message sanitization**

Raw exception text is returned in user-facing eval output instead of a sanitized error message.</violation>
</file>

<file name="packages/evals/core/targets/index.ts">

<violation number="1" location="packages/evals/core/targets/index.ts:26">
P2: Custom agent: **Exception and error message sanitization**

Uses generic `Error` for user-visible validation failures instead of a typed error class.</violation>
</file>

<file name="packages/evals/tasks/bench/act/csr_in_spif.ts">

<violation number="1" location="packages/evals/tasks/bench/act/csr_in_spif.ts:40">
P2: Custom agent: **Exception and error message sanitization**

Raw exception text is returned to the user-facing failure message without sanitization.</violation>
</file>

<file name="packages/evals/tasks/bench/act/login.ts">

<violation number="1" location="packages/evals/tasks/bench/act/login.ts:32">
P2: Custom agent: **Exception and error message sanitization**

Returns the raw caught error object in the task result instead of a sanitized error payload, which can leak stack/context data into eval output.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread packages/evals/core/targets/browserbase.ts
Comment thread packages/evals/tasks/bench/act/csr_in_oopif.ts
Comment thread packages/evals/core/targets/index.ts
Comment thread packages/evals/tasks/bench/act/csr_in_spif.ts
Comment thread packages/evals/tasks/bench/act/login.ts
@miguelg719 miguelg719 removed the agent Agentic evals label Apr 29, 2026
case "chrome_devtools_mcp":
return new ChromeDevtoolsMcpTool();
case "browse_cli":
return new BrowseCliTool();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add more competitors? 😁

@miguelg719 miguelg719 merged commit 2eb9416 into main May 1, 2026
207 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

act These changes pertain to the act function extract These changes pertain to the extract function observe These changes pertain to the observe function

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants