Evals v2#2011
Conversation
|
There was a problem hiding this comment.
14 issues found across 427 files
Confidence score: 2/5
- There is a high-confidence async cleanup bug in
packages/evals/framework/benchRunner.ts: missingawaitcan runfinallytoo early, potentially tearing down browser sessions while work is still in flight and masking downstream errors. packages/evals/framework/assertions.tsandpackages/evals/core/targets/localChrome.tsintroduce meaningful user/runtime risk: unsanitized assertion errors may leak sensitive values, and Chrome/temp-dir cleanup can be skipped on debugger URL failures, causing resource leaks.- Given multiple medium-to-high severity findings (6–8/10) across execution flow, resource lifecycle, and task stability, this does not yet look safe to merge without fixes despite clear, targeted remediation paths.
- Pay close attention to
packages/evals/framework/benchRunner.ts,packages/evals/framework/assertions.ts, andpackages/evals/core/targets/localChrome.ts- async lifecycle ordering, error-data exposure, and failure-path cleanup are the key risk areas.
Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed. cubic prioritises the most important files to review.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/evals/suites/gaia.ts">
<violation number="1" location="packages/evals/suites/gaia.ts:11">
P2: Duplicate code: `normalizeModelEntries` is copy-pasted identically in all four suite files (`gaia.ts`, `webvoyager.ts`, `onlineMind2Web.ts`, `webtailbench.ts`). Extract it into a shared utility (e.g., `utils.ts` which is already imported here) to avoid divergence when the mode-inference logic changes.</violation>
<violation number="2" location="packages/evals/suites/gaia.ts:17">
P2: Custom agent: **Ensure we never check against hardcoded lists of allowed LLM model names**
New code introduces allowlist-based model-name checking (`AVAILABLE_CUA_MODELS.includes`) to drive runtime mode, which violates the rule against hardcoded allowed-model checks.</violation>
</file>
<file name="packages/evals/suites/onlineMind2Web.ts">
<violation number="1" location="packages/evals/suites/onlineMind2Web.ts:17">
P2: Custom agent: **Ensure we never check against hardcoded lists of allowed LLM model names**
New code gates model behavior using a hardcoded model-name allowlist (`AVAILABLE_CUA_MODELS.includes(modelName)`), violating the rule prohibiting hardcoded model-name checks.</violation>
</file>
<file name="packages/evals/framework/assertions.ts">
<violation number="1" location="packages/evals/framework/assertions.ts:30">
P1: Custom agent: **Exception and error message sanitization**
Assertion helper failures expose unsanitized runtime values in user-visible error messages, risking secret/data leakage.</violation>
</file>
<file name="packages/evals/core/targets/browserbase.ts">
<violation number="1" location="packages/evals/core/targets/browserbase.ts:7">
P2: Use `||` instead of `??` for environment variable fallback chains. If `BROWSERBASE_API_KEY` is explicitly set to an empty string (e.g., `BROWSERBASE_API_KEY=` in `.env` or Docker), `??` won't fall through to `BB_API_KEY` because `""` is not nullish. The error thrown then misleadingly says `BROWSERBASE_API_KEY` is required, even if `BB_API_KEY` holds a valid key.</violation>
</file>
<file name="packages/evals/core/targets/index.ts">
<violation number="1" location="packages/evals/core/targets/index.ts:16">
P2: `environment` and `toolSurface` are accepted in the input but never used. In particular, `environment` is not validated against the chosen `startupProfile`, so contradictory combinations (e.g., `LOCAL` + `runner_provided_browserbase_cdp`) are silently accepted. Either remove the unused fields or add a consistency check.</violation>
</file>
<file name="packages/evals/tasks/bench/act/iframe_scroll.ts">
<violation number="1" location="packages/evals/tasks/bench/act/iframe_scroll.ts:28">
P2: A 1-pixel tolerance for the 50% scroll check is extremely tight and likely to cause flaky results. Sub-pixel rendering, rounding, and varying viewport sizes can easily produce offsets of several pixels. Consider a more forgiving threshold (e.g., 50–100px or a percentage of `scrollHeight`).</violation>
</file>
<file name="packages/evals/framework/metrics.ts">
<violation number="1" location="packages/evals/framework/metrics.ts:42">
P2: `value: singleValue` is included in the multi-sample summary where it's just a duplicate of `min`. Per the interface docs, `value` is meant only for single measurements. Remove it from the multi-sample return to match the documented contract and avoid confusing consumers.</violation>
</file>
<file name="packages/evals/framework/benchRunner.ts">
<violation number="1" location="packages/evals/framework/benchRunner.ts:69">
P1: Missing `await` on `legacyFn` inside try/catch/finally. Without `await`, the `finally` block (cleanup) runs before the returned promise settles, tearing down browser sessions while the task is still running, and any rejection from `legacyFn` bypasses the `catch` block's error handling. The `definition.fn` path above correctly uses `return await`.</violation>
<violation number="2" location="packages/evals/framework/benchRunner.ts:101">
P2: `JSON.stringify` on a native `Error` produces `"{}"` because `message`/`stack`/`name` are non-enumerable. The `TaskResult.error` field will be an empty object, losing all error information. Serialize the relevant fields explicitly instead.</violation>
</file>
<file name="packages/evals/runtimePaths.ts">
<violation number="1" location="packages/evals/runtimePaths.ts:124">
P2: `path.join` will produce backslash paths on Windows, breaking consistency with the rest of this file which normalizes all paths to forward slashes. Use template literals with `/` (or `path.posix.join`) to match the convention.
(Based on your team's feedback about preferring `/` over `path.join` to avoid introducing backslashes.) [FEEDBACK_USED]</violation>
</file>
<file name="packages/evals/core/tasks/actions/click_coordinates.ts">
<violation number="1" location="packages/evals/core/tasks/actions/click_coordinates.ts:9">
P2: Missing wait-for-visible before reading the bounding box. The sibling `click.ts` task waits for the button to be visible before interacting; this task skips that step, so `page.evaluate` can run before the element is laid out, producing incorrect coordinates or a confusing `null` error instead of a proper timeout.</violation>
</file>
<file name="packages/evals/core/targets/localChrome.ts">
<violation number="1" location="packages/evals/core/targets/localChrome.ts:18">
P2: The candidates list only contains the macOS Chrome path. Add common Linux paths so auto-detection works on CI runners without requiring `CHROME_PATH`.</violation>
<violation number="2" location="packages/evals/core/targets/localChrome.ts:127">
P1: If `waitForDebuggerUrl` throws (e.g., on timeout), the spawned Chrome process and temp directory are leaked because `cleanup` is only returned on success. Wrap the call in a try/catch that terminates Chrome and removes the directory on failure.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/evals/framework/runner.ts">
<violation number="1" location="packages/evals/framework/runner.ts:458">
P2: The returned `experimentName` doesn't reflect the Braintrust-assigned name. `generateSummary` receives `evalResult.summary?.experimentName ?? experimentName`, but the return value still uses the local `experimentName`. Extract the resolved name into a variable and use it in both places so callers get a consistent experiment name.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/evals/tasks/bench/observe/observe_yc_startup.ts">
<violation number="1" location="packages/evals/tasks/bench/observe/observe_yc_startup.ts:28">
P1: Broken CSS selector: the `.` (dot) before `_results_18olp_345` was dropped and replaced with a space. This makes the selector invalid — it looks for a nonexistent HTML element type `_results_18olp_345` as a descendant, instead of matching a `div` with both classes. This locator will never match, silently degrading the test.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
1 issue found across 3 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/evals/tasks/bench/observe/observe_file_uploads.ts">
<violation number="1" location="packages/evals/tasks/bench/observe/observe_file_uploads.ts:43">
P1: Missing `await v3.close()` in a `finally` block. Every other observe task cleans up the Stagehand instance this way; omitting it leaks the session.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment |
097068a to
4019a8a
Compare
|
@cubic-dev-ai review |
@miguelg719 I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
5 issues found across 432 files
Confidence score: 3/5
- There is some merge risk because several medium-severity, high-confidence findings point to user-facing error handling regressions (unsanitized exceptions and generic errors) rather than isolated style issues.
- The most severe concern is in
packages/evals/core/targets/browserbase.tsandpackages/evals/tasks/bench/act/csr_in_oopif.ts, where generic/unsanitized errors can expose raw exception details in eval output and reduce consistency of failure handling. - Similar patterns in
packages/evals/core/targets/index.ts,packages/evals/tasks/bench/act/csr_in_spif.ts, andpackages/evals/tasks/bench/act/login.tssuggest a broader sanitization gap that is likely user-visible if not addressed. - Pay close attention to
packages/evals/core/targets/browserbase.ts,packages/evals/tasks/bench/act/csr_in_oopif.ts,packages/evals/core/targets/index.ts,packages/evals/tasks/bench/act/csr_in_spif.ts, andpackages/evals/tasks/bench/act/login.ts- user-facing errors should be typed and sanitized to avoid leaking stack/context details.
Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed. cubic prioritises the most important files to review.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/evals/core/targets/browserbase.ts">
<violation number="1" location="packages/evals/core/targets/browserbase.ts:11">
P2: Custom agent: **Exception and error message sanitization**
Uses generic `new Error()` in a user-facing failure path instead of a typed/sanitized error class.</violation>
</file>
<file name="packages/evals/tasks/bench/act/csr_in_oopif.ts">
<violation number="1" location="packages/evals/tasks/bench/act/csr_in_oopif.ts:41">
P2: Custom agent: **Exception and error message sanitization**
Raw exception text is returned in user-facing eval output instead of a sanitized error message.</violation>
</file>
<file name="packages/evals/core/targets/index.ts">
<violation number="1" location="packages/evals/core/targets/index.ts:26">
P2: Custom agent: **Exception and error message sanitization**
Uses generic `Error` for user-visible validation failures instead of a typed error class.</violation>
</file>
<file name="packages/evals/tasks/bench/act/csr_in_spif.ts">
<violation number="1" location="packages/evals/tasks/bench/act/csr_in_spif.ts:40">
P2: Custom agent: **Exception and error message sanitization**
Raw exception text is returned to the user-facing failure message without sanitization.</violation>
</file>
<file name="packages/evals/tasks/bench/act/login.ts">
<violation number="1" location="packages/evals/tasks/bench/act/login.ts:32">
P2: Custom agent: **Exception and error message sanitization**
Returns the raw caught error object in the task result instead of a sanitized error payload, which can leak stack/context data into eval output.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
1d9ca2c to
a1ad2e0
Compare
| case "chrome_devtools_mcp": | ||
| return new ChromeDevtoolsMcpTool(); | ||
| case "browse_cli": | ||
| return new BrowseCliTool(); |
There was a problem hiding this comment.
should we add more competitors? 😁
why
Evals CLI v2.0
what changed
You can find this diagram in
ARCHITECTURE.mdflowchart TB subgraph Entrypoints["Entrypoints and Distribution"] CLI["cli.ts<br/>single-shot argv"] REPL["tui/repl.ts<br/>interactive shell<br/>Esc abort support"] BuildCLI["scripts/build-cli.ts<br/>bundled dist/cli bin"] BuildESM["scripts/build-esm.ts<br/>dist/esm + assets + datasets"] LegacyCLI["cli-legacy.ts / index.eval.ts<br/>--legacy escape hatch"] end subgraph TUI["TUI Command Layer"] Help["help<br/>usage text"] List["list<br/>registry views"] ConfigCmd["config<br/>defaults + core tool setup"] NewTask["new<br/>task scaffold"] Experiments["experiments<br/>Braintrust inspect/compare"] RunCmd["run<br/>progress + dry-run plan"] Parse["parse.ts<br/>flag validation<br/>target shorthand<br/>env override scope"] ConfigFile[("evals.config.json<br/>defaults, core config,<br/>benchmark metadata")] end subgraph Registry["Task Registry and Target Resolution"] RuntimePaths["runtimePaths.ts<br/>source vs dist roots"] Discovery["framework/discovery.ts<br/>filesystem scan<br/>category overrides<br/>lazy/eager metadata"] Resolver["resolveTarget()<br/>tier, category, task,<br/>partial name"] CoreTasks[("core/tasks/*<br/>deterministic tool tasks")] BenchTasks[("tasks/bench/*<br/>act, extract, observe,<br/>agent, combination, suites")] DefineTask["framework/defineTask.ts<br/>defineCoreTask<br/>defineBenchTask"] end subgraph Planning["Run Planning"] ResolvedOptions["ResolvedRunOptions<br/>env, trials, concurrency,<br/>model, provider, api,<br/>harness, agent mode"] BenchPlanner["framework/benchPlanner.ts<br/>model matrix<br/>suite expansion<br/>metadata/tags"] TaskConfig["taskConfig.ts<br/>default models<br/>agent model entries"] Datasets[("datasets/*<br/>WebVoyager<br/>OnlineMind2Web<br/>WebTailBench")] DryRun["dry-run JSON<br/>tasks, skippedTasks,<br/>envOverrides, runOptions"] end subgraph Runner["Unified Runner"] RunEvals["framework/runner.ts<br/>generate testcases<br/>Braintrust Eval wrapper<br/>progress callbacks"] BraintrustAdapter["framework/braintrust.ts<br/>lazy import<br/>traced spans"] Cleanup["activeRunCleanup.ts<br/>signal cleanup<br/>registered session cleanup"] Scoring["scoring.ts<br/>passRate<br/>exactMatch<br/>errorMatch"] Summary["summary.ts<br/>local run summary"] end subgraph CoreTier["Core Tier Execution"] CoreContext["framework/context.ts<br/>buildCoreContext"] FixtureServer["core/fixtures<br/>local deterministic pages"] CoreTargets["core/targets<br/>local Chrome<br/>Browserbase CDP"] CoreTools["core/tools registry<br/>understudy_code<br/>playwright_code<br/>cdp_code<br/>MCP + browse_cli"] CoreAssertions["assertions + metrics<br/>adapter-backed results"] end subgraph BenchTier["Bench Tier Execution"] BenchRunner["framework/benchRunner.ts<br/>load task module<br/>execute bench task"] TaskLoader["framework/taskLoader.ts<br/>defineBenchTask + legacy<br/>module loader"] HarnessRegistry["framework/benchHarness.ts<br/>stagehand + claude_code<br/>(codex slot pending)"] InitV3["initV3.ts<br/>V3 session + agent"] AISDK["AISdkClientWrapped<br/>provider/model client"] ClaudeCodeRunner["framework/claudeCodeRunner.ts<br/>Claude Code SDK adapter<br/>prompt + result parse"] ClaudeCodeToolAdapter["framework/claudeCodeToolAdapter.ts<br/>tool-surface handoff<br/>browse_cli wrapper"] ExternalHarnessPlan["framework/externalHarnessPlan.ts<br/>webvoyager / onlineMind2Web /<br/>webtailbench task plan"] BenchSuites["suites/*<br/>dataset-backed suite cases"] BenchTaskDefs["bench task modules<br/>defineBenchTask or legacy fn"] end subgraph Reporting["Reporting and Inspection"] Progress["tui/progress.ts<br/>live task table"] ResultsView["tui/results.ts<br/>model summary<br/>results table"] ReportLib["lib/braintrust-report.ts<br/>experiment fetch<br/>comparison data"] RenderReport["scripts/render-braintrust-core-report.ts<br/>core report rendering"] end subgraph External["External Systems"] Braintrust["Braintrust<br/>experiments, spans,<br/>scores, comparisons"] Browserbase["Browserbase<br/>remote browser sessions"] LocalBrowser["Local Chromium<br/>fixture-backed sessions"] Providers["LLM Providers<br/>OpenAI, Anthropic,<br/>Google, others"] Stagehand["Stagehand package<br/>V3, AgentProvider,<br/>available models"] ClaudeCodeSDK["@anthropic-ai/claude-agent-sdk<br/>external Claude Code agent<br/>headless run"] end %% Entrypoints CLI --> Parse CLI --> RunCmd CLI --> List CLI --> ConfigCmd CLI --> Experiments CLI --> NewTask CLI --> Help REPL --> Parse REPL --> RunCmd BuildCLI --> CLI BuildESM --> RuntimePaths LegacyCLI --> LegacyRunner["legacy runner path<br/>pre-refactor behavior"] %% TUI and config ConfigCmd <--> ConfigFile Parse --> ConfigFile Parse --> ResolvedOptions Parse --> DryRun RunCmd --> ResolvedOptions RunCmd --> RuntimePaths RunCmd --> Discovery List --> Discovery NewTask --> CoreTasks NewTask --> BenchTasks Experiments --> ReportLib %% Registry RuntimePaths --> CoreTasks RuntimePaths --> BenchTasks CoreTasks --> Discovery BenchTasks --> Discovery DefineTask --> CoreTasks DefineTask --> BenchTasks Discovery --> Resolver Resolver --> RunCmd %% Planning RunCmd --> BenchPlanner RunCmd --> DryRun ResolvedOptions --> BenchPlanner TaskConfig --> BenchPlanner Datasets --> BenchSuites BenchSuites --> BenchPlanner BenchPlanner --> RunEvals Resolver --> RunEvals %% Runner RunCmd --> RunEvals RunEvals --> BraintrustAdapter RunEvals --> Cleanup RunEvals --> Scoring RunEvals --> Summary RunEvals --> CoreContext RunEvals --> BenchRunner RunEvals --> Progress Progress --> ResultsView %% Core path CoreContext --> FixtureServer CoreContext --> CoreTargets CoreContext --> CoreTools CoreContext --> CoreAssertions CoreTargets --> LocalBrowser CoreTargets --> Browserbase CoreTools --> LocalBrowser CoreTools --> Browserbase %% Bench path BenchRunner --> HarnessRegistry BenchRunner --> TaskLoader TaskLoader --> BenchTaskDefs HarnessRegistry --> InitV3 InitV3 --> Stagehand InitV3 --> Browserbase InitV3 --> LocalBrowser HarnessRegistry --> AISDK AISDK --> Providers BenchRunner --> BenchTaskDefs BenchTaskDefs --> Stagehand BenchSuites --> Datasets %% Claude Code external harness path HarnessRegistry --> ClaudeCodeRunner HarnessRegistry --> ClaudeCodeToolAdapter HarnessRegistry --> ExternalHarnessPlan ClaudeCodeToolAdapter --> CoreTools ClaudeCodeToolAdapter --> CoreTargets ClaudeCodeRunner --> ClaudeCodeSDK ClaudeCodeSDK --> Providers %% Reporting and external BraintrustAdapter --> Braintrust ReportLib --> Braintrust RenderReport --> ReportLib Summary --> Braintrust Stagehand --> Providers %% Styling style Entrypoints fill:#263238,stroke:#111111,stroke-width:3px,color:#ffffff style TUI fill:#0d47a1,stroke:#082d69,stroke-width:3px,color:#ffffff style Registry fill:#1b5e20,stroke:#0d3010,stroke-width:3px,color:#ffffff style Planning fill:#6a1b9a,stroke:#4a148c,stroke-width:3px,color:#ffffff style Runner fill:#004d40,stroke:#00251f,stroke-width:3px,color:#ffffff style CoreTier fill:#bf360c,stroke:#7f2207,stroke-width:3px,color:#ffffff style BenchTier fill:#1565c0,stroke:#0d47a1,stroke-width:3px,color:#ffffff style Reporting fill:#4e342e,stroke:#2d1b17,stroke-width:3px,color:#ffffff style External fill:#7f0000,stroke:#4d0000,stroke-width:3px,color:#ffffff style CLI fill:#eceff1,stroke:#263238,stroke-width:2px,color:#000000 style REPL fill:#eceff1,stroke:#263238,stroke-width:2px,color:#000000 style BuildCLI fill:#cfd8dc,stroke:#263238,stroke-width:2px,color:#000000 style BuildESM fill:#cfd8dc,stroke:#263238,stroke-width:2px,color:#000000 style LegacyCLI fill:#ffcdd2,stroke:#b71c1c,stroke-width:2px,color:#000000 style LegacyRunner fill:#ffcdd2,stroke:#b71c1c,stroke-width:2px,color:#000000 style Parse fill:#bbdefb,stroke:#0d47a1,stroke-width:2px,color:#000000 style RunCmd fill:#bbdefb,stroke:#0d47a1,stroke-width:2px,color:#000000 style ConfigFile fill:#e3f2fd,stroke:#0d47a1,stroke-width:2px,color:#000000 style Discovery fill:#c8e6c9,stroke:#1b5e20,stroke-width:2px,color:#000000 style Resolver fill:#a5d6a7,stroke:#1b5e20,stroke-width:2px,color:#000000 style CoreTasks fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px,color:#000000 style BenchTasks fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px,color:#000000 style BenchPlanner fill:#e1bee7,stroke:#6a1b9a,stroke-width:2px,color:#000000 style Datasets fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#000000 style RunEvals fill:#b2dfdb,stroke:#004d40,stroke-width:2px,color:#000000 style BraintrustAdapter fill:#e0f2f1,stroke:#004d40,stroke-width:2px,color:#000000 style Cleanup fill:#e0f2f1,stroke:#004d40,stroke-width:2px,color:#000000 style CoreContext fill:#ffe0b2,stroke:#bf360c,stroke-width:2px,color:#000000 style CoreTargets fill:#ffcc80,stroke:#bf360c,stroke-width:2px,color:#000000 style CoreTools fill:#ffcc80,stroke:#bf360c,stroke-width:2px,color:#000000 style BenchRunner fill:#bbdefb,stroke:#1565c0,stroke-width:2px,color:#000000 style HarnessRegistry fill:#90caf9,stroke:#1565c0,stroke-width:2px,color:#000000 style InitV3 fill:#90caf9,stroke:#1565c0,stroke-width:2px,color:#000000 style TaskLoader fill:#bbdefb,stroke:#1565c0,stroke-width:2px,color:#000000 style ClaudeCodeRunner fill:#90caf9,stroke:#1565c0,stroke-width:2px,color:#000000 style ClaudeCodeToolAdapter fill:#90caf9,stroke:#1565c0,stroke-width:2px,color:#000000 style ExternalHarnessPlan fill:#bbdefb,stroke:#1565c0,stroke-width:2px,color:#000000 style Braintrust fill:#ffcdd2,stroke:#7f0000,stroke-width:2px,color:#000000 style Browserbase fill:#ffcdd2,stroke:#7f0000,stroke-width:2px,color:#000000 style Providers fill:#ffcdd2,stroke:#7f0000,stroke-width:2px,color:#000000 style Stagehand fill:#ffcdd2,stroke:#7f0000,stroke-width:2px,color:#000000 style ClaudeCodeSDK fill:#ffcdd2,stroke:#7f0000,stroke-width:2px,color:#000000test plan
Summary by cubic
Evals v2 rebuilds the suite into a core + bench system with auto-discovered tasks, adapter-backed tools, a unified Braintrust runner, and a stable CLI/TUI. It now stabilizes the
playwright_codeandcdp_codetool surfaces, addsbrowse_cliand MCP adapters, ships a Braintrust comparison report, fixessimple_google_search, guards file-upload evals with cleanup-on-failure, and removes outdated planning docs.New Features
packages/evals/core/tasks/**, shared contracts/assertions/metrics, local fixture server, lazygetEnv(), runner-provided Chrome targets (local, Browserbase) with artifacts/cleanup and a shared cleanup registry.playwright_codeandcdp_code; plusunderstudy_code,playwright_mcp,chrome_devtools_mcp,browse_cli.stagehandandclaude_code, Claude Code SDK boundary, and abrowse_clitool handoff; bench runner propagatessessionUrl/debugUrl.dist/cli(pnpm evals ...) withbin/evals,experimentscommand, startup warning suppression, quieter logs viaEvalLoggerecho toggle; legacy CLI viapnpm evals:old.lib/braintrust-report.ts),report:corescript, and summaries that include the experiment URL and scores.EVAL_AGENT_MODELS= openai/gpt-5.4-mini, anthropic/claude-haiku-4-5, google/gemini-3-flash-preview;EVAL_CATEGORIESincludesregression);.gitignoreignores Playwright/MCP artifacts; version bumped to2.0.0; addplaywright,@anthropic-ai/claude-agent-sdk, andvitest; updated README, architecture diagram, andskills/browserguide.Migration
evals.config.json.defineBenchTask({ name })underpackages/evals/tasks/bench/**; core tasks live underpackages/evals/core/tasks/**viadefineCoreTask.pnpm evals ...or theevalsbin); the legacy interface remains viapnpm evals:old.BROWSERBASE_API_KEY(and optionallyBROWSERBASE_PROJECT_ID); setCHROME_PATHif auto-detect fails;EVAL_ENVis chosen at runtime.webbenchreferences — useb:webvoyagerorb:onlineMind2Web.Written for commit 5a0056d. Summary will update on new commits. Review in cubic