Workflow: publish eval results by miguelg719 · Pull Request #2093 · browserbase/stagehand

miguelg719 · 2026-05-06T23:07:36Z

why

what changed

test plan

Summary by cubic

Adds a workflow and script to publish Braintrust benchmark results to Upstash/Vercel KV for the Evals UI, with model/agent‑mode aggregation and token‑based cost estimates. Also fixes manual evals to strip trailing “.0” from numeric inputs to avoid CLI parse errors.

New Features
- Adds Publish Evals workflow (.github/workflows/publish-evals.yml) to fetch a Braintrust experiment and write the UI payload to KV (latest key plus optional <prefix>:<experiment-id>), with dry_run support and artifact/summary output.
- Introduces packages/evals/scripts/publish-braintrust-ui-data.ts (run via pnpm --filter @browserbasehq/stagehand-evals exec tsx) that infers the benchmark, groups by model+agent mode, computes pass rate, avg duration, and cost per task/total (fallback to model pricing), merges into the KV key, and sorts rows; supports --dry-run, --dataset-id, and --out.
Bug Fixes
- In .github/workflows/manual-evals.yml, strip a single trailing .0 from EVAL_TRIALS and EVAL_CONCURRENCY so the eval CLI accepts integer inputs from the GitHub iOS app.

^{Written for commit 2c88217. Summary will update on new commits. Review in cubic}

changeset-bot · 2026-05-06T23:07:40Z

⚠️ No Changeset found

Latest commit: 2c88217

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

cubic-dev-ai

1 issue found across 2 files

Confidence score: 4/5

This PR is likely safe to merge, with a modest logic risk: in packages/evals/scripts/publish-braintrust-ui-data.ts, benchCase.category is currently unreachable because benchCase.suite.replace(...) always yields a string.
The most significant impact is potential benchmark-source misclassification when dataset is missing, which is user-visible in published eval metadata but not likely to break core execution paths.
Given the reported severity (5/10) and a single focused issue, the risk appears contained rather than merge-blocking.
Pay close attention to packages/evals/scripts/publish-braintrust-ui-data.ts - unreachable fallback logic may misclassify source data when dataset is absent.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/scripts/publish-braintrust-ui-data.ts">

<violation number="1" location="packages/evals/scripts/publish-braintrust-ui-data.ts:291">
P2: `benchCase.category` is unreachable here because `benchCase.suite.replace(...)` always returns a string, so this fallback never executes. This can misclassify benchmark source when `dataset` is missing.</violation>
</file>

Architecture diagram

sequenceDiagram
    participant Dev as Developer (Manual Trigger)
    participant GHA as GitHub Actions Workflow
    participant Script as publish-braintrust-ui-data.ts
    participant Braintrust as Braintrust API
    participant Upstash as Upstash Redis
    participant Artifact as GitHub Artifacts

    Note over Dev,Artifact: Publish Evals Workflow

    Dev->>GHA: Trigger workflow_dispatch with inputs
    Note over Dev,GHA: experiment, project, kv_key, dry_run, etc.

    GHA->>GHA: Checkout code & setup node/pnpm

    GHA->>Script: Execute pnpm tsx script with parsed args
    Note over GHA,Script: Passes env vars (BRAINTRUST_API_KEY, UPSTASH_*)

    Script->>Braintrust: Fetch experiment data
    Note over Script,Braintrust: Uses experiment name/UUID & project

    Braintrust-->>Script: Return experiment results & bench cases

    Script->>Script: Infer benchmark key & label
    Note over Script: Extracts from dataset/suite/category

    Script->>Script: Infer model, provider, & provider key
    Note over Script: Parses model string for provider prefix

    Script->>Script: Compute pass rate, avg duration (speed), & cost
    Note over Script: Scans bench results for pass/fail, timing, & cost metrics

    Script->>Script: Build UiBenchmarkRow structure

    alt Not dry run
        Script->>Upstash: Read existing data from <kv_key>
        Upstash-->>Script: Return existing payload (or empty)

        Script->>Script: Merge new row into existing dataset
        Note over Script: Dedup by model+provider, sort by accuracy then speed

        Script->>Upstash: SET <kv_key> with merged payload
        Script->>Upstash: OPTIONAL: SET <experiment_key_prefix>:<experiment_id>
        Upstash-->>Script: Confirm write

        Script->>Script: Build output JSON with summary & keys written
    else Dry run
        Script->>Script: Build output JSON without Upstash calls
    end

    Script-->>GHA: Write evals-ui-data.json & publish-evals-ui-data-output.json

    GHA->>GHA: Parse output JSON and extract summary

    GHA->>GHA: Add publish summary to GitHub Step Summary

    GHA->>Artifact: Upload generated payload files
    Note over GHA,Artifact: Always uploads, even on failure

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.}

cubic-dev-ai · 2026-05-06T23:11:57Z

+function benchmarkSource(benchCase: BenchCaseRow): string | undefined {
+  return (
+    benchCase.dataset ??
+    benchCase.suite.replace(/^agent\//, "") ??


P2: benchCase.category is unreachable here because benchCase.suite.replace(...) always returns a string, so this fallback never executes. This can misclassify benchmark source when dataset is missing.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/scripts/publish-braintrust-ui-data.ts, line 291: <comment>`benchCase.category` is unreachable here because `benchCase.suite.replace(...)` always returns a string, so this fallback never executes. This can misclassify benchmark source when `dataset` is missing.</comment> <file context> @@ -0,0 +1,744 @@ +function benchmarkSource(benchCase: BenchCaseRow): string | undefined { + return ( + benchCase.dataset ?? + benchCase.suite.replace(/^agent\//, "") ?? + benchCase.category + ); </file context>

The GitHub iOS app dispatches workflows with numeric inputs serialized as doubles ("1.0"/"50.0") even when the receiving field is declared `type: number` for an integer. GitHub accepts the float and threads it into the run, so the workflow triggers, but the evals CLI's `parsePositiveInteger` (regex /^[0-9]+$/) then rejects the value and the job fails immediately. Strip a single trailing `.0` from `EVAL_TRIALS` and `EVAL_CONCURRENCY` in the dispatch shell before forwarding them to the eval CLI. Integers and non-zero decimals are passed through unchanged, so behaviour for the web UI / API callers is identical.  --- ## Summary by cubic Strip a trailing `.0` from `EVAL_TRIALS` and `EVAL_CONCURRENCY` in the manual-evals workflow so the evals CLI accepts integer inputs. Prevents job failures when the GitHub iOS app sends numeric inputs as doubles (e.g., "1.0", "50.0"). <sup>Written for commit c6c1d73. Summary will update on new commits. <a href="https://cubic.dev/pr/browserbase/stagehand/pull/2154?utm_source=github">Review in cubic</a></sup>  Co-authored-by: Chromie <chromie@anthropic.com>

Publish eval results

5fcdb16

miguelg719 marked this pull request as ready for review May 6, 2026 23:07

cubic-dev-ai Bot reviewed May 6, 2026

View reviewed changes

miguelg719 and others added 4 commits May 7, 2026 15:23

linting

d06f816

update

900b504

update workflow

32a0a0d

Kylejeong2 approved these changes May 21, 2026

View reviewed changes

miguelg719 merged commit 4354837 into main May 21, 2026
34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow: publish eval results#2093

Workflow: publish eval results#2093
miguelg719 merged 5 commits into
mainfrom
miguelgonzalez/stg-1926-ci-evals-page-integration-2

miguelg719 commented May 6, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

changeset-bot Bot commented May 6, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

miguelg719 commented May 6, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

what changed

test plan

Summary by cubic

Uh oh!

changeset-bot Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

miguelg719 commented May 6, 2026 •

edited by cubic-dev-ai Bot

Loading

changeset-bot Bot commented May 6, 2026 •

edited

Loading

cubic-dev-ai Bot May 6, 2026 •

edited

Loading