Workflow: publish eval results#2093
Conversation
|
There was a problem hiding this comment.
1 issue found across 2 files
Confidence score: 4/5
- This PR is likely safe to merge, with a modest logic risk: in
packages/evals/scripts/publish-braintrust-ui-data.ts,benchCase.categoryis currently unreachable becausebenchCase.suite.replace(...)always yields a string. - The most significant impact is potential benchmark-source misclassification when
datasetis missing, which is user-visible in published eval metadata but not likely to break core execution paths. - Given the reported severity (5/10) and a single focused issue, the risk appears contained rather than merge-blocking.
- Pay close attention to
packages/evals/scripts/publish-braintrust-ui-data.ts- unreachable fallback logic may misclassify source data whendatasetis absent.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/evals/scripts/publish-braintrust-ui-data.ts">
<violation number="1" location="packages/evals/scripts/publish-braintrust-ui-data.ts:291">
P2: `benchCase.category` is unreachable here because `benchCase.suite.replace(...)` always returns a string, so this fallback never executes. This can misclassify benchmark source when `dataset` is missing.</violation>
</file>
Architecture diagram
sequenceDiagram
participant Dev as Developer (Manual Trigger)
participant GHA as GitHub Actions Workflow
participant Script as publish-braintrust-ui-data.ts
participant Braintrust as Braintrust API
participant Upstash as Upstash Redis
participant Artifact as GitHub Artifacts
Note over Dev,Artifact: Publish Evals Workflow
Dev->>GHA: Trigger workflow_dispatch with inputs
Note over Dev,GHA: experiment, project, kv_key, dry_run, etc.
GHA->>GHA: Checkout code & setup node/pnpm
GHA->>Script: Execute pnpm tsx script with parsed args
Note over GHA,Script: Passes env vars (BRAINTRUST_API_KEY, UPSTASH_*)
Script->>Braintrust: Fetch experiment data
Note over Script,Braintrust: Uses experiment name/UUID & project
Braintrust-->>Script: Return experiment results & bench cases
Script->>Script: Infer benchmark key & label
Note over Script: Extracts from dataset/suite/category
Script->>Script: Infer model, provider, & provider key
Note over Script: Parses model string for provider prefix
Script->>Script: Compute pass rate, avg duration (speed), & cost
Note over Script: Scans bench results for pass/fail, timing, & cost metrics
Script->>Script: Build UiBenchmarkRow structure
alt Not dry run
Script->>Upstash: Read existing data from <kv_key>
Upstash-->>Script: Return existing payload (or empty)
Script->>Script: Merge new row into existing dataset
Note over Script: Dedup by model+provider, sort by accuracy then speed
Script->>Upstash: SET <kv_key> with merged payload
Script->>Upstash: OPTIONAL: SET <experiment_key_prefix>:<experiment_id>
Upstash-->>Script: Confirm write
Script->>Script: Build output JSON with summary & keys written
else Dry run
Script->>Script: Build output JSON without Upstash calls
end
Script-->>GHA: Write evals-ui-data.json & publish-evals-ui-data-output.json
GHA->>GHA: Parse output JSON and extract summary
GHA->>GHA: Add publish summary to GitHub Step Summary
GHA->>Artifact: Upload generated payload files
Note over GHA,Artifact: Always uploads, even on failure
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
| function benchmarkSource(benchCase: BenchCaseRow): string | undefined { | ||
| return ( | ||
| benchCase.dataset ?? | ||
| benchCase.suite.replace(/^agent\//, "") ?? |
There was a problem hiding this comment.
P2: benchCase.category is unreachable here because benchCase.suite.replace(...) always returns a string, so this fallback never executes. This can misclassify benchmark source when dataset is missing.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/scripts/publish-braintrust-ui-data.ts, line 291:
<comment>`benchCase.category` is unreachable here because `benchCase.suite.replace(...)` always returns a string, so this fallback never executes. This can misclassify benchmark source when `dataset` is missing.</comment>
<file context>
@@ -0,0 +1,744 @@
+function benchmarkSource(benchCase: BenchCaseRow): string | undefined {
+ return (
+ benchCase.dataset ??
+ benchCase.suite.replace(/^agent\//, "") ??
+ benchCase.category
+ );
</file context>
The GitHub iOS app dispatches workflows with numeric inputs serialized
as doubles ("1.0"/"50.0") even when the receiving field is declared
`type: number` for an integer. GitHub accepts the float and threads it
into the run, so the workflow triggers, but the evals CLI's
`parsePositiveInteger` (regex /^[0-9]+$/) then rejects the value and the
job fails immediately.
Strip a single trailing `.0` from `EVAL_TRIALS` and `EVAL_CONCURRENCY`
in the dispatch shell before forwarding them to the eval CLI. Integers
and non-zero decimals are passed through unchanged, so behaviour for the
web UI / API callers is identical.
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Strip a trailing `.0` from `EVAL_TRIALS` and `EVAL_CONCURRENCY` in the
manual-evals workflow so the evals CLI accepts integer inputs. Prevents
job failures when the GitHub iOS app sends numeric inputs as doubles
(e.g., "1.0", "50.0").
<sup>Written for commit c6c1d73.
Summary will update on new commits. <a
href="https://cubic.dev/pr/browserbase/stagehand/pull/2154?utm_source=github">Review
in cubic</a></sup>
<!-- End of auto-generated description by cubic. -->
Co-authored-by: Chromie <chromie@anthropic.com>
why
what changed
test plan
Summary by cubic
Adds a workflow and script to publish Braintrust benchmark results to Upstash/Vercel KV for the Evals UI, with model/agent‑mode aggregation and token‑based cost estimates. Also fixes manual evals to strip trailing “.0” from numeric inputs to avoid CLI parse errors.
New Features
Publish Evalsworkflow (.github/workflows/publish-evals.yml) to fetch a Braintrust experiment and write the UI payload to KV (latest key plus optional<prefix>:<experiment-id>), withdry_runsupport and artifact/summary output.packages/evals/scripts/publish-braintrust-ui-data.ts(run viapnpm --filter @browserbasehq/stagehand-evals exec tsx) that infers the benchmark, groups by model+agent mode, computes pass rate, avg duration, and cost per task/total (fallback to model pricing), merges into the KV key, and sorts rows; supports--dry-run,--dataset-id, and--out.Bug Fixes
.github/workflows/manual-evals.yml, strip a single trailing.0fromEVAL_TRIALSandEVAL_CONCURRENCYso the eval CLI accepts integer inputs from the GitHub iOS app.Written for commit 2c88217. Summary will update on new commits. Review in cubic