Manual eval dispatch workflow#2085
Merged
miguelg719 merged 1 commit intomainfrom May 6, 2026
Merged
Conversation
|
Contributor
There was a problem hiding this comment.
No issues found across 1 file
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Architecture diagram
sequenceDiagram
participant Dev as Developer Trigger
participant GHA as GitHub Actions
participant Prepare as Prepare Job
participant Build as Build Job
participant Eval as Eval Jobs (per matrix)
participant Braintrust as Braintrust API
participant BB as Browserbase
Note over Dev,BB: Manual Eval Dispatch Workflow
Dev->>GHA: workflow_dispatch (eval_suite, model, agent_mode, trials, concurrency)
GHA->>Prepare: Start prepare job
Prepare->>Prepare: Build matrix from eval_suite & external_benchmark
alt eval_suite == "agent"
Prepare->>Prepare: matrix = [agent]
else eval_suite == "all"
Prepare->>Prepare: matrix = [act, extract, observe]
else eval_suite == "external_agent_benchmarks"
alt external_benchmark in [webvoyager, onlineMind2Web, webtailbench]
Prepare->>Prepare: matrix = [agent/{benchmark}]
else unsupported
Prepare->>GHA: Fail workflow
end
end
Prepare-->>GHA: Output matrix JSON
GHA->>Build: Start build job (after prepare)
Build->>Build: Checkout repo
Build->>Build: Setup Node/PNPM/Turbo
Build->>Build: pnpm exec turbo build
Build->>Build: Save turbo cache
Build->>Build: Upload build artifacts (1 day retention)
Build-->>GHA: Build complete
GHA->>Eval: Start eval jobs (needs prepare + build)
loop For each matrix entry
Eval->>Eval: Setup Node/PNPM/Turbo
Eval->>BB: Select Browserbase region (weighted distribution)
Note over Eval,Braintrust: ENV: API keys + LLM_MAX_MS + BROWSERBASE_REGION_DISTRIBUTION
Eval->>Eval: pnpm exec turbo run test:evals (with evals args)
alt agent_mode != "auto"
Eval->>Eval: Pass --agent-mode flag
end
Eval->>Braintrust: Run eval suite (publish results)
Braintrust-->>Eval: Experiment data + scores
alt eval complete (success or failure)
Eval->>Eval: Parse eval-summary.json
Eval->>Eval: Print summary to GITHUB_OUTPUT
end
end
Note over Eval,GHA: Post-processing per job
alt Always (success or failure)
Eval->>Eval: Log performance to GITHUB_STEP_SUMMARY
alt eval-summary.json exists
Eval->>Eval: Generate Braintrust link + primary score
else no summary
Eval->>GHA: Fail workflow
end
Eval->>GHA: Upload CTRF report
Eval->>GHA: Upload V8 coverage
end
pirate
approved these changes
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
why
what changed
test plan
Summary by cubic
Adds a manual GitHub Actions workflow to run evaluation suites on demand with selectable model, agent mode, trials, and concurrency. Builds once, runs chosen evals in a matrix, and posts Braintrust results and summaries.
model,agent_mode(auto/dom/hybrid/cua),trials(default 1),concurrency(default 50).jq.pnpm+turboagainst@browserbasehq/stagehand-evals, with OpenAI/Anthropic/Google/Braintrust/Browserbase env wired and region selection.Written for commit 8267815. Summary will update on new commits.