Manual eval dispatch workflow by miguelg719 · Pull Request #2085 · browserbase/stagehand

miguelg719 · 2026-05-06T06:37:48Z

why

what changed

test plan

Summary by cubic

Adds a manual GitHub Actions workflow to run evaluation suites on demand with selectable model, agent mode, trials, and concurrency. Builds once, runs chosen evals in a matrix, and posts Braintrust results and summaries.

New Features
- Run agent, act/extract/observe, or external agent benchmarks (webvoyager, onlineMind2Web, webtailbench).
- Inputs: model, agent_mode (auto/dom/hybrid/cua), trials (default 1), concurrency (default 50).
- One build job; artifacts reused across eval jobs. Dynamic matrix built via jq.
- Executes evals with pnpm + turbo against @browserbasehq/stagehand-evals, with OpenAI/Anthropic/Google/Braintrust/Browserbase env wired and region selection.
- Publishes job summary with Braintrust link and primary score, and uploads CTRF report + V8 coverage (fails if no summary).

^{Written for commit 8267815. Summary will update on new commits.}

changeset-bot · 2026-05-06T06:37:52Z

⚠️ No Changeset found

Latest commit: 8267815

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

cubic-dev-ai

No issues found across 1 file

Confidence score: 5/5

Automated review surfaced no issues in the provided summaries.
No files require special attention.

Architecture diagram

sequenceDiagram
    participant Dev as Developer Trigger
    participant GHA as GitHub Actions
    participant Prepare as Prepare Job
    participant Build as Build Job
    participant Eval as Eval Jobs (per matrix)
    participant Braintrust as Braintrust API
    participant BB as Browserbase

    Note over Dev,BB: Manual Eval Dispatch Workflow

    Dev->>GHA: workflow_dispatch (eval_suite, model, agent_mode, trials, concurrency)
    
    GHA->>Prepare: Start prepare job
    Prepare->>Prepare: Build matrix from eval_suite & external_benchmark
    alt eval_suite == "agent"
        Prepare->>Prepare: matrix = [agent]
    else eval_suite == "all"
        Prepare->>Prepare: matrix = [act, extract, observe]
    else eval_suite == "external_agent_benchmarks"
        alt external_benchmark in [webvoyager, onlineMind2Web, webtailbench]
            Prepare->>Prepare: matrix = [agent/{benchmark}]
        else unsupported
            Prepare->>GHA: Fail workflow
        end
    end
    Prepare-->>GHA: Output matrix JSON

    GHA->>Build: Start build job (after prepare)
    Build->>Build: Checkout repo
    Build->>Build: Setup Node/PNPM/Turbo
    Build->>Build: pnpm exec turbo build
    Build->>Build: Save turbo cache
    Build->>Build: Upload build artifacts (1 day retention)
    Build-->>GHA: Build complete

    GHA->>Eval: Start eval jobs (needs prepare + build)
    loop For each matrix entry
        Eval->>Eval: Setup Node/PNPM/Turbo
        Eval->>BB: Select Browserbase region (weighted distribution)
        Note over Eval,Braintrust: ENV: API keys + LLM_MAX_MS + BROWSERBASE_REGION_DISTRIBUTION
        Eval->>Eval: pnpm exec turbo run test:evals (with evals args)
        alt agent_mode != "auto"
            Eval->>Eval: Pass --agent-mode flag
        end
        Eval->>Braintrust: Run eval suite (publish results)
        Braintrust-->>Eval: Experiment data + scores
        alt eval complete (success or failure)
            Eval->>Eval: Parse eval-summary.json
            Eval->>Eval: Print summary to GITHUB_OUTPUT
        end
    end

    Note over Eval,GHA: Post-processing per job
    alt Always (success or failure)
        Eval->>Eval: Log performance to GITHUB_STEP_SUMMARY
        alt eval-summary.json exists
            Eval->>Eval: Generate Braintrust link + primary score
        else no summary
            Eval->>GHA: Fail workflow
        end
        Eval->>GHA: Upload CTRF report
        Eval->>GHA: Upload V8 coverage
    end

Manual eval dispatch workflow

8267815

miguelg719 marked this pull request as ready for review May 6, 2026 06:37

cubic-dev-ai Bot reviewed May 6, 2026

View reviewed changes

pirate approved these changes May 6, 2026

View reviewed changes

miguelg719 merged commit 886884b into main May 6, 2026
35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manual eval dispatch workflow#2085

Manual eval dispatch workflow#2085
miguelg719 merged 1 commit intomainfrom
miguelgonzalez/stg-1903-ci-evals-page-integration-1

miguelg719 commented May 6, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

changeset-bot Bot commented May 6, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

miguelg719 commented May 6, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

what changed

test plan

Summary by cubic

Uh oh!

changeset-bot Bot commented May 6, 2026

⚠️ No Changeset found

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

miguelg719 commented May 6, 2026 •

edited by cubic-dev-ai Bot

Loading