Skip to content

Manual eval dispatch workflow#2085

Merged
miguelg719 merged 1 commit intomainfrom
miguelgonzalez/stg-1903-ci-evals-page-integration-1
May 6, 2026
Merged

Manual eval dispatch workflow#2085
miguelg719 merged 1 commit intomainfrom
miguelgonzalez/stg-1903-ci-evals-page-integration-1

Conversation

@miguelg719
Copy link
Copy Markdown
Collaborator

@miguelg719 miguelg719 commented May 6, 2026

why

what changed

test plan


Summary by cubic

Adds a manual GitHub Actions workflow to run evaluation suites on demand with selectable model, agent mode, trials, and concurrency. Builds once, runs chosen evals in a matrix, and posts Braintrust results and summaries.

  • New Features
    • Run agent, act/extract/observe, or external agent benchmarks (webvoyager, onlineMind2Web, webtailbench).
    • Inputs: model, agent_mode (auto/dom/hybrid/cua), trials (default 1), concurrency (default 50).
    • One build job; artifacts reused across eval jobs. Dynamic matrix built via jq.
    • Executes evals with pnpm + turbo against @browserbasehq/stagehand-evals, with OpenAI/Anthropic/Google/Braintrust/Browserbase env wired and region selection.
    • Publishes job summary with Braintrust link and primary score, and uploads CTRF report + V8 coverage (fails if no summary).

Written for commit 8267815. Summary will update on new commits.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 6, 2026

⚠️ No Changeset found

Latest commit: 8267815

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@miguelg719 miguelg719 marked this pull request as ready for review May 6, 2026 06:37
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.
Architecture diagram
sequenceDiagram
    participant Dev as Developer Trigger
    participant GHA as GitHub Actions
    participant Prepare as Prepare Job
    participant Build as Build Job
    participant Eval as Eval Jobs (per matrix)
    participant Braintrust as Braintrust API
    participant BB as Browserbase

    Note over Dev,BB: Manual Eval Dispatch Workflow

    Dev->>GHA: workflow_dispatch (eval_suite, model, agent_mode, trials, concurrency)
    
    GHA->>Prepare: Start prepare job
    Prepare->>Prepare: Build matrix from eval_suite & external_benchmark
    alt eval_suite == "agent"
        Prepare->>Prepare: matrix = [agent]
    else eval_suite == "all"
        Prepare->>Prepare: matrix = [act, extract, observe]
    else eval_suite == "external_agent_benchmarks"
        alt external_benchmark in [webvoyager, onlineMind2Web, webtailbench]
            Prepare->>Prepare: matrix = [agent/{benchmark}]
        else unsupported
            Prepare->>GHA: Fail workflow
        end
    end
    Prepare-->>GHA: Output matrix JSON

    GHA->>Build: Start build job (after prepare)
    Build->>Build: Checkout repo
    Build->>Build: Setup Node/PNPM/Turbo
    Build->>Build: pnpm exec turbo build
    Build->>Build: Save turbo cache
    Build->>Build: Upload build artifacts (1 day retention)
    Build-->>GHA: Build complete

    GHA->>Eval: Start eval jobs (needs prepare + build)
    loop For each matrix entry
        Eval->>Eval: Setup Node/PNPM/Turbo
        Eval->>BB: Select Browserbase region (weighted distribution)
        Note over Eval,Braintrust: ENV: API keys + LLM_MAX_MS + BROWSERBASE_REGION_DISTRIBUTION
        Eval->>Eval: pnpm exec turbo run test:evals (with evals args)
        alt agent_mode != "auto"
            Eval->>Eval: Pass --agent-mode flag
        end
        Eval->>Braintrust: Run eval suite (publish results)
        Braintrust-->>Eval: Experiment data + scores
        alt eval complete (success or failure)
            Eval->>Eval: Parse eval-summary.json
            Eval->>Eval: Print summary to GITHUB_OUTPUT
        end
    end

    Note over Eval,GHA: Post-processing per job
    alt Always (success or failure)
        Eval->>Eval: Log performance to GITHUB_STEP_SUMMARY
        alt eval-summary.json exists
            Eval->>Eval: Generate Braintrust link + primary score
        else no summary
            Eval->>GHA: Fail workflow
        end
        Eval->>GHA: Upload CTRF report
        Eval->>GHA: Upload V8 coverage
    end
Loading

@miguelg719 miguelg719 merged commit 886884b into main May 6, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants