Updated readme for evals package by miguelg719 · Pull Request #2112 · browserbase/stagehand

miguelg719 · 2026-05-14T00:04:20Z

why

what changed

test plan

Summary by cubic

Revamped the packages/evals README to explain the current Stagehand Evals TUI/CLI: quickstart, commands, run targets, common flags (harness/agent modes), preview/progress/results, task scaffolding, and Braintrust tracing. Added screenshots plus a run demo GIF, clarified .env loading, and removed outdated notes to match auto-discovered tasks and the evals binary.

^{Written for commit 0f80088. Summary will update on new commits.}

changeset-bot · 2026-05-14T00:04:25Z

⚠️ No Changeset found

Latest commit: 0f80088

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

cubic-dev-ai

No issues found across 6 files

Confidence score: 5/5

Automated review surfaced no issues in the provided summaries.
No files require special attention.

Architecture diagram

sequenceDiagram
    participant User as User / Shell
    participant CLI as Evals TUI / REPL
    participant Disc as Task Discovery
    participant Runner as Execution Runner
    participant Task as Bench Task (TS/JS)
    participant SH as Stagehand SDK
    participant Ext as External (Browser/LLM)
    participant BT as Braintrust API

    Note over User, BT: Startup & Configuration
    User->>CLI: Launch `evals` or `evals run`
    CLI->>CLI: Load .env & evals.config.json
    CLI->>Disc: NEW: Scan tasks/bench/**
    Disc-->>CLI: Return auto-discovered tasks

    Note over User, BT: Execution Flow
    User->>CLI: select target (e.g. b:webvoyager)
    CLI->>Runner: Initialize run (trials, concurrency)
    
    loop Per Task Trial
        Runner->>Task: Execute run()
        Task->>SH: Initialize Stagehand
        SH->>Ext: CHANGED: Connect (Local or Browserbase)
        SH->>Ext: Prompt LLM (Model/Provider Matrix)
        Ext-->>SH: Action/Observation
        SH-->>Task: Result (_success, etc.)
        Task-->>Runner: Return metrics
        
        opt BRAINTRUST_API_KEY set
            Runner->>BT: CHANGED: Stream experiment logs
        end

        Runner->>CLI: NEW: Update live progress table
    end

    Runner->>CLI: Aggregate final results
    CLI->>User: Display summary & by-model breakdown

    Note over User, Disc: Task Scaffolding (Optional)
    User->>CLI: `evals new <tier> <name>`
    CLI->>Disc: NEW: Generate defineBenchTask file

Updated readme for evals package

d910984

miguelg719 marked this pull request as ready for review May 14, 2026 00:04

seanmcguire12 approved these changes May 14, 2026

View reviewed changes

cubic-dev-ai Bot reviewed May 14, 2026

View reviewed changes

gif demo

0f80088

seanmcguire12 self-requested a review May 14, 2026 00:55

seanmcguire12 approved these changes May 14, 2026

View reviewed changes

miguelg719 merged commit 342681f into main May 14, 2026
32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated readme for evals package#2112

Updated readme for evals package#2112
miguelg719 merged 2 commits into
mainfrom
miguelgonzalez/stg-1988-evals-readme

miguelg719 commented May 14, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

changeset-bot Bot commented May 14, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

miguelg719 commented May 14, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

what changed

test plan

Summary by cubic

Uh oh!

changeset-bot Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

miguelg719 commented May 14, 2026 •

edited by cubic-dev-ai Bot

Loading

changeset-bot Bot commented May 14, 2026 •

edited

Loading