Skip to content

Updated readme for evals package#2112

Merged
miguelg719 merged 2 commits into
mainfrom
miguelgonzalez/stg-1988-evals-readme
May 14, 2026
Merged

Updated readme for evals package#2112
miguelg719 merged 2 commits into
mainfrom
miguelgonzalez/stg-1988-evals-readme

Conversation

@miguelg719
Copy link
Copy Markdown
Collaborator

@miguelg719 miguelg719 commented May 14, 2026

why

what changed

test plan


Summary by cubic

Revamped the packages/evals README to explain the current Stagehand Evals TUI/CLI: quickstart, commands, run targets, common flags (harness/agent modes), preview/progress/results, task scaffolding, and Braintrust tracing. Added screenshots plus a run demo GIF, clarified .env loading, and removed outdated notes to match auto-discovered tasks and the evals binary.

Written for commit 0f80088. Summary will update on new commits.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 14, 2026

⚠️ No Changeset found

Latest commit: 0f80088

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@miguelg719 miguelg719 marked this pull request as ready for review May 14, 2026 00:04
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 6 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.
Architecture diagram
sequenceDiagram
    participant User as User / Shell
    participant CLI as Evals TUI / REPL
    participant Disc as Task Discovery
    participant Runner as Execution Runner
    participant Task as Bench Task (TS/JS)
    participant SH as Stagehand SDK
    participant Ext as External (Browser/LLM)
    participant BT as Braintrust API

    Note over User, BT: Startup & Configuration
    User->>CLI: Launch `evals` or `evals run`
    CLI->>CLI: Load .env & evals.config.json
    CLI->>Disc: NEW: Scan tasks/bench/**
    Disc-->>CLI: Return auto-discovered tasks

    Note over User, BT: Execution Flow
    User->>CLI: select target (e.g. b:webvoyager)
    CLI->>Runner: Initialize run (trials, concurrency)
    
    loop Per Task Trial
        Runner->>Task: Execute run()
        Task->>SH: Initialize Stagehand
        SH->>Ext: CHANGED: Connect (Local or Browserbase)
        SH->>Ext: Prompt LLM (Model/Provider Matrix)
        Ext-->>SH: Action/Observation
        SH-->>Task: Result (_success, etc.)
        Task-->>Runner: Return metrics
        
        opt BRAINTRUST_API_KEY set
            Runner->>BT: CHANGED: Stream experiment logs
        end

        Runner->>CLI: NEW: Update live progress table
    end

    Runner->>CLI: Aggregate final results
    CLI->>User: Display summary & by-model breakdown

    Note over User, Disc: Task Scaffolding (Optional)
    User->>CLI: `evals new <tier> <name>`
    CLI->>Disc: NEW: Generate defineBenchTask file
Loading

@seanmcguire12 seanmcguire12 self-requested a review May 14, 2026 00:55
@miguelg719 miguelg719 merged commit 342681f into main May 14, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants