Updated readme for evals package#2112
Merged
Merged
Conversation
|
seanmcguire12
approved these changes
May 14, 2026
Contributor
There was a problem hiding this comment.
No issues found across 6 files
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Architecture diagram
sequenceDiagram
participant User as User / Shell
participant CLI as Evals TUI / REPL
participant Disc as Task Discovery
participant Runner as Execution Runner
participant Task as Bench Task (TS/JS)
participant SH as Stagehand SDK
participant Ext as External (Browser/LLM)
participant BT as Braintrust API
Note over User, BT: Startup & Configuration
User->>CLI: Launch `evals` or `evals run`
CLI->>CLI: Load .env & evals.config.json
CLI->>Disc: NEW: Scan tasks/bench/**
Disc-->>CLI: Return auto-discovered tasks
Note over User, BT: Execution Flow
User->>CLI: select target (e.g. b:webvoyager)
CLI->>Runner: Initialize run (trials, concurrency)
loop Per Task Trial
Runner->>Task: Execute run()
Task->>SH: Initialize Stagehand
SH->>Ext: CHANGED: Connect (Local or Browserbase)
SH->>Ext: Prompt LLM (Model/Provider Matrix)
Ext-->>SH: Action/Observation
SH-->>Task: Result (_success, etc.)
Task-->>Runner: Return metrics
opt BRAINTRUST_API_KEY set
Runner->>BT: CHANGED: Stream experiment logs
end
Runner->>CLI: NEW: Update live progress table
end
Runner->>CLI: Aggregate final results
CLI->>User: Display summary & by-model breakdown
Note over User, Disc: Task Scaffolding (Optional)
User->>CLI: `evals new <tier> <name>`
CLI->>Disc: NEW: Generate defineBenchTask file
seanmcguire12
approved these changes
May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
why
what changed
test plan
Summary by cubic
Revamped the
packages/evalsREADME to explain the current Stagehand Evals TUI/CLI: quickstart, commands, run targets, common flags (harness/agent modes), preview/progress/results, task scaffolding, and Braintrust tracing. Added screenshots plus a run demo GIF, clarified.envloading, and removed outdated notes to match auto-discovered tasks and theevalsbinary.Written for commit 0f80088. Summary will update on new commits.