Skip to content

Conversation

@miguelg719
Copy link
Collaborator

@miguelg719 miguelg719 commented Dec 16, 2025

why

Fixes benchmark eval runners to be more resilient and adds metadata for filtering/grouping eval tasks

what changed

  • Exports imageResize utility from core

test plan

@changeset-bot
Copy link

changeset-bot bot commented Dec 16, 2025

🦋 Changeset detected

Latest commit: d2f1436

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@browserbasehq/stagehand Patch
@browserbasehq/stagehand-evals Patch
@browserbasehq/stagehand-server Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@miguelg719 miguelg719 changed the title mprove benchmark handling and add metadata Improve benchmark handling and add metadata Dec 16, 2025
@miguelg719 miguelg719 force-pushed the miguelgonzalez/stg-1075-improve-benchmark-handling-and-add-metadata branch from f155f5b to cb69ecb Compare January 14, 2026 19:16
@github-actions
Copy link
Contributor

github-actions bot commented Jan 14, 2026

✱ Stainless preview builds

This PR will update the stagehand SDKs with the following commit message.

feat: Improve benchmark handling and add metadata

Edit this comment to update it. It will appear in the SDK's changelogs.

⚠️ stagehand-ruby studio · conflict

There was a regression in your SDK.

New diagnostics (1 error, 8 note)
Transform/Failed: Transformation remove at step 0 exited with a failure
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
⚠️ stagehand-python studio · code · diff

There was a regression in your SDK.
generate ❗ (prev: generate ✅) → build ✅lint ❗test ❗

pip install https://pkg.stainless.com/s/stagehand-python/27f7fd6f16a1f8f0510db4d5df90d46136907b33/stagehand_alpha-0.3.0-py3-none-any.whl
New diagnostics (1 error, 8 note)
Transform/Failed: Transformation remove at step 0 exited with a failure
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
⚠️ stagehand-openapi studio · code · diff

There was a regression in your SDK.
generate ❗ (prev: generate ✅) → lint ⏳test ⏳

New diagnostics (1 error, 8 note)
Transform/Failed: Transformation remove at step 0 exited with a failure
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
⚠️ stagehand-typescript studio · code · diff

There was a regression in your SDK.
generate ❗ (prev: generate ✅) → build ✅lint ✅test ❗

npm install https://pkg.stainless.com/s/stagehand-typescript/39b056138c65bc7e381ad838ac1bf0e7734cf3f4/dist.tar.gz
New diagnostics (1 error, 8 note)
Transform/Failed: Transformation remove at step 0 exited with a failure
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
⚠️ stagehand-kotlin studio · code · diff

There was a regression in your SDK.
generate ❗ (prev: generate ⚠️) → lint ⏳test ⏳

New diagnostics (1 error, 8 note)
Transform/Failed: Transformation remove at step 0 exited with a failure
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
⚠️ stagehand-java studio · code · diff

There was a regression in your SDK.
generate ❗ (prev: generate ⚠️) → lint ❗test ❗

New diagnostics (1 error, 8 note)
Transform/Failed: Transformation remove at step 0 exited with a failure
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
⚠️ stagehand-php studio · code · diff

There was a regression in your SDK.
generate ❗ (prev: generate ✅) → lint ✅test ✅

New diagnostics (1 error, 8 note)
Transform/Failed: Transformation remove at step 0 exited with a failure
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
stagehand-csharp studio · conflict
⚠️ stagehand-go studio · code · diff

There was a regression in your SDK.
generate ❗ (prev: generate ✅) → lint ❗test ❗

go get github.com/stainless-sdks/stagehand-go@6ad6453b72a4f1b9ac8324f30460f7b8e8a22b29
New diagnostics (1 error, 8 note)
Transform/Failed: Transformation remove at step 0 exited with a failure
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
⚠️ stagehand-cli studio · code · diff

There was a regression in your SDK.
generate ❗ (prev: generate ✅) → lint ❗test ❗

New diagnostics (1 error, 8 note)
Transform/Failed: Transformation remove at step 0 exited with a failure
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.

⏳ These are partial results; builds are still running.


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2026-01-20 20:55:12 UTC

@miguelg719 miguelg719 marked this pull request as ready for review January 14, 2026 20:20
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 14, 2026

Greptile Summary

This PR improves the resilience and observability of benchmark evaluation runners by adding proper resource cleanup, fixing a typo in event names, and adding metadata tracking.

Key Changes:

  • Exported imageResize utility from core package to resize screenshots during eval runs, reducing memory usage and API payload sizes
  • Fixed critical typo in webvoyager.ts event name (agent_screensot_taken_eventagent_screenshot_taken_event) that was preventing screenshots from being collected
  • Added finally blocks to both eval tasks to ensure event listeners are always cleaned up, preventing resource leaks and hanging processes
  • Increased page navigation timeout from 60s to 120s in onlineMind2Web.ts and added explicit 120s timeout to webvoyager.ts
  • Enhanced testcase metadata with structured fields (dataset, task_id, website, difficulty) for better benchmark analysis
  • Simplified tag structure by removing redundant per-task tags and category prefixes
  • Added explicit memory cleanup by clearing screenshot buffers after evaluation

Minor Issue:

  • The imageResize function doesn't validate that metadata.width and metadata.height exist before using them, which could cause runtime errors with certain image types

Confidence Score: 4/5

  • This PR is safe to merge with one minor issue requiring attention
  • The changes significantly improve benchmark reliability through proper cleanup and error handling. The typo fix is critical and was preventing proper functionality. The metadata additions are well-structured. However, the imageResize function has a potential issue with undefined metadata properties that should be addressed to prevent runtime errors.
  • Pay close attention to packages/core/lib/utils.ts - the imageResize function needs validation for metadata width/height

Important Files Changed

Filename Overview
packages/core/lib/utils.ts Added imageResize utility function with sharp dependency, but lacks error handling for undefined metadata properties
packages/evals/tasks/agent/onlineMind2Web.ts Fixed typo, improved error handling with finally block, increased timeout, added screenshot resizing, and enhanced memory management
packages/evals/tasks/agent/webvoyager.ts Fixed event name typo, improved error handling with finally block, added timeout config, added screenshot resizing, and enhanced memory management

Sequence Diagram

sequenceDiagram
    participant Runner as Eval Runner
    participant Task as onlineMind2Web/webvoyager
    participant V3 as V3 Instance
    participant Agent as Agent
    participant Collector as ScreenshotCollector
    participant Utils as imageResize

    Runner->>Task: Execute eval with input params
    Task->>Task: Initialize screenshotCollector & handler (nullable)
    Task->>V3: Get page and navigate (120s timeout)
    Task->>Agent: Create agent with model & systemPrompt
    Task->>Collector: new ScreenshotCollector(v3, maxScreenshots: 7)
    Task->>V3: bus.on("agent_screenshot_taken_event", handler)
    
    Task->>Agent: execute(instruction, maxSteps)
    loop During Agent Execution
        Agent->>V3: Emit agent_screenshot_taken_event
        V3->>Task: screenshotHandler(buffer)
        Task->>Collector: addScreenshot(buffer)
        Collector->>Collector: Apply MSE/SSIM filtering
    end
    
    Agent-->>Task: agentResult
    Task->>V3: bus.off("agent_screenshot_taken_event")
    Task->>Collector: stop()
    Collector-->>Task: screenshots[]
    
    loop For each screenshot
        Task->>Utils: imageResize(screenshot, 0.7)
        Utils->>Utils: Get metadata, calculate dimensions
        Utils->>Utils: Resize with lanczos3 kernel
        Utils-->>Task: resized Buffer
    end
    
    Task->>Task: Clear screenshot buffers
    Task->>V3: V3Evaluator.ask(question, screenshots)
    V3-->>Task: evalResult
    
    alt Finally Block
        Task->>V3: bus.off (cleanup)
        Task->>Collector: stop() (cleanup)
    end
    
    Task-->>Runner: Return evaluation result
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 10 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/evals/tasks/agent/webvoyager.ts">

<violation number="1" location="packages/evals/tasks/agent/webvoyager.ts:111">
P2: The `screenshotCollector.stop()` is called twice on successful runs - once in the try block and again in the finally block. Consider either:
1. Only doing cleanup in the finally block (remove from try block), or
2. Setting `screenshotCollector = null` after cleanup in the success path to prevent double-stop.</violation>
</file>

<file name="packages/evals/tasks/agent/onlineMind2Web.ts">

<violation number="1" location="packages/evals/tasks/agent/onlineMind2Web.ts:122">
P2: Missing `await` for async `stop()` call. The `ScreenshotCollector.stop()` method returns a `Promise<Buffer[]>`, so it should be awaited to ensure cleanup completes before the function exits.</violation>
</file>

<file name="packages/core/lib/utils.ts">

<violation number="1" location="packages/core/lib/utils.ts:852">
P1: Missing null check for `metadata.width` and `metadata.height`. Sharp's `metadata()` returns optional width/height properties that can be `undefined` for corrupted images or certain image types. This will cause `NaN` to be passed to `resize()`, leading to runtime errors.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@@ -1,5 +1,5 @@
import { EvalFunction } from "../../types/evals";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The screenshotCollector.stop() is called twice on successful runs - once in the try block and again in the finally block. Consider either:

  1. Only doing cleanup in the finally block (remove from try block), or
  2. Setting screenshotCollector = null after cleanup in the success path to prevent double-stop.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/tasks/agent/webvoyager.ts, line 111:

<comment>The `screenshotCollector.stop()` is called twice on successful runs - once in the try block and again in the finally block. Consider either:
1. Only doing cleanup in the finally block (remove from try block), or
2. Setting `screenshotCollector = null` after cleanup in the success path to prevent double-stop.</comment>

<file context>
@@ -88,5 +105,21 @@ export const webvoyager: EvalFunction = async ({
+  } finally {
+    // Always clean up event listener and stop collector to prevent hanging
+    if (screenshotHandler) {
+      try {
+        v3.bus.off("agent_screenshot_taken_event", screenshotHandler);
+      } catch {
</file context>

@miguelg719 miguelg719 force-pushed the miguelgonzalez/stg-1075-improve-benchmark-handling-and-add-metadata branch from 3aa02e4 to d2f1436 Compare January 20, 2026 20:55
@miguelg719 miguelg719 merged commit 40e1d80 into main Jan 20, 2026
18 checks passed
seanmcguire12 pushed a commit that referenced this pull request Jan 22, 2026
This PR was opened by the [Changesets
release](https://github.com/changesets/action) GitHub action. When
you're ready to do a release, you can merge this and the packages will
be published to npm automatically. If you're not ready to do a release
yet, that's fine, whenever you add more changesets to main, this PR will
be updated.


# Releases
## @browserbasehq/stagehand@3.0.8

### Patch Changes

- [#1514](#1514)
[`40ce5cc`](40ce5cc)
Thanks [@tkattkat](https://github.com/tkattkat)! - Rename the close tool
in agent to "done"

- [#1574](#1574)
[`5506f41`](5506f41)
Thanks [@tkattkat](https://github.com/tkattkat)! - fix(server): pass
cdpUrl to localBrowserLaunchOptions when launchOptions absent

- [#1521](#1521)
[`84c05ca`](84c05ca)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: get
agent cache working in API mode

- [#1486](#1486)
[`692ffa0`](692ffa0)
Thanks [@tkattkat](https://github.com/tkattkat)! - improve logging in
agent

- [#1551](#1551)
[`1ef8901`](1ef8901)
Thanks [@miguelg719](https://github.com/miguelg719)! - move extract
handler response log to after URL injection

- [#1495](#1495)
[`72ac775`](72ac775)
Thanks [@tkattkat](https://github.com/tkattkat)! - export tool function
& type to simplify defining custom tools

- [#1481](#1481)
[`3d5af07`](3d5af07)
Thanks [@tkattkat](https://github.com/tkattkat)! - add waitForTimeout to
page

- [#1423](#1423)
[`40e1d80`](40e1d80)
Thanks [@miguelg719](https://github.com/miguelg719)! - Improve benchmark
handling and add metadata

- [#1588](#1588)
[`56c0d24`](56c0d24)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - add
SnapshotOptions to page.snapshot()

- [#1483](#1483)
[`16d72fb`](16d72fb)
Thanks [@tkattkat](https://github.com/tkattkat)! - Optimize screenshot
handling in agent hybrid mode

- [#1498](#1498)
[`088c4cc`](088c4cc)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix:
replaying cached actions (for agent & act) now uses the originally
defined model, (instead of default model) when action fails and
rerunning inference is needed

- [#1575](#1575)
[`4276f4a`](4276f4a)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - expose port
param in localBrowserLaunchOptions

- [#1544](#1544)
[`6005786`](6005786)
Thanks [@tkattkat](https://github.com/tkattkat)! - Recommend hybrid mode
over DOM mode in agent, which is now considered legacy

- [#1505](#1505)
[`6fbf5fc`](6fbf5fc)
Thanks [@tkattkat](https://github.com/tkattkat)! - Add structured output
to agent result + ensure close tool is always called

- [#1511](#1511)
[`704cf18`](704cf18)
Thanks [@shrey150](https://github.com/shrey150)! - Fix ControlOrMeta
keypress event

- [#1480](#1480)
[`091296e`](091296e)
Thanks [@tkattkat](https://github.com/tkattkat)! - Update agent to only
calculate xpath when caching is enabled

- [#1509](#1509)
[`e56c6eb`](e56c6eb)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - add support
for page.waitForSelector()

- [#1478](#1478)
[`2cb78d0`](2cb78d0)
Thanks [@tkattkat](https://github.com/tkattkat)! - update agent message
handling

- [#1518](#1518)
[`5dad639`](5dad639)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - add
page.snapshot() for capturing a stringified DOM snapshot of the page,
including an xpath map & url map

- [#1576](#1576)
[`b7c2571`](b7c2571)
Thanks [@tkattkat](https://github.com/tkattkat)! - utilize
waitForSelector when running agent cache

- [#1560](#1560)
[`4c69117`](4c69117)
Thanks [@tkattkat](https://github.com/tkattkat)! - Update coordinate
handling in cua and hybrid

## @browserbasehq/stagehand-server@3.5.0

### Minor Changes

- [#1578](#1578)
[`a5074bd`](a5074bd)
Thanks [@monadoid](https://github.com/monadoid)! - /end endpoint no
longer takes an empty object - instead, no request body is required.

### Patch Changes

- Updated dependencies
\[[`40ce5cc`](40ce5cc),
[`5506f41`](5506f41),
[`84c05ca`](84c05ca),
[`692ffa0`](692ffa0),
[`1ef8901`](1ef8901),
[`72ac775`](72ac775),
[`3d5af07`](3d5af07),
[`40e1d80`](40e1d80),
[`56c0d24`](56c0d24),
[`16d72fb`](16d72fb),
[`088c4cc`](088c4cc),
[`4276f4a`](4276f4a),
[`6005786`](6005786),
[`6fbf5fc`](6fbf5fc),
[`704cf18`](704cf18),
[`091296e`](091296e),
[`e56c6eb`](e56c6eb),
[`2cb78d0`](2cb78d0),
[`5dad639`](5dad639),
[`b7c2571`](b7c2571),
[`4c69117`](4c69117)]:
    -   @browserbasehq/stagehand@3.0.8

## @browserbasehq/stagehand-evals@1.1.7

### Patch Changes

- Updated dependencies
\[[`40ce5cc`](40ce5cc),
[`5506f41`](5506f41),
[`84c05ca`](84c05ca),
[`692ffa0`](692ffa0),
[`1ef8901`](1ef8901),
[`72ac775`](72ac775),
[`3d5af07`](3d5af07),
[`40e1d80`](40e1d80),
[`56c0d24`](56c0d24),
[`16d72fb`](16d72fb),
[`088c4cc`](088c4cc),
[`4276f4a`](4276f4a),
[`6005786`](6005786),
[`6fbf5fc`](6fbf5fc),
[`704cf18`](704cf18),
[`091296e`](091296e),
[`e56c6eb`](e56c6eb),
[`2cb78d0`](2cb78d0),
[`5dad639`](5dad639),
[`b7c2571`](b7c2571),
[`4c69117`](4c69117)]:
    -   @browserbasehq/stagehand@3.0.8

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants