-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Improve benchmark handling and add metadata #1423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve benchmark handling and add metadata #1423
Conversation
🦋 Changeset detectedLatest commit: d2f1436 The changes in this PR will be included in the next version bump. This PR includes changesets to release 3 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
f155f5b to
cb69ecb
Compare
✱ Stainless preview buildsThis PR will update the Edit this comment to update it. It will appear in the SDK's changelogs.
|
| ❗ Transform/Failed: Transformation remove at step 0 exited with a failure |
| 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. |
| 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. |
| 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. |
| 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. |
| 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. |
| 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. |
| 💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed. |
| 💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed. |
⚠️ stagehand-python studio · code · diff
There was a regression in your SDK.
generate ❗(prev:generate ✅) →build ✅→lint ❗→test ❗pip install https://pkg.stainless.com/s/stagehand-python/27f7fd6f16a1f8f0510db4d5df90d46136907b33/stagehand_alpha-0.3.0-py3-none-any.whlNew diagnostics (1 error, 8 note)
❗ Transform/Failed: Transformation remove at step 0 exited with a failure 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
⚠️ stagehand-openapi studio · code · diff
There was a regression in your SDK.
generate ❗(prev:generate ✅) →lint ⏳→test ⏳New diagnostics (1 error, 8 note)
❗ Transform/Failed: Transformation remove at step 0 exited with a failure 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
⚠️ stagehand-typescript studio · code · diff
There was a regression in your SDK.
generate ❗(prev:generate ✅) →build ✅→lint ✅→test ❗npm install https://pkg.stainless.com/s/stagehand-typescript/39b056138c65bc7e381ad838ac1bf0e7734cf3f4/dist.tar.gzNew diagnostics (1 error, 8 note)
❗ Transform/Failed: Transformation remove at step 0 exited with a failure 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
⚠️ stagehand-kotlin studio · code · diff
There was a regression in your SDK.
generate ❗(prev:generate ⚠️) →lint ⏳→test ⏳New diagnostics (1 error, 8 note)
❗ Transform/Failed: Transformation remove at step 0 exited with a failure 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
⚠️ stagehand-java studio · code · diff
There was a regression in your SDK.
generate ❗(prev:generate ⚠️) →lint ❗→test ❗New diagnostics (1 error, 8 note)
❗ Transform/Failed: Transformation remove at step 0 exited with a failure 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
⚠️ stagehand-php studio · code · diff
There was a regression in your SDK.
generate ❗(prev:generate ✅) →lint ✅→test ✅New diagnostics (1 error, 8 note)
❗ Transform/Failed: Transformation remove at step 0 exited with a failure 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
⚠️ stagehand-go studio · code · diff
There was a regression in your SDK.
generate ❗(prev:generate ✅) →lint ❗→test ❗go get github.com/stainless-sdks/stagehand-go@6ad6453b72a4f1b9ac8324f30460f7b8e8a22b29New diagnostics (1 error, 8 note)
❗ Transform/Failed: Transformation remove at step 0 exited with a failure 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
⚠️ stagehand-cli studio · code · diff
There was a regression in your SDK.
generate ❗(prev:generate ✅) →lint ❗→test ❗New diagnostics (1 error, 8 note)
❗ Transform/Failed: Transformation remove at step 0 exited with a failure 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation update had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed. 💡 Transform/Useless: Transformation merge had no effect on the OpenAPI schema and can safely be removed.
⏳ These are partial results; builds are still running.
This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2026-01-20 20:55:12 UTC
Greptile SummaryThis PR improves the resilience and observability of benchmark evaluation runners by adding proper resource cleanup, fixing a typo in event names, and adding metadata tracking. Key Changes:
Minor Issue:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Runner as Eval Runner
participant Task as onlineMind2Web/webvoyager
participant V3 as V3 Instance
participant Agent as Agent
participant Collector as ScreenshotCollector
participant Utils as imageResize
Runner->>Task: Execute eval with input params
Task->>Task: Initialize screenshotCollector & handler (nullable)
Task->>V3: Get page and navigate (120s timeout)
Task->>Agent: Create agent with model & systemPrompt
Task->>Collector: new ScreenshotCollector(v3, maxScreenshots: 7)
Task->>V3: bus.on("agent_screenshot_taken_event", handler)
Task->>Agent: execute(instruction, maxSteps)
loop During Agent Execution
Agent->>V3: Emit agent_screenshot_taken_event
V3->>Task: screenshotHandler(buffer)
Task->>Collector: addScreenshot(buffer)
Collector->>Collector: Apply MSE/SSIM filtering
end
Agent-->>Task: agentResult
Task->>V3: bus.off("agent_screenshot_taken_event")
Task->>Collector: stop()
Collector-->>Task: screenshots[]
loop For each screenshot
Task->>Utils: imageResize(screenshot, 0.7)
Utils->>Utils: Get metadata, calculate dimensions
Utils->>Utils: Resize with lanczos3 kernel
Utils-->>Task: resized Buffer
end
Task->>Task: Clear screenshot buffers
Task->>V3: V3Evaluator.ask(question, screenshots)
V3-->>Task: evalResult
alt Finally Block
Task->>V3: bus.off (cleanup)
Task->>Collector: stop() (cleanup)
end
Task-->>Runner: Return evaluation result
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9 files reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 issues found across 10 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="packages/evals/tasks/agent/webvoyager.ts">
<violation number="1" location="packages/evals/tasks/agent/webvoyager.ts:111">
P2: The `screenshotCollector.stop()` is called twice on successful runs - once in the try block and again in the finally block. Consider either:
1. Only doing cleanup in the finally block (remove from try block), or
2. Setting `screenshotCollector = null` after cleanup in the success path to prevent double-stop.</violation>
</file>
<file name="packages/evals/tasks/agent/onlineMind2Web.ts">
<violation number="1" location="packages/evals/tasks/agent/onlineMind2Web.ts:122">
P2: Missing `await` for async `stop()` call. The `ScreenshotCollector.stop()` method returns a `Promise<Buffer[]>`, so it should be awaited to ensure cleanup completes before the function exits.</violation>
</file>
<file name="packages/core/lib/utils.ts">
<violation number="1" location="packages/core/lib/utils.ts:852">
P1: Missing null check for `metadata.width` and `metadata.height`. Sharp's `metadata()` returns optional width/height properties that can be `undefined` for corrupted images or certain image types. This will cause `NaN` to be passed to `resize()`, leading to runtime errors.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| @@ -1,5 +1,5 @@ | |||
| import { EvalFunction } from "../../types/evals"; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: The screenshotCollector.stop() is called twice on successful runs - once in the try block and again in the finally block. Consider either:
- Only doing cleanup in the finally block (remove from try block), or
- Setting
screenshotCollector = nullafter cleanup in the success path to prevent double-stop.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/tasks/agent/webvoyager.ts, line 111:
<comment>The `screenshotCollector.stop()` is called twice on successful runs - once in the try block and again in the finally block. Consider either:
1. Only doing cleanup in the finally block (remove from try block), or
2. Setting `screenshotCollector = null` after cleanup in the success path to prevent double-stop.</comment>
<file context>
@@ -88,5 +105,21 @@ export const webvoyager: EvalFunction = async ({
+ } finally {
+ // Always clean up event listener and stop collector to prevent hanging
+ if (screenshotHandler) {
+ try {
+ v3.bus.off("agent_screenshot_taken_event", screenshotHandler);
+ } catch {
</file context>
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
3aa02e4 to
d2f1436
Compare
This PR was opened by the [Changesets release](https://github.com/changesets/action) GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated. # Releases ## @browserbasehq/stagehand@3.0.8 ### Patch Changes - [#1514](#1514) [`40ce5cc`](40ce5cc) Thanks [@tkattkat](https://github.com/tkattkat)! - Rename the close tool in agent to "done" - [#1574](#1574) [`5506f41`](5506f41) Thanks [@tkattkat](https://github.com/tkattkat)! - fix(server): pass cdpUrl to localBrowserLaunchOptions when launchOptions absent - [#1521](#1521) [`84c05ca`](84c05ca) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: get agent cache working in API mode - [#1486](#1486) [`692ffa0`](692ffa0) Thanks [@tkattkat](https://github.com/tkattkat)! - improve logging in agent - [#1551](#1551) [`1ef8901`](1ef8901) Thanks [@miguelg719](https://github.com/miguelg719)! - move extract handler response log to after URL injection - [#1495](#1495) [`72ac775`](72ac775) Thanks [@tkattkat](https://github.com/tkattkat)! - export tool function & type to simplify defining custom tools - [#1481](#1481) [`3d5af07`](3d5af07) Thanks [@tkattkat](https://github.com/tkattkat)! - add waitForTimeout to page - [#1423](#1423) [`40e1d80`](40e1d80) Thanks [@miguelg719](https://github.com/miguelg719)! - Improve benchmark handling and add metadata - [#1588](#1588) [`56c0d24`](56c0d24) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - add SnapshotOptions to page.snapshot() - [#1483](#1483) [`16d72fb`](16d72fb) Thanks [@tkattkat](https://github.com/tkattkat)! - Optimize screenshot handling in agent hybrid mode - [#1498](#1498) [`088c4cc`](088c4cc) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: replaying cached actions (for agent & act) now uses the originally defined model, (instead of default model) when action fails and rerunning inference is needed - [#1575](#1575) [`4276f4a`](4276f4a) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - expose port param in localBrowserLaunchOptions - [#1544](#1544) [`6005786`](6005786) Thanks [@tkattkat](https://github.com/tkattkat)! - Recommend hybrid mode over DOM mode in agent, which is now considered legacy - [#1505](#1505) [`6fbf5fc`](6fbf5fc) Thanks [@tkattkat](https://github.com/tkattkat)! - Add structured output to agent result + ensure close tool is always called - [#1511](#1511) [`704cf18`](704cf18) Thanks [@shrey150](https://github.com/shrey150)! - Fix ControlOrMeta keypress event - [#1480](#1480) [`091296e`](091296e) Thanks [@tkattkat](https://github.com/tkattkat)! - Update agent to only calculate xpath when caching is enabled - [#1509](#1509) [`e56c6eb`](e56c6eb) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - add support for page.waitForSelector() - [#1478](#1478) [`2cb78d0`](2cb78d0) Thanks [@tkattkat](https://github.com/tkattkat)! - update agent message handling - [#1518](#1518) [`5dad639`](5dad639) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - add page.snapshot() for capturing a stringified DOM snapshot of the page, including an xpath map & url map - [#1576](#1576) [`b7c2571`](b7c2571) Thanks [@tkattkat](https://github.com/tkattkat)! - utilize waitForSelector when running agent cache - [#1560](#1560) [`4c69117`](4c69117) Thanks [@tkattkat](https://github.com/tkattkat)! - Update coordinate handling in cua and hybrid ## @browserbasehq/stagehand-server@3.5.0 ### Minor Changes - [#1578](#1578) [`a5074bd`](a5074bd) Thanks [@monadoid](https://github.com/monadoid)! - /end endpoint no longer takes an empty object - instead, no request body is required. ### Patch Changes - Updated dependencies \[[`40ce5cc`](40ce5cc), [`5506f41`](5506f41), [`84c05ca`](84c05ca), [`692ffa0`](692ffa0), [`1ef8901`](1ef8901), [`72ac775`](72ac775), [`3d5af07`](3d5af07), [`40e1d80`](40e1d80), [`56c0d24`](56c0d24), [`16d72fb`](16d72fb), [`088c4cc`](088c4cc), [`4276f4a`](4276f4a), [`6005786`](6005786), [`6fbf5fc`](6fbf5fc), [`704cf18`](704cf18), [`091296e`](091296e), [`e56c6eb`](e56c6eb), [`2cb78d0`](2cb78d0), [`5dad639`](5dad639), [`b7c2571`](b7c2571), [`4c69117`](4c69117)]: - @browserbasehq/stagehand@3.0.8 ## @browserbasehq/stagehand-evals@1.1.7 ### Patch Changes - Updated dependencies \[[`40ce5cc`](40ce5cc), [`5506f41`](5506f41), [`84c05ca`](84c05ca), [`692ffa0`](692ffa0), [`1ef8901`](1ef8901), [`72ac775`](72ac775), [`3d5af07`](3d5af07), [`40e1d80`](40e1d80), [`56c0d24`](56c0d24), [`16d72fb`](16d72fb), [`088c4cc`](088c4cc), [`4276f4a`](4276f4a), [`6005786`](6005786), [`6fbf5fc`](6fbf5fc), [`704cf18`](704cf18), [`091296e`](091296e), [`e56c6eb`](e56c6eb), [`2cb78d0`](2cb78d0), [`5dad639`](5dad639), [`b7c2571`](b7c2571), [`4c69117`](4c69117)]: - @browserbasehq/stagehand@3.0.8 Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
why
Fixes benchmark eval runners to be more resilient and adds metadata for filtering/grouping eval tasks
what changed
imageResizeutility from coretest plan