update evals #1139

miguelg719 · 2025-10-16T17:50:46Z

why

Make it easier to parse/filter/group evals

what changed

Evals tagged with more granular metadata and error parsing

test plan

changeset-bot · 2025-10-16T17:50:49Z

⚠️ No Changeset found

Latest commit: 747be8e

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

evals/evaluator.ts

greptile-apps

Greptile Overview

Summary

This PR enhances the eval infrastructure with more granular metadata tagging and improved error parsing capabilities.

Key Changes:

Added PROXY_ERROR error type and improved error detection logic with case-insensitive matching for ANTIBOT, proxy, timeout, network, and parsing errors
Standardized error handling across eval tasks by removing try-catch blocks and letting errors propagate for centralized categorization
Added execution_time, final_answer, and agent_steps tracking to eval outputs for better metrics
Refactored agent initialization to use modelToAgentProviderMap with validation to ensure only supported models run agent tasks
Increased timeouts (30s → 60s DOM settle, 75s → 120s page navigation) and max steps (50 → 75-80) for agent tasks
Removed WebBench and OSWorld dataset support, simplified dataset filtering logic
Changed evaluator API key from GOOGLE_GENERATIVE_AI_API_KEY to GEMINI_API_KEY
Simplified GAIA evaluation to text-only (removed screenshot collection)
Removed WebVoyager ground truth checking in favor of standard evaluation flow
Added access denied detection in onlineMind2Web that throws for proper error categorization

Confidence Score: 4/5

This PR is safe to merge with minor attention to configuration changes
The changes are well-structured improvements to the eval infrastructure. The refactoring standardizes error handling, adds better metadata tracking, and removes unused datasets. However, the score is 4 rather than 5 due to: (1) the environment variable rename for GEMINI_API_KEY may require documentation/migration, (2) increased timeouts could mask underlying issues, and (3) removal of error try-catch blocks in tasks means errors must be properly caught at the runner level
Pay attention to evals/initStagehand.ts for the config restructuring and evals/index.eval.ts for the error handling logic changes

Important Files Changed

File Analysis

Filename	Score	Overview
types/evals.ts	5/5	Added PROXY_ERROR to ErrorType enum for better error categorization
evals/initStagehand.ts	4/5	Refactored config structure, increased DOM timeout to 60s, added error handling wrapper, updated agent config logic to use modelToAgentProviderMap
evals/index.eval.ts	4/5	Removed WebBench and OSWorld dataset support, improved error parsing with case-insensitive checks and new ANTIBOT/PROXY_ERROR detection, added dataset filtering
evals/tasks/agent/gaia.ts	4/5	Removed screenshot collection, simplified evaluation to use text-only, added model validation, increased timeout to 120s
evals/tasks/agent/webvoyager.ts	4/5	Removed ground truth checking logic, moved agent initialization into task, increased timeout and maxSteps, simplified evaluation flow
evals/tasks/agent/onlineMind2Web.ts	4/5	Added model validation, increased timeouts, added access denied detection that throws error for proper categorization

Sequence Diagram

sequenceDiagram
    participant Runner as Eval Runner
    participant Init as initStagehand
    participant Task as Agent Task
    participant Agent as Stagehand Agent
    participant Eval as Evaluator
    participant Error as Error Handler

    Runner->>Init: initStagehand({modelName, config})
    Init->>Init: Validate model in modelToAgentProviderMap
    Init->>Init: Create Stagehand instance with config
    Init-->>Runner: {stagehand, agent, logger, debugUrl}
    
    Runner->>Task: Execute task function
    Task->>Task: Validate params (web, ques, etc)
    Task->>Agent: page.goto(url, {timeout: 120s})
    Task->>Task: Validate model support
    Task->>Agent: agent.execute({instruction, maxSteps})
    Agent->>Agent: Execute browsing steps
    Agent-->>Task: {message, screenshots}
    
    Task->>Eval: evaluator.ask({question, answer, screenshot})
    Eval->>Eval: Take screenshot(s)
    Eval->>Eval: Call LLM with structured output
    Eval-->>Task: {evaluation: YES/NO, reasoning}
    
    alt Evaluation Success
        Task-->>Runner: {_success: true, execution_time, final_answer, logs}
    else Evaluation Failure
        Task-->>Runner: {_success: false, reasoning, logs}
    end
    
    alt Error Occurs
        Task->>Error: Throw error (no try-catch)
        Error->>Error: Parse error message (case-insensitive)
        Error->>Error: Categorize: ANTIBOT, PROXY_ERROR, TIMEOUT, etc
        Error-->>Runner: {_success: false, error_type, error_message, error_stack}
    end

_{30 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

evals/initStagehand.ts

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

update evals

20c10d0

miguelg719 force-pushed the update_evals branch from ce16e01 to 20c10d0 Compare October 16, 2025 18:00

miguelg719 added 4 commits October 16, 2025 11:01

return missing evals

fa49328

format

eeb6696

add missing evals to config

f011f27

update init

fc83d13

miguelg719 commented Oct 16, 2025

View reviewed changes

evals/evaluator.ts Outdated Show resolved Hide resolved

miguelg719 and others added 2 commits October 16, 2025 11:10

Update evals/evaluator.ts

e38805b

fix build errors

997b7ac

miguelg719 marked this pull request as ready for review October 16, 2025 19:49

greptile-apps bot reviewed Oct 16, 2025

View reviewed changes

evals/initStagehand.ts Outdated Show resolved Hide resolved

miguelg719 and others added 2 commits October 16, 2025 12:56

Update evals/initStagehand.ts

4643a03

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

remove

747be8e

seanmcguire12 approved these changes Oct 16, 2025

View reviewed changes

miguelg719 merged commit 9afc0a8 into main Oct 17, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

update evals #1139

update evals #1139

Uh oh!

miguelg719 commented Oct 16, 2025 •

edited

Loading

Uh oh!

changeset-bot bot commented Oct 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

update evals #1139

update evals #1139

Uh oh!

Conversation

miguelg719 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

what changed

test plan

Uh oh!

changeset-bot bot commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

miguelg719 commented Oct 16, 2025 •

edited

Loading

changeset-bot bot commented Oct 16, 2025 •

edited

Loading