Skip to content

Conversation

@miguelg719
Copy link
Collaborator

@miguelg719 miguelg719 commented Oct 16, 2025

why

Make it easier to parse/filter/group evals

what changed

Evals tagged with more granular metadata and error parsing

test plan

@changeset-bot
Copy link

changeset-bot bot commented Oct 16, 2025

⚠️ No Changeset found

Latest commit: 747be8e

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@miguelg719 miguelg719 marked this pull request as ready for review October 16, 2025 19:49
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Summary

This PR enhances the eval infrastructure with more granular metadata tagging and improved error parsing capabilities.

Key Changes:

  • Added PROXY_ERROR error type and improved error detection logic with case-insensitive matching for ANTIBOT, proxy, timeout, network, and parsing errors
  • Standardized error handling across eval tasks by removing try-catch blocks and letting errors propagate for centralized categorization
  • Added execution_time, final_answer, and agent_steps tracking to eval outputs for better metrics
  • Refactored agent initialization to use modelToAgentProviderMap with validation to ensure only supported models run agent tasks
  • Increased timeouts (30s → 60s DOM settle, 75s → 120s page navigation) and max steps (50 → 75-80) for agent tasks
  • Removed WebBench and OSWorld dataset support, simplified dataset filtering logic
  • Changed evaluator API key from GOOGLE_GENERATIVE_AI_API_KEY to GEMINI_API_KEY
  • Simplified GAIA evaluation to text-only (removed screenshot collection)
  • Removed WebVoyager ground truth checking in favor of standard evaluation flow
  • Added access denied detection in onlineMind2Web that throws for proper error categorization

Confidence Score: 4/5

  • This PR is safe to merge with minor attention to configuration changes
  • The changes are well-structured improvements to the eval infrastructure. The refactoring standardizes error handling, adds better metadata tracking, and removes unused datasets. However, the score is 4 rather than 5 due to: (1) the environment variable rename for GEMINI_API_KEY may require documentation/migration, (2) increased timeouts could mask underlying issues, and (3) removal of error try-catch blocks in tasks means errors must be properly caught at the runner level
  • Pay attention to evals/initStagehand.ts for the config restructuring and evals/index.eval.ts for the error handling logic changes

Important Files Changed

File Analysis

Filename Score Overview
types/evals.ts 5/5 Added PROXY_ERROR to ErrorType enum for better error categorization
evals/initStagehand.ts 4/5 Refactored config structure, increased DOM timeout to 60s, added error handling wrapper, updated agent config logic to use modelToAgentProviderMap
evals/index.eval.ts 4/5 Removed WebBench and OSWorld dataset support, improved error parsing with case-insensitive checks and new ANTIBOT/PROXY_ERROR detection, added dataset filtering
evals/tasks/agent/gaia.ts 4/5 Removed screenshot collection, simplified evaluation to use text-only, added model validation, increased timeout to 120s
evals/tasks/agent/webvoyager.ts 4/5 Removed ground truth checking logic, moved agent initialization into task, increased timeout and maxSteps, simplified evaluation flow
evals/tasks/agent/onlineMind2Web.ts 4/5 Added model validation, increased timeouts, added access denied detection that throws error for proper categorization

Sequence Diagram

sequenceDiagram
    participant Runner as Eval Runner
    participant Init as initStagehand
    participant Task as Agent Task
    participant Agent as Stagehand Agent
    participant Eval as Evaluator
    participant Error as Error Handler

    Runner->>Init: initStagehand({modelName, config})
    Init->>Init: Validate model in modelToAgentProviderMap
    Init->>Init: Create Stagehand instance with config
    Init-->>Runner: {stagehand, agent, logger, debugUrl}
    
    Runner->>Task: Execute task function
    Task->>Task: Validate params (web, ques, etc)
    Task->>Agent: page.goto(url, {timeout: 120s})
    Task->>Task: Validate model support
    Task->>Agent: agent.execute({instruction, maxSteps})
    Agent->>Agent: Execute browsing steps
    Agent-->>Task: {message, screenshots}
    
    Task->>Eval: evaluator.ask({question, answer, screenshot})
    Eval->>Eval: Take screenshot(s)
    Eval->>Eval: Call LLM with structured output
    Eval-->>Task: {evaluation: YES/NO, reasoning}
    
    alt Evaluation Success
        Task-->>Runner: {_success: true, execution_time, final_answer, logs}
    else Evaluation Failure
        Task-->>Runner: {_success: false, reasoning, logs}
    end
    
    alt Error Occurs
        Task->>Error: Throw error (no try-catch)
        Error->>Error: Parse error message (case-insensitive)
        Error->>Error: Categorize: ANTIBOT, PROXY_ERROR, TIMEOUT, etc
        Error-->>Runner: {_success: false, error_type, error_message, error_stack}
    end
Loading

30 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

miguelg719 and others added 2 commits October 16, 2025 12:56
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@miguelg719 miguelg719 merged commit 9afc0a8 into main Oct 17, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants