-
Notifications
You must be signed in to change notification settings - Fork 1.2k
update evals #1139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update evals #1139
Conversation
|
ce16e01 to
20c10d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Summary
This PR enhances the eval infrastructure with more granular metadata tagging and improved error parsing capabilities.
Key Changes:
- Added
PROXY_ERRORerror type and improved error detection logic with case-insensitive matching for ANTIBOT, proxy, timeout, network, and parsing errors - Standardized error handling across eval tasks by removing try-catch blocks and letting errors propagate for centralized categorization
- Added
execution_time,final_answer, andagent_stepstracking to eval outputs for better metrics - Refactored agent initialization to use
modelToAgentProviderMapwith validation to ensure only supported models run agent tasks - Increased timeouts (30s → 60s DOM settle, 75s → 120s page navigation) and max steps (50 → 75-80) for agent tasks
- Removed WebBench and OSWorld dataset support, simplified dataset filtering logic
- Changed evaluator API key from
GOOGLE_GENERATIVE_AI_API_KEYtoGEMINI_API_KEY - Simplified GAIA evaluation to text-only (removed screenshot collection)
- Removed WebVoyager ground truth checking in favor of standard evaluation flow
- Added access denied detection in onlineMind2Web that throws for proper error categorization
Confidence Score: 4/5
- This PR is safe to merge with minor attention to configuration changes
- The changes are well-structured improvements to the eval infrastructure. The refactoring standardizes error handling, adds better metadata tracking, and removes unused datasets. However, the score is 4 rather than 5 due to: (1) the environment variable rename for GEMINI_API_KEY may require documentation/migration, (2) increased timeouts could mask underlying issues, and (3) removal of error try-catch blocks in tasks means errors must be properly caught at the runner level
- Pay attention to evals/initStagehand.ts for the config restructuring and evals/index.eval.ts for the error handling logic changes
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| types/evals.ts | 5/5 | Added PROXY_ERROR to ErrorType enum for better error categorization |
| evals/initStagehand.ts | 4/5 | Refactored config structure, increased DOM timeout to 60s, added error handling wrapper, updated agent config logic to use modelToAgentProviderMap |
| evals/index.eval.ts | 4/5 | Removed WebBench and OSWorld dataset support, improved error parsing with case-insensitive checks and new ANTIBOT/PROXY_ERROR detection, added dataset filtering |
| evals/tasks/agent/gaia.ts | 4/5 | Removed screenshot collection, simplified evaluation to use text-only, added model validation, increased timeout to 120s |
| evals/tasks/agent/webvoyager.ts | 4/5 | Removed ground truth checking logic, moved agent initialization into task, increased timeout and maxSteps, simplified evaluation flow |
| evals/tasks/agent/onlineMind2Web.ts | 4/5 | Added model validation, increased timeouts, added access denied detection that throws error for proper categorization |
Sequence Diagram
sequenceDiagram
participant Runner as Eval Runner
participant Init as initStagehand
participant Task as Agent Task
participant Agent as Stagehand Agent
participant Eval as Evaluator
participant Error as Error Handler
Runner->>Init: initStagehand({modelName, config})
Init->>Init: Validate model in modelToAgentProviderMap
Init->>Init: Create Stagehand instance with config
Init-->>Runner: {stagehand, agent, logger, debugUrl}
Runner->>Task: Execute task function
Task->>Task: Validate params (web, ques, etc)
Task->>Agent: page.goto(url, {timeout: 120s})
Task->>Task: Validate model support
Task->>Agent: agent.execute({instruction, maxSteps})
Agent->>Agent: Execute browsing steps
Agent-->>Task: {message, screenshots}
Task->>Eval: evaluator.ask({question, answer, screenshot})
Eval->>Eval: Take screenshot(s)
Eval->>Eval: Call LLM with structured output
Eval-->>Task: {evaluation: YES/NO, reasoning}
alt Evaluation Success
Task-->>Runner: {_success: true, execution_time, final_answer, logs}
else Evaluation Failure
Task-->>Runner: {_success: false, reasoning, logs}
end
alt Error Occurs
Task->>Error: Throw error (no try-catch)
Error->>Error: Parse error message (case-insensitive)
Error->>Error: Categorize: ANTIBOT, PROXY_ERROR, TIMEOUT, etc
Error-->>Runner: {_success: false, error_type, error_message, error_stack}
end
30 files reviewed, 1 comment
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
why
Make it easier to parse/filter/group evals
what changed
Evals tagged with more granular metadata and error parsing
test plan