Skip to content

Conversation

@chughtapan
Copy link
Owner

  • Fix: Save model_hashes.json to enable proper change detection

    • AppWorld's evaluator relies on model hash counters to detect DB changes
    • Without saving these hashes, evaluation fails even when agent completes
    • Added save_model_hashes=True in mcp_server.py:234
  • Improve: Remove turn limit from agent prompt

    • Removed max_steps parameter and references from system instruction
    • Eliminates artificial pressure on agent to be overly conservative
    • Improves task completion rate by reducing unnecessary batching
  • Increase test timeout from 5 to 15 minutes

    • Accounts for bearer token expiration issues with some tasks
    • Prevents premature timeout on complex tasks

Verified with successful test runs on tasks 692c77d_2 and 22cc237_3.

Tapan Chugh added 5 commits October 27, 2025 15:47
- Fix: Save model_hashes.json to enable proper change detection
  * AppWorld's evaluator relies on model hash counters to detect DB changes
  * Without saving these hashes, evaluation fails even when agent completes
  * Added save_model_hashes=True in mcp_server.py:234

- Improve: Remove turn limit from agent prompt
  * Removed max_steps parameter and references from system instruction
  * Eliminates artificial pressure on agent to be overly conservative
  * Improves task completion rate by reducing unnecessary batching

- Increase test timeout from 5 to 15 minutes
  * Accounts for bearer token expiration issues with some tasks
  * Prevents premature timeout on complex tasks

Verified with successful test runs on tasks 692c77d_2 and 22cc237_3.
@chughtapan chughtapan merged commit 9c99603 into main Oct 27, 2025
3 checks passed
@chughtapan chughtapan deleted the analyze_appworld_logs branch October 31, 2025 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants