feat: MCP tool calling reliability test framework#718
feat: MCP tool calling reliability test framework#718itomek-amd wants to merge 5 commits intomainfrom
Conversation
Preload LLM model and ML dependencies at application startup instead of on the first user prompt, eliminating a 30-120s delay. A new DispatchQueue tracks background task status and exposes it to the frontend via /api/system/status and /api/system/tasks. Backend: - Add DispatchQueue (src/gaia/ui/dispatch.py) with job lifecycle, dependency support, visibility filter, and auto-pruning - Replace fire-and-forget asyncio.create_task() in lifespan with three dispatched startup jobs: check Lemonade, import modules, load AI model - Promote _model_load_lock to model_load_lock for cross-module use - Extend SystemStatus with init_state and init_tasks fields - Add GET /api/system/tasks endpoint Frontend: - Adaptive status polling (3s during init, 15s otherwise) - ConnectionBanner Case 0: blue init banner with task name + spinner - Disable chat input and WelcomeScreen during initialization Tests: - 8 unit tests for DispatchQueue (lifecycle, dependencies, visibility, concurrency, pruning) - 8 integration tests for startup init (status endpoint, tasks endpoint, wiring, defaults) Closes #713 Co-Authored-By: Tomasz Iniewicz <itomek@users.noreply.github.com>
- Let _load_model exceptions propagate so job fails properly (not silent DONE when Lemonade unreachable) - Import _DEFAULT_MODEL_NAME constant instead of hardcoding - Use InitTaskInfo model instead of Dict[str, Any] for init_tasks - Use get_dispatch_queue DI helper in /api/system/tasks endpoint - Remove unused patch import from test file Co-Authored-By: Tomasz Iniewicz <itomek@users.noreply.github.com>
- Fix finally block overwriting completed_at set by early returns - Fail loudly when dependency job ID is not found (instead of silent skip) - Use defensive getattr in get_dispatch_queue DI helper - Fix adaptive polling replacing 3s with 15s before first status response - Add regression test for init_tasks error field absence Co-Authored-By: Tomasz Iniewicz <itomek@users.noreply.github.com>
Add MCP reliability eval scenarios and --iterations CLI flag for measuring tool calling reliability across models and hardware configurations. - Add 10 MCP reliability scenarios (simple/moderate/complex tiers) in eval/scenarios/mcp_reliability/ - Add --iterations N flag to gaia eval agent for repeated runs - Add _print_reliability_summary() with colorized pass rate table and GO/NO_GO readiness signal - Block --fix + --iterations > 1 (incompatible modes) - Write reliability_report.json with per-scenario pass rates - Add 16 unit tests for scenario validation and iteration aggregation Closes #709
- Add gaia eval agent command examples to docs/reference/cli.mdx including --iterations, --category, --fix, --compare flags - Add AMD copyright header to tests/unit/eval/__init__.py
|
@claude review this PR |
|
Claude finished @itomek's task —— View job PR Review: MCP Tool Calling Reliability Test Framework
SummaryThis is a well-structured PR that adds a comprehensive MCP reliability evaluation framework along with a boot-time initialization system for the Agent UI. The code is clean, well-documented, and includes thorough test coverage. Review Feedback🟢 Strengths1. Clean Architecture - The
2. Comprehensive Test Coverage - 16 new tests covering:
3. Well-Designed Eval Scenarios - The 10 MCP reliability scenarios cover a good range:
4. Good UX - Frontend changes properly:
🟡 Suggestions (Non-blocking)1. Consider using
while dep.status not in (JobStatus.DONE, JobStatus.FAILED):
if time.monotonic() >= deadline:
...
await asyncio.sleep(_DEPENDENCY_POLL_INTERVAL)An 2. CLI table formatting edge case
print(f" {sid:<40} {rate_str:>12} {result_str}")Consider truncating long IDs or using dynamic width. 3. Test isolation with subprocess
🔴 Issues (Should Address)1. Unused import in test file
import ioThis import is not used in the test file. 2.
📋 Test Plan VerificationBased on code review, the test plan items should work:
Overall AssessmentApprove ✅ - This is high-quality code with good design patterns, comprehensive tests, and proper documentation. The minor issues noted above are non-blocking. The PR successfully adds:
Nice work! 🎉 |
Summary
eval/scenarios/mcp_reliability/--iterations Nflag togaia eval agentfor running scenarios multiple times to measure consistency_print_reliability_summary()with colorized pass rate table and GO/NO_GO readiness signalgaia eval agentcommand indocs/reference/cli.mdxCloses #709
How It Works
The
--iterationsloop runs at the CLI level (not insideAgentEvalRunner), callingrunner.run()N times as a black box. Each iteration gets its own run_id and run_dir — no trace collisions, no resume conflicts.Test plan
python -m pytest tests/unit/eval/ -xvs— 16 new tests passpython -m pytest tests/test_eval.py -x— 99 existing eval tests unbrokengaia eval agent --helpshows--iterationsflaggaia eval agent --iterations 0prints error and exitsgaia eval agent --fix --iterations 3prints incompatibility errorgaia eval agent --category mcp_reliabilityruns MCP scenarios (requires Agent UI backend + claude CLI)Generated with Claude Code