feat: MCP tool calling reliability test framework by itomek-amd · Pull Request #718 · amd/gaia

itomek-amd · 2026-04-02T22:14:22Z

Summary

Add 10 MCP reliability eval scenarios across 3 complexity tiers (simple/moderate/complex) in eval/scenarios/mcp_reliability/
Add --iterations N flag to gaia eval agent for running scenarios multiple times to measure consistency
Add _print_reliability_summary() with colorized pass rate table and GO/NO_GO readiness signal
Add CLI docs for gaia eval agent command in docs/reference/cli.mdx

Closes #709

How It Works

# Run MCP reliability scenarios once
gaia eval agent --category mcp_reliability

# Run 5 iterations for reliability measurement
gaia eval agent --category mcp_reliability --iterations 5

# Cross-model comparison: run per model, compare scorecards
gaia eval agent --category mcp_reliability --iterations 5  # with Qwen loaded
gaia eval agent --category mcp_reliability --iterations 5  # with Liquid loaded
gaia eval agent --compare eval/results/run1/scorecard.json eval/results/run2/scorecard.json

The --iterations loop runs at the CLI level (not inside AgentEvalRunner), calling runner.run() N times as a black box. Each iteration gets its own run_id and run_dir — no trace collisions, no resume conflicts.

Test plan

python -m pytest tests/unit/eval/ -xvs — 16 new tests pass
python -m pytest tests/test_eval.py -x — 99 existing eval tests unbroken
gaia eval agent --help shows --iterations flag
gaia eval agent --iterations 0 prints error and exits
gaia eval agent --fix --iterations 3 prints incompatibility error
gaia eval agent --category mcp_reliability runs MCP scenarios (requires Agent UI backend + claude CLI)

Generated with Claude Code

Preload LLM model and ML dependencies at application startup instead of on the first user prompt, eliminating a 30-120s delay. A new DispatchQueue tracks background task status and exposes it to the frontend via /api/system/status and /api/system/tasks. Backend: - Add DispatchQueue (src/gaia/ui/dispatch.py) with job lifecycle, dependency support, visibility filter, and auto-pruning - Replace fire-and-forget asyncio.create_task() in lifespan with three dispatched startup jobs: check Lemonade, import modules, load AI model - Promote _model_load_lock to model_load_lock for cross-module use - Extend SystemStatus with init_state and init_tasks fields - Add GET /api/system/tasks endpoint Frontend: - Adaptive status polling (3s during init, 15s otherwise) - ConnectionBanner Case 0: blue init banner with task name + spinner - Disable chat input and WelcomeScreen during initialization Tests: - 8 unit tests for DispatchQueue (lifecycle, dependencies, visibility, concurrency, pruning) - 8 integration tests for startup init (status endpoint, tasks endpoint, wiring, defaults) Closes #713 Co-Authored-By: Tomasz Iniewicz <itomek@users.noreply.github.com>

- Let _load_model exceptions propagate so job fails properly (not silent DONE when Lemonade unreachable) - Import _DEFAULT_MODEL_NAME constant instead of hardcoding - Use InitTaskInfo model instead of Dict[str, Any] for init_tasks - Use get_dispatch_queue DI helper in /api/system/tasks endpoint - Remove unused patch import from test file Co-Authored-By: Tomasz Iniewicz <itomek@users.noreply.github.com>

- Fix finally block overwriting completed_at set by early returns - Fail loudly when dependency job ID is not found (instead of silent skip) - Use defensive getattr in get_dispatch_queue DI helper - Fix adaptive polling replacing 3s with 15s before first status response - Add regression test for init_tasks error field absence Co-Authored-By: Tomasz Iniewicz <itomek@users.noreply.github.com>

Add MCP reliability eval scenarios and --iterations CLI flag for measuring tool calling reliability across models and hardware configurations. - Add 10 MCP reliability scenarios (simple/moderate/complex tiers) in eval/scenarios/mcp_reliability/ - Add --iterations N flag to gaia eval agent for repeated runs - Add _print_reliability_summary() with colorized pass rate table and GO/NO_GO readiness signal - Block --fix + --iterations > 1 (incompatible modes) - Write reliability_report.json with per-scenario pass rates - Add 16 unit tests for scenario validation and iteration aggregation Closes #709

- Add gaia eval agent command examples to docs/reference/cli.mdx including --iterations, --category, --fix, --compare flags - Add AMD copyright header to tests/unit/eval/__init__.py

itomek · 2026-04-09T13:47:05Z

@claude review this PR

github-actions · 2026-04-09T13:47:37Z

Claude finished @itomek's task —— View job

PR Review: MCP Tool Calling Reliability Test Framework

Read PR diff and changed files
Review CLI changes (src/gaia/cli.py)
Review dispatch module (src/gaia/ui/dispatch.py)
Review test scenarios (eval/scenarios/mcp_reliability/)
Review unit tests
Review frontend changes
Provide comprehensive feedback

Summary

This is a well-structured PR that adds a comprehensive MCP reliability evaluation framework along with a boot-time initialization system for the Agent UI. The code is clean, well-documented, and includes thorough test coverage.

Review Feedback

🟢 Strengths

1. Clean Architecture - The DispatchQueue in src/gaia/ui/dispatch.py is well-designed with:

Clear job lifecycle states (PENDING → RUNNING → DONE/FAILED)
Dependency handling with timeout protection (60s)
Automatic pruning of old jobs
Thread-safe design using asyncio event loop

2. Comprehensive Test Coverage - 16 new tests covering:

Job lifecycle transitions
Dependency handling edge cases
Concurrency verification
CLI validation (--iterations 0, --fix --iterations N)

3. Well-Designed Eval Scenarios - The 10 MCP reliability scenarios cover a good range:

Simple: basic tool calls, no-tool-needed restraint, introspection
Moderate: structured params, error handling, search context
Complex: multi-step chains, conditional selection, multi-document

4. Good UX - Frontend changes properly:

Block input during initialization (ChatView.tsx:503-506)
Show initialization progress with animated banner (ConnectionBanner.tsx:66-82)
Adaptive polling (3s during init, 15s after)

🟡 Suggestions (Non-blocking)

1. Consider using asyncio.Event instead of polling for dependencies

src/gaia/ui/dispatch.py:126-138 - The current implementation polls every 0.2s:

while dep.status not in (JobStatus.DONE, JobStatus.FAILED):
    if time.monotonic() >= deadline:
        ...
    await asyncio.sleep(_DEPENDENCY_POLL_INTERVAL)

An asyncio.Event per job would be more efficient for CPU and latency, though the current approach is perfectly acceptable for boot-time use.

2. CLI table formatting edge case

src/gaia/cli.py:904 - If a scenario ID exceeds 40 characters, the table formatting may break:

print(f"  {sid:<40} {rate_str:>12} {result_str}")

Consider truncating long IDs or using dynamic width.

3. Test isolation with subprocess

tests/unit/eval/test_iterations.py:112-146 - The subprocess tests call the full CLI. While this tests the real behavior, consider mocking AgentEvalRunner for faster unit tests and reserving subprocess tests for integration.

🔴 Issues (Should Address)

1. Unused import in test file

tests/unit/eval/test_iterations.py:6:

import io

This import is not used in the test file.

2. model_load_lock visibility change should be documented

src/gaia/ui/_chat_helpers.py:62-63 - The lock was renamed from _model_load_lock to model_load_lock (public) for cross-module access. The comment explains why, but this is a public API change that could affect external code importing this module. Consider adding to the module docstring.

📋 Test Plan Verification

Based on code review, the test plan items should work:

✅ python -m pytest tests/unit/eval/ -xvs — Tests are well-structured
✅ gaia eval agent --help — --iterations flag is registered at line 2106-2112
✅ gaia eval agent --iterations 0 — Validation at line 3972-3974
✅ gaia eval agent --fix --iterations 3 — Incompatibility check at line 3976-3982

Overall Assessment

Approve ✅ - This is high-quality code with good design patterns, comprehensive tests, and proper documentation. The minor issues noted above are non-blocking. The PR successfully adds:

A robust MCP reliability evaluation framework with 10 scenarios across 3 complexity tiers
A --iterations flag for measuring consistency across multiple runs
A boot-time initialization system with frontend progress tracking
GO/NO_GO readiness signals for reliability assessment

Nice work! 🎉

itomek and others added 5 commits April 2, 2026 11:54

docs: add gaia eval agent CLI reference and copyright header

b8213d6

- Add gaia eval agent command examples to docs/reference/cli.mdx including --iterations, --category, --fix, --compare flags - Add AMD copyright header to tests/unit/eval/__init__.py

github-actions bot added documentation Documentation changes cli CLI changes tests Test changes electron Electron app changes labels Apr 2, 2026

itomek self-assigned this Apr 3, 2026

itomek added this to the v0.17.2 -- Installer, Onboarding & MCP Reliability [OSS] milestone Apr 7, 2026

itomek modified the milestones: v0.17.2 -- Installer, Onboarding & MCP Reliability [OSS], v0.17.3 — Website, and RAG bug fixes [OSS] Apr 10, 2026

kovtcharov-amd modified the milestones: v0.17.3 — Website, and RAG bug fixes [OSS], v0.18.2 — Agent Registry, API Readiness & UI Polish [OSS], v0.18.2 — API Readiness & UI Polish [OSS] Apr 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: MCP tool calling reliability test framework#718

feat: MCP tool calling reliability test framework#718
itomek-amd wants to merge 5 commits intomainfrom
feat/issue-709-mcp-reliability

itomek-amd commented Apr 2, 2026

Uh oh!

itomek commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

itomek-amd commented Apr 2, 2026

Summary

How It Works

Test plan

Uh oh!

itomek commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: MCP Tool Calling Reliability Test Framework

Summary

Review Feedback

🟢 Strengths

🟡 Suggestions (Non-blocking)

🔴 Issues (Should Address)

📋 Test Plan Verification

Overall Assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Apr 9, 2026 •

edited

Loading