Skip to content

feat: MCP tool calling reliability test framework#718

Draft
itomek-amd wants to merge 5 commits intomainfrom
feat/issue-709-mcp-reliability
Draft

feat: MCP tool calling reliability test framework#718
itomek-amd wants to merge 5 commits intomainfrom
feat/issue-709-mcp-reliability

Conversation

@itomek-amd
Copy link
Copy Markdown
Collaborator

Summary

  • Add 10 MCP reliability eval scenarios across 3 complexity tiers (simple/moderate/complex) in eval/scenarios/mcp_reliability/
  • Add --iterations N flag to gaia eval agent for running scenarios multiple times to measure consistency
  • Add _print_reliability_summary() with colorized pass rate table and GO/NO_GO readiness signal
  • Add CLI docs for gaia eval agent command in docs/reference/cli.mdx

Closes #709

How It Works

# Run MCP reliability scenarios once
gaia eval agent --category mcp_reliability

# Run 5 iterations for reliability measurement
gaia eval agent --category mcp_reliability --iterations 5

# Cross-model comparison: run per model, compare scorecards
gaia eval agent --category mcp_reliability --iterations 5  # with Qwen loaded
gaia eval agent --category mcp_reliability --iterations 5  # with Liquid loaded
gaia eval agent --compare eval/results/run1/scorecard.json eval/results/run2/scorecard.json

The --iterations loop runs at the CLI level (not inside AgentEvalRunner), calling runner.run() N times as a black box. Each iteration gets its own run_id and run_dir — no trace collisions, no resume conflicts.

Test plan

  • python -m pytest tests/unit/eval/ -xvs — 16 new tests pass
  • python -m pytest tests/test_eval.py -x — 99 existing eval tests unbroken
  • gaia eval agent --help shows --iterations flag
  • gaia eval agent --iterations 0 prints error and exits
  • gaia eval agent --fix --iterations 3 prints incompatibility error
  • gaia eval agent --category mcp_reliability runs MCP scenarios (requires Agent UI backend + claude CLI)

Generated with Claude Code

itomek and others added 5 commits April 2, 2026 11:54
Preload LLM model and ML dependencies at application startup instead
of on the first user prompt, eliminating a 30-120s delay. A new
DispatchQueue tracks background task status and exposes it to the
frontend via /api/system/status and /api/system/tasks.

Backend:
- Add DispatchQueue (src/gaia/ui/dispatch.py) with job lifecycle,
  dependency support, visibility filter, and auto-pruning
- Replace fire-and-forget asyncio.create_task() in lifespan with
  three dispatched startup jobs: check Lemonade, import modules,
  load AI model
- Promote _model_load_lock to model_load_lock for cross-module use
- Extend SystemStatus with init_state and init_tasks fields
- Add GET /api/system/tasks endpoint

Frontend:
- Adaptive status polling (3s during init, 15s otherwise)
- ConnectionBanner Case 0: blue init banner with task name + spinner
- Disable chat input and WelcomeScreen during initialization

Tests:
- 8 unit tests for DispatchQueue (lifecycle, dependencies, visibility,
  concurrency, pruning)
- 8 integration tests for startup init (status endpoint, tasks
  endpoint, wiring, defaults)

Closes #713

Co-Authored-By: Tomasz Iniewicz <itomek@users.noreply.github.com>
- Let _load_model exceptions propagate so job fails properly (not
  silent DONE when Lemonade unreachable)
- Import _DEFAULT_MODEL_NAME constant instead of hardcoding
- Use InitTaskInfo model instead of Dict[str, Any] for init_tasks
- Use get_dispatch_queue DI helper in /api/system/tasks endpoint
- Remove unused patch import from test file

Co-Authored-By: Tomasz Iniewicz <itomek@users.noreply.github.com>
- Fix finally block overwriting completed_at set by early returns
- Fail loudly when dependency job ID is not found (instead of silent skip)
- Use defensive getattr in get_dispatch_queue DI helper
- Fix adaptive polling replacing 3s with 15s before first status response
- Add regression test for init_tasks error field absence

Co-Authored-By: Tomasz Iniewicz <itomek@users.noreply.github.com>
Add MCP reliability eval scenarios and --iterations CLI flag for measuring
tool calling reliability across models and hardware configurations.

- Add 10 MCP reliability scenarios (simple/moderate/complex tiers)
  in eval/scenarios/mcp_reliability/
- Add --iterations N flag to gaia eval agent for repeated runs
- Add _print_reliability_summary() with colorized pass rate table
  and GO/NO_GO readiness signal
- Block --fix + --iterations > 1 (incompatible modes)
- Write reliability_report.json with per-scenario pass rates
- Add 16 unit tests for scenario validation and iteration aggregation

Closes #709
- Add gaia eval agent command examples to docs/reference/cli.mdx
  including --iterations, --category, --fix, --compare flags
- Add AMD copyright header to tests/unit/eval/__init__.py
@github-actions github-actions bot added documentation Documentation changes cli CLI changes tests Test changes electron Electron app changes labels Apr 2, 2026
@itomek itomek self-assigned this Apr 3, 2026
@itomek
Copy link
Copy Markdown
Collaborator

itomek commented Apr 9, 2026

@claude review this PR

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

Claude finished @itomek's task —— View job


PR Review: MCP Tool Calling Reliability Test Framework

  • Read PR diff and changed files
  • Review CLI changes (src/gaia/cli.py)
  • Review dispatch module (src/gaia/ui/dispatch.py)
  • Review test scenarios (eval/scenarios/mcp_reliability/)
  • Review unit tests
  • Review frontend changes
  • Provide comprehensive feedback

Summary

This is a well-structured PR that adds a comprehensive MCP reliability evaluation framework along with a boot-time initialization system for the Agent UI. The code is clean, well-documented, and includes thorough test coverage.


Review Feedback

🟢 Strengths

1. Clean Architecture - The DispatchQueue in src/gaia/ui/dispatch.py is well-designed with:

  • Clear job lifecycle states (PENDINGRUNNINGDONE/FAILED)
  • Dependency handling with timeout protection (60s)
  • Automatic pruning of old jobs
  • Thread-safe design using asyncio event loop

2. Comprehensive Test Coverage - 16 new tests covering:

  • Job lifecycle transitions
  • Dependency handling edge cases
  • Concurrency verification
  • CLI validation (--iterations 0, --fix --iterations N)

3. Well-Designed Eval Scenarios - The 10 MCP reliability scenarios cover a good range:

  • Simple: basic tool calls, no-tool-needed restraint, introspection
  • Moderate: structured params, error handling, search context
  • Complex: multi-step chains, conditional selection, multi-document

4. Good UX - Frontend changes properly:

  • Block input during initialization (ChatView.tsx:503-506)
  • Show initialization progress with animated banner (ConnectionBanner.tsx:66-82)
  • Adaptive polling (3s during init, 15s after)

🟡 Suggestions (Non-blocking)

1. Consider using asyncio.Event instead of polling for dependencies

src/gaia/ui/dispatch.py:126-138 - The current implementation polls every 0.2s:

while dep.status not in (JobStatus.DONE, JobStatus.FAILED):
    if time.monotonic() >= deadline:
        ...
    await asyncio.sleep(_DEPENDENCY_POLL_INTERVAL)

An asyncio.Event per job would be more efficient for CPU and latency, though the current approach is perfectly acceptable for boot-time use.

2. CLI table formatting edge case

src/gaia/cli.py:904 - If a scenario ID exceeds 40 characters, the table formatting may break:

print(f"  {sid:<40} {rate_str:>12} {result_str}")

Consider truncating long IDs or using dynamic width.

3. Test isolation with subprocess

tests/unit/eval/test_iterations.py:112-146 - The subprocess tests call the full CLI. While this tests the real behavior, consider mocking AgentEvalRunner for faster unit tests and reserving subprocess tests for integration.


🔴 Issues (Should Address)

1. Unused import in test file

tests/unit/eval/test_iterations.py:6:

import io

This import is not used in the test file.

2. model_load_lock visibility change should be documented

src/gaia/ui/_chat_helpers.py:62-63 - The lock was renamed from _model_load_lock to model_load_lock (public) for cross-module access. The comment explains why, but this is a public API change that could affect external code importing this module. Consider adding to the module docstring.


📋 Test Plan Verification

Based on code review, the test plan items should work:

  • python -m pytest tests/unit/eval/ -xvs — Tests are well-structured
  • gaia eval agent --help--iterations flag is registered at line 2106-2112
  • gaia eval agent --iterations 0 — Validation at line 3972-3974
  • gaia eval agent --fix --iterations 3 — Incompatibility check at line 3976-3982

Overall Assessment

Approve ✅ - This is high-quality code with good design patterns, comprehensive tests, and proper documentation. The minor issues noted above are non-blocking. The PR successfully adds:

  1. A robust MCP reliability evaluation framework with 10 scenarios across 3 complexity tiers
  2. A --iterations flag for measuring consistency across multiple runs
  3. A boot-time initialization system with frontend progress tracking
  4. GO/NO_GO readiness signals for reliability assessment

Nice work! 🎉


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI changes documentation Documentation changes electron Electron app changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP tool calling reliability test framework

3 participants