Skip to content

Installer tests: custom-agent + MCP harness (dummy MCP, with-MCP, no-MCP) #993

@itomek

Description

@itomek

Parent: #989
Blocked by: #989 Child 1 (pre-release artifact handling) and Child 2 (self-hosted runners)

Problem

The installer matrix in Child 3 (#989) verifies install-time correctness — does GAIA install and reach state: ready across the OS / UV / Lemonade combinations. It does not exercise what users actually do with GAIA after install: run a custom agent, optionally connected to an MCP server.

We have unit tests for MCP under tests/mcp/servers/test_* but no test that drives an installed GAIA binary running a custom agent through real MCP and non-MCP paths. A regression in the MCP wiring or the agent runtime would not be caught by Child 3.

Scope

Build a custom-agent + MCP test harness that runs against the installed binary on the same self-hosted runners as Child 3:

  1. Dummy MCP server — minimal HTTP/stdio server that exposes a known set of fake tools (e.g., echo, add_two_numbers, mock_search). Lives in tests/fixtures/mcp/dummy_server/ and is launched as a subprocess by the test.
  2. Custom agent fixture — a small custom agent definition (skill spec or equivalent) that the test installs into the running GAIA instance.
  3. Three test cases:
    • Custom agent + dummy MCP — agent invokes a dummy MCP tool, harness asserts the tool was called with the right args and the result reached the agent
    • Custom agent + real MCP (one of the bundled servers, e.g., the file MCP) — sanity check that the dummy harness isn't masking a real-MCP-only failure
    • Custom agent + no MCP — agent runs a code-only path, harness asserts no MCP traffic
  4. Run on every cell of Child 3 (or a smaller subset — TBD) so we know the agent runtime works on every OS the matrix covers.

Acceptance criteria

  • Dummy MCP server fixture exists, has its own pytest test for sanity, and is invokable both as a CLI and as a pytest fixture
  • Custom agent fixture exists and can be installed into a running GAIA instance via the supported install flow (skill install, agent registration, whatever the supported entry point is — not hand-editing files)
  • All three test cases (with-MCP, dummy-MCP, no-MCP) pass on Windows + Ubuntu
  • An intentional break (return wrong type from dummy MCP, remove the agent, kill the MCP server mid-run) produces a clear failure with diagnosable logs
  • Each test cell uploads agent logs + MCP server logs as artifacts on failure

Open questions

  • Subset or full coverage? Running the agent harness on all 8 Child 3 cells doubles matrix runtime. Default proposal: run only on the {UV installed, Lemonade installed} cell of each OS — since this issue is about agent runtime, not install preconditions. Open for discussion.
  • Should the dummy MCP server be reused by other test surfaces (e.g., Playwright Agent UI E2E in Deep audit: CI/CD and test coverage gaps + Playwright/Strix Halo E2E plan #875)? If so, factor it into tests/fixtures/mcp/ deliberately.
  • Custom-agent definition format: skill, agent spec, or both? Match whatever the supported user path is at the time this lands.

Metadata

Metadata

Assignees

Labels

devopsDevOps/infrastructure changesdomain:qualityTests, CI/CD, security, performance, evalsinstallerInstaller changesp1medium prioritytestsTest changes

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions