Installer tests: custom-agent + MCP harness (dummy MCP, with-MCP, no-MCP)

Parent: #989
Blocked by: #989 Child 1 (pre-release artifact handling) and Child 2 (self-hosted runners)

## Problem

The installer matrix in Child 3 (#989) verifies install-time correctness — does GAIA install and reach `state: ready` across the OS / UV / Lemonade combinations. It does not exercise what users actually do with GAIA after install: run a custom agent, optionally connected to an MCP server.

We have unit tests for MCP under `tests/mcp/servers/test_*` but no test that drives an installed GAIA binary running a custom agent through real MCP and non-MCP paths. A regression in the MCP wiring or the agent runtime would not be caught by Child 3.

## Scope

Build a custom-agent + MCP test harness that runs against the installed binary on the same self-hosted runners as Child 3:

1. **Dummy MCP server** — minimal HTTP/stdio server that exposes a known set of fake tools (e.g., `echo`, `add_two_numbers`, `mock_search`). Lives in `tests/fixtures/mcp/dummy_server/` and is launched as a subprocess by the test.
2. **Custom agent fixture** — a small custom agent definition (skill spec or equivalent) that the test installs into the running GAIA instance.
3. **Three test cases:**
   - Custom agent + dummy MCP — agent invokes a dummy MCP tool, harness asserts the tool was called with the right args and the result reached the agent
   - Custom agent + real MCP (one of the bundled servers, e.g., the file MCP) — sanity check that the dummy harness isn't masking a real-MCP-only failure
   - Custom agent + no MCP — agent runs a code-only path, harness asserts no MCP traffic
4. **Run on every cell of Child 3** (or a smaller subset — TBD) so we know the agent runtime works on every OS the matrix covers.

## Acceptance criteria

- [ ] Dummy MCP server fixture exists, has its own pytest test for sanity, and is invokable both as a CLI and as a pytest fixture
- [ ] Custom agent fixture exists and can be installed into a running GAIA instance via the supported install flow (skill install, agent registration, whatever the supported entry point is — not hand-editing files)
- [ ] All three test cases (with-MCP, dummy-MCP, no-MCP) pass on Windows + Ubuntu
- [ ] An intentional break (return wrong type from dummy MCP, remove the agent, kill the MCP server mid-run) produces a clear failure with diagnosable logs
- [ ] Each test cell uploads agent logs + MCP server logs as artifacts on failure

## Open questions

- Subset or full coverage? Running the agent harness on all 8 Child 3 cells doubles matrix runtime. Default proposal: run only on the {UV installed, Lemonade installed} cell of each OS — since this issue is about agent runtime, not install preconditions. Open for discussion.
- Should the dummy MCP server be reused by other test surfaces (e.g., Playwright Agent UI E2E in #875)? If so, factor it into `tests/fixtures/mcp/` deliberately.
- Custom-agent definition format: skill, agent spec, or both? Match whatever the supported user path is at the time this lands.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Installer tests: custom-agent + MCP harness (dummy MCP, with-MCP, no-MCP) #993

Problem

Scope

Acceptance criteria

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Installer tests: custom-agent + MCP harness (dummy MCP, with-MCP, no-MCP) #993

Description

Problem

Scope

Acceptance criteria

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions