Skip to content

Conversation

@ThomasK33
Copy link
Member

Deflake MCP Chrome screenshot integration test.

Changes:

  • Stop leaking MCP/Chrome processes by ensuring setupWorkspace() tears down the workspace via workspace.remove(...) and disposes services during test cleanup.
  • Make the screenshot assertions deterministic by forcing a chrome_take_screenshot tool call via toolPolicy: require (no longer depends on the model deciding to use tools).
  • Reduce CI variance by pinning chrome-devtools-mcp and using a fixed viewport; run PNG/JPEG cases sequentially; increase the post-stream-end tool-call wait.

Validation:

  • make static-check

📋 Implementation Plan

Deflake: tests/ipc/mcpConfig.test.ts MCP screenshot test

What’s failing

  • Test: MCP server integration with model › MCP PNG image content is correctly transformed to AI SDK format
  • Failure: waitForToolCallEnd(…, "chrome_take_screenshot", …) returns undefined → no matching tool-call-end event.

Likely root causes (ranked)

  1. Model nondeterminism: the prompt uses a well-known page (example.com), so the model can sometimes answer from prior knowledge and skip chrome_take_screenshot entirely.
  2. Leaked MCP server processes between tests: setupWorkspace() (in tests/ipc/setup.ts) never calls workspace.remove, so MCP servers started during a test can keep running. That matches the suite’s “Force exiting Jest…” warning and can cause resource contention / sporadic MCP startup failures.
  3. Timing/resource contention: the test runs PNG+JPEG cases concurrently and starts headless Chrome via npx; on slower CI hosts, tool execution and event delivery may exceed the current 20s polling window.
🔎 Evidence in repo
  • tests/ipc/setup.ts::setupWorkspace() cleanup only deletes temp dirs; it does not call env.orpc.workspace.remove({ workspaceId }).
  • WorkspaceService.remove() explicitly stops MCP servers via mcpServerManager.stopServers(workspaceId).
  • mcpConfig.test.ts depends on the model choosing to call chrome_take_screenshot (not enforced).

Recommended approach (A): Keep the integration test, but make it deterministic

Net new product LoC: ~0 (test/harness only)

A1) Fix cleanup so MCP servers don’t leak across tests

  1. Update tests/ipc/setup.ts::setupWorkspace()’s cleanup() to:
    • await env.orpc.workspace.remove({ workspaceId }).catch(() => {}) (must run before deleting env.tempDir)
    • await env.services.dispose() (clears MCP idle interval + terminates background procs)
    • then run existing cleanupTestEnvironment(env) + cleanupTempGitRepo(tempGitRepo)

This should eliminate orphaned Chrome/MCP processes and reduce CI flake across the whole integration suite.

A2) Stop relying on the model “choosing” to call screenshot tools

Modify mcpConfig.test.ts so the test asserts the transformation path without depending on free-form model behavior.

Concrete options (pick one):

Option 1 (preferred): force the tool call using toolPolicy: require and don’t assert the description

  • Send a minimal prompt like:
    • PNG: “Call chrome_take_screenshot now.”
    • JPEG: “Call chrome_take_screenshot with format "jpeg".”
  • Pass options.toolPolicy = [{ regex_match: "chrome_take_screenshot", action: "require" }].
  • Only assert:
    • a tool-call-end event exists for chrome_take_screenshot
    • assertValidScreenshotResult(…, mediaTypePattern) passes
  • Drop (or relax) assertModelDescribesScreenshot(); it adds LLM-output flake and isn’t needed to validate the MCP→AI-SDK media transformation.

Option 2: split into two required calls (navigate then screenshot)

  • Message 1: require chrome_navigate_page and instruct URL.
  • Message 2: require chrome_take_screenshot.
  • This is useful only if we still want to validate “example.com” specifically; otherwise it’s extra moving parts.

A3) Reduce environment-driven variance

  • Pin the MCP server version: replace chrome-devtools-mcp@latest with the currently observed version (chrome-devtools-mcp@0.12.1).
  • Add a deterministic viewport (smaller = faster + avoids huge PNGs): --viewport 1280x720.
  • If CI still flakes, run PNG/JPEG sequentially (remove test.concurrent.each).
  • Increase waitForToolCallEnd timeout from 20s → 60s (CI headless Chrome can be slow).

Alternative approach (B): Move correctness to unit tests; keep only a small integration smoke test

Net new product LoC: ~0

  1. Add a unit test suite for src/node/services/mcpResultTransform.ts:
    • converts MCP { content: [{type:"image", data, mimeType}] }{ type:"content", value:[{type:"media", …}] }
    • preserves mimeTypemediaType
    • validates the size guard behavior (MAX_IMAGE_DATA_BYTES) deterministically
  2. Replace the flaky Chrome+model integration assertion with:
    • existing memory_create_entities MCP integration test (already present)
    • optional: a chrome MCP “tools available” test (no screenshot, no model)

Use this if we decide that a full Chrome+LLM flow is too expensive/flaky for CI.


Optional product hardening (nice-to-have)

Net new product LoC: ~20–60

  • Consider making MCPServerManager.dispose() stop all running workspace servers (not just clear the idle interval). This would harden app shutdown behavior and prevent long-lived processes in any non-test embedding.

Validation

  • Run the failing test in a loop (CI-like):
    • TEST_INTEGRATION=1 bun x jest tests/ipc/mcpConfig.test.ts -t "image content" --runInBand
  • Confirm:
    • tool-call-end for chrome_take_screenshot is always present
    • no lingering Node handles (the “Force exiting Jest…” warning disappears or is reduced)

Generated with mux • Model: openai:gpt-5.2 • Thinking: xhigh

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

Change-Id: Iec69cd54774598041af89b2a402419c2147071c9
Signed-off-by: Thomas Kosiewski <tk@coder.com>
Change-Id: I82163d301bc87cc3ab79e7b9eb72951f9287fe51
Signed-off-by: Thomas Kosiewski <tk@coder.com>
Change-Id: I021d09b7ae832aa6c03fc4ce299aa3c1cb31eda7
Signed-off-by: Thomas Kosiewski <tk@coder.com>
Queued auto-send can race with workspace teardown, causing AgentSession.emitChatEvent assertions. Drop emits / skip streaming once disposed and add a regression test.\n\n---\n_Generated with `mux` • Model: openai:gpt-5.2 • Thinking: xhigh_\n

Change-Id: I678498dbd6c52ad0d64d22ce4da4962e67dd8792
Signed-off-by: Thomas Kosiewski <tk@coder.com>
@ThomasK33 ThomasK33 added this pull request to the merge queue Dec 15, 2025
Merged via the queue into main with commit 9aee49b Dec 15, 2025
20 checks passed
@ThomasK33 ThomasK33 deleted the deflake-mcp-tests branch December 15, 2025 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant