🤖 ci: deflake MCP screenshot integration test #1173

ThomasK33 · 2025-12-15T12:12:21Z

Deflake MCP Chrome screenshot integration test.

Changes:

Stop leaking MCP/Chrome processes by ensuring setupWorkspace() tears down the workspace via workspace.remove(...) and disposes services during test cleanup.
Make the screenshot assertions deterministic by forcing a chrome_take_screenshot tool call via toolPolicy: require (no longer depends on the model deciding to use tools).
Reduce CI variance by pinning chrome-devtools-mcp and using a fixed viewport; run PNG/JPEG cases sequentially; increase the post-stream-end tool-call wait.

Validation:

make static-check

📋 Implementation Plan

Deflake: `tests/ipc/mcpConfig.test.ts` MCP screenshot test

What’s failing

Test: MCP server integration with model › MCP PNG image content is correctly transformed to AI SDK format
Failure: waitForToolCallEnd(…, "chrome_take_screenshot", …) returns undefined → no matching tool-call-end event.

Likely root causes (ranked)

Model nondeterminism: the prompt uses a well-known page (example.com), so the model can sometimes answer from prior knowledge and skip chrome_take_screenshot entirely.
Leaked MCP server processes between tests: setupWorkspace() (in tests/ipc/setup.ts) never calls workspace.remove, so MCP servers started during a test can keep running. That matches the suite’s “Force exiting Jest…” warning and can cause resource contention / sporadic MCP startup failures.
Timing/resource contention: the test runs PNG+JPEG cases concurrently and starts headless Chrome via npx; on slower CI hosts, tool execution and event delivery may exceed the current 20s polling window.

🔎 Evidence in repo

tests/ipc/setup.ts::setupWorkspace() cleanup only deletes temp dirs; it does not call env.orpc.workspace.remove({ workspaceId }).
WorkspaceService.remove() explicitly stops MCP servers via mcpServerManager.stopServers(workspaceId).
mcpConfig.test.ts depends on the model choosing to call chrome_take_screenshot (not enforced).

Recommended approach (A): Keep the integration test, but make it deterministic

Net new product LoC: ~0 (test/harness only)

A1) Fix cleanup so MCP servers don’t leak across tests

Update tests/ipc/setup.ts::setupWorkspace()’s cleanup() to:
- await env.orpc.workspace.remove({ workspaceId }).catch(() => {}) (must run before deleting env.tempDir)
- await env.services.dispose() (clears MCP idle interval + terminates background procs)
- then run existing cleanupTestEnvironment(env) + cleanupTempGitRepo(tempGitRepo)

This should eliminate orphaned Chrome/MCP processes and reduce CI flake across the whole integration suite.

A2) Stop relying on the model “choosing” to call screenshot tools

Modify mcpConfig.test.ts so the test asserts the transformation path without depending on free-form model behavior.

Concrete options (pick one):

Option 1 (preferred): force the tool call using toolPolicy: require and don’t assert the description

Send a minimal prompt like:
- PNG: “Call chrome_take_screenshot now.”
- JPEG: “Call chrome_take_screenshot with format "jpeg".”
Pass options.toolPolicy = [{ regex_match: "chrome_take_screenshot", action: "require" }].
Only assert:
- a tool-call-end event exists for chrome_take_screenshot
- assertValidScreenshotResult(…, mediaTypePattern) passes
Drop (or relax) assertModelDescribesScreenshot(); it adds LLM-output flake and isn’t needed to validate the MCP→AI-SDK media transformation.

Option 2: split into two required calls (navigate then screenshot)

Message 1: require chrome_navigate_page and instruct URL.
Message 2: require chrome_take_screenshot.
This is useful only if we still want to validate “example.com” specifically; otherwise it’s extra moving parts.

A3) Reduce environment-driven variance

Pin the MCP server version: replace chrome-devtools-mcp@latest with the currently observed version (chrome-devtools-mcp@0.12.1).
Add a deterministic viewport (smaller = faster + avoids huge PNGs): --viewport 1280x720.
If CI still flakes, run PNG/JPEG sequentially (remove test.concurrent.each).
Increase waitForToolCallEnd timeout from 20s → 60s (CI headless Chrome can be slow).

Alternative approach (B): Move correctness to unit tests; keep only a small integration smoke test

Net new product LoC: ~0

Add a unit test suite for src/node/services/mcpResultTransform.ts:
- converts MCP { content: [{type:"image", data, mimeType}] } → { type:"content", value:[{type:"media", …}] }
- preserves mimeType → mediaType
- validates the size guard behavior (MAX_IMAGE_DATA_BYTES) deterministically
Replace the flaky Chrome+model integration assertion with:
- existing memory_create_entities MCP integration test (already present)
- optional: a chrome MCP “tools available” test (no screenshot, no model)

Use this if we decide that a full Chrome+LLM flow is too expensive/flaky for CI.

Optional product hardening (nice-to-have)

Net new product LoC: ~20–60

Consider making MCPServerManager.dispose() stop all running workspace servers (not just clear the idle interval). This would harden app shutdown behavior and prevent long-lived processes in any non-test embedding.

Validation

Run the failing test in a loop (CI-like):
- TEST_INTEGRATION=1 bun x jest tests/ipc/mcpConfig.test.ts -t "image content" --runInBand
Confirm:
- tool-call-end for chrome_take_screenshot is always present
- no lingering Node handles (the “Force exiting Jest…” warning disappears or is reduced)

Generated with mux • Model: openai:gpt-5.2 • Thinking: xhigh

chatgpt-codex-connector · 2025-12-15T12:12:29Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

Change-Id: Iec69cd54774598041af89b2a402419c2147071c9 Signed-off-by: Thomas Kosiewski <tk@coder.com>

Change-Id: I82163d301bc87cc3ab79e7b9eb72951f9287fe51 Signed-off-by: Thomas Kosiewski <tk@coder.com>

Change-Id: I021d09b7ae832aa6c03fc4ce299aa3c1cb31eda7 Signed-off-by: Thomas Kosiewski <tk@coder.com>

Queued auto-send can race with workspace teardown, causing AgentSession.emitChatEvent assertions. Drop emits / skip streaming once disposed and add a regression test.\n\n---\n_Generated with `mux` • Model: openai:gpt-5.2 • Thinking: xhigh_\n Change-Id: I678498dbd6c52ad0d64d22ce4da4962e67dd8792 Signed-off-by: Thomas Kosiewski <tk@coder.com>

ThomasK33 force-pushed the deflake-mcp-tests branch from fa3f7a1 to 074f092 Compare December 15, 2025 13:57

ThomasK33 added 3 commits December 15, 2025 16:05

ci: deflake MCP screenshot integration test

f0ac63f

Change-Id: Iec69cd54774598041af89b2a402419c2147071c9 Signed-off-by: Thomas Kosiewski <tk@coder.com>

fix: use correct AI SDK toolChoice shape

fb57f45

Change-Id: I82163d301bc87cc3ab79e7b9eb72951f9287fe51 Signed-off-by: Thomas Kosiewski <tk@coder.com>

fix: ignore queued message autosend after session dispose

1199d2f

Change-Id: I021d09b7ae832aa6c03fc4ce299aa3c1cb31eda7 Signed-off-by: Thomas Kosiewski <tk@coder.com>

ThomasK33 force-pushed the deflake-mcp-tests branch from e22197b to 1199d2f Compare December 15, 2025 15:06

ThomasK33 added this pull request to the merge queue Dec 15, 2025

Merged via the queue into main with commit 9aee49b Dec 15, 2025
20 checks passed

ThomasK33 deleted the deflake-mcp-tests branch December 15, 2025 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🤖 ci: deflake MCP screenshot integration test #1173

🤖 ci: deflake MCP screenshot integration test #1173

Uh oh!

ThomasK33 commented Dec 15, 2025

Uh oh!

chatgpt-codex-connector bot commented Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

🤖 ci: deflake MCP screenshot integration test #1173

🤖 ci: deflake MCP screenshot integration test #1173

Uh oh!

Conversation

ThomasK33 commented Dec 15, 2025

Deflake: tests/ipc/mcpConfig.test.ts MCP screenshot test

What’s failing

Likely root causes (ranked)

Recommended approach (A): Keep the integration test, but make it deterministic

A1) Fix cleanup so MCP servers don’t leak across tests

A2) Stop relying on the model “choosing” to call screenshot tools

A3) Reduce environment-driven variance

Alternative approach (B): Move correctness to unit tests; keep only a small integration smoke test

Optional product hardening (nice-to-have)

Validation

Uh oh!

chatgpt-codex-connector bot commented Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Deflake: `tests/ipc/mcpConfig.test.ts` MCP screenshot test