🤖 ci: deflake MCP screenshot integration test #1173
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Deflake MCP Chrome screenshot integration test.
Changes:
setupWorkspace()tears down the workspace viaworkspace.remove(...)and disposes services during test cleanup.chrome_take_screenshottool call viatoolPolicy: require(no longer depends on the model deciding to use tools).chrome-devtools-mcpand using a fixed viewport; run PNG/JPEG cases sequentially; increase the post-stream-endtool-call wait.Validation:
make static-check📋 Implementation Plan
Deflake:
tests/ipc/mcpConfig.test.tsMCP screenshot testWhat’s failing
MCP server integration with model › MCP PNG image content is correctly transformed to AI SDK formatwaitForToolCallEnd(…, "chrome_take_screenshot", …)returnsundefined→ no matchingtool-call-endevent.Likely root causes (ranked)
example.com), so the model can sometimes answer from prior knowledge and skipchrome_take_screenshotentirely.setupWorkspace()(intests/ipc/setup.ts) never callsworkspace.remove, so MCP servers started during a test can keep running. That matches the suite’s “Force exiting Jest…” warning and can cause resource contention / sporadic MCP startup failures.npx; on slower CI hosts, tool execution and event delivery may exceed the current 20s polling window.🔎 Evidence in repo
tests/ipc/setup.ts::setupWorkspace()cleanup only deletes temp dirs; it does not callenv.orpc.workspace.remove({ workspaceId }).WorkspaceService.remove()explicitly stops MCP servers viamcpServerManager.stopServers(workspaceId).mcpConfig.test.tsdepends on the model choosing to callchrome_take_screenshot(not enforced).Recommended approach (A): Keep the integration test, but make it deterministic
Net new product LoC: ~0 (test/harness only)
A1) Fix cleanup so MCP servers don’t leak across tests
tests/ipc/setup.ts::setupWorkspace()’scleanup()to:await env.orpc.workspace.remove({ workspaceId }).catch(() => {})(must run before deletingenv.tempDir)await env.services.dispose()(clears MCP idle interval + terminates background procs)cleanupTestEnvironment(env)+cleanupTempGitRepo(tempGitRepo)This should eliminate orphaned Chrome/MCP processes and reduce CI flake across the whole integration suite.
A2) Stop relying on the model “choosing” to call screenshot tools
Modify
mcpConfig.test.tsso the test asserts the transformation path without depending on free-form model behavior.Concrete options (pick one):
Option 1 (preferred): force the tool call using
toolPolicy: requireand don’t assert the descriptionchrome_take_screenshotnow.”chrome_take_screenshotwith format "jpeg".”options.toolPolicy = [{ regex_match: "chrome_take_screenshot", action: "require" }].tool-call-endevent exists forchrome_take_screenshotassertValidScreenshotResult(…, mediaTypePattern)passesassertModelDescribesScreenshot(); it adds LLM-output flake and isn’t needed to validate the MCP→AI-SDK media transformation.Option 2: split into two required calls (navigate then screenshot)
chrome_navigate_pageand instruct URL.chrome_take_screenshot.A3) Reduce environment-driven variance
chrome-devtools-mcp@latestwith the currently observed version (chrome-devtools-mcp@0.12.1).--viewport 1280x720.test.concurrent.each).waitForToolCallEndtimeout from 20s → 60s (CI headless Chrome can be slow).Alternative approach (B): Move correctness to unit tests; keep only a small integration smoke test
Net new product LoC: ~0
src/node/services/mcpResultTransform.ts:{ content: [{type:"image", data, mimeType}] }→{ type:"content", value:[{type:"media", …}] }mimeType→mediaTypeMAX_IMAGE_DATA_BYTES) deterministicallymemory_create_entitiesMCP integration test (already present)Use this if we decide that a full Chrome+LLM flow is too expensive/flaky for CI.
Optional product hardening (nice-to-have)
Net new product LoC: ~20–60
MCPServerManager.dispose()stop all running workspace servers (not just clear the idle interval). This would harden app shutdown behavior and prevent long-lived processes in any non-test embedding.Validation
TEST_INTEGRATION=1 bun x jest tests/ipc/mcpConfig.test.ts -t "image content" --runInBandtool-call-endforchrome_take_screenshotis always presentGenerated with
mux• Model:openai:gpt-5.2• Thinking:xhigh