feat: reload MCP config on change via lazy stat-on-request by mafredri · Pull Request #24700 · coder/coder

mafredri · 2026-04-24T08:26:43Z

The MCP manager previously read .mcp.json exactly once at agent startup.
Editing the file had no effect until workspace rebuild or agent restart.

This change adds lazy stat-on-request: the handleListTools handler checks
config file mtimes on every tool-list request and triggers a differential
reload when any file changed. Unchanged servers keep their client pointer
so in-flight tool calls survive. Concurrent reload requests coalesce via
singleflight.

MCP stdio subprocesses now use the agent's execer for resource
limits and receive the same enriched environment as SSH sessions
(via updateCommandEnv). This ensures ScriptBinDir is in PATH
so Coder module binaries are found.

On the chatd side, WorkspaceMCPTool.Run now detects 404 responses from
CallMCPTool (indicating the server was removed) and drops the chat's
cached tool list so the next turn refetches from the agent.

🤖 This PR was created with the help of Coder Agents, and will be reviewed by a human. 🏂🏻

coder-agents-review

Round 5. Two new commits add login shell wrapping and environment enrichment for MCP stdio subprocesses (121 insertions). The enrichment design is correct (same updateCommandEnv as SSH, same execer, config env overrides applied last). One P1, one P2, three P3, one P4.

The P1 (DEREM-35) is a platform breakage: the shell wrapping uses POSIX-only constructs unconditionally, breaking every stdio MCP server on Windows. Three reviewers traced the same failure: -l -c and exec are not valid for pwsh.exe or cmd.exe. The CI failures on windows-2022 are consistent with this. The SSH code has explicit platform guards for this exact case.

The P2 (DEREM-36) is a protocol corruption risk: login shell -l sources user profiles that can write to stdout, injecting non-JSON bytes before the MCP handshake. The SSH code deliberately avoids -l for command execution; the MCP code should follow the same pattern.

Process observation: the two commits add 121 lines of new behavior (buildEnv enrichment, login shell wrapping, resolveUserShell, execer integration) and update 21 test call sites to compile, but add zero new assertions for the new behavior. This is the same pattern as five prior findings (DEREM-7, DEREM-9, DEREM-10, DEREM-27, DEREM-28), each fixed when pointed out.

"What happens at scale: a platform team pushes a template with a base image whose /etc/profile.d/company-banner.sh writes an ASCII banner. Every MCP server on that template silently fails." (Hisoka)

🤖 This review was automatically generated with Coder Agents.

coder-agents-review

Round 6. Shell wrapping removed entirely (DEREM-34, DEREM-35, DEREM-36 all resolved by deletion). GOTRACEBACK leak fixed (DEREM-37). DEREM-38 moot (resolveUserShell removed). CI passing.

DEREM-33 (buildEnv test gap) contested and closed by panel vote (3/4 accept). Mafuuu: error fallback does not corrupt state or alter lifecycle. Meruem: consequence is degraded but functional, logged. Zoro: no new evidence. Bisky dissented with a concrete 10-line test sketch disproving the "requires real agent" claim, but the majority found the risk bounded by the function's simplicity. This needs a human decision: the test is trivial to write but the panel found the gap acceptable.

All 39 findings across 6 rounds resolved: 29 fixed, 3 accepted, 2 panel-closed (DEREM-8 4/4, DEREM-33 3/4), 5 dropped (Notes). One Nit remains open.

"For a void-return operation, the conventional instantiation would be Group[string, struct{}] with the reload error in the second return." (Mafuuu, on the singleflight semantics inversion)

🤖 This review was automatically generated with Coder Agents.

coder-agents-review

Round 7. PR rebased. Latest commit refactors manager.go (serverDiff struct, public API reordering, 288 ins / 283 del). DEREM-39 (singleflight semantics) fixed. CI has a race test failure (test-go-race-pg). Three P3s, two Nits.

The refactor introduced DEREM-40 (orphaned doc comment fragment), DEREM-41 (Close ExitError filtering untested), and surfaced DEREM-42 (micro-TOCTOU in parseAndDedup where ReadFile precedes Stat). DEREM-42 is the same class of bug as the original DEREM-3, but with a microsecond window instead of 30 seconds.

"If someone removes the errors.As(err, &exitErr) guard, Close() returns subprocess kill errors on every agent shutdown. No test catches this." (Bisky)

🤖 This review was automatically generated with Coder Agents.

coder-agents-review

Round 8. All 5 R7 findings verified fixed by the panel. Mafuuu confirmed all five fixes and ran a four-mode contract sweep. Meruem traced six structural modes (lock discipline, generation counter, singleflight, updateEnv goroutine safety, Close-during-RefreshTools, test coverage). Bisky verified every test assertion. Chopper confirmed all failure paths and diagnostics. No new P0-P3 findings.

All 45 findings across 8 rounds resolved: 34 fixed, 3 accepted, 2 panel-closed, 6 dropped. One Nit open (DEREM-45, drive-by assertion change in an unrelated storybook file). CI pending.

"Eight rounds of polishing and it shows." (Bisky)

🤖 This review was automatically generated with Coder Agents.

johnstcn

Some comments below, but non-blocking. I don't need to review again.

johnstcn · 2026-04-27T18:53:13Z

+// fakeAgentConn implements just enough of workspacesdk.AgentConn
+// for testing CallMCPTool.
+type fakeAgentConn struct {
+	workspacesdk.AgentConn
+	callMCPToolFunc func(ctx context.Context, req workspacesdk.CallMCPToolRequest) (workspacesdk.CallMCPToolResponse, error)
+}


suggestion: the only usage of conn is CallMCPTool, so suggest defining a CallMCPTooler interface. Then you can just define a fakeCallMCPTooler and pass it in.

Agree in principle. The test only needs CallMCPTool, so a narrow interface would be cleaner. However, WorkspaceMCPTool.getConn returns workspacesdk.AgentConn (the full interface), so introducing CallMCPTooler requires changing the production type signature. That's a broader refactor beyond this PR's scope. The current embedding approach works and is idiomatic for test fakes.

🤖 Posted using /amend-review skill via Coder Agents.

…request The MCP manager previously read .mcp.json exactly once at agent startup. Editing the file had no effect until workspace rebuild. This change adds: 1. Snapshot-based change detection: the Manager records (mtime, size) for each config file after every reload and compares on each tool-list request. 2. Differential reload (D12): unchanged servers keep their client pointer so in-flight tool calls survive. Only new or changed servers get a fresh connection. Removed servers are closed. 3. Singleflight coalescing (D10): concurrent reload requests share a single in-flight connect pass via DoChan. The reload body runs under the Manager's lifetime context so it survives caller cancellation. 4. Chatd cache invalidation on 404 (D13): when CallMCPTool returns a 404 (ErrUnknownServer), the chat's cached tool list is dropped so the next turn refetches from the agent. The handleListTools handler now stats config files on every request and triggers a reload when any file differs from the snapshot. New chats see updated tools; existing chats recover from server removals via the 404 invalidation path.

- Restructure Manager struct: ctx at top, all mu-protected fields grouped tight under mu with no blank-line separators. - Trim verbose comments on struct fields. - Extract classifyServers, connectAll, installServers from connect to reduce function length. - Revert local mutex name from cmu back to mu (no shadowing issue). - Rename callerCtx to ctx in Reload (idiomatic Go). - Make TestSnapshotChanged and TestHandleListTools table-driven. - Keep real filesystem (os.Stat) consistent with package patterns; ParseConfig already uses os.ReadFile directly.

- Fix TOCTOU: capture snapshot at parse time, before connectAll, so mid-reload config edits trigger a re-check on the next request. - Fix perpetual reload: deduplicate paths in SnapshotChanged before comparing lengths against the snapshot map. - Rename connect to doReload for clarity. - Fix stale log message (connect -> reload). - Fix handleListTools doc comment (order was backwards). - Skip redundant RefreshTools when reload already ran. - Replace magic 404 literal with http.StatusNotFound. - Use maps.Clone, wg.Go (Go 1.25), remove dead tc := tc. - Fix InFlightToolCallSurvivesReload to require.Error. - Add multi-config-file test (TestSnapshotChanged_MultipleConfigFiles). - Delete duplicate e2e test, hoist closure, restore dedup comment. - Add race-guard and singleflight key documentation.

… test names

…tion

MCP stdio subprocesses used os.Environ() directly, missing ScriptBinDir and manifest environment variables. Thread the agent's updateCommandEnv callback into the Manager so MCP servers see the same PATH and env as SSH sessions.

Thread the agent's execer and updateCommandEnv callback into the MCP Manager. Stdio subprocesses now get proper resource limits via execer.CommandContext and the same enriched environment as SSH sessions (CODER_* vars, manifest env, ScriptBinDir in PATH). Use usershell.SystemEnvInfo{}.Environ() instead of os.Environ() to filter GOTRACEBACK=none from subprocess environments.

…ppress expected close errors Two fixes in manager.go: 1. The reload singleflight used Group[string, error], placing the reload error in res.Val while res.Err was always nil. This inverts the convention used by every other typed singleflight in the codebase (e.g. coderd/x/chatd/configcache.go). Change to Group[string, struct{}] with the error in the second return. 2. Close() propagated exec.ExitError from subprocess kills during shutdown, causing ERROR-level log noise in cli golden tests. Filter out exec.ExitError since subprocess kill signals are expected during intentional close.

Replace the 5-return classifyServers and 5-parameter installServers with a focused serverDiff struct that captures only the classify output. connectAll and installServers take explicit parameters; the compiler enforces what each function receives. Reorder file: public API (Reload, SnapshotChanged, Tools, CallTool, RefreshTools, Close) appears after the constructor. Private reload internals, connection internals, and helpers follow in sections. Also restores three comments from the pre-refactor code: - parseAndDedup: missing files are silently skipped (not logged). - doReload close loop: preserves the leak-prevention rationale. - connectAll: restores 'Don't fail the group' on the nil return.

Five fixes from deep-review R7: 1. parseAndDedup: stat config files before reading so the snapshot is conservatively old. Eliminates a micro-TOCTOU where a file modified between read and stat would be missed. 2. classifyServers: Lock -> RLock since it only reads m.closed and m.servers. Avoids blocking concurrent Tools/CallTool callers. 3. Orphaned doc comment fragment (connectServer) deleted from above Tools(), restored with complete sentence above connectServer. 4. buildEnv doc: updateCommandEnv -> updateEnv to match the field. 5. New TestClose_SuppressesSubprocessExitError verifies Close() returns nil when subprocess exits with kill signal.

- Add comments to SnapshotChanged stat-error and file-appeared branches. - Use early-continue in captureSnapshot instead of else block. - Remove 'D12' plan reference from test comment. - Clarify 'just ran' in api.go refresh comment.

github-actions Bot assigned mafredri Apr 24, 2026

This comment was marked as resolved.

Sign in to view

mafredri changed the title ~~feat(agent/x/agentmcp): reload MCP config on change via lazy stat-on-request~~ feat: reload MCP config on change via lazy stat-on-request Apr 24, 2026

mafredri changed the title ~~feat: reload MCP config on change via lazy stat-on-request~~ feat(agent/x/agentmcp): reload MCP config on change via lazy stat-on-request Apr 24, 2026

This comment was marked as resolved.

Sign in to view

mafredri changed the title ~~feat(agent/x/agentmcp): reload MCP config on change via lazy stat-on-request~~ feat: reload MCP config on change via lazy stat-on-request Apr 24, 2026

This comment was marked as resolved.

Sign in to view

mafredri force-pushed the mcp-reload-lazy-stat branch 2 times, most recently from 1431ef1 to a8bafb2 Compare April 24, 2026 16:57

coder-agents-review Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread agent/x/agentmcp/manager.go Outdated

Comment thread agent/x/agentmcp/manager.go Outdated

Comment thread agent/x/agentmcp/manager.go

Comment thread agent/x/agentmcp/manager.go

Comment thread agent/x/agentmcp/manager.go Outdated

Comment thread agent/x/agentmcp/manager.go Outdated

mafredri force-pushed the mcp-reload-lazy-stat branch from a8bafb2 to 87acc04 Compare April 24, 2026 18:15

coder-agents-review Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread agent/x/agentmcp/manager.go Outdated

mafredri force-pushed the mcp-reload-lazy-stat branch 4 times, most recently from f8a2e43 to 8495559 Compare April 27, 2026 11:37

coder-agents-review Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread agent/x/agentmcp/manager.go Outdated

Comment thread agent/x/agentmcp/manager.go

Comment thread agent/x/agentmcp/manager.go Outdated

Comment thread agent/x/agentmcp/manager.go Outdated

Comment thread agent/x/agentmcp/manager.go Outdated

mafredri force-pushed the mcp-reload-lazy-stat branch 2 times, most recently from 5c81ad9 to de1483a Compare April 27, 2026 14:55

coder-agents-review Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread site/src/pages/AgentsPage/components/ChatConversation/ConversationTimeline.stories.tsx Outdated

mafredri force-pushed the mcp-reload-lazy-stat branch 2 times, most recently from ef45bc2 to b2158b5 Compare April 27, 2026 16:31

mafredri marked this pull request as ready for review April 27, 2026 16:32

johnstcn approved these changes Apr 27, 2026

View reviewed changes

mafredri added 3 commits April 28, 2026 15:38

mafredri added 8 commits April 28, 2026 15:38

test(agent/x/agentmcp): cover refresh param, verify client close, fix…

9c9b8df

… test names

fix(agent/x/agentmcp): guard RefreshTools against stale server genera…

7fcc5e8

…tion

fix: use agent environment for MCP server subprocesses

971e597

MCP stdio subprocesses used os.Environ() directly, missing ScriptBinDir and manifest environment variables. Thread the agent's updateCommandEnv callback into the Manager so MCP servers see the same PATH and env as SSH sessions.

fix(agent/x/agentmcp): address human review feedback

0e5cc33

- Add comments to SnapshotChanged stat-error and file-appeared branches. - Use early-continue in captureSnapshot instead of else block. - Remove 'D12' plan reference from test comment. - Clarify 'just ran' in api.go refresh comment.

mafredri force-pushed the mcp-reload-lazy-stat branch from e10d3f1 to 0e5cc33 Compare April 28, 2026 15:51

mafredri merged commit 881df9a into main Apr 28, 2026
45 of 47 checks passed

mafredri deleted the mcp-reload-lazy-stat branch April 28, 2026 16:47

github-actions Bot locked and limited conversation to collaborators Apr 28, 2026

Conversation

mafredri commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

johnstcn left a comment

Choose a reason for hiding this comment

Uh oh!

johnstcn Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

mafredri Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mafredri commented Apr 24, 2026 •

edited

Loading