feat: reload MCP config on change via lazy stat-on-request#24700
feat: reload MCP config on change via lazy stat-on-request#24700
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
1431ef1 to
a8bafb2
Compare
There was a problem hiding this comment.
Round 5. Two new commits add login shell wrapping and environment enrichment for MCP stdio subprocesses (121 insertions). The enrichment design is correct (same updateCommandEnv as SSH, same execer, config env overrides applied last). One P1, one P2, three P3, one P4.
The P1 (DEREM-35) is a platform breakage: the shell wrapping uses POSIX-only constructs unconditionally, breaking every stdio MCP server on Windows. Three reviewers traced the same failure: -l -c and exec are not valid for pwsh.exe or cmd.exe. The CI failures on windows-2022 are consistent with this. The SSH code has explicit platform guards for this exact case.
The P2 (DEREM-36) is a protocol corruption risk: login shell -l sources user profiles that can write to stdout, injecting non-JSON bytes before the MCP handshake. The SSH code deliberately avoids -l for command execution; the MCP code should follow the same pattern.
Process observation: the two commits add 121 lines of new behavior (buildEnv enrichment, login shell wrapping, resolveUserShell, execer integration) and update 21 test call sites to compile, but add zero new assertions for the new behavior. This is the same pattern as five prior findings (DEREM-7, DEREM-9, DEREM-10, DEREM-27, DEREM-28), each fixed when pointed out.
"What happens at scale: a platform team pushes a template with a base image whose
/etc/profile.d/company-banner.shwrites an ASCII banner. Every MCP server on that template silently fails." (Hisoka)
🤖 This review was automatically generated with Coder Agents.
a8bafb2 to
87acc04
Compare
There was a problem hiding this comment.
Round 6. Shell wrapping removed entirely (DEREM-34, DEREM-35, DEREM-36 all resolved by deletion). GOTRACEBACK leak fixed (DEREM-37). DEREM-38 moot (resolveUserShell removed). CI passing.
DEREM-33 (buildEnv test gap) contested and closed by panel vote (3/4 accept). Mafuuu: error fallback does not corrupt state or alter lifecycle. Meruem: consequence is degraded but functional, logged. Zoro: no new evidence. Bisky dissented with a concrete 10-line test sketch disproving the "requires real agent" claim, but the majority found the risk bounded by the function's simplicity. This needs a human decision: the test is trivial to write but the panel found the gap acceptable.
All 39 findings across 6 rounds resolved: 29 fixed, 3 accepted, 2 panel-closed (DEREM-8 4/4, DEREM-33 3/4), 5 dropped (Notes). One Nit remains open.
"For a void-return operation, the conventional instantiation would be Group[string, struct{}] with the reload error in the second return." (Mafuuu, on the singleflight semantics inversion)
🤖 This review was automatically generated with Coder Agents.
f8a2e43 to
8495559
Compare
There was a problem hiding this comment.
Round 7. PR rebased. Latest commit refactors manager.go (serverDiff struct, public API reordering, 288 ins / 283 del). DEREM-39 (singleflight semantics) fixed. CI has a race test failure (test-go-race-pg). Three P3s, two Nits.
The refactor introduced DEREM-40 (orphaned doc comment fragment), DEREM-41 (Close ExitError filtering untested), and surfaced DEREM-42 (micro-TOCTOU in parseAndDedup where ReadFile precedes Stat). DEREM-42 is the same class of bug as the original DEREM-3, but with a microsecond window instead of 30 seconds.
"If someone removes the
errors.As(err, &exitErr)guard, Close() returns subprocess kill errors on every agent shutdown. No test catches this." (Bisky)
🤖 This review was automatically generated with Coder Agents.
5c81ad9 to
de1483a
Compare
There was a problem hiding this comment.
Round 8. All 5 R7 findings verified fixed by the panel. Mafuuu confirmed all five fixes and ran a four-mode contract sweep. Meruem traced six structural modes (lock discipline, generation counter, singleflight, updateEnv goroutine safety, Close-during-RefreshTools, test coverage). Bisky verified every test assertion. Chopper confirmed all failure paths and diagnostics. No new P0-P3 findings.
All 45 findings across 8 rounds resolved: 34 fixed, 3 accepted, 2 panel-closed, 6 dropped. One Nit open (DEREM-45, drive-by assertion change in an unrelated storybook file). CI pending.
"Eight rounds of polishing and it shows." (Bisky)
🤖 This review was automatically generated with Coder Agents.
ef45bc2 to
b2158b5
Compare
johnstcn
left a comment
There was a problem hiding this comment.
Some comments below, but non-blocking. I don't need to review again.
| // fakeAgentConn implements just enough of workspacesdk.AgentConn | ||
| // for testing CallMCPTool. | ||
| type fakeAgentConn struct { | ||
| workspacesdk.AgentConn | ||
| callMCPToolFunc func(ctx context.Context, req workspacesdk.CallMCPToolRequest) (workspacesdk.CallMCPToolResponse, error) | ||
| } |
There was a problem hiding this comment.
suggestion: the only usage of conn is CallMCPTool, so suggest defining a CallMCPTooler interface. Then you can just define a fakeCallMCPTooler and pass it in.
There was a problem hiding this comment.
Agree in principle. The test only needs CallMCPTool, so a narrow interface would be cleaner. However, WorkspaceMCPTool.getConn returns workspacesdk.AgentConn (the full interface), so introducing CallMCPTooler requires changing the production type signature. That's a broader refactor beyond this PR's scope. The current embedding approach works and is idiomatic for test fakes.
🤖 Posted using
/amend-reviewskill via Coder Agents.
…request The MCP manager previously read .mcp.json exactly once at agent startup. Editing the file had no effect until workspace rebuild. This change adds: 1. Snapshot-based change detection: the Manager records (mtime, size) for each config file after every reload and compares on each tool-list request. 2. Differential reload (D12): unchanged servers keep their client pointer so in-flight tool calls survive. Only new or changed servers get a fresh connection. Removed servers are closed. 3. Singleflight coalescing (D10): concurrent reload requests share a single in-flight connect pass via DoChan. The reload body runs under the Manager's lifetime context so it survives caller cancellation. 4. Chatd cache invalidation on 404 (D13): when CallMCPTool returns a 404 (ErrUnknownServer), the chat's cached tool list is dropped so the next turn refetches from the agent. The handleListTools handler now stats config files on every request and triggers a reload when any file differs from the snapshot. New chats see updated tools; existing chats recover from server removals via the 404 invalidation path.
- Restructure Manager struct: ctx at top, all mu-protected fields grouped tight under mu with no blank-line separators. - Trim verbose comments on struct fields. - Extract classifyServers, connectAll, installServers from connect to reduce function length. - Revert local mutex name from cmu back to mu (no shadowing issue). - Rename callerCtx to ctx in Reload (idiomatic Go). - Make TestSnapshotChanged and TestHandleListTools table-driven. - Keep real filesystem (os.Stat) consistent with package patterns; ParseConfig already uses os.ReadFile directly.
- Fix TOCTOU: capture snapshot at parse time, before connectAll, so mid-reload config edits trigger a re-check on the next request. - Fix perpetual reload: deduplicate paths in SnapshotChanged before comparing lengths against the snapshot map. - Rename connect to doReload for clarity. - Fix stale log message (connect -> reload). - Fix handleListTools doc comment (order was backwards). - Skip redundant RefreshTools when reload already ran. - Replace magic 404 literal with http.StatusNotFound. - Use maps.Clone, wg.Go (Go 1.25), remove dead tc := tc. - Fix InFlightToolCallSurvivesReload to require.Error. - Add multi-config-file test (TestSnapshotChanged_MultipleConfigFiles). - Delete duplicate e2e test, hoist closure, restore dedup comment. - Add race-guard and singleflight key documentation.
MCP stdio subprocesses used os.Environ() directly, missing ScriptBinDir and manifest environment variables. Thread the agent's updateCommandEnv callback into the Manager so MCP servers see the same PATH and env as SSH sessions.
Thread the agent's execer and updateCommandEnv callback into the
MCP Manager. Stdio subprocesses now get proper resource limits
via execer.CommandContext and the same enriched environment as
SSH sessions (CODER_* vars, manifest env, ScriptBinDir in PATH).
Use usershell.SystemEnvInfo{}.Environ() instead of os.Environ()
to filter GOTRACEBACK=none from subprocess environments.
…ppress expected close errors
Two fixes in manager.go:
1. The reload singleflight used Group[string, error], placing the reload
error in res.Val while res.Err was always nil. This inverts the
convention used by every other typed singleflight in the codebase
(e.g. coderd/x/chatd/configcache.go). Change to Group[string, struct{}]
with the error in the second return.
2. Close() propagated exec.ExitError from subprocess kills during
shutdown, causing ERROR-level log noise in cli golden tests. Filter
out exec.ExitError since subprocess kill signals are expected during
intentional close.
Replace the 5-return classifyServers and 5-parameter installServers with a focused serverDiff struct that captures only the classify output. connectAll and installServers take explicit parameters; the compiler enforces what each function receives. Reorder file: public API (Reload, SnapshotChanged, Tools, CallTool, RefreshTools, Close) appears after the constructor. Private reload internals, connection internals, and helpers follow in sections. Also restores three comments from the pre-refactor code: - parseAndDedup: missing files are silently skipped (not logged). - doReload close loop: preserves the leak-prevention rationale. - connectAll: restores 'Don't fail the group' on the nil return.
Five fixes from deep-review R7: 1. parseAndDedup: stat config files before reading so the snapshot is conservatively old. Eliminates a micro-TOCTOU where a file modified between read and stat would be missed. 2. classifyServers: Lock -> RLock since it only reads m.closed and m.servers. Avoids blocking concurrent Tools/CallTool callers. 3. Orphaned doc comment fragment (connectServer) deleted from above Tools(), restored with complete sentence above connectServer. 4. buildEnv doc: updateCommandEnv -> updateEnv to match the field. 5. New TestClose_SuppressesSubprocessExitError verifies Close() returns nil when subprocess exits with kill signal.
- Add comments to SnapshotChanged stat-error and file-appeared branches. - Use early-continue in captureSnapshot instead of else block. - Remove 'D12' plan reference from test comment. - Clarify 'just ran' in api.go refresh comment.
e10d3f1 to
0e5cc33
Compare
The MCP manager previously read
.mcp.jsonexactly once at agent startup.Editing the file had no effect until workspace rebuild or agent restart.
This change adds lazy stat-on-request: the
handleListToolshandler checksconfig file mtimes on every tool-list request and triggers a differential
reload when any file changed. Unchanged servers keep their client pointer
so in-flight tool calls survive. Concurrent reload requests coalesce via
singleflight.
MCP stdio subprocesses now use the agent's
execerfor resourcelimits and receive the same enriched environment as SSH sessions
(via
updateCommandEnv). This ensures ScriptBinDir is in PATHso Coder module binaries are found.
On the chatd side,
WorkspaceMCPTool.Runnow detects 404 responses fromCallMCPTool(indicating the server was removed) and drops the chat'scached tool list so the next turn refetches from the agent.