Skip to content

feat: reload MCP config on change via lazy stat-on-request#24700

Merged
mafredri merged 11 commits intomainfrom
mcp-reload-lazy-stat
Apr 28, 2026
Merged

feat: reload MCP config on change via lazy stat-on-request#24700
mafredri merged 11 commits intomainfrom
mcp-reload-lazy-stat

Conversation

@mafredri
Copy link
Copy Markdown
Member

@mafredri mafredri commented Apr 24, 2026

The MCP manager previously read .mcp.json exactly once at agent startup.
Editing the file had no effect until workspace rebuild or agent restart.

This change adds lazy stat-on-request: the handleListTools handler checks
config file mtimes on every tool-list request and triggers a differential
reload when any file changed. Unchanged servers keep their client pointer
so in-flight tool calls survive. Concurrent reload requests coalesce via
singleflight.

MCP stdio subprocesses now use the agent's execer for resource
limits and receive the same enriched environment as SSH sessions
(via updateCommandEnv). This ensures ScriptBinDir is in PATH
so Coder module binaries are found.

On the chatd side, WorkspaceMCPTool.Run now detects 404 responses from
CallMCPTool (indicating the server was removed) and drops the chat's
cached tool list so the next turn refetches from the agent.

🤖 This PR was created with the help of Coder Agents, and will be reviewed by a human. 🏂🏻

coder-agents-review[bot]

This comment was marked as resolved.

@mafredri mafredri changed the title feat(agent/x/agentmcp): reload MCP config on change via lazy stat-on-request feat: reload MCP config on change via lazy stat-on-request Apr 24, 2026
@mafredri mafredri changed the title feat: reload MCP config on change via lazy stat-on-request feat(agent/x/agentmcp): reload MCP config on change via lazy stat-on-request Apr 24, 2026

This comment was marked as resolved.

@mafredri mafredri changed the title feat(agent/x/agentmcp): reload MCP config on change via lazy stat-on-request feat: reload MCP config on change via lazy stat-on-request Apr 24, 2026
coder-agents-review[bot]

This comment was marked as resolved.

coder-agents-review[bot]

This comment was marked as resolved.

This comment was marked as resolved.

coder-agents-review[bot]

This comment was marked as resolved.

@mafredri mafredri force-pushed the mcp-reload-lazy-stat branch 2 times, most recently from 1431ef1 to a8bafb2 Compare April 24, 2026 16:57
Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 5. Two new commits add login shell wrapping and environment enrichment for MCP stdio subprocesses (121 insertions). The enrichment design is correct (same updateCommandEnv as SSH, same execer, config env overrides applied last). One P1, one P2, three P3, one P4.

The P1 (DEREM-35) is a platform breakage: the shell wrapping uses POSIX-only constructs unconditionally, breaking every stdio MCP server on Windows. Three reviewers traced the same failure: -l -c and exec are not valid for pwsh.exe or cmd.exe. The CI failures on windows-2022 are consistent with this. The SSH code has explicit platform guards for this exact case.

The P2 (DEREM-36) is a protocol corruption risk: login shell -l sources user profiles that can write to stdout, injecting non-JSON bytes before the MCP handshake. The SSH code deliberately avoids -l for command execution; the MCP code should follow the same pattern.

Process observation: the two commits add 121 lines of new behavior (buildEnv enrichment, login shell wrapping, resolveUserShell, execer integration) and update 21 test call sites to compile, but add zero new assertions for the new behavior. This is the same pattern as five prior findings (DEREM-7, DEREM-9, DEREM-10, DEREM-27, DEREM-28), each fixed when pointed out.

"What happens at scale: a platform team pushes a template with a base image whose /etc/profile.d/company-banner.sh writes an ASCII banner. Every MCP server on that template silently fails." (Hisoka)

🤖 This review was automatically generated with Coder Agents.

Comment thread agent/x/agentmcp/manager.go Outdated
Comment thread agent/x/agentmcp/manager.go Outdated
Comment thread agent/x/agentmcp/manager.go
Comment thread agent/x/agentmcp/manager.go
Comment thread agent/x/agentmcp/manager.go Outdated
Comment thread agent/x/agentmcp/manager.go Outdated
@mafredri mafredri force-pushed the mcp-reload-lazy-stat branch from a8bafb2 to 87acc04 Compare April 24, 2026 18:15
Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 6. Shell wrapping removed entirely (DEREM-34, DEREM-35, DEREM-36 all resolved by deletion). GOTRACEBACK leak fixed (DEREM-37). DEREM-38 moot (resolveUserShell removed). CI passing.

DEREM-33 (buildEnv test gap) contested and closed by panel vote (3/4 accept). Mafuuu: error fallback does not corrupt state or alter lifecycle. Meruem: consequence is degraded but functional, logged. Zoro: no new evidence. Bisky dissented with a concrete 10-line test sketch disproving the "requires real agent" claim, but the majority found the risk bounded by the function's simplicity. This needs a human decision: the test is trivial to write but the panel found the gap acceptable.

All 39 findings across 6 rounds resolved: 29 fixed, 3 accepted, 2 panel-closed (DEREM-8 4/4, DEREM-33 3/4), 5 dropped (Notes). One Nit remains open.

"For a void-return operation, the conventional instantiation would be Group[string, struct{}] with the reload error in the second return." (Mafuuu, on the singleflight semantics inversion)

🤖 This review was automatically generated with Coder Agents.

Comment thread agent/x/agentmcp/manager.go Outdated
@mafredri mafredri force-pushed the mcp-reload-lazy-stat branch 4 times, most recently from f8a2e43 to 8495559 Compare April 27, 2026 11:37
Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 7. PR rebased. Latest commit refactors manager.go (serverDiff struct, public API reordering, 288 ins / 283 del). DEREM-39 (singleflight semantics) fixed. CI has a race test failure (test-go-race-pg). Three P3s, two Nits.

The refactor introduced DEREM-40 (orphaned doc comment fragment), DEREM-41 (Close ExitError filtering untested), and surfaced DEREM-42 (micro-TOCTOU in parseAndDedup where ReadFile precedes Stat). DEREM-42 is the same class of bug as the original DEREM-3, but with a microsecond window instead of 30 seconds.

"If someone removes the errors.As(err, &exitErr) guard, Close() returns subprocess kill errors on every agent shutdown. No test catches this." (Bisky)

🤖 This review was automatically generated with Coder Agents.

Comment thread agent/x/agentmcp/manager.go Outdated
Comment thread agent/x/agentmcp/manager.go
Comment thread agent/x/agentmcp/manager.go Outdated
Comment thread agent/x/agentmcp/manager.go Outdated
Comment thread agent/x/agentmcp/manager.go Outdated
@mafredri mafredri force-pushed the mcp-reload-lazy-stat branch 2 times, most recently from 5c81ad9 to de1483a Compare April 27, 2026 14:55
Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 8. All 5 R7 findings verified fixed by the panel. Mafuuu confirmed all five fixes and ran a four-mode contract sweep. Meruem traced six structural modes (lock discipline, generation counter, singleflight, updateEnv goroutine safety, Close-during-RefreshTools, test coverage). Bisky verified every test assertion. Chopper confirmed all failure paths and diagnostics. No new P0-P3 findings.

All 45 findings across 8 rounds resolved: 34 fixed, 3 accepted, 2 panel-closed, 6 dropped. One Nit open (DEREM-45, drive-by assertion change in an unrelated storybook file). CI pending.

"Eight rounds of polishing and it shows." (Bisky)

🤖 This review was automatically generated with Coder Agents.

@mafredri mafredri force-pushed the mcp-reload-lazy-stat branch 2 times, most recently from ef45bc2 to b2158b5 Compare April 27, 2026 16:31
@mafredri mafredri marked this pull request as ready for review April 27, 2026 16:32
Copy link
Copy Markdown
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments below, but non-blocking. I don't need to review again.

Comment on lines +18 to +23
// fakeAgentConn implements just enough of workspacesdk.AgentConn
// for testing CallMCPTool.
type fakeAgentConn struct {
workspacesdk.AgentConn
callMCPToolFunc func(ctx context.Context, req workspacesdk.CallMCPToolRequest) (workspacesdk.CallMCPToolResponse, error)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: the only usage of conn is CallMCPTool, so suggest defining a CallMCPTooler interface. Then you can just define a fakeCallMCPTooler and pass it in.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree in principle. The test only needs CallMCPTool, so a narrow interface would be cleaner. However, WorkspaceMCPTool.getConn returns workspacesdk.AgentConn (the full interface), so introducing CallMCPTooler requires changing the production type signature. That's a broader refactor beyond this PR's scope. The current embedding approach works and is idiomatic for test fakes.

🤖 Posted using /amend-review skill via Coder Agents.

Comment thread agent/x/agentmcp/manager.go
Comment thread agent/x/agentmcp/manager.go
Comment thread agent/x/agentmcp/manager.go Outdated
Comment thread agent/x/agentmcp/reload_internal_test.go Outdated
Comment thread agent/x/agentmcp/api.go
…request

The MCP manager previously read .mcp.json exactly once at agent
startup. Editing the file had no effect until workspace rebuild.

This change adds:

1. Snapshot-based change detection: the Manager records (mtime,
   size) for each config file after every reload and compares
   on each tool-list request.

2. Differential reload (D12): unchanged servers keep their client
   pointer so in-flight tool calls survive. Only new or changed
   servers get a fresh connection. Removed servers are closed.

3. Singleflight coalescing (D10): concurrent reload requests
   share a single in-flight connect pass via DoChan. The reload
   body runs under the Manager's lifetime context so it survives
   caller cancellation.

4. Chatd cache invalidation on 404 (D13): when CallMCPTool
   returns a 404 (ErrUnknownServer), the chat's cached tool
   list is dropped so the next turn refetches from the agent.

The handleListTools handler now stats config files on every
request and triggers a reload when any file differs from the
snapshot. New chats see updated tools; existing chats recover
from server removals via the 404 invalidation path.
- Restructure Manager struct: ctx at top, all mu-protected fields
  grouped tight under mu with no blank-line separators.
- Trim verbose comments on struct fields.
- Extract classifyServers, connectAll, installServers from connect
  to reduce function length.
- Revert local mutex name from cmu back to mu (no shadowing issue).
- Rename callerCtx to ctx in Reload (idiomatic Go).
- Make TestSnapshotChanged and TestHandleListTools table-driven.
- Keep real filesystem (os.Stat) consistent with package patterns;
  ParseConfig already uses os.ReadFile directly.
- Fix TOCTOU: capture snapshot at parse time, before connectAll,
  so mid-reload config edits trigger a re-check on the next request.
- Fix perpetual reload: deduplicate paths in SnapshotChanged before
  comparing lengths against the snapshot map.
- Rename connect to doReload for clarity.
- Fix stale log message (connect -> reload).
- Fix handleListTools doc comment (order was backwards).
- Skip redundant RefreshTools when reload already ran.
- Replace magic 404 literal with http.StatusNotFound.
- Use maps.Clone, wg.Go (Go 1.25), remove dead tc := tc.
- Fix InFlightToolCallSurvivesReload to require.Error.
- Add multi-config-file test (TestSnapshotChanged_MultipleConfigFiles).
- Delete duplicate e2e test, hoist closure, restore dedup comment.
- Add race-guard and singleflight key documentation.
MCP stdio subprocesses used os.Environ() directly, missing
ScriptBinDir and manifest environment variables. Thread the
agent's updateCommandEnv callback into the Manager so MCP
servers see the same PATH and env as SSH sessions.
Thread the agent's execer and updateCommandEnv callback into the
MCP Manager. Stdio subprocesses now get proper resource limits
via execer.CommandContext and the same enriched environment as
SSH sessions (CODER_* vars, manifest env, ScriptBinDir in PATH).

Use usershell.SystemEnvInfo{}.Environ() instead of os.Environ()
to filter GOTRACEBACK=none from subprocess environments.
…ppress expected close errors

Two fixes in manager.go:

1. The reload singleflight used Group[string, error], placing the reload
   error in res.Val while res.Err was always nil. This inverts the
   convention used by every other typed singleflight in the codebase
   (e.g. coderd/x/chatd/configcache.go). Change to Group[string, struct{}]
   with the error in the second return.

2. Close() propagated exec.ExitError from subprocess kills during
   shutdown, causing ERROR-level log noise in cli golden tests. Filter
   out exec.ExitError since subprocess kill signals are expected during
   intentional close.
Replace the 5-return classifyServers and 5-parameter installServers
with a focused serverDiff struct that captures only the classify
output. connectAll and installServers take explicit parameters;
the compiler enforces what each function receives.

Reorder file: public API (Reload, SnapshotChanged, Tools, CallTool,
RefreshTools, Close) appears after the constructor. Private reload
internals, connection internals, and helpers follow in sections.

Also restores three comments from the pre-refactor code:
- parseAndDedup: missing files are silently skipped (not logged).
- doReload close loop: preserves the leak-prevention rationale.
- connectAll: restores 'Don't fail the group' on the nil return.
Five fixes from deep-review R7:

1. parseAndDedup: stat config files before reading so the snapshot
   is conservatively old. Eliminates a micro-TOCTOU where a file
   modified between read and stat would be missed.

2. classifyServers: Lock -> RLock since it only reads m.closed and
   m.servers. Avoids blocking concurrent Tools/CallTool callers.

3. Orphaned doc comment fragment (connectServer) deleted from above
   Tools(), restored with complete sentence above connectServer.

4. buildEnv doc: updateCommandEnv -> updateEnv to match the field.

5. New TestClose_SuppressesSubprocessExitError verifies Close()
   returns nil when subprocess exits with kill signal.
- Add comments to SnapshotChanged stat-error and file-appeared branches.
- Use early-continue in captureSnapshot instead of else block.
- Remove 'D12' plan reference from test comment.
- Clarify 'just ran' in api.go refresh comment.
@mafredri mafredri force-pushed the mcp-reload-lazy-stat branch from e10d3f1 to 0e5cc33 Compare April 28, 2026 15:51
@mafredri mafredri merged commit 881df9a into main Apr 28, 2026
45 of 47 checks passed
@mafredri mafredri deleted the mcp-reload-lazy-stat branch April 28, 2026 16:47
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 28, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants