Skip to content

feat: enforce lifecycle.startup_timeout for MCP/LSP toolset startup#3373

Merged
Sayt-0 merged 3 commits into
docker:mainfrom
dgageot:feat/enforce-startup-timeout
Jul 1, 2026
Merged

feat: enforce lifecycle.startup_timeout for MCP/LSP toolset startup#3373
Sayt-0 merged 3 commits into
docker:mainfrom
dgageot:feat/enforce-startup-timeout

Conversation

@dgageot

@dgageot dgageot commented Jul 1, 2026

Copy link
Copy Markdown
Member

What

Enforces lifecycle.startup_timeout for MCP/LSP toolset startup. Until now this config field (and its strict-profile 30s default) was parsed and validated but never acted onEffectiveStartupTimeout() had no callers, so a server that hung mid-initialize would block startup indefinitely.

Why

The MCP clientConnector.Connect deliberately detaches its context with context.WithoutCancel (the session must outlive the request that triggered it). A consequence is that the initialize handshake has no deadline of its own: a server that accepts the connection but never completes initialize wedges the startup attempt forever, and startup_timeout could not rescue it because nothing consumed the value.

How

  • Add StartupTimeout to lifecycle.Policy and wire EffectiveStartupTimeout() into PolicyFromConfig.
  • Enforce it in Supervisor.connect() by racing Connect against a timer rather than a ctx deadline (a deadline would be stripped by WithoutCancel). On expiry, Start returns ErrInitTimeout and the toolset stays Stopped so the runtime retries on the next turn.
  • Guarantee at most one Connect is ever in flight: a timed-out Connect goroutine is left running and recorded in inflightConnect; the next Start adopts it and Stop reaps it. This is required because the MCP transport shares one underlying client across Connect calls — two overlapping connects would race on the shared session (a late one could clobber/close a newer one). An adopted flag under the mutex ensures the resulting session is adopted-or-closed exactly once.

Scope

startup_timeout bounds only the initial Start, not background watcher reconnects (which already have their own restart/backoff budget). strict → 30s default; other profiles → 0 (no timeout); explicit value always wins.

Tests

New supervisor tests (all pass under -race): timeout returns ErrInitTimeout with a single Connect; the next Start adopts a late-completing connect; Stop reaps and closes a late connect. New PolicyFromConfig test covers the nil/resilient/strict/explicit cases.

@dgageot dgageot requested a review from a team as a code owner July 1, 2026 12:24

@docker-agent docker-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟡 NEEDS ATTENTION

This PR correctly implements the startup-timeout enforcement mechanism for MCP/LSP toolset startup with a well-thought-out single-inflight-connect guarantee. Two findings worth the author's attention are filed as inline comments.

Comment thread pkg/tools/lifecycle/supervisor.go Outdated
Comment thread pkg/tools/lifecycle/supervisor.go
@aheritier aheritier added area/config For configuration parsing, YAML, environment variables area/mcp MCP protocol, MCP tool servers, integration area/tools For features/issues/fixes related to the usage of built-in and MCP tools kind/feat PR adds a new feature (maps to feat:). Use on PRs only. labels Jul 1, 2026
dgageot added 3 commits July 1, 2026 14:38
Race Connect against a timer so a server that hangs mid-handshake is
bounded by startup_timeout instead of blocking indefinitely. The strict
profile defaults to 30s; other profiles default to 0 (no timeout).

Assisted-By: docker-agent
…imeout

Keep at most one connector.Connect in flight via inflightConnect/pendingConnect.
A timed-out Connect is adopted by the next Start or reaped by Stop, preventing
a late session from clobbering/closing a newer shared session.

Assisted-By: Claude <noreply@anthropic.com>
Addresses PR review feedback:
- Launch the timed-out Connect goroutine with context.WithoutCancel so a
  later adopting Start does not receive a stale context.Canceled when the
  first caller's context is cancelled (matters for ctx-respecting connectors
  such as LSP).
- Defensively close the adopted session if the connector returns both a
  session and an error, since Start discards the session on error.
@dgageot dgageot force-pushed the feat/enforce-startup-timeout branch from b07e435 to 328b837 Compare July 1, 2026 12:40
@dgageot

dgageot commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

Addressed both review findings, rebased on origin/main, and verified locally (task lint clean, go test -race ./pkg/tools/lifecycle/... and task build pass).

  • [MEDIUM] The timed-out in-flight Connect goroutine now runs with context.WithoutCancel(ctx), so a cancellation of the first caller's context can't surface as a stale context.Canceled to a later adopting Start. Added TestSupervisor_StartAdoptsLateConnectAfterFirstCtxCancelled.
  • [LOW] The adopt path now closes the session defensively when a connector returns both a non-nil session and a non-nil error, rather than discarding it.

@Sayt-0 Sayt-0 merged commit 5b42995 into docker:main Jul 1, 2026
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/config For configuration parsing, YAML, environment variables area/mcp MCP protocol, MCP tool servers, integration area/tools For features/issues/fixes related to the usage of built-in and MCP tools kind/feat PR adds a new feature (maps to feat:). Use on PRs only.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants