Skip to content

fix(shim/manager): retry on pipe busy/timeout when waiting for shim pipe on Windows#218

Merged
dmcgowan merged 1 commit into
containerd:mainfrom
austinvazquez:fix/windows-shim-pipe-dial-timeout
Jun 3, 2026
Merged

fix(shim/manager): retry on pipe busy/timeout when waiting for shim pipe on Windows#218
dmcgowan merged 1 commit into
containerd:mainfrom
austinvazquez:fix/windows-shim-pipe-dial-timeout

Conversation

@austinvazquez
Copy link
Copy Markdown
Member

@austinvazquez austinvazquez commented Jun 3, 2026

Root cause

Primary bug: winio.DialPipe(address, nil) defaults to a 2-second per-attempt timeout. When the shim has called ListenPipe but no goroutine has yet reached Accept(), DialPipe blocks for 2 s and returns winio.ErrTimeout ("i/o timeout"). That error is not os.IsNotExist, so the old code treated it as fatal and returned immediately — discarding the rest of the 10 s budget.

Fix

  • Use a 1 s per-attempt DialPipe timeout; retry on winio.ErrTimeout and windows.ERROR_PIPE_BUSY
  • Only truly unexpected errors are fatal; include the transient error in debug logs to aid diagnosis
  • Mirrors the strategy containerd/v2/pkg/shim/shim_windows.go:awaitPipeReady already uses

Tests

Added manager_windows_test.go with four cases:

  • TestWaitForPipe_SlowListen: server starts after 300 ms — must succeed
  • TestWaitForPipe_SlowAccept: server listens immediately but delays Accept 600 ms (regression for primary bug)
  • TestWaitForPipe_Timeout: no server, short deadline — must return descriptive timeout error
  • TestWaitForPipe_CtxCancelled: pre-cancelled context — must error immediately
  • TestWaitForShimPipe_ShimExit: shim exits before pipe is ready — must return shim exit code immediately

Copilot AI review requested due to automatic review settings June 3, 2026 01:16
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Windows shim startup reliability by making the shim TTRPC named-pipe readiness wait more robust (short per-attempt dials, retries on transient errors, longer overall budget bounded by the caller’s context) and adds Windows-specific tests to cover key race/timeout scenarios.

Changes:

  • Update Start() Windows pipe polling to use a short DialPipe timeout, retry on transient readiness errors, and extend the overall wait budget to 30s (bounded by caller context).
  • Improve timeout diagnostics by tracking the last transient error encountered while polling.
  • Add Windows-only unit tests covering slow listen, slow accept, timeout, and context cancellation behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
pkg/shim/manager/manager_windows.go Makes the shim pipe readiness wait retry on transient conditions with a longer overall budget.
pkg/shim/manager/manager_windows_test.go Adds Windows-only tests to validate readiness polling behavior under races/timeouts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/shim/manager/manager_windows_test.go Outdated
Comment thread pkg/shim/manager/manager_windows_test.go
Comment thread pkg/shim/manager/manager_windows_test.go Outdated
Comment thread pkg/shim/manager/manager_windows.go Outdated
Comment thread pkg/shim/manager/manager_windows.go Outdated
@austinvazquez austinvazquez marked this pull request as draft June 3, 2026 01:36
@austinvazquez austinvazquez force-pushed the fix/windows-shim-pipe-dial-timeout branch 2 times, most recently from 13d77c1 to 95a5d98 Compare June 3, 2026 02:17
Copilot AI review requested due to automatic review settings June 3, 2026 02:24
@austinvazquez austinvazquez force-pushed the fix/windows-shim-pipe-dial-timeout branch from 95a5d98 to 6e5e046 Compare June 3, 2026 02:24
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comment thread pkg/shim/manager/manager_windows.go Outdated
Comment thread pkg/shim/manager/manager_windows.go
Comment thread pkg/shim/manager/manager_windows_test.go
Comment thread pkg/shim/manager/manager_windows_test.go
@austinvazquez austinvazquez force-pushed the fix/windows-shim-pipe-dial-timeout branch from 6e5e046 to 4e76bf8 Compare June 3, 2026 02:41
Copilot AI review requested due to automatic review settings June 3, 2026 14:38
@austinvazquez austinvazquez force-pushed the fix/windows-shim-pipe-dial-timeout branch from 4e76bf8 to 569bd52 Compare June 3, 2026 14:38
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread pkg/shim/manager/manager_windows.go
@KaustubhUp025

This comment was marked as spam.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

Comment thread pkg/shim/manager/manager_windows.go
Comment thread pkg/shim/manager/manager_windows.go
Comment thread pkg/shim/manager/manager_windows.go
Comment thread pkg/shim/manager/manager_windows.go
Comment thread pkg/shim/manager/manager_windows.go
Comment thread pkg/shim/manager/manager_windows.go Outdated
@austinvazquez austinvazquez force-pushed the fix/windows-shim-pipe-dial-timeout branch from 1ea1193 to baaec09 Compare June 3, 2026 18:52
Copilot AI review requested due to automatic review settings June 3, 2026 19:14
@austinvazquez
Copy link
Copy Markdown
Member Author

🟠 HIGH — RULE_06: Retry Loop Uses Fixed Backoff Without Jitter
Confidence: 95% | pkg/shim/manager/manager_windows.go:188

The new waitForShimPipe function implements a retry loop using a constant retryDelay (10ms). When multiple processes encounter a transient error simultaneously, this deterministic backoff will cause them to retry in lock-step. This 'thundering herd' can overwhelm local system resources and amplify a small slowdown into a larger availability issue. Adding a small amount of random jitter to the sleep interval prevents this synchronization and improves system stability under contention.

In your diff:

time.Sleep(retryDelay)
Found via semantic search:

A semantic search for existing retry utilities with jitter returned no results, indicating this pattern may be new or inconsistent within the codebase.
Suggested fix: Add a small random duration to the sleep interval to spread out retries. For example: time.Sleep(retryDelay + time.Duration(rand.Intn(10))*time.Millisecond).

Reference: AWS Architecture Blog — Exponential Backoff and Jitter

Probably mostly theoretical for nerdbox request volume but simple enough to resolve.

@austinvazquez austinvazquez marked this pull request as ready for review June 3, 2026 19:15
@austinvazquez
Copy link
Copy Markdown
Member Author

Ready for review. @eginez, I know you did a similar change to containerd windows shim manager if you want to take a look.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread pkg/shim/manager/manager_windows.go Outdated
Comment thread pkg/shim/manager/manager_windows.go
…ipe on Windows

1. winio.DialPipe(address, nil) uses a 2-second per-attempt timeout (the
   default when nil is passed). When the shim has called ListenPipe but no
   goroutine has reached Accept() yet, DialPipe blocks for 2 s and returns
   winio.ErrTimeout ("i/o timeout"). That error is not os.IsNotExist, so the
   old code treated it as fatal and returned immediately, discarding the rest
   of the 10 s budget. This is the primary defect.

Fix:
- Use a 1s per-attempt DialPipe timeout so individual probes complete
  quickly and do not eat the whole budget.
- Retry on winio.ErrTimeout and windows.ERROR_PIPE_BUSY (pipe exists, no
  Accept yet or all instances momentarily busy); only truly unexpected errors
  are fatal.
- Respect the caller's context deadline if it is shorter than 10 s.
- Log the transient error to debug logs to aid with diagnosis.

This mirrors the strategy that containerd's pkg/shim/shim_windows.go already
uses in awaitPipeReady.

Signed-off-by: Austin Vazquez <austin.vazquez@docker.com>
@austinvazquez austinvazquez force-pushed the fix/windows-shim-pipe-dial-timeout branch from 84d1ae4 to f08556b Compare June 3, 2026 19:51
Comment thread pkg/shim/manager/manager_windows.go
Copy link
Copy Markdown

@eginez eginez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dmcgowan dmcgowan merged commit ce02387 into containerd:main Jun 3, 2026
15 checks passed
@austinvazquez austinvazquez deleted the fix/windows-shim-pipe-dial-timeout branch June 3, 2026 23:36
eginez added a commit to eginez/containerd that referenced this pull request Jun 5, 2026
Match the error checks in containerd/nerdbox#218: retry on os.IsNotExist,
winio.ErrTimeout, and windows.ERROR_PIPE_BUSY. ERROR_PIPE_BUSY is normally
surfaced as winio.ErrTimeout by go-winio's internal retry loop once the
per-attempt deadline fires, but guard it explicitly for safety.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants