fix(shim/manager): retry on pipe busy/timeout when waiting for shim pipe on Windows#218
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves Windows shim startup reliability by making the shim TTRPC named-pipe readiness wait more robust (short per-attempt dials, retries on transient errors, longer overall budget bounded by the caller’s context) and adds Windows-specific tests to cover key race/timeout scenarios.
Changes:
- Update
Start()Windows pipe polling to use a shortDialPipetimeout, retry on transient readiness errors, and extend the overall wait budget to 30s (bounded by caller context). - Improve timeout diagnostics by tracking the last transient error encountered while polling.
- Add Windows-only unit tests covering slow listen, slow accept, timeout, and context cancellation behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| pkg/shim/manager/manager_windows.go | Makes the shim pipe readiness wait retry on transient conditions with a longer overall budget. |
| pkg/shim/manager/manager_windows_test.go | Adds Windows-only tests to validate readiness polling behavior under races/timeouts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
13d77c1 to
95a5d98
Compare
95a5d98 to
6e5e046
Compare
6e5e046 to
4e76bf8
Compare
4e76bf8 to
569bd52
Compare
This comment was marked as spam.
This comment was marked as spam.
0d63ca5 to
1ea1193
Compare
1ea1193 to
baaec09
Compare
Probably mostly theoretical for nerdbox request volume but simple enough to resolve. |
|
Ready for review. @eginez, I know you did a similar change to containerd windows shim manager if you want to take a look. |
…ipe on Windows
1. winio.DialPipe(address, nil) uses a 2-second per-attempt timeout (the
default when nil is passed). When the shim has called ListenPipe but no
goroutine has reached Accept() yet, DialPipe blocks for 2 s and returns
winio.ErrTimeout ("i/o timeout"). That error is not os.IsNotExist, so the
old code treated it as fatal and returned immediately, discarding the rest
of the 10 s budget. This is the primary defect.
Fix:
- Use a 1s per-attempt DialPipe timeout so individual probes complete
quickly and do not eat the whole budget.
- Retry on winio.ErrTimeout and windows.ERROR_PIPE_BUSY (pipe exists, no
Accept yet or all instances momentarily busy); only truly unexpected errors
are fatal.
- Respect the caller's context deadline if it is shorter than 10 s.
- Log the transient error to debug logs to aid with diagnosis.
This mirrors the strategy that containerd's pkg/shim/shim_windows.go already
uses in awaitPipeReady.
Signed-off-by: Austin Vazquez <austin.vazquez@docker.com>
84d1ae4 to
f08556b
Compare
Match the error checks in containerd/nerdbox#218: retry on os.IsNotExist, winio.ErrTimeout, and windows.ERROR_PIPE_BUSY. ERROR_PIPE_BUSY is normally surfaced as winio.ErrTimeout by go-winio's internal retry loop once the per-attempt deadline fires, but guard it explicitly for safety. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause
Primary bug:
winio.DialPipe(address, nil)defaults to a 2-second per-attempt timeout. When the shim has calledListenPipebut no goroutine has yet reachedAccept(),DialPipeblocks for 2 s and returnswinio.ErrTimeout("i/o timeout"). That error is notos.IsNotExist, so the old code treated it as fatal and returned immediately — discarding the rest of the 10 s budget.Fix
DialPipetimeout; retry onwinio.ErrTimeoutandwindows.ERROR_PIPE_BUSYcontainerd/v2/pkg/shim/shim_windows.go:awaitPipeReadyalready usesTests
Added
manager_windows_test.gowith four cases:TestWaitForPipe_SlowListen: server starts after 300 ms — must succeedTestWaitForPipe_SlowAccept: server listens immediately but delays Accept 600 ms (regression for primary bug)TestWaitForPipe_Timeout: no server, short deadline — must return descriptive timeout errorTestWaitForPipe_CtxCancelled: pre-cancelled context — must error immediatelyTestWaitForShimPipe_ShimExit: shim exits before pipe is ready — must return shim exit code immediately