Summary
PR #218 adds waitForShimPipe() in pkg/shim/manager/manager_windows.go. The retry loop calls time.Sleep(retryDelay) where retryDelay is a single fixed time.Duration with no randomisation.
When multiple container shims start simultaneously (a common pattern in pod batch launches or OCI runtime parallel execution), all shims that fail their first pipe dial will sleep for exactly the same retryDelay and retry at the same instant — a thundering herd that can momentarily saturate the Windows named pipe listener queue.
Where it happens
pkg/shim/manager/manager_windows.go (added by PR #218):
func waitForShimPipe(ctx context.Context, address string,
readyTimeout, perAttempt, retryDelay time.Duration) error {
// ...
for {
_, err := winio.DialPipe(address, &perAttempt)
if err == nil {
return nil
}
retryable := os.IsNotExist(err) ||
errors.Is(err, winio.ErrTimeout) ||
errors.Is(err, windows.ERROR_PIPE_BUSY)
if !retryable {
return fmt.Errorf("waiting for shim pipe %s: %w", address, err)
}
time.Sleep(retryDelay) // fixed — no jitter
}
}
Suggested fix
Add ±25% jitter to spread retries:
import "math/rand"
// Spread retries: ±25% jitter around retryDelay
jittered := retryDelay + time.Duration(rand.Int63n(int64(retryDelay)/2)) - retryDelay/4
time.Sleep(jittered)
Or use an exponential backoff with jitter starting from a small base:
backoff := retryDelay
for {
// ...
jitter := time.Duration(rand.Int63n(int64(backoff) / 2))
time.Sleep(backoff/2 + jitter)
if backoff < maxRetryDelay {
backoff *= 2
}
}
The AWS Architecture Blog (Marc Brooker) recommends full jitter for any retry loop where multiple callers may experience the same transient failure simultaneously.
Additional finding: RULE_07 — Test uses time.Sleep for goroutine coordination
pkg/shim/manager/manager_windows_test.go uses time.Sleep to wait for goroutine state — a pattern that makes tests timing-sensitive and intermittently flaky. Prefer a channel signal or sync.WaitGroup to coordinate the goroutine with the test assertion.
How this was found
This issue was identified by Quorum, an open-source Gemini-powered agent that reviews pull requests for distributed coordination anti-patterns (thundering herds, missing saga compensation, lost updates, transactional outbox violations, and more).
Happy to open a follow-up PR with the jitter fix if that would be helpful.
Summary
PR #218 adds
waitForShimPipe()inpkg/shim/manager/manager_windows.go. The retry loop callstime.Sleep(retryDelay)whereretryDelayis a single fixedtime.Durationwith no randomisation.When multiple container shims start simultaneously (a common pattern in pod batch launches or OCI runtime parallel execution), all shims that fail their first pipe dial will sleep for exactly the same
retryDelayand retry at the same instant — a thundering herd that can momentarily saturate the Windows named pipe listener queue.Where it happens
pkg/shim/manager/manager_windows.go(added by PR #218):Suggested fix
Add ±25% jitter to spread retries:
Or use an exponential backoff with jitter starting from a small base:
The AWS Architecture Blog (Marc Brooker) recommends full jitter for any retry loop where multiple callers may experience the same transient failure simultaneously.
Additional finding: RULE_07 — Test uses
time.Sleepfor goroutine coordinationpkg/shim/manager/manager_windows_test.gousestime.Sleepto wait for goroutine state — a pattern that makes tests timing-sensitive and intermittently flaky. Prefer a channel signal orsync.WaitGroupto coordinate the goroutine with the test assertion.How this was found
This issue was identified by Quorum, an open-source Gemini-powered agent that reviews pull requests for distributed coordination anti-patterns (thundering herds, missing saga compensation, lost updates, transactional outbox violations, and more).
Happy to open a follow-up PR with the jitter fix if that would be helpful.