Skip to content

fix(shim/manager): waitForShimPipe retry uses fixed delay — no jitter, thundering herd risk on Windows #219

@KaustubhUp025

Description

@KaustubhUp025

Summary

PR #218 adds waitForShimPipe() in pkg/shim/manager/manager_windows.go. The retry loop calls time.Sleep(retryDelay) where retryDelay is a single fixed time.Duration with no randomisation.

When multiple container shims start simultaneously (a common pattern in pod batch launches or OCI runtime parallel execution), all shims that fail their first pipe dial will sleep for exactly the same retryDelay and retry at the same instant — a thundering herd that can momentarily saturate the Windows named pipe listener queue.

Where it happens

pkg/shim/manager/manager_windows.go (added by PR #218):

func waitForShimPipe(ctx context.Context, address string,
    readyTimeout, perAttempt, retryDelay time.Duration) error {
    // ...
    for {
        _, err := winio.DialPipe(address, &perAttempt)
        if err == nil {
            return nil
        }
        retryable := os.IsNotExist(err) ||
            errors.Is(err, winio.ErrTimeout) ||
            errors.Is(err, windows.ERROR_PIPE_BUSY)
        if !retryable {
            return fmt.Errorf("waiting for shim pipe %s: %w", address, err)
        }
        time.Sleep(retryDelay)  // fixed — no jitter
    }
}

Suggested fix

Add ±25% jitter to spread retries:

import "math/rand"

// Spread retries: ±25% jitter around retryDelay
jittered := retryDelay + time.Duration(rand.Int63n(int64(retryDelay)/2)) - retryDelay/4
time.Sleep(jittered)

Or use an exponential backoff with jitter starting from a small base:

backoff := retryDelay
for {
    // ...
    jitter := time.Duration(rand.Int63n(int64(backoff) / 2))
    time.Sleep(backoff/2 + jitter)
    if backoff < maxRetryDelay {
        backoff *= 2
    }
}

The AWS Architecture Blog (Marc Brooker) recommends full jitter for any retry loop where multiple callers may experience the same transient failure simultaneously.

Additional finding: RULE_07 — Test uses time.Sleep for goroutine coordination

pkg/shim/manager/manager_windows_test.go uses time.Sleep to wait for goroutine state — a pattern that makes tests timing-sensitive and intermittently flaky. Prefer a channel signal or sync.WaitGroup to coordinate the goroutine with the test assertion.

How this was found

This issue was identified by Quorum, an open-source Gemini-powered agent that reviews pull requests for distributed coordination anti-patterns (thundering herds, missing saga compensation, lost updates, transactional outbox violations, and more).

Happy to open a follow-up PR with the jitter fix if that would be helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions