fix(shim/manager): waitForShimPipe retry uses fixed delay — no jitter, thundering herd risk on Windows

## Summary

PR #218 adds `waitForShimPipe()` in `pkg/shim/manager/manager_windows.go`. The retry loop calls `time.Sleep(retryDelay)` where `retryDelay` is a single fixed `time.Duration` with no randomisation.

When multiple container shims start simultaneously (a common pattern in pod batch launches or OCI runtime parallel execution), all shims that fail their first pipe dial will sleep for exactly the same `retryDelay` and retry at the same instant — a thundering herd that can momentarily saturate the Windows named pipe listener queue.

## Where it happens

`pkg/shim/manager/manager_windows.go` (added by PR #218):

```go
func waitForShimPipe(ctx context.Context, address string,
    readyTimeout, perAttempt, retryDelay time.Duration) error {
    // ...
    for {
        _, err := winio.DialPipe(address, &perAttempt)
        if err == nil {
            return nil
        }
        retryable := os.IsNotExist(err) ||
            errors.Is(err, winio.ErrTimeout) ||
            errors.Is(err, windows.ERROR_PIPE_BUSY)
        if !retryable {
            return fmt.Errorf("waiting for shim pipe %s: %w", address, err)
        }
        time.Sleep(retryDelay)  // fixed — no jitter
    }
}
```

## Suggested fix

Add ±25% jitter to spread retries:

```go
import "math/rand"

// Spread retries: ±25% jitter around retryDelay
jittered := retryDelay + time.Duration(rand.Int63n(int64(retryDelay)/2)) - retryDelay/4
time.Sleep(jittered)
```

Or use an exponential backoff with jitter starting from a small base:

```go
backoff := retryDelay
for {
    // ...
    jitter := time.Duration(rand.Int63n(int64(backoff) / 2))
    time.Sleep(backoff/2 + jitter)
    if backoff < maxRetryDelay {
        backoff *= 2
    }
}
```

The AWS Architecture Blog (Marc Brooker) recommends [full jitter](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/) for any retry loop where multiple callers may experience the same transient failure simultaneously.

## Additional finding: RULE_07 — Test uses `time.Sleep` for goroutine coordination

`pkg/shim/manager/manager_windows_test.go` uses `time.Sleep` to wait for goroutine state — a pattern that makes tests timing-sensitive and intermittently flaky. Prefer a channel signal or `sync.WaitGroup` to coordinate the goroutine with the test assertion.

## How this was found

This issue was identified by [Quorum](https://github.com/KaustubhUp025/quorum), an open-source Gemini-powered agent that reviews pull requests for distributed coordination anti-patterns (thundering herds, missing saga compensation, lost updates, transactional outbox violations, and more).

Happy to open a follow-up PR with the jitter fix if that would be helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(shim/manager): waitForShimPipe retry uses fixed delay — no jitter, thundering herd risk on Windows #219

Summary

Where it happens

Suggested fix

Additional finding: RULE_07 — Test uses `time.Sleep` for goroutine coordination

How this was found

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

fix(shim/manager): waitForShimPipe retry uses fixed delay — no jitter, thundering herd risk on Windows #219

Description

Summary

Where it happens

Suggested fix

Additional finding: RULE_07 — Test uses time.Sleep for goroutine coordination

How this was found

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Additional finding: RULE_07 — Test uses `time.Sleep` for goroutine coordination