Skip to content

wait has no overall timeout — orchestrators hang indefinitely on permanently-broken dependencies #18

@dolph

Description

@dolph

Summary

connectivity wait is documented as a way to gate other processes on dependency readiness (init containers, deploy scripts, etc.). It polls forever with a fixed 15s sleep and no overall deadline. If a dependency is permanently broken — a typo'd hostname, a service that never comes up, a destination behind a firewall change — wait hangs indefinitely. The orchestrator (k8s, systemd, GitHub Actions) eventually times out and reports a useless "still waiting" failure with no diagnostic surface.

For an SRE, the desired UX is:

  • connectivity wait --timeout 5m exits with a distinct non-zero code on timeout
  • The log clearly identifies which destination(s) were not reached
  • The error is detectable from process exit code without parsing logs

Code

destinations.go:219-230:

func (dest *Destination) WaitFor() {
	for {
		reachable := dest.Check()
		if reachable {
			LogDestination(dest, "Connected")
			return
		}
		time.Sleep(15 * time.Second)
	}
}

connectivity.go:155-166:

func WaitLoop(destinations []*Destination) {
	var wg sync.WaitGroup
	for _, dest := range destinations {
		wg.Add(1)
		go func(dest *Destination) {
			defer wg.Done()
			dest.WaitFor()
		}(dest)
	}
	wg.Wait()
}

No context.Context, no deadline, no progress reporting beyond per-attempt logs.

Suggested fix

Add a --timeout flag (default e.g. unlimited, but recommend setting one) and propagate it through context:

ctx, cancel := context.WithTimeout(context.Background(), *timeout)
defer cancel()

var wg sync.WaitGroup
errs := make(chan string, len(destinations))
for _, dest := range destinations {
    wg.Add(1)
    go func(d *Destination) {
        defer wg.Done()
        if !d.WaitFor(ctx) {
            errs <- d.Label
        }
    }(dest)
}
wg.Wait()
close(errs)

if len(errs) > 0 {
    log.Printf("Timed out waiting for: %s", strings.Join(collect(errs), ", "))
    os.Exit(1)
}

WaitFor(ctx) should also use exponential backoff (capped) rather than 15s flat — so a fast-failing destination doesn't issue 4 DNS lookups per minute for an hour, and a slow-converging one isn't punished by an aggressive cadence.

Distinct exit codes (e.g., 1 = timeout, 2 = config error) would also help orchestrator-level alerting distinguish "dependency missing" from "we never started".

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions