Skip to content

net.Dial has no timeout — checks can hang for the OS default (often minutes) #8

@dolph

Description

@dolph

Summary

dialer.go uses net.Dial with no timeout. When a destination silently drops SYNs (firewall blackhole, network partition, hung server), the OS TCP stack will retry up to its system default before failing — on Linux this is typically tcp_syn_retries=6, which can take roughly 75–130 seconds. In monitor mode, a single bad destination can stall the check thread for that long every iteration.

This is the same root cause as the HTTP check timeout, but it affects every tcp:// / udp:// / scheme-based destination, not just HTTP.

Note: udp:// is even worse here — net.Dial("udp", ...) does no on-wire work, so it appears to "succeed" instantly even for unreachable hosts. That's not what this issue is about, but it is worth a separate look (see linked issue if filed).

Code

dialer.go:11-26:

func Dial(route *Route, dest *Destination, ip net.IP) bool {
	metricTags := []string{fmt.Sprintf("dest_ip:%s", ip.String())}
	hostPort := fmt.Sprintf("%s:%d", ip.String(), dest.Port)

	dest.Increment("connectivity.dial", metricTags)
	conn, err := net.Dial(dest.Protocol, hostPort)     // no timeout
	...
}

Impact

  • connectivity check against a blackholed host blocks ~2 minutes per IP, per call. Combined with the sequential CheckLoop (connectivity.go:144) that's catastrophic for any non-trivial config.
  • monitor mode never recovers a snappy cadence after a destination begins failing.

Suggested fix

Use net.DialTimeout or a net.Dialer with Timeout and (for HTTP/HTTPS) a Context:

conn, err := net.DialTimeout(dest.Protocol, hostPort, 10*time.Second)

Make the timeout configurable (and probably independent of the HTTP timeout in #7).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions