Summary
dialer.go uses net.Dial with no timeout. When a destination silently drops SYNs (firewall blackhole, network partition, hung server), the OS TCP stack will retry up to its system default before failing — on Linux this is typically tcp_syn_retries=6, which can take roughly 75–130 seconds. In monitor mode, a single bad destination can stall the check thread for that long every iteration.
This is the same root cause as the HTTP check timeout, but it affects every tcp:// / udp:// / scheme-based destination, not just HTTP.
Note: udp:// is even worse here — net.Dial("udp", ...) does no on-wire work, so it appears to "succeed" instantly even for unreachable hosts. That's not what this issue is about, but it is worth a separate look (see linked issue if filed).
Code
dialer.go:11-26:
func Dial(route *Route, dest *Destination, ip net.IP) bool {
metricTags := []string{fmt.Sprintf("dest_ip:%s", ip.String())}
hostPort := fmt.Sprintf("%s:%d", ip.String(), dest.Port)
dest.Increment("connectivity.dial", metricTags)
conn, err := net.Dial(dest.Protocol, hostPort) // no timeout
...
}
Impact
connectivity check against a blackholed host blocks ~2 minutes per IP, per call. Combined with the sequential CheckLoop (connectivity.go:144) that's catastrophic for any non-trivial config.
monitor mode never recovers a snappy cadence after a destination begins failing.
Suggested fix
Use net.DialTimeout or a net.Dialer with Timeout and (for HTTP/HTTPS) a Context:
conn, err := net.DialTimeout(dest.Protocol, hostPort, 10*time.Second)
Make the timeout configurable (and probably independent of the HTTP timeout in #7).
Summary
dialer.gousesnet.Dialwith no timeout. When a destination silently drops SYNs (firewall blackhole, network partition, hung server), the OS TCP stack will retry up to its system default before failing — on Linux this is typicallytcp_syn_retries=6, which can take roughly 75–130 seconds. Inmonitormode, a single bad destination can stall the check thread for that long every iteration.This is the same root cause as the HTTP check timeout, but it affects every
tcp:///udp:/// scheme-based destination, not just HTTP.Note:
udp://is even worse here —net.Dial("udp", ...)does no on-wire work, so it appears to "succeed" instantly even for unreachable hosts. That's not what this issue is about, but it is worth a separate look (see linked issue if filed).Code
dialer.go:11-26:Impact
connectivity checkagainst a blackholed host blocks ~2 minutes per IP, per call. Combined with the sequentialCheckLoop(connectivity.go:144) that's catastrophic for any non-trivial config.monitormode never recovers a snappy cadence after a destination begins failing.Suggested fix
Use
net.DialTimeoutor anet.DialerwithTimeoutand (for HTTP/HTTPS) aContext:Make the timeout configurable (and probably independent of the HTTP timeout in #7).