Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
MB-47254 (7.1.0 1910) Avoid log flooding from watcherServer connect fail
indexing repo piece of this fix. There is also a gometa repo piece. In a cluster with a large number of Index nodes, a long-lived network partition flooded the logs with error messages from watcherServer.go runOnce(), because even though the retries backed off to be 30 seconds apart for each peer, these retries were ongoing for attempts to contact all other Index nodes, which multiplies their frequency by the number of such nodes. This made indexer.log wrap in less than an hour at one customer. The retry timers were all ticking at integer numbers of seconds from their start times, which themselves are all one second apart in a network partition case because of a foreground 1-second wait for success by the outer caller, metadata_provider.go WatchMetadata(), before switching to background forever waits. The fix is: 1. Only log the connection failure messages for each peer on first failure and then every 100 retries thereafter. I also added the try number and the hostname with which contact failed to the logging. 2. (Minor:) In the case of an explicit kill, the old Timer needs to be stopped and its channel potentially drained before returning, else it can never be garbage collected. 3. Change the 1000 ms foreground wait in WatchMetadata() to 971 ms, a prime number, to prevent the retry Timers from all waking up on 1-second harmonics of the start of launch if the network is in fact partitioned. Change-Id: Ic88fe91cc18cd806901042443dca171e074a16ec
- Loading branch information