MB-47254 (7.1.0 1910) Avoid log flooding from watcherServer connect fail

indexing repo piece of this fix. There is also a gometa repo piece. In a cluster with a large number of Index nodes, a long-lived network partition flooded the logs with error messages from watcherServer.go runOnce(), because even though the retries backed off to be 30 seconds apart for each peer, these retries were ongoing for attempts to contact all other Index nodes, which multiplies their frequency by the number of such nodes. This made indexer.log wrap in less than an hour at one customer. The retry timers were all ticking at integer numbers of seconds from their start times, which themselves are all one second apart in a network partition case because of a foreground 1-second wait for success by the outer caller, metadata_provider.go WatchMetadata(), before switching to background forever waits. The fix is: 1. Only log the connection failure messages for each peer on first failure and then every 100 retries thereafter. I also added the try number and the hostname with which contact failed to the logging. 2. (Minor:) In the case of an explicit kill, the old Timer needs to be stopped and its channel potentially drained before returning, else it can never be garbage collected. 3. Change the 1000 ms foreground wait in WatchMetadata() to 971 ms, a prime number, to prevent the retry Timers from all waking up on 1-second harmonics of the start of launch if the network is in fact partitioned. Change-Id: Ic88fe91cc18cd806901042443dca171e074a16ec
couchbase · Dec 20, 2021 · 451687a · 451687a
1 parent 7d0d14c
commit 451687a
Showing 1 changed file with 4 additions and 2 deletions.
diff --git a/secondary/manager/client/metadata_provider.go b/secondary/manager/client/metadata_provider.go
@@ -293,8 +293,10 @@ func (o *MetadataProvider) WatchMetadata(indexAdminPort string, callback watcher
 	// start a watcher to the indexer admin
 	watcher, readych := o.startWatcher(indexAdminPort)
 
-	// wait for indexer to connect
-	success, _ := watcher.waitForReady(readych, 1000, nil)
+	// Wait for indexer to connect for a prime number of ms to prevent retry Timers in watcherServer
+	// from all being aligned on harmonics of 1 sec if the network is partitioned. (This used to
+	// foreground wait for 1,000 ms which led to "thundering herd" retries.)
+	success, _ := watcher.waitForReady(readych, 971, nil)
 	if success {
 		// if successfully connected, retrieve indexerId
 		success, _ = watcher.notifyReady(indexAdminPort, 0, nil)