stability: Stuck ranges with lots of raft elections #26257

bdarnell · 2018-05-30T21:00:50Z

cyan is currently in a pretty bad state: It's not serving any SQL qps (it looks like the problem first started on May 23 with a build from the master branch, and has been continuing much of the time since then). Restarting all the nodes gets it working for about an hour, then it stops again. It doesn't appear to recover on its own.

While it's in the broken state, all raft messages except vote, prevote, their responses, and TimeoutNow drop to near zero, while vote-related traffic is much higher than normal. The number of quiesced ranges increases to about 95% of the total ranges (when things are working, it's more like 60%). The problem ranges page shows about a dozen ranges with either no raft leader or leader-not-leaseholder.

On these ranges, it appears that something is going wrong with quiescence. We see this cycle repeat itself:

Node 1 is the leader and leaseholder. Nothing appears to be happening on the range; no writes are getting proposed.
Nodes 2 and 3 log "not quiescing; not leader" periodically, indicating that they are unquiesced.
Either node 2 or node 3 times out and starts an election, indicating that they are not receiving heartbeats.
The election succeeds; node 2 (or 3) becomes leader. Then that node sees it is not the leaseholder and transfers leadership back to node 1
Node 1's election succeeds and the cycle repeats.

There is no evidence of network or raft transport problems (this was my first suspicion, blaming the recent grpc upgrade). Each node is able to win an election and is able to communicate bidirectionally with each of its peers.

Node 1 never logs any quiescing or unquiescing messages for this replica. An unquiesced replica will periodically log (at a high V level), so this suggests that node 1's replica is quiesced throughout this process, but somehow failed to send a quiesce heartbeat to the followers (or it failed to unquiesce when it should have, and is somehow sleepwalking through these elections).

#24956 (merged may 21) is the most likely culprit I can see. I don't have any theories as to why this is only affecting cyan.

The text was updated successfully, but these errors were encountered:

I suspect that cockroachdb#26257 is caused by the unquiescedReplicas map introduced in cockroachdb#24956 getting out of sync with the per-replica quiescent flag. Add debug pages to help us see if that's happening. Release note: None

26269: server,ui: Add debugging for quiesced ranges r=bdarnell a=bdarnell I suspect that #26257 is caused by the unquiescedReplicas map introduced in #24956 getting out of sync with the per-replica quiescent flag. Add debug pages to help us see if that's happening. Release note: None Co-authored-by: Ben Darnell <ben@cockroachlabs.com>

bdarnell · 2018-05-31T16:47:20Z

Omega has been hit by what appears to be the same issue (a day and a half after being upgraded to the alpha candidate).

I've got a hypothesis that the unquiesced map introduced in #24956 is getting out of sync with the per-replica quiescent flag. I've added some debugging for this on master (#26269) which I'll test on cyan. If this is the case, I'll probably revert #24956 for the alpha.

bdarnell · 2018-05-31T17:47:17Z

Confirmed on cyan that we have leaders with replicas where the quiescent flag is true but they are not present in the unquiesced map. I'm going to revert #24956 and also cherry-pick the revert onto the alpha candidate.

This reverts the main effect of cockroachdb#24956. The supporting infrastructure is left in place to minimize merge conflicts and to aid in diagnosing why the maps get out of sync. Fixes cockroachdb#26257 Release note: None

26291: storage: Tick all replicas, not just unquiesced ones r=bdarnell a=bdarnell This reverts the main effect of #24956. The supporting infrastructure is left in place to minimize merge conflicts and to aid in diagnosing why the maps get out of sync. Fixes #26257 Release note: None Co-authored-by: Ben Darnell <ben@cockroachlabs.com>

There are two paths by which a replica can become unquiesced, and only one of them updated the unquiesced map. Updates cockroachdb#26257 This is probably the underlying issue there, although I'm still looking for potential synchronization problems coming from the use of two different locks. Release note: None

26265: sql: stop using a txn for getting the index backfill timestamp r=andreimatei a=andreimatei Index backfills use AOST so that their reads are less intrusive. Before this patch, the timestamp for them reads was chosen in a bizarre way (unless I'm missing something) - we were creating a txn for the soul purpose of getting its timestamp. That txn was used to save the job details with the timestamp in them, but I don't think using a txn for that is necessary. This patch gets a timestamp directly from the node's clock. Besides seeming bizarre, the old code had a bug - it was incorrectly using the job.WithTxn() API. The code was: if err := sc.db.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { details := *sc.job.WithTxn(txn).Payload().Details.(*jobs.Payload_SchemaChange).SchemaChange if details.ReadAsOf == (hlc.Timestamp{}) { details.ReadAsOf = txn.CommitTimestamp() if err := sc.job.WithTxn(txn).SetDetails(ctx, details); err != nil { return errors.Wrapf(err, "failed to store readAsOf on job %d", *sc.job.ID()) } } sc.readAsOf = details.ReadAsOf return nil }); err != nil { return err } The first WithTxn() call is just bad - it mutates the job and persists the txn in it, for no reason. If the second WithTxn() call runs, then the effects of the first one will be negated - the SetDetails() call will indirectly wipe the txn. However, if the 2nd call didn't run (which happens, for example, when the txn gets restarted), then the job ends up holding on to a committed transaction, which will cause it to fail when it tries to do some further operations. Fixes #25680 Release note(bug fix): Fixed a bug causing index creation to fail under rare circustances. 26297: storage: Fix a discrepancy between quiescent flag and unquiesced map r=petermattis a=bdarnell There are two paths by which a replica can become unquiesced, and only one of them updated the unquiesced map. Updates #26257 This is probably the underlying issue there, although I'm still looking for potential synchronization problems coming from the use of two different locks. Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Ben Darnell <ben@cockroachlabs.com>

bdarnell mentioned this issue May 31, 2018

server,ui: Add debugging for quiesced ranges #26269

Merged

bdarnell self-assigned this May 31, 2018

bdarnell mentioned this issue May 31, 2018

storage: Tick all replicas, not just unquiesced ones #26291

Merged

craig bot closed this as completed in #26291 May 31, 2018

bdarnell mentioned this issue May 31, 2018

storage: Fix a discrepancy between quiescent flag and unquiesced map #26297

Merged

bdarnell mentioned this issue Jun 4, 2018

storage: very slow restart of local cluster #26391

Closed

nvanbenschoten mentioned this issue Jun 26, 2018

storage: tick only raft groups which are not quiesced #26910

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: Stuck ranges with lots of raft elections #26257

stability: Stuck ranges with lots of raft elections #26257

bdarnell commented May 30, 2018

bdarnell commented May 31, 2018

bdarnell commented May 31, 2018

stability: Stuck ranges with lots of raft elections #26257

stability: Stuck ranges with lots of raft elections #26257

Comments

bdarnell commented May 30, 2018

bdarnell commented May 31, 2018

bdarnell commented May 31, 2018