-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: Stuck ranges with lots of raft elections #26257
Comments
I suspect that cockroachdb#26257 is caused by the unquiescedReplicas map introduced in cockroachdb#24956 getting out of sync with the per-replica quiescent flag. Add debug pages to help us see if that's happening. Release note: None
I suspect that cockroachdb#26257 is caused by the unquiescedReplicas map introduced in cockroachdb#24956 getting out of sync with the per-replica quiescent flag. Add debug pages to help us see if that's happening. Release note: None
26269: server,ui: Add debugging for quiesced ranges r=bdarnell a=bdarnell I suspect that #26257 is caused by the unquiescedReplicas map introduced in #24956 getting out of sync with the per-replica quiescent flag. Add debug pages to help us see if that's happening. Release note: None Co-authored-by: Ben Darnell <ben@cockroachlabs.com>
Omega has been hit by what appears to be the same issue (a day and a half after being upgraded to the alpha candidate). I've got a hypothesis that the unquiesced map introduced in #24956 is getting out of sync with the per-replica quiescent flag. I've added some debugging for this on master (#26269) which I'll test on cyan. If this is the case, I'll probably revert #24956 for the alpha. |
Confirmed on cyan that we have leaders with replicas where the |
This reverts the main effect of cockroachdb#24956. The supporting infrastructure is left in place to minimize merge conflicts and to aid in diagnosing why the maps get out of sync. Fixes cockroachdb#26257 Release note: None
26291: storage: Tick all replicas, not just unquiesced ones r=bdarnell a=bdarnell This reverts the main effect of #24956. The supporting infrastructure is left in place to minimize merge conflicts and to aid in diagnosing why the maps get out of sync. Fixes #26257 Release note: None Co-authored-by: Ben Darnell <ben@cockroachlabs.com>
There are two paths by which a replica can become unquiesced, and only one of them updated the unquiesced map. Updates cockroachdb#26257 This is probably the underlying issue there, although I'm still looking for potential synchronization problems coming from the use of two different locks. Release note: None
26265: sql: stop using a txn for getting the index backfill timestamp r=andreimatei a=andreimatei Index backfills use AOST so that their reads are less intrusive. Before this patch, the timestamp for them reads was chosen in a bizarre way (unless I'm missing something) - we were creating a txn for the soul purpose of getting its timestamp. That txn was used to save the job details with the timestamp in them, but I don't think using a txn for that is necessary. This patch gets a timestamp directly from the node's clock. Besides seeming bizarre, the old code had a bug - it was incorrectly using the job.WithTxn() API. The code was: if err := sc.db.Txn(ctx, func(ctx context.Context, txn *client.Txn) error { details := *sc.job.WithTxn(txn).Payload().Details.(*jobs.Payload_SchemaChange).SchemaChange if details.ReadAsOf == (hlc.Timestamp{}) { details.ReadAsOf = txn.CommitTimestamp() if err := sc.job.WithTxn(txn).SetDetails(ctx, details); err != nil { return errors.Wrapf(err, "failed to store readAsOf on job %d", *sc.job.ID()) } } sc.readAsOf = details.ReadAsOf return nil }); err != nil { return err } The first WithTxn() call is just bad - it mutates the job and persists the txn in it, for no reason. If the second WithTxn() call runs, then the effects of the first one will be negated - the SetDetails() call will indirectly wipe the txn. However, if the 2nd call didn't run (which happens, for example, when the txn gets restarted), then the job ends up holding on to a committed transaction, which will cause it to fail when it tries to do some further operations. Fixes #25680 Release note(bug fix): Fixed a bug causing index creation to fail under rare circustances. 26297: storage: Fix a discrepancy between quiescent flag and unquiesced map r=petermattis a=bdarnell There are two paths by which a replica can become unquiesced, and only one of them updated the unquiesced map. Updates #26257 This is probably the underlying issue there, although I'm still looking for potential synchronization problems coming from the use of two different locks. Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Ben Darnell <ben@cockroachlabs.com>
cyan
is currently in a pretty bad state: It's not serving any SQL qps (it looks like the problem first started on May 23 with a build from the master branch, and has been continuing much of the time since then). Restarting all the nodes gets it working for about an hour, then it stops again. It doesn't appear to recover on its own.While it's in the broken state, all raft messages except vote, prevote, their responses, and TimeoutNow drop to near zero, while vote-related traffic is much higher than normal. The number of quiesced ranges increases to about 95% of the total ranges (when things are working, it's more like 60%). The problem ranges page shows about a dozen ranges with either no raft leader or leader-not-leaseholder.
On these ranges, it appears that something is going wrong with quiescence. We see this cycle repeat itself:
There is no evidence of network or raft transport problems (this was my first suspicion, blaming the recent grpc upgrade). Each node is able to win an election and is able to communicate bidirectionally with each of its peers.
Node 1 never logs any quiescing or unquiescing messages for this replica. An unquiesced replica will periodically log (at a high V level), so this suggests that node 1's replica is quiesced throughout this process, but somehow failed to send a quiesce heartbeat to the followers (or it failed to unquiesce when it should have, and is somehow sleepwalking through these elections).
#24956 (merged may 21) is the most likely culprit I can see. I don't have any theories as to why this is only affecting cyan.
The text was updated successfully, but these errors were encountered: