storage: Simplify raft automatic campaigning after PreVote #24920

bdarnell · 2018-04-18T23:59:43Z

Before we implemented PreVote, we had various heuristics to decide when we should ask raft to campaign (bypassing the usual timeout). Since PreVote has reduced the cost of raft elections (by ensuring that a node that calls for an election it can't win doesn't disrupt its peers), we can get by with simpler logic.

In addition to simplifying the logic, this PR introduces a new campaign trigger when a range unquiesces. This is a prerequisite for getting rid of the TickQuiesced hack (which is disabled by default in this PR and will be removed in a future one).

Fixes #18365

cockroach-teamcity · 2018-04-18T23:59:50Z

This change is

tbg · 2018-04-23T17:46:05Z

, but as @petermattis pointed out we should merge this when the alpha sha has been picked.

Reviewed 2 of 2 files at r1, 4 of 4 files at r2, 1 of 1 files at r3, 1 of 1 files at r4.
Review status: 3 of 4 files reviewed at latest revision, all discussions resolved, some commit checks failed.

pkg/storage/replica.go, line 548 at r1 (raw file):

		// creation if we gossiped our store descriptor more than the election
		// timeout in the past.
		shouldCampaignOnCreation = (r.mu.internalRaftGroup == nil) && r.store.canCampaignIdleReplica()

The Raft groups are still created lazily, and in a large enough cluster most ranges are going to be quiesced, so in that scenario there won't be a "pre-election storm" that can cause latency blips in the foreground workload on the cluster. Correct?

pkg/storage/replica.go, line 599 at r4 (raw file):

	// If we're already campaigning or know who the leader is, don't
	// start a new term.
	status := r.mu.internalRaftGroup.Status()

The method comment suggests that this needs a nil check. Is the Raft group populated in the new callsites as well? Might be worth adjusting the comment.

Comments from Reviewable

bdarnell · 2018-04-23T19:11:40Z

Review status: 3 of 4 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

pkg/storage/replica.go, line 548 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

The Raft groups are still created lazily, and in a large enough cluster most ranges are going to be quiesced, so in that scenario there won't be a "pre-election storm" that can cause latency blips in the foreground workload on the cluster. Correct?

Right. The fact that most ranges are quiesced doesn't really matter. The key factors in preventing storms are that the ranges are created lazily and that unquiescing a range will not usually cause a campaign because the previous leaseholder is still alive so it is presumed to still be the leader.

pkg/storage/replica.go, line 599 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

The method comment suggests that this needs a nil check. Is the Raft group populated in the new callsites as well? Might be worth adjusting the comment.

Replica.RaftStatus() may return nil. internalRaftGroup.Status() never will (if irg is nil, Status will panic instead of returning nil). I think irg is guaranteed to be initialized at all call sites of this method (if not, it'll crash in Campaign too, not just Status).

Comments from Reviewable

tbg · 2018-04-23T20:10:24Z

Reviewed 2 of 2 files at r5, 1 of 1 files at r6, 1 of 1 files at r7.
Review status: 3 of 4 files reviewed at latest revision, all discussions resolved.

Comments from Reviewable

PreVote will be the only option in 2.1 Release note: None

With PreVote, it is less disruptive to campaign unnecessarily, so there's no need for this additional check. Release note: None

Fold the necessary checks into withRaftGroupLocked and remove unnecessary arguments. This has the effect of campaigning somewhat more than before but that's OK since PreVote minimizes the disruption. Fixes cockroachdb#18365 Release note: None

This removes the time-to-recovery penalty for disabling TickQuiesced. Note that the new maybeCampaignOnWakeLocked method is not a verbatim copy of the code that was moved from withRaftGroupLocked; new conditions were added to avoid unnecessary campaigns since this method can be called more than before. Release note: None

With the change to automatically campaign when unquiescing, this should no longer be necessary. The option will be removed in a subsequent change. Release note: None

bdarnell · 2018-05-02T18:50:58Z

bors r+

24920: storage: Simplify raft automatic campaigning after PreVote r=bdarnell a=bdarnell Before we implemented PreVote, we had various heuristics to decide when we should ask raft to campaign (bypassing the usual timeout). Since PreVote has reduced the cost of raft elections (by ensuring that a node that calls for an election it can't win doesn't disrupt its peers), we can get by with simpler logic. In addition to simplifying the logic, this PR introduces a new campaign trigger when a range unquiesces. This is a prerequisite for getting rid of the TickQuiesced hack (which is disabled by default in this PR and will be removed in a future one). Fixes #18365 Co-authored-by: Ben Darnell <ben@cockroachlabs.com>

craig · 2018-05-02T19:16:31Z

Build succeeded

GitHub CI (Cockroach)

24956: storage: Maintain a separate set of unquiesced replicas r=petermattis a=bdarnell This means that idle replicas no longer have a per-tick CPU cost, which is one of the bottlenecks limiting the amount of data we can handle per store. Fixes #17609 Release note (performance improvement): Reduced CPU overhead of idle ranges The first five commits are from #24920; that PR should be merged and tested in isolation first. 25735: sql: fix null normalization r=RaduBerinde a=RaduBerinde The normalization rules are happy to convert `NULL::TEXT` to `NULL`. While both expressions evaluate to `DNull`, the `ResolvedType()` is different. It seems unsound for normalization to change the type. This issue is shown by trying to run a query containing `ARRAY_AGG(NULL::TEXT)` through distsql planning: by the time the distsql planner looks at it, the `NULL::TEXT` is just `DNull` (with the `Unknown` type) and the distsql planner cannot find the builtin. This change fixes the normalization rules by retaining the cast in this case. In general, any expression that statically evaluates to NULL gets a cast to the original expression type. The same is done in the opt execbuilder. Fixes #25724. Release note (bug fix): Fixed query errors in some cases involving a NULL constant that is cast to a specific type. Co-authored-by: Ben Darnell <ben@cockroachlabs.com> Co-authored-by: Radu Berinde <radu@cockroachlabs.com>

bdarnell requested review from tbg and a team April 18, 2018 23:59

bdarnell mentioned this pull request Apr 20, 2018

storage: Maintain a separate set of unquiesced replicas #24956

Merged

petermattis mentioned this pull request Apr 23, 2018

storage: avoid acquiring raftMu in Replica.propose #24990

Merged

bdarnell force-pushed the prevote-cleanup branch from 0cce1b6 to bfa2bfb Compare April 23, 2018 19:18

bdarnell force-pushed the prevote-cleanup branch from bfa2bfb to 19bcdb0 Compare May 2, 2018 17:40

bdarnell requested a review from a team May 2, 2018 17:40

bdarnell added 5 commits May 2, 2018 14:06

storage: Remove COCKROACH_ENABLE_PREVOTE

43e76a0

PreVote will be the only option in 2.1 Release note: None

storage: Remove Store.canCampaignIdleReplica

2b9c931

With PreVote, it is less disruptive to campaign unnecessarily, so there's no need for this additional check. Release note: None

storage: Default enableTickQuiesced to false

53a60f5

With the change to automatically campaign when unquiescing, this should no longer be necessary. The option will be removed in a subsequent change. Release note: None

bdarnell force-pushed the prevote-cleanup branch from 19bcdb0 to 53a60f5 Compare May 2, 2018 18:07

craig bot merged commit 53a60f5 into cockroachdb:master May 2, 2018

a-robinson mentioned this pull request Jun 4, 2018

storage: very slow restart of local cluster #26391

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Simplify raft automatic campaigning after PreVote #24920

storage: Simplify raft automatic campaigning after PreVote #24920

bdarnell commented Apr 18, 2018

cockroach-teamcity commented Apr 18, 2018

tbg commented Apr 23, 2018

bdarnell commented Apr 23, 2018

tbg commented Apr 23, 2018

bdarnell commented May 2, 2018

craig bot commented May 2, 2018

storage: Simplify raft automatic campaigning after PreVote #24920

storage: Simplify raft automatic campaigning after PreVote #24920

Conversation

bdarnell commented Apr 18, 2018

cockroach-teamcity commented Apr 18, 2018

tbg commented Apr 23, 2018

bdarnell commented Apr 23, 2018

tbg commented Apr 23, 2018

bdarnell commented May 2, 2018

craig bot commented May 2, 2018

Build succeeded