Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Simplify raft automatic campaigning after PreVote #24920

Merged
merged 5 commits into from
May 2, 2018

Conversation

bdarnell
Copy link
Contributor

Before we implemented PreVote, we had various heuristics to decide when we should ask raft to campaign (bypassing the usual timeout). Since PreVote has reduced the cost of raft elections (by ensuring that a node that calls for an election it can't win doesn't disrupt its peers), we can get by with simpler logic.

In addition to simplifying the logic, this PR introduces a new campaign trigger when a range unquiesces. This is a prerequisite for getting rid of the TickQuiesced hack (which is disabled by default in this PR and will be removed in a future one).

Fixes #18365

@bdarnell bdarnell requested review from tbg and a team April 18, 2018 23:59
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member

tbg commented Apr 23, 2018

:lgtm:, but as @petermattis pointed out we should merge this when the alpha sha has been picked.


Reviewed 2 of 2 files at r1, 4 of 4 files at r2, 1 of 1 files at r3, 1 of 1 files at r4.
Review status: 3 of 4 files reviewed at latest revision, all discussions resolved, some commit checks failed.


pkg/storage/replica.go, line 548 at r1 (raw file):

		// creation if we gossiped our store descriptor more than the election
		// timeout in the past.
		shouldCampaignOnCreation = (r.mu.internalRaftGroup == nil) && r.store.canCampaignIdleReplica()

The Raft groups are still created lazily, and in a large enough cluster most ranges are going to be quiesced, so in that scenario there won't be a "pre-election storm" that can cause latency blips in the foreground workload on the cluster. Correct?


pkg/storage/replica.go, line 599 at r4 (raw file):

	// If we're already campaigning or know who the leader is, don't
	// start a new term.
	status := r.mu.internalRaftGroup.Status()

The method comment suggests that this needs a nil check. Is the Raft group populated in the new callsites as well? Might be worth adjusting the comment.


Comments from Reviewable

@bdarnell
Copy link
Contributor Author

Review status: 3 of 4 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


pkg/storage/replica.go, line 548 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

The Raft groups are still created lazily, and in a large enough cluster most ranges are going to be quiesced, so in that scenario there won't be a "pre-election storm" that can cause latency blips in the foreground workload on the cluster. Correct?

Right. The fact that most ranges are quiesced doesn't really matter. The key factors in preventing storms are that the ranges are created lazily and that unquiescing a range will not usually cause a campaign because the previous leaseholder is still alive so it is presumed to still be the leader.


pkg/storage/replica.go, line 599 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

The method comment suggests that this needs a nil check. Is the Raft group populated in the new callsites as well? Might be worth adjusting the comment.

Replica.RaftStatus() may return nil. internalRaftGroup.Status() never will (if irg is nil, Status will panic instead of returning nil). I think irg is guaranteed to be initialized at all call sites of this method (if not, it'll crash in Campaign too, not just Status).


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Apr 23, 2018

Reviewed 2 of 2 files at r5, 1 of 1 files at r6, 1 of 1 files at r7.
Review status: 3 of 4 files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@bdarnell bdarnell requested a review from a team May 2, 2018 17:40
PreVote will be the only option in 2.1

Release note: None
With PreVote, it is less disruptive to campaign unnecessarily, so
there's no need for this additional check.

Release note: None
Fold the necessary checks into withRaftGroupLocked and remove
unnecessary arguments. This has the effect of campaigning somewhat
more than before but that's OK since PreVote minimizes the disruption.

Fixes cockroachdb#18365

Release note: None
This removes the time-to-recovery penalty for disabling TickQuiesced.

Note that the new maybeCampaignOnWakeLocked method is not a verbatim
copy of the code that was moved from withRaftGroupLocked; new
conditions were added to avoid unnecessary campaigns since this method
can be called more than before.

Release note: None
With the change to automatically campaign when unquiescing, this
should no longer be necessary. The option will be removed in a
subsequent change.

Release note: None
@bdarnell
Copy link
Contributor Author

bdarnell commented May 2, 2018

bors r+

craig bot pushed a commit that referenced this pull request May 2, 2018
24920: storage: Simplify raft automatic campaigning after PreVote r=bdarnell a=bdarnell

Before we implemented PreVote, we had various heuristics to decide when we should ask raft to campaign (bypassing the usual timeout). Since PreVote has reduced the cost of raft elections (by ensuring that a node that calls for an election it can't win doesn't disrupt its peers), we can get by with simpler logic. 

In addition to simplifying the logic, this PR introduces a new campaign trigger when a range unquiesces. This is a prerequisite for getting rid of the TickQuiesced hack (which is disabled by default in this PR and will be removed in a future one). 

Fixes #18365

Co-authored-by: Ben Darnell <ben@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented May 2, 2018

Build succeeded

@craig craig bot merged commit 53a60f5 into cockroachdb:master May 2, 2018
craig bot pushed a commit that referenced this pull request May 21, 2018
24956: storage: Maintain a separate set of unquiesced replicas r=petermattis a=bdarnell

This means that idle replicas no longer have a per-tick CPU cost,
which is one of the bottlenecks limiting the amount of data we can
handle per store.

Fixes #17609

Release note (performance improvement): Reduced CPU overhead of idle
ranges


The first five commits are from #24920; that PR should be merged and tested in isolation first. 

25735: sql: fix null normalization r=RaduBerinde a=RaduBerinde

The normalization rules are happy to convert `NULL::TEXT` to `NULL`.
While both expressions evaluate to `DNull`, the `ResolvedType()` is
different. It seems unsound for normalization to change the type.

This issue is shown by trying to run a query containing
`ARRAY_AGG(NULL::TEXT)` through distsql planning: by the time the
distsql planner looks at it, the `NULL::TEXT` is just `DNull` (with
the `Unknown` type) and the distsql planner cannot find the builtin.

This change fixes the normalization rules by retaining the cast in
this case. In general, any expression that statically evaluates to
NULL gets a cast to the original expression type. The same is done in
the opt execbuilder.

Fixes #25724.

Release note (bug fix): Fixed query errors in some cases involving a
NULL constant that is cast to a specific type.

Co-authored-by: Ben Darnell <ben@cockroachlabs.com>
Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants