Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: make queue timeouts controllable, snapshot sending queues dynamic #44809

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Feb 6, 2020

In #42686 we made the raft snapshot queue timeout dynamic and based on the
size of the snapshot being sent. We also added an escape hatch to control
the timeout of processing of that queue. This change generalizes that
cluster setting to apply to all of the queues.

It so happens that the replicate queue and the merge queue also sometimes
need to send snapshots. This PR gives them similar treatment to the
raft snapshot queue.

The previous cluster setting was never released and is reserved so it does not
need a release note.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner ajwerner requested a review from knz February 6, 2020 15:15
@ajwerner
Copy link
Contributor Author

ajwerner commented Feb 6, 2020

@knz this relates to bad behavior we observed in a cluster this week. I'd like to backport both this and #42686 to 19.1 and 19.2.

@ajwerner
Copy link
Contributor Author

ajwerner commented Feb 6, 2020

Flaked on #44789

@ajwerner ajwerner force-pushed the ajwerner/make-merge-queue-have-a-dynamic-timeout branch from 7e71b63 to e1d39d0 Compare February 6, 2020 18:33
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit on terminology. Code LGTM otherwise.

What's our testing story around this code? I am surprised to see no test.

Reviewed 7 of 7 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/storage/merge_queue.go, line 106 at r1 (raw file):

			// sometimes the merge queue needs to send multiple snapshots so perhaps
			// it would want even more time than the snapshot queue. That being said
			// this timeout give leeway for snapshots to be 10x slower than the

This sentence is either too long or ungrammatical, I don't understand it.


pkg/storage/queue.go, line 52 at r1 (raw file):

// queue will time out.
var queueMinimumTimeout = settings.RegisterDurationSetting(
	// NB: this setting has a relatively awkward name because the linter does not

Neither timeout_minimum nor minimum_timeout are good names. The linter is trying to tell you something. :)

Your description says it - guaranteed latency budget (time allotment without a timeout): kv.queue.process.guaranteed_latency_budget.

(English semantics: generally the word "minimum" is not clear if you don't specify what it is the minimum of - there needs to be multiple values to start with.)

@ajwerner ajwerner force-pushed the ajwerner/make-merge-queue-have-a-dynamic-timeout branch 3 times, most recently from d684131 to 7bc07d5 Compare February 6, 2020 23:28
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making write the tests. It needed them. Added a second commit to ensure that our rate settings are always positive. We would have had divide by zero panics both in this code and in the code that used them if they were set to 0.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/storage/merge_queue.go, line 106 at r1 (raw file):

Previously, knz (kena) wrote…

This sentence is either too long or ungrammatical, I don't understand it.

I reworked this comment.


pkg/storage/queue.go, line 52 at r1 (raw file):

Previously, knz (kena) wrote…

Neither timeout_minimum nor minimum_timeout are good names. The linter is trying to tell you something. :)

Your description says it - guaranteed latency budget (time allotment without a timeout): kv.queue.process.guaranteed_latency_budget.

(English semantics: generally the word "minimum" is not clear if you don't specify what it is the minimum of - there needs to be multiple values to start with.)

See how you feel about this.

@ajwerner ajwerner force-pushed the ajwerner/make-merge-queue-have-a-dynamic-timeout branch from 7bc07d5 to a66e072 Compare February 6, 2020 23:59
Before this commit we would permit rate cluster settings to be set to
non-positive values. This would have caused crashes had anybody dared
to try.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/make-merge-queue-have-a-dynamic-timeout branch 3 times, most recently from dd10f60 to 351c2b5 Compare February 10, 2020 06:03
…namic

In cockroachdb#42686 we made the raft snapshot queue timeout dynamic and based on the
size of the snapshot being sent. We also added an escape hatch to control
the timeout of processing of that queue. This change generalizes that
cluster setting to apply to all of the queues.

It so happens that the replicate queue and the merge queue also sometimes
need to send snapshots. This PR gives them similar treatment to the
raft snapshot queue.

The previous cluster setting was never released and is reserved so it does not
need a release note.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/make-merge-queue-have-a-dynamic-timeout branch from 351c2b5 to 72ae564 Compare February 10, 2020 06:07
@ajwerner
Copy link
Contributor Author

@knz friendly ping

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 9 files at r4, 7 of 7 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@ajwerner
Copy link
Contributor Author

TFTR!

bors r=knz

craig bot pushed a commit that referenced this pull request Feb 10, 2020
44809: storage: make queue timeouts controllable, snapshot sending queues dynamic r=knz a=ajwerner

In #42686 we made the raft snapshot queue timeout dynamic and based on the
size of the snapshot being sent. We also added an escape hatch to control
the timeout of processing of that queue. This change generalizes that
cluster setting to apply to all of the queues.

It so happens that the replicate queue and the merge queue also sometimes
need to send snapshots. This PR gives them similar treatment to the
raft snapshot queue.

The previous cluster setting was never released and is reserved so it does not
need a release note.

Release note: None

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Feb 10, 2020

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants