Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: lowering the range size can cause unavailability #46271

Closed
ajwerner opened this issue Mar 18, 2020 · 6 comments · Fixed by #46863
Closed

kv: lowering the range size can cause unavailability #46271

ajwerner opened this issue Mar 18, 2020 · 6 comments · Fixed by #46863
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-zone-configs

Comments

@ajwerner
Copy link
Contributor

Describe the problem

When a range (well, most ranges) get too large (2x the range_max_bytes as dictated by the zone config), we apply back pressure to writes and block them until a split occurs. In the normal course of business when a split is due to write traffic to the range, this is reasonable. We expect this situation to be quite rate. Once a range exceeds range_max_bytes by even a bit, we'd expect a split to happen soon. A common case where this backpressure can occur is when a range is unsplittable because all of its data is due to a single key.

There is however, a case where this backpressure is particularly bad: lowering the range_max_bytes for a range. Imagine you have a large number of ranges with the new default range size of 512 MiB and you try to lower it to the old default of 64 MiB. At that point, all writes to those ranges will be blocked until the range splits twice!

To Reproduce

Create a large number of ranges with the new default range size of 512 MiB and you try to lower it to the old default of 64 MiB. At that point, all writes to those ranges will be blocked until the range splits twice!

Expected behavior

Lowering the max_range_size causes graceful splitting.

Additional context

Also pretty terrible is that this impacts the time series data. Check out the part of this graph where no time series data was written:

image

@ajwerner ajwerner added A-kv-replication Relating to Raft, consensus, and coordination. A-zone-configs labels Mar 18, 2020
@ajwerner
Copy link
Contributor Author

We have some options here:

A) Disable all size-based back pressure. In the course of increasing the range size we've explored problems with ranges getting too large. We've now ensured that O(range size) operations no longer buffer anything in RAM and generally carry timeouts proportional to that size.
B) One alternative to this harsh back pressure mechanism which we've discussed is to just inject latency when the range is too large.
C) Carry some history of the zone configuration in the range state and disable the backpressure if the size was reasonable in a recent zone config

  • This one is appealing but also complex.

Anybody have preferences or other ideas?

@andreimatei
Copy link
Contributor

How about if we did C, but really crudely - we gossiped something any time a range size is decreased. And then we don't backpressure for 24h after such a change (anywhere in the cluster).
I'd hope there's no race between finding out about the new zone config, and finding out that you don't want to backpressure right away - cause both pieces of info are gossiped.

@ajwerner
Copy link
Contributor Author

How about if we did C, but really crudely - we gossiped something any time a range size is decreased. And then we don't backpressure for 24h after such a change (anywhere in the cluster).
I'd hope there's no race between finding out about the new zone config, and finding out that you don't want to backpressure right away - cause both pieces of info are gossiped.

The problem with this approach and all like it is that it won't survive a node restart. We don't need to gossip anything separately; we could hook the disablement into the update to the zone config itself.


D) (from @tbg) Disable backpressure if the range is strictly larger than the size at which backpressure would have started. As in, “don’t backpressure if backpressure wouldn’t have let the range get its current size in the first place.”

This option is pretty compelling to me. It seems like generally it should mitigate the problem.

@andreimatei
Copy link
Contributor

D) sounds kinda fragile because only some writes are backpressured? So if it's a non-backpressured write happens to take the range over the limit, than the range gets to run wild?
If that's unfounded, it sounds pretty good to me.

@ajwerner
Copy link
Contributor Author

So if it's a non-backpressured write happens to take the range over the limit, than the range gets to run wild?
If that's unfounded, it sounds pretty good to me.

If we did it such that a single byte would change the behavior then you're right. Perhaps we do something like we backpressure if the current range size would become some amount larger than the current backpressureRangeSizeMultiplier * rangeSize? Maybe something like 512 KiB behind a non-public cluster setting? The non-backpressured writes tend not to be gigantic.

@ajwerner
Copy link
Contributor Author

There is probably a pathological case here where a loop of abort spans or lease transfers pushes up the range size and then the range gets to run wild but I'm not all that concerned about that.

@ajwerner ajwerner self-assigned this Mar 18, 2020
@craig craig bot closed this as completed in c36f7cb Apr 3, 2020
ajwerner added a commit to ajwerner/cockroach that referenced this issue Apr 8, 2020
This commit increases backpressureByteTolerance in order to make it less
fragile in the face of large concurrent writes. This smaller setting
was chosen prior to implementing the additional mitigations in cockroachdb#46271.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this issue Apr 8, 2020
This commit increases backpressureByteTolerance in order to make it less
fragile in the face of large concurrent writes. This smaller setting
was chosen prior to implementing the additional mitigations in cockroachdb#46271.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-zone-configs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants