kv: lowering the range size can cause unavailability #46271

ajwerner · 2020-03-18T21:41:18Z

Describe the problem

When a range (well, most ranges) get too large (2x the range_max_bytes as dictated by the zone config), we apply back pressure to writes and block them until a split occurs. In the normal course of business when a split is due to write traffic to the range, this is reasonable. We expect this situation to be quite rate. Once a range exceeds range_max_bytes by even a bit, we'd expect a split to happen soon. A common case where this backpressure can occur is when a range is unsplittable because all of its data is due to a single key.

There is however, a case where this backpressure is particularly bad: lowering the range_max_bytes for a range. Imagine you have a large number of ranges with the new default range size of 512 MiB and you try to lower it to the old default of 64 MiB. At that point, all writes to those ranges will be blocked until the range splits twice!

To Reproduce

Create a large number of ranges with the new default range size of 512 MiB and you try to lower it to the old default of 64 MiB. At that point, all writes to those ranges will be blocked until the range splits twice!

Expected behavior

Lowering the max_range_size causes graceful splitting.

Additional context

Also pretty terrible is that this impacts the time series data. Check out the part of this graph where no time series data was written:

The text was updated successfully, but these errors were encountered:

ajwerner · 2020-03-18T21:48:18Z

We have some options here:

A) Disable all size-based back pressure. In the course of increasing the range size we've explored problems with ranges getting too large. We've now ensured that O(range size) operations no longer buffer anything in RAM and generally carry timeouts proportional to that size.
B) One alternative to this harsh back pressure mechanism which we've discussed is to just inject latency when the range is too large.
C) Carry some history of the zone configuration in the range state and disable the backpressure if the size was reasonable in a recent zone config

This one is appealing but also complex.

Anybody have preferences or other ideas?

andreimatei · 2020-03-18T22:02:48Z

How about if we did C, but really crudely - we gossiped something any time a range size is decreased. And then we don't backpressure for 24h after such a change (anywhere in the cluster).
I'd hope there's no race between finding out about the new zone config, and finding out that you don't want to backpressure right away - cause both pieces of info are gossiped.

ajwerner · 2020-03-18T22:12:12Z

How about if we did C, but really crudely - we gossiped something any time a range size is decreased. And then we don't backpressure for 24h after such a change (anywhere in the cluster).
I'd hope there's no race between finding out about the new zone config, and finding out that you don't want to backpressure right away - cause both pieces of info are gossiped.

The problem with this approach and all like it is that it won't survive a node restart. We don't need to gossip anything separately; we could hook the disablement into the update to the zone config itself.

D) (from @tbg) Disable backpressure if the range is strictly larger than the size at which backpressure would have started. As in, “don’t backpressure if backpressure wouldn’t have let the range get its current size in the first place.”

This option is pretty compelling to me. It seems like generally it should mitigate the problem.

andreimatei · 2020-03-18T22:21:42Z

D) sounds kinda fragile because only some writes are backpressured? So if it's a non-backpressured write happens to take the range over the limit, than the range gets to run wild?
If that's unfounded, it sounds pretty good to me.

ajwerner · 2020-03-18T22:24:58Z

So if it's a non-backpressured write happens to take the range over the limit, than the range gets to run wild?
If that's unfounded, it sounds pretty good to me.

If we did it such that a single byte would change the behavior then you're right. Perhaps we do something like we backpressure if the current range size would become some amount larger than the current backpressureRangeSizeMultiplier * rangeSize? Maybe something like 512 KiB behind a non-public cluster setting? The non-backpressured writes tend not to be gigantic.

ajwerner · 2020-03-18T22:25:42Z

There is probably a pathological case here where a loop of abort spans or lease transfers pushes up the range size and then the range gets to run wild but I'm not all that concerned about that.

This commit increases backpressureByteTolerance in order to make it less fragile in the face of large concurrent writes. This smaller setting was chosen prior to implementing the additional mitigations in cockroachdb#46271. Release note: None

ajwerner added A-kv-replication Relating to Raft, consensus, and coordination. A-zone-configs labels Mar 18, 2020

ajwerner self-assigned this Mar 18, 2020

ajwerner mentioned this issue Apr 1, 2020

kvserver: mitigate unavailability due to backpressure when reducing range size #46863

Merged

craig bot closed this as completed in c36f7cb Apr 3, 2020

ajwerner mentioned this issue Apr 7, 2020

roachtest: hotspotsplits/nodes=4 failed #46957

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: lowering the range size can cause unavailability #46271

kv: lowering the range size can cause unavailability #46271

ajwerner commented Mar 18, 2020

ajwerner commented Mar 18, 2020

andreimatei commented Mar 18, 2020

ajwerner commented Mar 18, 2020

andreimatei commented Mar 18, 2020

ajwerner commented Mar 18, 2020

ajwerner commented Mar 18, 2020

kv: lowering the range size can cause unavailability #46271

kv: lowering the range size can cause unavailability #46271

Comments

ajwerner commented Mar 18, 2020

ajwerner commented Mar 18, 2020

andreimatei commented Mar 18, 2020

ajwerner commented Mar 18, 2020

andreimatei commented Mar 18, 2020

ajwerner commented Mar 18, 2020

ajwerner commented Mar 18, 2020