storage: merge less aggressively after load-based splitting #41317

nvanbenschoten · 2019-10-04T05:07:31Z

We're currently very aggressive about merging away split points due to load-based splitting immediately after the load is shut off. This can be an issue for benchmarking because splits created during a ramp-up period are often merged away before the workload begins running. This means that the system is not immediately ready for the workload when it does begin running.

Here's an example where load based splits are merged away almost immediately, which isn't desirable.

We can probably delay these merges by O(minutes) and still be completely fine.

tbg · 2019-10-04T12:27:16Z

Are you saying that these splits should use a larger sticky bit?

tbg · 2019-10-04T12:28:33Z

Hmm, no, you're not saying that. Even if a split happened 10 hours ago and the load dropped just now, you want to hold off on the merge. It sounds like it could be awkward to achieve that.

nvanbenschoten · 2019-10-04T19:04:06Z

Well, I think using a sticky bit would be a quick way to improve what we have here, even if it doesn't get us all the way there. Ideally, though, there would be some memory about historical load on a range.

Fixes cockroachdb#41317. In cockroachdb#41317, we saw that we are currently very aggressive about merging away split points due to load-based splitting immediately after the load is shut off. This can be an issue for benchmarking because splits created during a ramp-up period are often merged away before the workload begins running. This means that the system is not immediately ready for the workload when it does begin running. I'm seeing this again while automating the YCSB benchmark suite. While it would be nice to maintain some load history for each range and use that to avoid merging split points created due to load, that doesn't seem like something we intend to do anytime soon. Meanwhile, we have since fully stabilized the "sticky bit" on range descriptors, which allows us to arbitrarily delay merges of a split point. To make some progress here (~80% of the way), we introduce a new `kv.range_split.by_load_merge_delay` cluster setting which controls he delay that range splits created due to load will wait before considering being merged away. This setting used to set the ExpirationTime on load-based splits. Its value defaults to 5 minutes. Release note (performance improvement): Range merges are now delays for a short amount of time after load-based splitting to prevent load-based split points from being merged away immediately after load is removed.

50070: opt: add GeneratePartialIndexScans rule r=mgartner a=mgartner This commit introduces a new exploration rule, GeneratePartialIndexScans. This rule generates unconstrained Scans over partial indexes when a Select filter implies the partial index predicate. For example, consider the table and query: CREATE TABLE t ( i INT, s STRING, INDEX pidx (i) STORING (s) WHERE s = 'foo' ) SELECT i FROM t WHERE s = 'foo' GeneratePartialIndexScans will generate a scan over the partial index: project ├── columns: i:1 └── scan t@pidx └── columns: i:1 s:3!null It is capable of handling cases where the index does not cover the selected columns, and where additional filtering is required after the partial index scan. For example: CREATE TABLE t ( i INT, s STRING, f FLOAT, INDEX pidx (i) WHERE s = 'foo' ) SELECT i FROM t WHERE s = 'foo' AND f = 2.5 In this case, GeneratePartialIndexScans will generate the following expression with an added Select and IndexJoin: project ├── columns: i:1 └── index-join t ├── columns: i:1 s:2!null f:3!null └── select ├── columns: i:1 rowid:4!null ├── scan t@pidx │ └── columns: i:1 rowid:4!null └── filters └── f:3 = 2.5 Fixes #50232 Release note: None 50151: kv: delay range merges after load based splits r=nvanbenschoten a=nvanbenschoten Fixes #41317. In #41317, we saw that we are currently very aggressive about merging away split points due to load-based splitting immediately after the load is shut off. This can be an issue for benchmarking because splits created during a ramp-up period are often merged away before the workload begins running. This means that the system is not immediately ready for the workload when it does begin running. I'm seeing this again while automating the YCSB benchmark suite. While it would be nice to maintain some load history for each range and use that to avoid merging split points created due to load, that doesn't seem like something we intend to do anytime soon. Meanwhile, we have since fully stabilized the "sticky bit" on range descriptors, which allows us to arbitrarily delay merges of a split point. To make some progress here (~80% of the way), we introduce a new `kv.range_split.by_load_merge_delay` cluster setting which controls he delay that range splits created due to load will wait before considering being merged away. This setting used to set the ExpirationTime on load-based splits. Its value defaults to 5 minutes. Release note (performance improvement): Range merges are now delays for a short amount of time after load-based splitting to prevent load-based split points from being merged away immediately after load is removed. Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

Closes cockroachdb#62700. Re-addresses cockroachdb#41317. This commit reworks how queries-per-second measurements are used when determining whether to merge two ranges together. At a high-level, the change moves from a scheme where the QPS over the last second on the LHS and RHS ranges are combined and compared against a threshold (half the load-based split threshold) to a scheme where the maximum QPS measured over the past 5 minutes (configurable) on the LHS and RHS ranges are combined and compared against said threshold. The commit makes this change to avoid thrashing and to avoid overreacting to temporary fluctuations in load. These overreactions lead to general instability in clusters, as we saw in cockroachdb#41317. Worse, the overreactions compound and can lead to cluster-wide meltdowns where a transient slowdown can trigger a wave of range merges, which can slow the cluster down further, which can lead to more merges, etc. This is what we saw in cockroachdb#62700. This behavior is bad on small clusters and it is even worse on large ones, where range merges don't just interrupt traffic, but also result in a centralization of load in a previously well-distributed dataset, undoing all of the hard work of load-based splitting and rebalancing and creating serious hotspots. The commit improves this situation by introducing a form of memory into the load-based split `Decider`. This is the object which was previously only responsible for measuring queries-per-second on a range and triggering the process of finding a load-based split point. The object is now given an additional role of taking the second-long QPS samples that it measures and aggregating them together to track the maximum historical QPS over a configurable retention period. This maximum QPS measurement can be used to prevent load-based splits from being merged away until the resulting ranges have consistently remained below a certain QPS threshold for a sufficiently long period of time. The `mergeQueue` is taught how to use this new source of information. It is also taught that it should be conservative about imprecision in this QPS tracking, opting to skip a merge rather than perform one when the maximum QPS measurement has not been tracked for long enough. This means that range merges will typically no longer fire within 5 minutes of a lease transfer. This seems fine, as there are almost never situations where a range merge is desperately needed and we should risk making a bad decision in order to perform one. I've measured this change on the `clearrange` roachtest that we made heavy use of in cockroachdb#62700. As expected, it has the same effect as bumping up the `kv.range_split.by_load_merge_delay` high enough such that ranges never merge on the active table. Here's a screenshot of a recent run. We still see a period of increased tail latency and reduced throughput, which has a strong correlation with Pebble compactions. However, we no longer see the subsequent cluster outage that used to follow, where ranges on the active table would begin to merge and throughput would fall to 0 and struggle to recover, bottoming out repeatedly. <todo insert images> Release note (performance improvement): Range merges are no longer triggered if a range has seen significant load over the previous 5 minutes, instead of only considering the last second. This improves stability, as load-based splits will no longer rapidly disappear during transient throughput dips.

Closes cockroachdb#62700. Re-addresses cockroachdb#41317. This commit reworks how queries-per-second measurements are used when determining whether to merge two ranges together. At a high-level, the change moves from a scheme where the QPS over the last second on the LHS and RHS ranges are combined and compared against a threshold (half the load-based split threshold) to a scheme where the maximum QPS measured over the past 5 minutes (configurable) on the LHS and RHS ranges are combined and compared against said threshold. The commit makes this change to avoid thrashing and to avoid overreacting to temporary fluctuations in load. These overreactions lead to general instability in clusters, as we saw in cockroachdb#41317. Worse, the overreactions compound and can lead to cluster-wide meltdowns where a transient slowdown can trigger a wave of range merges, which can slow the cluster down further, which can lead to more merges, etc. This is what we saw in cockroachdb#62700. This behavior is bad on small clusters and it is even worse on large ones, where range merges don't just interrupt traffic, but also result in a centralization of load in a previously well-distributed dataset, undoing all of the hard work of load-based splitting and rebalancing and creating serious hotspots. The commit improves this situation by introducing a form of memory into the load-based split `Decider`. This is the object which was previously only responsible for measuring queries-per-second on a range and triggering the process of finding a load-based split point. The object is now given an additional role of taking the second-long QPS samples that it measures and aggregating them together to track the maximum historical QPS over a configurable retention period. This maximum QPS measurement can be used to prevent load-based splits from being merged away until the resulting ranges have consistently remained below a certain QPS threshold for a sufficiently long period of time. The `mergeQueue` is taught how to use this new source of information. It is also taught that it should be conservative about imprecision in this QPS tracking, opting to skip a merge rather than perform one when the maximum QPS measurement has not been tracked for long enough. This means that range merges will typically no longer fire within 5 minutes of a lease transfer. This seems fine, as there are almost never situations where a range merge is desperately needed and we should risk making a bad decision in order to perform one. I've measured this change on the `clearrange` roachtest that we made heavy use of in cockroachdb#62700. As expected, it has the same effect as bumping up the `kv.range_split.by_load_merge_delay` high enough such that ranges never merge on the active table. Here's a screenshot of a recent run. We still see a period of increased tail latency and reduced throughput, which has a strong correlation with Pebble compactions. However, we no longer see the subsequent cluster outage that used to follow, where ranges on the active table would begin to merge and throughput would fall to 0 and struggle to recover, bottoming out repeatedly. <todo insert images> Release note (performance improvement): Range merges are no longer considered if a range has seen significant load over the previous 5 minutes, instead of being considered as long as a range has low load over the last second. This improves stability, as load-based splits will no longer rapidly disappear during transient throughput dips.

64201: kv: rationalize load-based range merging r=nvanbenschoten a=nvanbenschoten Closes #62700. Fully addresses #41317. This commit reworks how queries-per-second measurements are used when determining whether to merge two ranges together. At a high-level, the change moves from a scheme where the QPS over the last second on the LHS and RHS ranges are combined and compared against a threshold (half the load-based split threshold) to a scheme where the maximum QPS measured over the past 5 minutes (configurable) on the LHS and RHS ranges are combined and compared against said threshold. The commit makes this change to avoid thrashing and to avoid overreacting to temporary fluctuations in load. These overreactions lead to general instability in clusters, as we saw in #41317. Worse, the overreactions compound and can lead to cluster-wide meltdowns where a transient slowdown can trigger a wave of range merges, which can slow the cluster down further, which can lead to more merges, etc. This is what we saw in #62700. This behavior is bad on small clusters and it is even worse on large ones, where range merges don't just interrupt traffic, but also result in a centralization of load in a previously well-distributed dataset, undoing all of the hard work of load-based splitting and rebalancing and creating serious hotspots. The commit improves this situation by introducing a form of memory into the load-based split `Decider`. This is the object which was previously only responsible for measuring queries-per-second on a range and triggering the process of finding a load-based split point. The object is now given an additional role of taking the second-long QPS samples that it measures and aggregating them together to track the maximum historical QPS over a configurable retention period. This maximum QPS measurement can be used to prevent load-based splits from being merged away until the resulting ranges have consistently remained below a certain QPS threshold for a sufficiently long period of time. The `mergeQueue` is taught how to use this new source of information. It is also taught that it should be conservative about imprecision in this QPS tracking, opting to skip a merge rather than perform one when the maximum QPS measurement has not been tracked for long enough. This means that range merges will typically no longer fire within 5 minutes of a lease transfer. This seems fine, as there are almost never situations where a range merge is desperately needed and we should risk making a bad decision in order to perform one. I've measured this change on the `clearrange` roachtest that we made heavy use of in #62700. As expected, it has the same effect as bumping up the `kv.range_split.by_load_merge_delay` high enough such that ranges never merge on the active table. Here's a screenshot of a recent run. We still see a period of increased tail latency and reduced throughput, which has a strong correlation with Pebble compactions. However, we no longer see the subsequent cluster outage that used to follow, where ranges on the active table would begin to merge and throughput would fall to 0 and struggle to recover, bottoming out repeatedly. <img width="1323" alt="Screen Shot 2021-04-26 at 12 32 53 AM" src="https://user-images.githubusercontent.com/5438456/116037215-c8f66300-a635-11eb-8ff2-9e7db4baee8d.png"> <img width="986" alt="Screen Shot 2021-04-26 at 12 33 04 AM" src="https://user-images.githubusercontent.com/5438456/116037225-cc89ea00-a635-11eb-8f2c-a40b2b3e47a7.png"> Instead of what we originally saw, which looked like: <img width="1305" alt="Screen Shot 2021-03-27 at 10 52 18 PM" src="https://user-images.githubusercontent.com/5438456/112763884-53668b00-8fd4-11eb-9ebc-61d9494eca10.png"> Release note (performance improvement): Range merges are no longer triggered if a range has seen significant load over the previous 5 minutes, instead of only considering the last second. This improves stability, as load-based splits will no longer rapidly disappear during transient throughput dips. Release note (performance improvement): Range merges are no longer considered if a range has seen significant load over the previous 5 minutes, instead of being considered as long as a range has low load over the last second. This improves stability, as load-based splits will no longer rapidly disappear during transient throughput dips. cc @cockroachdb/kv Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

Closes cockroachdb#62700. Re-addresses cockroachdb#41317. This commit reworks how queries-per-second measurements are used when determining whether to merge two ranges together. At a high-level, the change moves from a scheme where the QPS over the last second on the LHS and RHS ranges are combined and compared against a threshold (half the load-based split threshold) to a scheme where the maximum QPS measured over the past 5 minutes (configurable) on the LHS and RHS ranges are combined and compared against said threshold. The commit makes this change to avoid thrashing and to avoid overreacting to temporary fluctuations in load. These overreactions lead to general instability in clusters, as we saw in cockroachdb#41317. Worse, the overreactions compound and can lead to cluster-wide meltdowns where a transient slowdown can trigger a wave of range merges, which can slow the cluster down further, which can lead to more merges, etc. This is what we saw in cockroachdb#62700. This behavior is bad on small clusters and it is even worse on large ones, where range merges don't just interrupt traffic, but also result in a centralization of load in a previously well-distributed dataset, undoing all of the hard work of load-based splitting and rebalancing and creating serious hotspots. The commit improves this situation by introducing a form of memory into the load-based split `Decider`. This is the object which was previously only responsible for measuring queries-per-second on a range and triggering the process of finding a load-based split point. The object is now given an additional role of taking the second-long QPS samples that it measures and aggregating them together to track the maximum historical QPS over a configurable retention period. This maximum QPS measurement can be used to prevent load-based splits from being merged away until the resulting ranges have consistently remained below a certain QPS threshold for a sufficiently long period of time. The `mergeQueue` is taught how to use this new source of information. It is also taught that it should be conservative about imprecision in this QPS tracking, opting to skip a merge rather than perform one when the maximum QPS measurement has not been tracked for long enough. This means that range merges will typically no longer fire within 5 minutes of a lease transfer. This seems fine, as there are almost never situations where a range merge is desperately needed and we should risk making a bad decision in order to perform one. I've measured this change on the `clearrange` roachtest that we made heavy use of in cockroachdb#62700. As expected, it has the same effect as bumping up the `kv.range_split.by_load_merge_delay` high enough such that ranges never merge on the active table. Here's a screenshot of a recent run. We still see a period of increased tail latency and reduced throughput, which has a strong correlation with Pebble compactions. However, we no longer see the subsequent cluster outage that used to follow, where ranges on the active table would begin to merge and throughput would fall to 0 and struggle to recover, bottoming out repeatedly. <todo insert images> Release note (performance improvement): Range merges are no longer considered if a range has seen significant load over the previous 5 minutes, instead of being considered as long as a range has low load over the last second. This improves stability, as load-based splits will no longer rapidly disappear during transient throughput dips.

nvanbenschoten added S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. A-kv-distribution Relating to rebalancing and leasing. labels Oct 4, 2019

nvanbenschoten added this to Incoming in KV via automation Oct 4, 2019

nvanbenschoten mentioned this issue Jun 12, 2020

kv: delay range merges after load based splits #50151

Merged

nvanbenschoten self-assigned this Jun 12, 2020

craig bot closed this as completed in f0bad68 Jun 15, 2020

KV automation moved this from Incoming to Closed Jun 15, 2020

nvanbenschoten mentioned this issue Mar 26, 2021

kv/sql: index backfill and truncate temporarily crater write throughput on large tables #62672

Closed

nvanbenschoten mentioned this issue Apr 23, 2021

kv/sql: GC after TRUNCATE temporarily craters write throughput #62700

Closed

nvanbenschoten mentioned this issue Apr 26, 2021

kv: rationalize load-based range merging #64201

Merged

nvanbenschoten mentioned this issue May 17, 2021

release-21.1: kv: rationalize load-based range merging #65362

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: merge less aggressively after load-based splitting #41317

storage: merge less aggressively after load-based splitting #41317

nvanbenschoten commented Oct 4, 2019

tbg commented Oct 4, 2019

tbg commented Oct 4, 2019

nvanbenschoten commented Oct 4, 2019

storage: merge less aggressively after load-based splitting #41317

storage: merge less aggressively after load-based splitting #41317

Comments

nvanbenschoten commented Oct 4, 2019

tbg commented Oct 4, 2019

tbg commented Oct 4, 2019

nvanbenschoten commented Oct 4, 2019