storage: dropping a large table will brick a cluster due to compactions #24029

benesch · 2018-03-19T18:44:19Z

$ roachprod create USER-FOO -n10
$ roachprod run USER-FOO 'mkdir -p /mnt/data1/cockroach && gsutil -m -q cp -r gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/stores=10/$((10#$(hostname | grep -oE [0-9]+$)))/* /mnt/data1/cockroach'

Wait ~10m for stores to download. Then drop the 2TiB table:

ALTER TABLE bank.bank EXPERIMENTAL CONFIGURE ZONE 'gc: {ttlseconds: 30}';
DROP TABLE bank.bank;

The cluster explodes a few minutes later as RocksDB tombstones pile up. I can no longer execute any SQL queries that read from/write to disk.

Very closely related to #21901, but thought I'd file a separate tracking issue.

/cc @spencerkimball

benesch · 2018-03-19T19:54:02Z

Ok, I'm counting 22k range deletion tombstones in aggregate:

$ roachprod run benesch-drop-2 -- grep -a -A 10000 "'Range deletions:'" /mnt/data1/cockroach/*.txt \| grep -a HEX \| wc -l | tail -n+2 | cut -f2 -d: | paste -sd+ - | bc
22194

Broken down by node:

$ roachprod run benesch-drop-2 -- grep -a -A 10000 "'Range deletions:'" /mnt/data1/cockroach/*.txt \| grep -a HEX \| wc -l 
benesch-drop-2: grep -a -A 10000 'Range del... 10/10
   1: 61
   2: 2134
   3: 2342
   4: 2292
   5: 2252
   6: 5080
   7: 34
   8: 862
   9: 2992
  10: 4145

I'm surprised there's such a high variance—the replicas looked balanced when I dropped the table.

bdarnell · 2018-03-19T21:26:07Z

Breaking it down on a per-sst basis, most of the 4k ssts have zero tombstones. There are a handful that have hundreds:

007620_dump.txt 	1
007840_dump.txt 	1
008611_dump.txt 	1
008641_dump.txt 	1
008681_dump.txt 	1
008703_dump.txt 	1
008826_dump.txt 	1
008822_dump.txt 	106
008817_dump.txt 	112
007623_dump.txt 	2
008596_dump.txt 	2
008820_dump.txt 	241
008818_dump.txt 	247
007615_dump.txt 	3
008824_dump.txt 	3
008728_dump.txt 	4
008816_dump.txt 	4
008823_dump.txt 	4
008825_dump.txt 	6
008814_dump.txt 	632
008819_dump.txt 	662
008815_dump.txt 	7
008821_dump.txt 	92

Digging into individual values on 008819, we see that the end key of one range and the start key of the next are adjacent in cockroach terms but different at the rocksdb level. This is the start and end key of two adjacent tombstones:

{/Table/51/1/8035458/0 0.000000000,0}
{/Table/51/1/8036092 0.000000000,0}
{/Table/51/1/8036092/0 0.000000000,0}
{/Table/51/1/8039365 0.000000000,0}

If we adjusted our endpoints so that the tombstones lined up exactly, would rocksdb be able to coalesce them into a single value?

tbg · 2018-03-19T21:28:35Z

Let's make sure this becomes a variant of the drop roachtest!

benesch · 2018-03-19T21:57:42Z

If we adjusted our endpoints so that the tombstones lined up exactly, would rocksdb be able to coalesce them into a single value?

Do you think that would help that much? A sufficiently scattered table can always produce maximally discontiguous ranges.

bdarnell · 2018-03-19T22:12:56Z

True, but empirically it would make a big difference in this case. A very large fraction of the range tombstones in 008819 are contiguous (or would be with this change).

Another upstream fix would be for rocksdb to prioritize sstables with multiple range tombstones for compaction. These are likely to be easy to compact away to the lowest level, and leaving them in higher levels is disproportionately expensive.

bdarnell · 2018-03-19T22:26:50Z

Note that I'm speculating about this - I haven't found any code in rocksdb that would join two consecutive range deletions. I'm not sure if it's possible - each range tombstone has a sequence number as a payload, but I think it may be possible for those to be flattened away on compaction.

benesch · 2018-03-19T22:28:16Z

range_del_aggregator.cc has a collapse_deletions_ flag, but I haven't been able to trace what sets it.

bdarnell · 2018-03-19T22:30:39Z

The CPU profile showed that we're spending all our time inside the if (collapse_deletions_) block, so it's definitely getting set. But I've been looking at what that flag does and if it joins two adjacent ranges I can't see where it does so.

tbg · 2018-03-21T15:52:49Z

I experimented with a change that replaced the ClearRange with a (post-Raft, i.e. not WriteBatched) ClearIterRange. Presumably due to the high parallelism with which these queries are thrown at RocksDB by DistSender, the nodes ground to a halt, missing heartbeats and all. So this alone isn't an option. I'll run this again with limited parallelism but that will likely slow it down a lot.

benesch · 2018-03-21T16:06:54Z

If you have time on your hands, you might also want to experiment with turning on the CompactOnDeletionCollectionFactory. Not my hacked together version for range deletion tombstones, but the official upstream one for triggering compactions whenever an SST contains too many normal deletion tombstones within a window: https://github.com/facebook/rocksdb/blob/master/utilities/table_properties_collectors/compact_on_deletion_collector.h

…

On Wed, Mar 21, 2018 at 11:53 AM, Tobias Schottdorf < ***@***.***> wrote: I experimented with a change that replaced the ClearRange with a (post-Raft, i.e. not WriteBatched) ClearIterRange. Presumably due to the high parallelism with which these queries are thrown at RocksDB by DistSender, the nodes ground to a halt, missing heartbeats and all. So this alone isn't an option. I'll run this again with limited parallelism but that will likely slow it down a lot. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#24029 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA15IKz_0Rdmv09pwIUVHNFwIXLNyanZks5tgndtgaJpZM4SwswS> .

benesch · 2018-03-21T16:13:44Z

Oooh, check this out: https://github.com/facebook/rocksdb/pull/3635/files

tbg · 2018-03-21T16:16:23Z

Good find! I'll try those out.

Ping #17229 (comment)

tbg · 2018-03-21T17:59:39Z

I tried running with https://github.com/facebook/rocksdb/pull/3635/files (on top of stock master) and it doesn't help. Not sure whether it's because of a bug in the code or because of @benesch's observation that it doesn't really address our problem; either way we see goroutines stuck in RocksDB seeks for >20 minutes at a time (they eventually come back and go into the next syscall, as far as I can tell, but it's clearly not anywhere close to working).

I then added @benesch's PR on top and restarted the nodes in the hope that I would see compaction activity while the node is stuck in initialization. However, that doesn't seem to happen; we only see one single thread maxing out one CPU:

tbg · 2018-03-21T19:41:20Z

Back to the ClearIterRange version, even with an added CompactOnDeletionCollectionFactory it's pretty bad (which makes sense, we've added more work):

root@localhost:26257/> create database foo;
CREATE DATABASE

Time: 1m49.87228088s

goroutine 52246 [syscall, 15 minutes]:
github.com/cockroachdb/cockroach/pkg/storage/engine._Cfunc_DBIterSeek(0x7f4e511e5100, 0xc4254a2628, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	_cgo_gotypes.go:549 +0x74
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBIterator).Seek.func2(0x7f4e511e5100, 0xc4254a2628, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1905 +0xba
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBIterator).Seek(0xc42a9b87e0, 0xc4254a2628, 0x6, 0x8, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1905 +0x16b
github.com/cockroachdb/cockroach/pkg/storage/engine.dbIterate(0x7f4e510990c0, 0x2657fc0, 0xc42be3fb00, 0xc4254a2628, 0x6, 0x8, 0x0, 0x0, 0xc4254a2640, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:2391 +0x166
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).Iterate(0xc42be3fb00, 0xc4254a2628, 0x6, 0x8, 0x0, 0x0, 0xc4254a2640, 0x7, 0x8, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1548 +0xfc
github.com/cockroachdb/cockroach/pkg/storage/batcheval.ClearRange(0x26486a0, 0xc42f06f440, 0x7f4e65c12328, 0xc42be3fb00, 0x26701c0, 0xc424c7c000, 0x151e0641f6a360bb, 0x0, 0x100000001, 0x3, ...)

spencerkimball · 2018-03-21T20:12:38Z

It might not be a popular suggestion, but we could explore reinstating the original changes to simply drop SSTables for large aggregate suggested compactions. This requires that we set an extra flag on the suggestion indicating its a range of data which will never be rewritten. This is true after table drops and truncates, but not true after rebalances.

spencerkimball · 2018-03-21T21:12:51Z

Or am I misunderstanding, and dropping the files would be independent of the range tombstone problem?

tbg · 2018-03-21T21:27:58Z

I was thinking about that too, I'm just spooked by the unintended consequences this can have as it pulls out the data even from open snapshots.

spencerkimball · 2018-03-21T22:57:48Z

Do you expect it would alleviate the problem? I have a PR that does it if you want to try.

…

On Wed, Mar 21, 2018 at 5:28 PM Tobias Schottdorf ***@***.***> wrote: I was thinking about that too, I'm just spooked by the unintended consequences this can have as it pulls out the data even from open snapshots. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24029 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF3MTY6b2vN-YRJ0OR0tZDRZsTeV9O3Vks5tgsXsgaJpZM4SwswS> .

tbg · 2018-03-21T23:00:41Z

It's likely that it would alleviate the problem, but actually introducing this is not an option that late in the 2.0 cycle, so I'm investing my energies elsewhere. That said, if you want to try this, go ahead! Would be fun to see it in action.

lingbin · 2018-03-22T04:59:07Z

Like I said in the ISSUE facebook/rocksdb#3634, That PR can only solve non-first seek performance issues, so @benesch is right, the PR facebook/rocksdb#3635 cannot solve your problem because cockroachdb requires multiple iterators, and then each iterator only does a small amount of seeks.

For this issue, I have some immature suggestions for consideration：

A table can contain multiple ranges, when dropping table, one-time delete the entire table, rather than one by one to delete range, which can reduce the number of tombstone. But I think this can only alleviate the problem, because this large range may split, when the rocksdb does a level compaction. And it may be difficult to do a merge deletion in a re-balance scenario.
If there is a mechanism guarantee, the dropped range can only be truely deleted, I.E., call DeleteRange() of rocksdb, if it is guaranteed to not be accessed anymore. For example, using a delayed deletion mechanism, such as delaying a transaction timeout? You can set ReadOptions.ignore_range_deletions to true, which will ignore range tombstones. Because you no longer need to access the deleted data, it is safe to ignore it.

tbg · 2018-03-22T15:39:26Z

Thanks @lingbin. We're aware of these options, but it's tricky for us to implement them because we've so far relied on RocksDB giving us consistent snapshots, and we have various mechanisms that check that the multiple copies of the data we keep are perfectly synchronized. That said, somewhere down the road we're going to have to use some technique that involves either vastly improving the performance of deletion tombstone seeks (what @benesch said upstream), DeleteFilesInRange, and/or skipping deletion tombstones.

tbg · 2018-03-22T16:15:20Z

@bdarnell, re the below:

If we adjusted our endpoints so that the tombstones lined up exactly, would rocksdb be able to coalesce them into a single value?

For a hail-mary short term fix, this seems to be a promising venue if it's true. Were you ever able to figure out if that should be happening? At least for testing, I can make these holes disappear, but it would be good to know if it's any good in the first place.

I'm also curious what the compactions necessary to clean up this table would do even if seeks weren't affected at all. Running a manual compaction takes hours on this kind of dataset; we were implicitly assuming that everything would just kind of evaporate and it would be quick, but it doesn't seem to be what's happening here. Maybe this is due to the "holes" we leave in the keyspace, but it's unclear to me. We need this to be reasonably efficient or the ClearRange option isn't one at all, even without the seek problem. Something to investigate.

Ping @a-robinson just in case anything from #21528 is applicable here. I believe the dataset there was much smaller, right?

In other news, I ran the experiment with a) ClearIterRange b) tombstone-sensitive compaction and the cluster was a flaming pile of garbage for most of the time, but came out looking good:

I'm not advocating that that should be our strategy for 2.0, but it's one of the so far equally bad strategies.

benesch · 2018-03-22T16:32:20Z

Thinking more about ClearRange coalescence, it’s not clear how RocksDB would manage it when they come in as separate commands. If you have ClearRange [a, b) @ t1 and ClearRange [b, c) @ t3, you'd need to preserve both of them for proper handling of any keys @ t2.

…

On Thu, Mar 22, 2018 at 12:15 PM Tobias Schottdorf ***@***.***> wrote: @bdarnell <https://github.com/bdarnell>, re the below: If we adjusted our endpoints so that the tombstones lined up exactly, would rocksdb be able to coalesce them into a single value? For a hail-mary short term fix, this seems to be a promising venue if it's true. Were you ever able to figure out if that should be happening? At least for testing, I can make these holes disappear, but it would be good to know if it's any good in the first place. I'm also curious what the compactions necessary to clean up this table would do even if seeks weren't affected at all. Running a manual compaction takes hours on this kind of dataset; we were implicitly assuming that everything would just kind of evaporate and it would be quick, but it doesn't seem to be what's happening here. Maybe this is due to the "holes" we leave in the keyspace, but it's unclear to me. We need this to be reasonably efficient or the ClearRange option isn't one at all, even without the seek problem. Something to investigate. Ping @a-robinson <https://github.com/a-robinson> just in case anything from #21528 <#21528> is applicable here. I believe the dataset there was much smaller, right? In other news, I ran the experiment with a) ClearIterRange b) tombstone-sensitive compaction and the cluster was a flaming pile of garbage for most of the time, but came out looking good: [image: image] <https://user-images.githubusercontent.com/5076964/37782015-46fbf22c-2dc8-11e8-90d2-fc851ffe1f40.png> I'm not advocating that that should be our strategy for 2.0, but it's one of the so far equally bad strategies. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24029 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA15IK9gUo5GPcEA1d2XmGKF4hltzsR6ks5tg848gaJpZM4SwswS> .

tbg · 2018-03-22T16:49:03Z

RocksDB would keep track of the lowest sequence number that's still open in a snapshot and at compaction time, it joins SSTables whose sequence numbers are irrelevant.

a-robinson · 2018-03-22T17:13:58Z

Ping @a-robinson just in case anything from #21528 is applicable here. I believe the dataset there was much smaller, right?

It took about 5 days of running kv --read-percent=0 --max-rate=1200 before problems started showing up on the 6-node cluster under chaos. I'm not sure if I was using a custom block size or how much data had been generated by the time the problems started.

I don't think there's anything particularly applicable here that we learned from that issue. The main lessons were:

manual compactions are slow
manual compactions of key ranges that have had data written to them since the deletion are particularly slow and expensive
using the exclusive_manual_compaction setting is a bad idea if you care about foreground writes

bdarnell · 2018-03-22T17:39:22Z

If we adjusted our endpoints so that the tombstones lined up exactly, would rocksdb be able to coalesce them into a single value?

For a hail-mary short term fix, this seems to be a promising venue if it's true. Were you ever able to figure out if that should be happening?

I haven't found any code that would do this, but I can't rule it out. It would need to happen on compaction but not, I think, on all uses of the range tombstones so I'm not sure I've looked in the right places. Might be worth trying an empirical test.

In other news, I ran the experiment with a) ClearIterRange b) tombstone-sensitive compaction and the cluster was a flaming pile of garbage for most of the time, but came out looking good:

Might make sense as a cluster setting to give people some way to delete large tables without knocking the cluster out completely.

tbg · 2018-03-22T23:18:39Z

Might make sense as a cluster setting to give people some way to delete large tables without knocking the cluster out completely.

I'm on the fence about that -- with the range deletion tombstones, you can run a full compaction and you're back. That option does not exist with the ClearIterRange change; you're kinda screwed until it clears up by itself.

While investigating range deletion performance issues, we realized that large swaths of contiguous tombstones can cause massive performance issues when seeking. Seeks of more than 15 minutes (!) were observed on one cluster [0]. The problem is that every key overlapping the range deletion tombstone must be loaded from disk, decompressed, and compared with the sequence number of the tombstone until a key is found that is not covered by the tombstone. If contiguous keys representing hundreds of gigabytes are covered by tombstones, RocksDB will need to scan hundreds of gigabytes of data. Needless to say, performance suffers. We have plans to improve seek performance in the face of range deletion tombstones upstream, but we can mitigate the issue locally, too. Iteration outside of tests is nearly always in the context of a range or some bounded span of local keys. By plumbing knowledge of the upper bound we care about for each scan to RocksDB, we can invalidate the iterator early, once the upper bound has been exceeded, rather than scanning over potentially hundreds of gigabytes of deleted keys just to find a key that we don't care about. To ensure we don't forget to specify an upper bound in a critical path, this commit requires that all iterators are either prefix iterators or declare their upper bound. This makes iterating in tests rather irksome, but I think the tradeoff is worthwhile. [0]: cockroachdb#24029 (comment) Release note: None

The current implementation of range deletion tombstones in RocksDB suffers from a performance bug that causes excessive CPU usage on every read operation in a database with many range tombstones. Dropping a large table can easily result in several thousand range deletion tombstones in one store, resulting in an unusable cluster as documented in cockroachdb#24029. Backport a refactoring of range deletion tombstone that fixes the performance problem. This refactoring has also been proposed upstream as facebook/rocksdb#4014. A more minimal change was also proposed in facebook/rocksdb#3992--and that patch better highlights the exact nature of the bug than the patch backported here, for those looking to understand the problem. But this refactoring, though more invasive, gets us one step closer to solving a related problem where range deletions can cause excessively large compactions (cockroachdb#26693). These large compactions do not appear to brick the cluster but undoubtedly have some impact on performance. Fix cockroachdb#24029. Release note: None

While investigating range deletion performance issues, we realized that large swaths of contiguous tombstones can cause massive performance issues when seeking. Seeks of more than 15 minutes (!) were observed on one cluster [0]. The problem is that every key overlapping the range deletion tombstone must be loaded from disk, decompressed, and compared with the sequence number of the tombstone until a key is found that is not covered by the tombstone. If contiguous keys representing hundreds of gigabytes are covered by tombstones, RocksDB will need to scan hundreds of gigabytes of data. Needless to say, performance suffers. We have plans to improve seek performance in the face of range deletion tombstones upstream, but we can mitigate the issue locally, too. Iteration outside of tests is nearly always in the context of a range or some bounded span of local keys. By plumbing knowledge of the upper bound we care about for each scan to RocksDB, we can invalidate the iterator early, once the upper bound has been exceeded, rather than scanning over potentially hundreds of gigabytes of deleted keys just to find a key that we don't care about. To ensure we don't forget to specify an upper bound in a critical path, this commit requires that all iterators are either prefix iterators or declare their upper bound. This makes iterating in tests rather irksome, but I think the tradeoff is worthwhile. [0]: cockroachdb#24029 (comment) Release note: None

26789: Makefile: don't globally install binaries during lint r=BramGruneir a=benesch The build system takes pains to scope all binaries it installs to REPO/bin to avoid polluting the user's GOPATH/bin with binaries that are internal to the Cockroach build system. The lint target was violating this rule by running 'go install ./pkg/...', which installed every package in the repository into GOPATH/bin. This commit adjusts the rule to run 'go build ./pkg/...' instead, which installs the necessary .a files into GOPATH/pkg without installing any binaries into GOPATH/bin. Fix #26633. Release note: None 26872: engine: require iterators to specify an upper bound r=petermattis,bdarnell a=benesch As promised. The issue the first commit fixes took longer than I'd like to admit to track down. I've done my best to keep the patch as small as possible in case we want to backport to 2.0. The changes to the C++ DBIterator are unfortunately large, though. Open to less disruptive ways of plumbing the upper bound there, but I'm not seeing anything obvious. --- While investigating range deletion performance issues, we realized that large swaths of contiguous tombstones can cause massive performance issues when seeking. Seeks of more than 15 minutes (!) were observed on one cluster [0]. The problem is that every key overlapping the range deletion tombstone must be loaded from disk, decompressed, and compared with the sequence number of the tombstone until a key is found that is not covered by the tombstone. If contiguous keys representing hundreds of gigabytes are covered by tombstones, RocksDB will need to scan hundreds of gigabytes of data. Needless to say, performance suffers. We have plans to improve seek performance in the face of range deletion tombstones upstream, but we can mitigate the issue locally, too. Iteration outside of tests is nearly always in the context of a range or some bounded span of local keys. By plumbing knowledge of the upper bound we care about for each scan to RocksDB, we can invalidate the iterator early, once the upper bound has been exceeded, rather than scanning over potentially hundreds of gigabytes of deleted keys just to find a key that we don't care about. To ensure we don't forget to specify an upper bound in a critical path, this commit requires that all iterators are either prefix iterators or declare their upper bound. This makes iterating in tests rather irksome, but I think the tradeoff is worthwhile. [0]: #24029 (comment) Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>

While debugging cockroachdb#24029 we discovered that RESTORE generates massive numbers of suggested compactions as it splits and scatters ranges. As the cluster rebalances, every removed replica leaves behind a range deletion tombstone and a suggested compaction over the keys it covered. Occasionally, the replica chosen for rebalancing will be a member of the last range of the cluster. This range extends from wherever the restore has last split the table it's restoring to the very last key. Suppose we're restoring a table with 10,000 ranges evenly distributed across primary keys from 1-10,000. If a replica in the last range gets rebalanced early in the restore—say, after only the first 500 ranges have been split off—at least one node in the cluster will have a suggested compaction for a range like the following: /Table/51/1/500 - /Max This creates a huge problem! The restore will eventually create 9500 more ranges in that keyspan, each about 32MiB in size. Some of those ranges will necessarily rebalance back onto the node with the suggested compaction. By the time the compaction queue gets around to processing the suggestion, there might be hundreds of gigabytes within the range. In our 2TB (replicated) store dump, snapshotted immediately after a RESTORE, there were two such massive suggested compactions, each of which took over 1h to complete. This bogs down the compaction queue with unnecessary work and makes it especially dangerous to initiate a DROP TABLE (cockroachdb#24029), as the incoming range deletion tombstones will pile up until the prior compaction finishes, and the cluster grinds to a halt in the meantime. The same problem happens whenever any replica is rebalanced away and back before the compaction queue has a chance to compact away the range deletion tombstone, though the impact is limited because the keyspan is smaller. This commit prevents the compaction queue from getting bogged down with suggestions based on outdated information. At the time the suggestion is considered for compaction, the queue checks whether the key span suggested has any live keys. The sign of even a single key is a good indicator that something has changed—usually that the replica, or one of its split children, has been rebalanced back onto the node. The compaction queue now deletes this suggestion instead of acting on it. This is a crucial piece of the fix to cockroachdb#24029. The other half, cockroachdb#26449, involves rate-limiting ClearRange requests. I suspect this change will have a nice performance boost for RESTORE. Anecdotally we've noticed that restores slow down over time. I'm willing to bet its because nonsense suggested compactions start hogging disk I/O. Release note: None

Demonstrate the performance problem with iterating through a range tombstone. name time/op RocksDBDeleteRangeIterate/entries=10-8 6.09µs ± 3% RocksDBDeleteRangeIterate/entries=1000-8 131µs ± 3% RocksDBDeleteRangeIterate/entries=100000-8 12.3ms ± 3% See cockroachdb#24029 Release note: None

26488: compactor: purge suggestions that have live data r=tschottdorf,bdarnell a=benesch While debugging #24029 we discovered that RESTORE generates massive numbers of suggested compactions as it splits and scatters ranges. As the cluster rebalances, every removed replica leaves behind a range deletion tombstone and a suggested compaction over the keys it covered. Occasionally, the replica chosen for rebalancing will be a member of the last range of the cluster. This range extends from wherever the restore has last split the table it's restoring to the very last key. Suppose we're restoring a table with 10,000 ranges evenly distributed across primary keys from 1-10,000. If a replica in the last range gets rebalanced early in the restore—say, after only the first 500 ranges have been split off—at least one node in the cluster will have a suggested compaction for a range like the following: /Table/51/1/500 - /Max This creates a huge problem! The restore will eventually create 9500 more ranges in that keyspan, each about 32MiB in size. Some of those ranges will necessarily rebalance back onto the node with the suggested compaction. By the time the compaction queue gets around to processing the suggestion, there might be hundreds of gigabytes within the range. In our 2TB (replicated) store dump, snapshotted immediately after a RESTORE, there were two such massive suggested compactions, each of which took over 1h to complete. This bogs down the compaction queue with unnecessary work and makes it especially dangerous to initiate a DROP TABLE (#24029), as the incoming range deletion tombstones will pile up until the prior compaction finishes, and the cluster grinds to a halt in the meantime. The same problem happens whenever any replica is rebalanced away and back before the compaction queue has a chance to compact away the range deletion tombstone, though the impact is limited because the keyspan is smaller. This commit prevents the compaction queue from getting bogged down with suggestions based on outdated information. At the time the suggestion is considered for compaction, the queue checks whether the key span suggested has any live keys. The sign of even a single key is a good indicator that something has changed—usually that the replica, or one of its split children, has been rebalanced back onto the node. The compaction queue now deletes this suggestion instead of acting on it. This is a crucial piece of the fix to #24029. The other half, #26449, involves rate-limiting ClearRange requests. I suspect this change will have a nice performance boost for RESTORE. Anecdotally we've noticed that restores slow down over time. I'm willing to bet its because nonsense suggested compactions start hogging disk I/O. Release note: None Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>

Demonstrate the performance problem with iterating through a range tombstone. name time/op RocksDBDeleteRangeIterate/entries=10-8 6.09µs ± 3% RocksDBDeleteRangeIterate/entries=1000-8 131µs ± 3% RocksDBDeleteRangeIterate/entries=100000-8 12.3ms ± 3% See cockroachdb#24029 Release note: None

petermattis · 2018-07-14T00:33:06Z

With the recent improvements to RocksDB range tombstones, the clearrange roachtest now reliably passes:

Note that the test started slightly after 00:00. This test was using a freshly generated bank fixture. For some reason, cockroach was refusing to start using the old store fixtures complaining about the version being too old. This doesn't make any sense and needs to be investigated. I'll file a separate issue when I reproduce.

benesch · 2018-07-14T00:40:10Z

For some reason, cockroach was refusing to start using the old store fixtures complaining about the version being too old. This doesn't make any sense and needs to be investigated.

Why doesn't it make any sense? I think that fixture was generated with 1.1, and you're running a 2.1 beta.

spencerkimball · 2018-07-14T00:43:03Z

In that graph, why does available stay constant while capacity decreases?

…

On Fri, Jul 13, 2018 at 8:40 PM Nikhil Benesch ***@***.***> wrote: For some reason, cockroach was refusing to start using the old store fixtures complaining about the version being too old. This doesn't make any sense and needs to be investigated. Why doesn't it make any sense? I think that fixture was generated with 1.1, and this is a 2.1 beta. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24029 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF3MTX1xzZnmjyOUp9tin3rkW-pQoIo3ks5uGT32gaJpZM4SwswS> .

petermattis · 2018-07-14T01:18:19Z

Why doesn't it make any sense? I think that fixture was generated with 1.1, and you're running a 2.1 beta.

See #27525. I generated that fixture with a binary at version v2.0-7.

In that graph, why does available stay constant while capacity decreases?

Zfs. For ease of debugging we take a zfs snapshot after restoring the store fixtures which makes subsequent runs of the test fast. The downside is that as we start to delete data, zfs doesn't actually delete it off disk because it is referenced by the snapshot.

petermattis · 2018-07-19T11:47:00Z

clearrange has passed 4 nights in a row. I'm closing this issue. #26693 remains open to investigate the excessive compactions that can still occur, but do not drive a cluster to its knees.

benesch added this to the 2.0 milestone Mar 19, 2018

benesch mentioned this issue Mar 20, 2018

libroach: automatically compact files with too many range deletions #24057

Closed

tbg self-assigned this Mar 26, 2018

benesch mentioned this issue Jun 20, 2018

engine: require iterators to specify an upper bound #26872

Merged

benesch mentioned this issue Jun 21, 2018

c-deps: backport RocksDB range deletion performance fix #26877

Closed

tbg moved this from Milestone 3 to On the horizon in KV Jun 25, 2018

tbg mentioned this issue Jul 3, 2018

3 Node Cluster becomes unavailable after losing access to liveness table #27076

Closed

petermattis closed this as completed Jul 19, 2018

KV automation moved this from On the horizon to Finished (milestone 3, ends 7/20) Jul 19, 2018

ghost mentioned this issue Mar 7, 2020

latency spike and wave pattern in disk usage - v19.2.4 #45557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: dropping a large table will brick a cluster due to compactions #24029

storage: dropping a large table will brick a cluster due to compactions #24029

benesch commented Mar 19, 2018 •

edited

benesch commented Mar 19, 2018

bdarnell commented Mar 19, 2018

tbg commented Mar 19, 2018

benesch commented Mar 19, 2018

bdarnell commented Mar 19, 2018

bdarnell commented Mar 19, 2018

benesch commented Mar 19, 2018

bdarnell commented Mar 19, 2018

tbg commented Mar 21, 2018

benesch commented Mar 21, 2018 via email

benesch commented Mar 21, 2018

tbg commented Mar 21, 2018 •

edited

tbg commented Mar 21, 2018

tbg commented Mar 21, 2018 •

edited

spencerkimball commented Mar 21, 2018

spencerkimball commented Mar 21, 2018

tbg commented Mar 21, 2018

spencerkimball commented Mar 21, 2018 via email

tbg commented Mar 21, 2018

lingbin commented Mar 22, 2018 •

edited

tbg commented Mar 22, 2018

tbg commented Mar 22, 2018

benesch commented Mar 22, 2018 via email

tbg commented Mar 22, 2018

a-robinson commented Mar 22, 2018

bdarnell commented Mar 22, 2018

tbg commented Mar 22, 2018

petermattis commented Jul 14, 2018

benesch commented Jul 14, 2018 •

edited

spencerkimball commented Jul 14, 2018 via email

petermattis commented Jul 14, 2018

petermattis commented Jul 19, 2018

storage: dropping a large table will brick a cluster due to compactions #24029

storage: dropping a large table will brick a cluster due to compactions #24029

Comments

benesch commented Mar 19, 2018 • edited

benesch commented Mar 19, 2018

bdarnell commented Mar 19, 2018

tbg commented Mar 19, 2018

benesch commented Mar 19, 2018

bdarnell commented Mar 19, 2018

bdarnell commented Mar 19, 2018

benesch commented Mar 19, 2018

bdarnell commented Mar 19, 2018

tbg commented Mar 21, 2018

benesch commented Mar 21, 2018 via email

benesch commented Mar 21, 2018

tbg commented Mar 21, 2018 • edited

tbg commented Mar 21, 2018

tbg commented Mar 21, 2018 • edited

spencerkimball commented Mar 21, 2018

spencerkimball commented Mar 21, 2018

tbg commented Mar 21, 2018

spencerkimball commented Mar 21, 2018 via email

tbg commented Mar 21, 2018

lingbin commented Mar 22, 2018 • edited

tbg commented Mar 22, 2018

tbg commented Mar 22, 2018

benesch commented Mar 22, 2018 via email

tbg commented Mar 22, 2018

a-robinson commented Mar 22, 2018

bdarnell commented Mar 22, 2018

tbg commented Mar 22, 2018

petermattis commented Jul 14, 2018

benesch commented Jul 14, 2018 • edited

spencerkimball commented Jul 14, 2018 via email

petermattis commented Jul 14, 2018

petermattis commented Jul 19, 2018

benesch commented Mar 19, 2018 •

edited

tbg commented Mar 21, 2018 •

edited

tbg commented Mar 21, 2018 •

edited

lingbin commented Mar 22, 2018 •

edited

benesch commented Jul 14, 2018 •

edited