Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: dropping a large table will brick a cluster due to compactions #24029

Closed
benesch opened this issue Mar 19, 2018 · 140 comments
Closed

storage: dropping a large table will brick a cluster due to compactions #24029

benesch opened this issue Mar 19, 2018 · 140 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@benesch
Copy link
Contributor

benesch commented Mar 19, 2018

$ roachprod create USER-FOO -n10
$ roachprod run USER-FOO 'mkdir -p /mnt/data1/cockroach && gsutil -m -q cp -r gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/stores=10/$((10#$(hostname | grep -oE [0-9]+$)))/* /mnt/data1/cockroach'

Wait ~10m for stores to download. Then drop the 2TiB table:

ALTER TABLE bank.bank EXPERIMENTAL CONFIGURE ZONE 'gc: {ttlseconds: 30}';
DROP TABLE bank.bank;

The cluster explodes a few minutes later as RocksDB tombstones pile up. I can no longer execute any SQL queries that read from/write to disk.

Very closely related to #21901, but thought I'd file a separate tracking issue.

/cc @spencerkimball

@benesch benesch added this to the 2.0 milestone Mar 19, 2018
@benesch
Copy link
Contributor Author

benesch commented Mar 19, 2018

Ok, I'm counting 22k range deletion tombstones in aggregate:

$ roachprod run benesch-drop-2 -- grep -a -A 10000 "'Range deletions:'" /mnt/data1/cockroach/*.txt \| grep -a HEX \| wc -l | tail -n+2 | cut -f2 -d: | paste -sd+ - | bc
22194

Broken down by node:

$ roachprod run benesch-drop-2 -- grep -a -A 10000 "'Range deletions:'" /mnt/data1/cockroach/*.txt \| grep -a HEX \| wc -l 
benesch-drop-2: grep -a -A 10000 'Range del... 10/10
   1: 61
   2: 2134
   3: 2342
   4: 2292
   5: 2252
   6: 5080
   7: 34
   8: 862
   9: 2992
  10: 4145

I'm surprised there's such a high variance—the replicas looked balanced when I dropped the table.

@bdarnell
Copy link
Member

Breaking it down on a per-sst basis, most of the 4k ssts have zero tombstones. There are a handful that have hundreds:

007620_dump.txt 	1
007840_dump.txt 	1
008611_dump.txt 	1
008641_dump.txt 	1
008681_dump.txt 	1
008703_dump.txt 	1
008826_dump.txt 	1
008822_dump.txt 	106
008817_dump.txt 	112
007623_dump.txt 	2
008596_dump.txt 	2
008820_dump.txt 	241
008818_dump.txt 	247
007615_dump.txt 	3
008824_dump.txt 	3
008728_dump.txt 	4
008816_dump.txt 	4
008823_dump.txt 	4
008825_dump.txt 	6
008814_dump.txt 	632
008819_dump.txt 	662
008815_dump.txt 	7
008821_dump.txt 	92

Digging into individual values on 008819, we see that the end key of one range and the start key of the next are adjacent in cockroach terms but different at the rocksdb level. This is the start and end key of two adjacent tombstones:

{/Table/51/1/8035458/0 0.000000000,0}
{/Table/51/1/8036092 0.000000000,0}
{/Table/51/1/8036092/0 0.000000000,0}
{/Table/51/1/8039365 0.000000000,0}

If we adjusted our endpoints so that the tombstones lined up exactly, would rocksdb be able to coalesce them into a single value?

@tbg
Copy link
Member

tbg commented Mar 19, 2018

Let's make sure this becomes a variant of the drop roachtest!

@benesch
Copy link
Contributor Author

benesch commented Mar 19, 2018

If we adjusted our endpoints so that the tombstones lined up exactly, would rocksdb be able to coalesce them into a single value?

Do you think that would help that much? A sufficiently scattered table can always produce maximally discontiguous ranges.

@bdarnell
Copy link
Member

True, but empirically it would make a big difference in this case. A very large fraction of the range tombstones in 008819 are contiguous (or would be with this change).

Another upstream fix would be for rocksdb to prioritize sstables with multiple range tombstones for compaction. These are likely to be easy to compact away to the lowest level, and leaving them in higher levels is disproportionately expensive.

@bdarnell
Copy link
Member

Note that I'm speculating about this - I haven't found any code in rocksdb that would join two consecutive range deletions. I'm not sure if it's possible - each range tombstone has a sequence number as a payload, but I think it may be possible for those to be flattened away on compaction.

@benesch
Copy link
Contributor Author

benesch commented Mar 19, 2018

range_del_aggregator.cc has a collapse_deletions_ flag, but I haven't been able to trace what sets it.

@bdarnell
Copy link
Member

The CPU profile showed that we're spending all our time inside the if (collapse_deletions_) block, so it's definitely getting set. But I've been looking at what that flag does and if it joins two adjacent ranges I can't see where it does so.

@tbg
Copy link
Member

tbg commented Mar 21, 2018

I experimented with a change that replaced the ClearRange with a (post-Raft, i.e. not WriteBatched) ClearIterRange. Presumably due to the high parallelism with which these queries are thrown at RocksDB by DistSender, the nodes ground to a halt, missing heartbeats and all. So this alone isn't an option. I'll run this again with limited parallelism but that will likely slow it down a lot.

@benesch
Copy link
Contributor Author

benesch commented Mar 21, 2018 via email

@benesch
Copy link
Contributor Author

benesch commented Mar 21, 2018

Oooh, check this out: https://github.com/facebook/rocksdb/pull/3635/files

@tbg
Copy link
Member

tbg commented Mar 21, 2018

Good find! I'll try those out.

Ping #17229 (comment)

@tbg
Copy link
Member

tbg commented Mar 21, 2018

I tried running with https://github.com/facebook/rocksdb/pull/3635/files (on top of stock master) and it doesn't help. Not sure whether it's because of a bug in the code or because of @benesch's observation that it doesn't really address our problem; either way we see goroutines stuck in RocksDB seeks for >20 minutes at a time (they eventually come back and go into the next syscall, as far as I can tell, but it's clearly not anywhere close to working).

I then added @benesch's PR on top and restarted the nodes in the hope that I would see compaction activity while the node is stuck in initialization. However, that doesn't seem to happen; we only see one single thread maxing out one CPU:

image

@tbg
Copy link
Member

tbg commented Mar 21, 2018

Back to the ClearIterRange version, even with an added CompactOnDeletionCollectionFactory it's pretty bad (which makes sense, we've added more work):

root@localhost:26257/> create database foo;
CREATE DATABASE

Time: 1m49.87228088s

image

goroutine 52246 [syscall, 15 minutes]:
github.com/cockroachdb/cockroach/pkg/storage/engine._Cfunc_DBIterSeek(0x7f4e511e5100, 0xc4254a2628, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	_cgo_gotypes.go:549 +0x74
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBIterator).Seek.func2(0x7f4e511e5100, 0xc4254a2628, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1905 +0xba
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBIterator).Seek(0xc42a9b87e0, 0xc4254a2628, 0x6, 0x8, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1905 +0x16b
github.com/cockroachdb/cockroach/pkg/storage/engine.dbIterate(0x7f4e510990c0, 0x2657fc0, 0xc42be3fb00, 0xc4254a2628, 0x6, 0x8, 0x0, 0x0, 0xc4254a2640, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:2391 +0x166
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).Iterate(0xc42be3fb00, 0xc4254a2628, 0x6, 0x8, 0x0, 0x0, 0xc4254a2640, 0x7, 0x8, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1548 +0xfc
github.com/cockroachdb/cockroach/pkg/storage/batcheval.ClearRange(0x26486a0, 0xc42f06f440, 0x7f4e65c12328, 0xc42be3fb00, 0x26701c0, 0xc424c7c000, 0x151e0641f6a360bb, 0x0, 0x100000001, 0x3, ...)

@spencerkimball
Copy link
Member

It might not be a popular suggestion, but we could explore reinstating the original changes to simply drop SSTables for large aggregate suggested compactions. This requires that we set an extra flag on the suggestion indicating its a range of data which will never be rewritten. This is true after table drops and truncates, but not true after rebalances.

@spencerkimball
Copy link
Member

Or am I misunderstanding, and dropping the files would be independent of the range tombstone problem?

@tbg
Copy link
Member

tbg commented Mar 21, 2018

I was thinking about that too, I'm just spooked by the unintended consequences this can have as it pulls out the data even from open snapshots.

@spencerkimball
Copy link
Member

spencerkimball commented Mar 21, 2018 via email

@tbg
Copy link
Member

tbg commented Mar 21, 2018

It's likely that it would alleviate the problem, but actually introducing this is not an option that late in the 2.0 cycle, so I'm investing my energies elsewhere. That said, if you want to try this, go ahead! Would be fun to see it in action.

@lingbin
Copy link

lingbin commented Mar 22, 2018

Like I said in the ISSUE facebook/rocksdb#3634, That PR can only solve non-first seek performance issues, so @benesch is right, the PR facebook/rocksdb#3635 cannot solve your problem because cockroachdb requires multiple iterators, and then each iterator only does a small amount of seeks.

For this issue, I have some immature suggestions for consideration:

  1. A table can contain multiple ranges, when dropping table, one-time delete the entire table, rather than one by one to delete range, which can reduce the number of tombstone. But I think this can only alleviate the problem, because this large range may split, when the rocksdb does a level compaction. And it may be difficult to do a merge deletion in a re-balance scenario.
  2. If there is a mechanism guarantee, the dropped range can only be truely deleted, I.E., call DeleteRange() of rocksdb, if it is guaranteed to not be accessed anymore. For example, using a delayed deletion mechanism, such as delaying a transaction timeout? You can set ReadOptions.ignore_range_deletions to true, which will ignore range tombstones. Because you no longer need to access the deleted data, it is safe to ignore it.

@tbg
Copy link
Member

tbg commented Mar 22, 2018

Thanks @lingbin. We're aware of these options, but it's tricky for us to implement them because we've so far relied on RocksDB giving us consistent snapshots, and we have various mechanisms that check that the multiple copies of the data we keep are perfectly synchronized. That said, somewhere down the road we're going to have to use some technique that involves either vastly improving the performance of deletion tombstone seeks (what @benesch said upstream), DeleteFilesInRange, and/or skipping deletion tombstones.

@tbg
Copy link
Member

tbg commented Mar 22, 2018

@bdarnell, re the below:

If we adjusted our endpoints so that the tombstones lined up exactly, would rocksdb be able to coalesce them into a single value?

For a hail-mary short term fix, this seems to be a promising venue if it's true. Were you ever able to figure out if that should be happening? At least for testing, I can make these holes disappear, but it would be good to know if it's any good in the first place.

I'm also curious what the compactions necessary to clean up this table would do even if seeks weren't affected at all. Running a manual compaction takes hours on this kind of dataset; we were implicitly assuming that everything would just kind of evaporate and it would be quick, but it doesn't seem to be what's happening here. Maybe this is due to the "holes" we leave in the keyspace, but it's unclear to me. We need this to be reasonably efficient or the ClearRange option isn't one at all, even without the seek problem. Something to investigate.

Ping @a-robinson just in case anything from #21528 is applicable here. I believe the dataset there was much smaller, right?

In other news, I ran the experiment with a) ClearIterRange b) tombstone-sensitive compaction and the cluster was a flaming pile of garbage for most of the time, but came out looking good:

image

I'm not advocating that that should be our strategy for 2.0, but it's one of the so far equally bad strategies.

@benesch
Copy link
Contributor Author

benesch commented Mar 22, 2018 via email

@tbg
Copy link
Member

tbg commented Mar 22, 2018

RocksDB would keep track of the lowest sequence number that's still open in a snapshot and at compaction time, it joins SSTables whose sequence numbers are irrelevant.

@a-robinson
Copy link
Contributor

Ping @a-robinson just in case anything from #21528 is applicable here. I believe the dataset there was much smaller, right?

It took about 5 days of running kv --read-percent=0 --max-rate=1200 before problems started showing up on the 6-node cluster under chaos. I'm not sure if I was using a custom block size or how much data had been generated by the time the problems started.

I don't think there's anything particularly applicable here that we learned from that issue. The main lessons were:

  1. manual compactions are slow
  2. manual compactions of key ranges that have had data written to them since the deletion are particularly slow and expensive
  3. using the exclusive_manual_compaction setting is a bad idea if you care about foreground writes

@bdarnell
Copy link
Member

If we adjusted our endpoints so that the tombstones lined up exactly, would rocksdb be able to coalesce them into a single value?

For a hail-mary short term fix, this seems to be a promising venue if it's true. Were you ever able to figure out if that should be happening?

I haven't found any code that would do this, but I can't rule it out. It would need to happen on compaction but not, I think, on all uses of the range tombstones so I'm not sure I've looked in the right places. Might be worth trying an empirical test.

In other news, I ran the experiment with a) ClearIterRange b) tombstone-sensitive compaction and the cluster was a flaming pile of garbage for most of the time, but came out looking good:

Might make sense as a cluster setting to give people some way to delete large tables without knocking the cluster out completely.

@tbg
Copy link
Member

tbg commented Mar 22, 2018

Might make sense as a cluster setting to give people some way to delete large tables without knocking the cluster out completely.

I'm on the fence about that -- with the range deletion tombstones, you can run a full compaction and you're back. That option does not exist with the ClearIterRange change; you're kinda screwed until it clears up by itself.

@tbg tbg self-assigned this Mar 26, 2018
benesch added a commit to benesch/cockroach that referenced this issue Jun 21, 2018
While investigating range deletion performance issues, we realized that
large swaths of contiguous tombstones can cause massive performance
issues when seeking. Seeks of more than 15 minutes (!) were observed on
one cluster [0]. The problem is that every key overlapping the range
deletion tombstone must be loaded from disk, decompressed, and compared
with the sequence number of the tombstone until a key is found that is
not covered by the tombstone. If contiguous keys representing hundreds
of gigabytes are covered by tombstones, RocksDB will need to scan
hundreds of gigabytes of data. Needless to say, performance suffers.

We have plans to improve seek performance in the face of range deletion
tombstones upstream, but we can mitigate the issue locally, too.
Iteration outside of tests is nearly always in the context of a range or
some bounded span of local keys. By plumbing knowledge of the upper
bound we care about for each scan to RocksDB, we can invalidate the
iterator early, once the upper bound has been exceeded, rather than
scanning over potentially hundreds of gigabytes of deleted keys just to
find a key that we don't care about.

To ensure we don't forget to specify an upper bound in a critical path,
this commit requires that all iterators are either prefix iterators or
declare their upper bound. This makes iterating in tests rather irksome,
but I think the tradeoff is worthwhile.

[0]: cockroachdb#24029 (comment)

Release note: None
benesch added a commit to benesch/cockroach that referenced this issue Jun 21, 2018
The current implementation of range deletion tombstones in RocksDB
suffers from a performance bug that causes excessive CPU usage on every
read operation in a database with many range tombstones. Dropping a
large table can easily result in several thousand range deletion
tombstones in one store, resulting in an unusable cluster as documented
in cockroachdb#24029.

Backport a refactoring of range deletion tombstone that fixes the
performance problem. This refactoring has also been proposed upstream as
facebook/rocksdb#4014.

A more minimal change was also proposed in facebook/rocksdb#3992--and
that patch better highlights the exact nature of the bug than the patch
backported here, for those looking to understand the problem. But this
refactoring, though more invasive, gets us one step closer to solving a
related problem where range deletions can cause excessively large
compactions (cockroachdb#26693). These large compactions do not appear to brick the
cluster but undoubtedly have some impact on performance.

Fix cockroachdb#24029.

Release note: None
@tbg tbg moved this from Milestone 3 to On the horizon in KV Jun 25, 2018
benesch added a commit to benesch/cockroach that referenced this issue Jun 26, 2018
While investigating range deletion performance issues, we realized that
large swaths of contiguous tombstones can cause massive performance
issues when seeking. Seeks of more than 15 minutes (!) were observed on
one cluster [0]. The problem is that every key overlapping the range
deletion tombstone must be loaded from disk, decompressed, and compared
with the sequence number of the tombstone until a key is found that is
not covered by the tombstone. If contiguous keys representing hundreds
of gigabytes are covered by tombstones, RocksDB will need to scan
hundreds of gigabytes of data. Needless to say, performance suffers.

We have plans to improve seek performance in the face of range deletion
tombstones upstream, but we can mitigate the issue locally, too.
Iteration outside of tests is nearly always in the context of a range or
some bounded span of local keys. By plumbing knowledge of the upper
bound we care about for each scan to RocksDB, we can invalidate the
iterator early, once the upper bound has been exceeded, rather than
scanning over potentially hundreds of gigabytes of deleted keys just to
find a key that we don't care about.

To ensure we don't forget to specify an upper bound in a critical path,
this commit requires that all iterators are either prefix iterators or
declare their upper bound. This makes iterating in tests rather irksome,
but I think the tradeoff is worthwhile.

[0]: cockroachdb#24029 (comment)

Release note: None
benesch added a commit to benesch/cockroach that referenced this issue Jul 5, 2018
While investigating range deletion performance issues, we realized that
large swaths of contiguous tombstones can cause massive performance
issues when seeking. Seeks of more than 15 minutes (!) were observed on
one cluster [0]. The problem is that every key overlapping the range
deletion tombstone must be loaded from disk, decompressed, and compared
with the sequence number of the tombstone until a key is found that is
not covered by the tombstone. If contiguous keys representing hundreds
of gigabytes are covered by tombstones, RocksDB will need to scan
hundreds of gigabytes of data. Needless to say, performance suffers.

We have plans to improve seek performance in the face of range deletion
tombstones upstream, but we can mitigate the issue locally, too.
Iteration outside of tests is nearly always in the context of a range or
some bounded span of local keys. By plumbing knowledge of the upper
bound we care about for each scan to RocksDB, we can invalidate the
iterator early, once the upper bound has been exceeded, rather than
scanning over potentially hundreds of gigabytes of deleted keys just to
find a key that we don't care about.

To ensure we don't forget to specify an upper bound in a critical path,
this commit requires that all iterators are either prefix iterators or
declare their upper bound. This makes iterating in tests rather irksome,
but I think the tradeoff is worthwhile.

[0]: cockroachdb#24029 (comment)

Release note: None
craig bot pushed a commit that referenced this issue Jul 6, 2018
26789: Makefile: don't globally install binaries during lint r=BramGruneir a=benesch

The build system takes pains to scope all binaries it installs to
REPO/bin to avoid polluting the user's GOPATH/bin with binaries that are
internal to the Cockroach build system.

The lint target was violating this rule by running 'go install
./pkg/...', which installed every package in the repository into
GOPATH/bin. This commit adjusts the rule to run 'go build ./pkg/...'
instead, which installs the necessary .a files into GOPATH/pkg without
installing any binaries into GOPATH/bin.

Fix #26633.

Release note: None

26872: engine: require iterators to specify an upper bound r=petermattis,bdarnell a=benesch

As promised. The issue the first commit fixes took longer than I'd like to admit to track down.

I've done my best to keep the patch as small as possible in case we want to backport to 2.0. The changes to the C++ DBIterator are unfortunately large, though. Open to less disruptive ways of plumbing the upper bound there, but I'm not seeing anything obvious.

---

While investigating range deletion performance issues, we realized that
large swaths of contiguous tombstones can cause massive performance
issues when seeking. Seeks of more than 15 minutes (!) were observed on
one cluster [0]. The problem is that every key overlapping the range
deletion tombstone must be loaded from disk, decompressed, and compared
with the sequence number of the tombstone until a key is found that is
not covered by the tombstone. If contiguous keys representing hundreds
of gigabytes are covered by tombstones, RocksDB will need to scan
hundreds of gigabytes of data. Needless to say, performance suffers.

We have plans to improve seek performance in the face of range deletion
tombstones upstream, but we can mitigate the issue locally, too.
Iteration outside of tests is nearly always in the context of a range or
some bounded span of local keys. By plumbing knowledge of the upper
bound we care about for each scan to RocksDB, we can invalidate the
iterator early, once the upper bound has been exceeded, rather than
scanning over potentially hundreds of gigabytes of deleted keys just to
find a key that we don't care about.

To ensure we don't forget to specify an upper bound in a critical path,
this commit requires that all iterators are either prefix iterators or
declare their upper bound. This makes iterating in tests rather irksome,
but I think the tradeoff is worthwhile.

[0]: #24029 (comment)

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
benesch added a commit to benesch/cockroach that referenced this issue Jul 6, 2018
While debugging cockroachdb#24029 we discovered that RESTORE generates massive
numbers of suggested compactions as it splits and scatters ranges. As
the cluster rebalances, every removed replica leaves behind a range
deletion tombstone and a suggested compaction over the keys it covered.

Occasionally, the replica chosen for rebalancing will be a member of the
last range of the cluster. This range extends from wherever the restore
has last split the table it's restoring to the very last key. Suppose
we're restoring a table with 10,000 ranges evenly distributed across
primary keys from 1-10,000. If a replica in the last range gets
rebalanced early in the restore—say, after only the first 500 ranges
have been split off—at least one node in the cluster will have a
suggested compaction for a range like the following:

    /Table/51/1/500 - /Max

This creates a huge problem! The restore will eventually create 9500
more ranges in that keyspan, each about 32MiB in size. Some of those
ranges will necessarily rebalance back onto the node with the suggested
compaction. By the time the compaction queue gets around to processing
the suggestion, there might be hundreds of gigabytes within the range.

In our 2TB (replicated) store dump, snapshotted immediately after a
RESTORE, there were two such massive suggested compactions, each of
which took over 1h to complete. This bogs down the compaction queue with
unnecessary work and makes it especially dangerous to initiate a DROP
TABLE (cockroachdb#24029), as the incoming range deletion tombstones will pile up
until the prior compaction finishes, and the cluster grinds to a halt in
the meantime.

The same problem happens whenever any replica is rebalanced away and
back before the compaction queue has a chance to compact away the range
deletion tombstone, though the impact is limited because the keyspan is
smaller.

This commit prevents the compaction queue from getting bogged down with
suggestions based on outdated information. At the time the suggestion is
considered for compaction, the queue checks whether the key span
suggested has any live keys. The sign of even a single key is a good
indicator that something has changed—usually that the replica, or one of
its split children, has been rebalanced back onto the node. The
compaction queue now deletes this suggestion instead of acting on it.

This is a crucial piece of the fix to cockroachdb#24029. The other half, cockroachdb#26449,
involves rate-limiting ClearRange requests.

I suspect this change will have a nice performance boost for RESTORE.
Anecdotally we've noticed that restores slow down over time. I'm willing
to bet its because nonsense suggested compactions start hogging disk
I/O.

Release note: None
benesch added a commit to benesch/cockroach that referenced this issue Jul 9, 2018
While debugging cockroachdb#24029 we discovered that RESTORE generates massive
numbers of suggested compactions as it splits and scatters ranges. As
the cluster rebalances, every removed replica leaves behind a range
deletion tombstone and a suggested compaction over the keys it covered.

Occasionally, the replica chosen for rebalancing will be a member of the
last range of the cluster. This range extends from wherever the restore
has last split the table it's restoring to the very last key. Suppose
we're restoring a table with 10,000 ranges evenly distributed across
primary keys from 1-10,000. If a replica in the last range gets
rebalanced early in the restore—say, after only the first 500 ranges
have been split off—at least one node in the cluster will have a
suggested compaction for a range like the following:

    /Table/51/1/500 - /Max

This creates a huge problem! The restore will eventually create 9500
more ranges in that keyspan, each about 32MiB in size. Some of those
ranges will necessarily rebalance back onto the node with the suggested
compaction. By the time the compaction queue gets around to processing
the suggestion, there might be hundreds of gigabytes within the range.

In our 2TB (replicated) store dump, snapshotted immediately after a
RESTORE, there were two such massive suggested compactions, each of
which took over 1h to complete. This bogs down the compaction queue with
unnecessary work and makes it especially dangerous to initiate a DROP
TABLE (cockroachdb#24029), as the incoming range deletion tombstones will pile up
until the prior compaction finishes, and the cluster grinds to a halt in
the meantime.

The same problem happens whenever any replica is rebalanced away and
back before the compaction queue has a chance to compact away the range
deletion tombstone, though the impact is limited because the keyspan is
smaller.

This commit prevents the compaction queue from getting bogged down with
suggestions based on outdated information. At the time the suggestion is
considered for compaction, the queue checks whether the key span
suggested has any live keys. The sign of even a single key is a good
indicator that something has changed—usually that the replica, or one of
its split children, has been rebalanced back onto the node. The
compaction queue now deletes this suggestion instead of acting on it.

This is a crucial piece of the fix to cockroachdb#24029. The other half, cockroachdb#26449,
involves rate-limiting ClearRange requests.

I suspect this change will have a nice performance boost for RESTORE.
Anecdotally we've noticed that restores slow down over time. I'm willing
to bet its because nonsense suggested compactions start hogging disk
I/O.

Release note: None
benesch added a commit to benesch/cockroach that referenced this issue Jul 9, 2018
While debugging cockroachdb#24029 we discovered that RESTORE generates massive
numbers of suggested compactions as it splits and scatters ranges. As
the cluster rebalances, every removed replica leaves behind a range
deletion tombstone and a suggested compaction over the keys it covered.

Occasionally, the replica chosen for rebalancing will be a member of the
last range of the cluster. This range extends from wherever the restore
has last split the table it's restoring to the very last key. Suppose
we're restoring a table with 10,000 ranges evenly distributed across
primary keys from 1-10,000. If a replica in the last range gets
rebalanced early in the restore—say, after only the first 500 ranges
have been split off—at least one node in the cluster will have a
suggested compaction for a range like the following:

    /Table/51/1/500 - /Max

This creates a huge problem! The restore will eventually create 9500
more ranges in that keyspan, each about 32MiB in size. Some of those
ranges will necessarily rebalance back onto the node with the suggested
compaction. By the time the compaction queue gets around to processing
the suggestion, there might be hundreds of gigabytes within the range.

In our 2TB (replicated) store dump, snapshotted immediately after a
RESTORE, there were two such massive suggested compactions, each of
which took over 1h to complete. This bogs down the compaction queue with
unnecessary work and makes it especially dangerous to initiate a DROP
TABLE (cockroachdb#24029), as the incoming range deletion tombstones will pile up
until the prior compaction finishes, and the cluster grinds to a halt in
the meantime.

The same problem happens whenever any replica is rebalanced away and
back before the compaction queue has a chance to compact away the range
deletion tombstone, though the impact is limited because the keyspan is
smaller.

This commit prevents the compaction queue from getting bogged down with
suggestions based on outdated information. At the time the suggestion is
considered for compaction, the queue checks whether the key span
suggested has any live keys. The sign of even a single key is a good
indicator that something has changed—usually that the replica, or one of
its split children, has been rebalanced back onto the node. The
compaction queue now deletes this suggestion instead of acting on it.

This is a crucial piece of the fix to cockroachdb#24029. The other half, cockroachdb#26449,
involves rate-limiting ClearRange requests.

I suspect this change will have a nice performance boost for RESTORE.
Anecdotally we've noticed that restores slow down over time. I'm willing
to bet its because nonsense suggested compactions start hogging disk
I/O.

Release note: None
petermattis added a commit to petermattis/cockroach that referenced this issue Jul 9, 2018
Demonstrate the performance problem with iterating through a range
tombstone.

name                                        time/op
RocksDBDeleteRangeIterate/entries=10-8      6.09µs ± 3%
RocksDBDeleteRangeIterate/entries=1000-8     131µs ± 3%
RocksDBDeleteRangeIterate/entries=100000-8  12.3ms ± 3%

See cockroachdb#24029

Release note: None
craig bot pushed a commit that referenced this issue Jul 10, 2018
26488: compactor: purge suggestions that have live data r=tschottdorf,bdarnell a=benesch

While debugging #24029 we discovered that RESTORE generates massive
numbers of suggested compactions as it splits and scatters ranges. As
the cluster rebalances, every removed replica leaves behind a range
deletion tombstone and a suggested compaction over the keys it covered.

Occasionally, the replica chosen for rebalancing will be a member of the
last range of the cluster. This range extends from wherever the restore
has last split the table it's restoring to the very last key. Suppose
we're restoring a table with 10,000 ranges evenly distributed across
primary keys from 1-10,000. If a replica in the last range gets
rebalanced early in the restore—say, after only the first 500 ranges
have been split off—at least one node in the cluster will have a
suggested compaction for a range like the following:

    /Table/51/1/500 - /Max

This creates a huge problem! The restore will eventually create 9500
more ranges in that keyspan, each about 32MiB in size. Some of those
ranges will necessarily rebalance back onto the node with the suggested
compaction. By the time the compaction queue gets around to processing
the suggestion, there might be hundreds of gigabytes within the range.

In our 2TB (replicated) store dump, snapshotted immediately after a
RESTORE, there were two such massive suggested compactions, each of
which took over 1h to complete. This bogs down the compaction queue with
unnecessary work and makes it especially dangerous to initiate a DROP
TABLE (#24029), as the incoming range deletion tombstones will pile up
until the prior compaction finishes, and the cluster grinds to a halt in
the meantime.

The same problem happens whenever any replica is rebalanced away and
back before the compaction queue has a chance to compact away the range
deletion tombstone, though the impact is limited because the keyspan is
smaller.

This commit prevents the compaction queue from getting bogged down with
suggestions based on outdated information. At the time the suggestion is
considered for compaction, the queue checks whether the key span
suggested has any live keys. The sign of even a single key is a good
indicator that something has changed—usually that the replica, or one of
its split children, has been rebalanced back onto the node. The
compaction queue now deletes this suggestion instead of acting on it.

This is a crucial piece of the fix to #24029. The other half, #26449,
involves rate-limiting ClearRange requests.

I suspect this change will have a nice performance boost for RESTORE.
Anecdotally we've noticed that restores slow down over time. I'm willing
to bet its because nonsense suggested compactions start hogging disk
I/O.

Release note: None

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
petermattis added a commit to petermattis/cockroach that referenced this issue Jul 10, 2018
Demonstrate the performance problem with iterating through a range
tombstone.

name                                        time/op
RocksDBDeleteRangeIterate/entries=10-8      6.09µs ± 3%
RocksDBDeleteRangeIterate/entries=1000-8     131µs ± 3%
RocksDBDeleteRangeIterate/entries=100000-8  12.3ms ± 3%

See cockroachdb#24029

Release note: None
@petermattis
Copy link
Collaborator

With the recent improvements to RocksDB range tombstones, the clearrange roachtest now reliably passes:

screen shot 2018-07-13 at 8 27 53 pm

Note that the test started slightly after 00:00. This test was using a freshly generated bank fixture. For some reason, cockroach was refusing to start using the old store fixtures complaining about the version being too old. This doesn't make any sense and needs to be investigated. I'll file a separate issue when I reproduce.

@benesch
Copy link
Contributor Author

benesch commented Jul 14, 2018

For some reason, cockroach was refusing to start using the old store fixtures complaining about the version being too old. This doesn't make any sense and needs to be investigated.

Why doesn't it make any sense? I think that fixture was generated with 1.1, and you're running a 2.1 beta.

@spencerkimball
Copy link
Member

spencerkimball commented Jul 14, 2018 via email

@petermattis
Copy link
Collaborator

Why doesn't it make any sense? I think that fixture was generated with 1.1, and you're running a 2.1 beta.

See #27525. I generated that fixture with a binary at version v2.0-7.

In that graph, why does available stay constant while capacity decreases?

Zfs. For ease of debugging we take a zfs snapshot after restoring the store fixtures which makes subsequent runs of the test fast. The downside is that as we start to delete data, zfs doesn't actually delete it off disk because it is referenced by the snapshot.

@petermattis
Copy link
Collaborator

clearrange has passed 4 nights in a row. I'm closing this issue. #26693 remains open to investigate the excessive compactions that can still occur, but do not drive a cluster to its knees.

KV automation moved this from On the horizon to Finished (milestone 3, ends 7/20) Jul 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants