Range tombstones can create excessively large compactions #3977

petermattis · 2018-06-10T19:22:24Z

The TLDR is that range tombstones can cause compactions to generate sstables that do not adhere to the heuristic that an sstable at level N should cover a reasonable number of sstables at level N+1. (See SubcompactionState::ShouldStopBefore). The first and last output file for a compaction is extended to the left and right to hold range tombstones. If these are the first or last file in the level, they can be extended an arbitrarily far distance causing the sstable to cover thousands of sstables at the next level.

Consider the following scenario involving level-style compaction where L6 is the lowest level.

6: "a000000000" - "a000009999"
6: "b000000000" - "b000009999"
6: "c000000000" - "c000009999"
6: "d000000000" - "d000009999"
6: "e000000000" - "e000009999"

This is showing that there are 5 sstables in L6 with the corresponding key ranges. Each sstable is 128MB in size which is the target size I've configured for L6.

I then write to key a000000000, perform a delete range on e000000000-e000010000, and flush:

0: "a000000000" - "e000010000"
6: "a000000000" - "a000009999"
6: "b000000000" - "b000009999"
6: "c000000000" - "c000009999"
6: "d000000000" - "d000009999"
6: "e000000000" - "e000009999"

Now there is an L0 sstable that covers the entire key space. I then compact the range e000000000-e000010000. Everything starts fine. The compactions from L0->L4 are simply file movements because those levels are empty.

[default] Manual compaction starting L0 -> L-2
[default] compact range L0 -> L-2: 'e000000000' seq:72057594037927935, type:17 - 'e000010000' seq:0, type:0
[default] L0 -> L4: compaction 0 overlapping inputs: 'a000000000 - 'e000010000
[default] Manual compaction starting L1 -> L2
[default] compact range L1 -> L2: 'e000000000' seq:72057594037927935, type:17 - 'e000010000' seq:0, type:0
[default] Manual compaction starting L2 -> L3
[default] compact range L2 -> L3: 'e000000000' seq:72057594037927935, type:17 - 'e000010000' seq:0, type:0
[default] Manual compaction starting L3 -> L4
[default] compact range L3 -> L4: 'e000000000' seq:72057594037927935, type:17 - 'e000010000' seq:0, type:0

The compaction code decides to actually perform a compaction from L4->L5 (I'm assuming because there are sstables in L6 and we'd like to ensure the sstables in L5 have good boundaries).

[default] Manual compaction starting L4 -> L5
[default] compact range L4 -> L5: 'e000000000' seq:72057594037927935, type:17 - 'e000010000' seq:0, type:0
[default] L4 -> L5: compaction 0 overlapping inputs: 'a000000000 - 'e000010000
[default] [JOB 4] Compacting 1@4 files to L5, score -1.00
[default] Compaction start summary: Base version 8 Base level 4, inputs: [13(1318B)]
[default] [JOB 4] Generated table #14: 2 keys, 1314 bytes: a000000000 - e000010000
[default] [JOB 4] Compacted 1@4 files to L5 => 1314 bytes

Notice that only 1 sstable was generated in L5 and it has exactly the same boundary as the input from L4. The compaction from L5->L6 is then excessively large, involving all of the L6 sstables:

[default] Manual compaction starting L5 -> L6
[default] compact range L5 -> L6: 'e000000000' seq:72057594037927935, type:17 - 'e000010000' seq:0, type:0
[default] L5 -> L6: compaction 5 overlapping inputs: 'a000000000 - 'e000010000
[default] Manual compaction from level-5 to level-6 from 'e000000000' seq:72057594037927935, type:17 .. 'e000010000' seq:0, type:0; will stop at (end)
[default] [JOB 5] Compacting 1@5 + 5@6 files to L6, score -1.00
[default] Compaction start summary: Base version 9 Base level 5, inputs: [14(1314B)], [6(128MB) 8(128MB) 9(128MB) 10(128MB) 11(128MB)]

In the above, I forced this large compaction to occur by issuing a manual compaction. If I didn't issue a manual compaction I imagine the compaction picker would never choose this L5->L6 compaction, which is preferable in some ways, though the disk space from the keys covered by the DeleteRange would not be reclaimed for a long time.

Rather than extending the first and last output file in a compaction, I think CompactionJob::FinishCompactionOutputFile should instead create sstables before and after the last output file that contain only tombstones if the tombstones cover enough sstables in the grandparent level.

For the enterprising reader who made it to the end of this report, I have a hack/workaround that mostly addresses the issue. Whenever I issue a DeleteRange, also issue a Delete for the key at the start of the range and a key near the end of the range. For example, instead of DeleteRange("e000000000", "e000010000"), I issue DeleteRange("e000000000", "e000010000"); Delete("e000000000"); Delete("e000009999"). This has identical semantics if they are issued in the same batch, but allows the compaction job to generate sstables with good boundaries. With this hack in place, running the same test as before, the L4->L5 compaction generates two sstables:

[default] [JOB 4] Generated table #14: 1 keys, 1241 bytes: a000000000 - a000000000
[default] [JOB 4] Generated table #15: 2 keys, 1314 bytes: e000000000 - e000010000

The second sstable is then used for compaction with L5->L6:

[default] Manual compaction starting L5 -> L6
[default] compact range L5 -> L6: 'e000000000' seq:72057594037927935, type:17 - 'e000010000' seq:0, type:0
[default] L5 -> L6: compaction 1 overlapping inputs: 'e000000000 - 'e000010000

The text was updated successfully, but these errors were encountered:

riversand963 · 2018-06-11T16:59:27Z

cc @ajkr

ajkr · 2018-07-22T04:03:14Z

I believe you fixed it fully in #4050. Let us know if it's still an issue.

petermattis · 2018-07-23T15:08:38Z

#4050 fixed an issue where adjacent sstables were treated unnecessarily as an atomic unit if they contained part of the same range tombstone.

The issue here was about something different: SubcompactionState::ShouldStopBefore doesn't take range tombstones into account when determining if an output sstable should be finished. Now that RangeDelAggregator::NewIterator exists, I think it should be possible to adjust the logic in CompactionJob to include range tombstones in the consideration of whether an output sstable should be finished.

siying · 2019-10-16T19:20:28Z

@ajkr do you have any suggestion how we should fix the issue?

siying · 2019-10-16T19:23:48Z

@petermattis do you mean that sometimes we can cut the file so that the file only contain tombstone information without actual keys?

petermattis · 2019-10-20T18:05:29Z

@siying Yes.

Looking at this again another thought comes to mind: it is unfortunate that compactions have to read their inputs in their entirety even if a range tombstone is shadowing a large swath of the data. The very large compactions described above wouldn't be a problem if they could be performed very quickly. And in the scenario depicted there the range tombstone is completely shadowing the data in the lower level sstables. I'm not sure how to identify this scenario easily. We'd also have to pay attention to the existence of snapshots which could invalidate the optimization.

…tions (#5956) Summary: For more information on the original problem see this [link](#3977). This change adds two new tests. They are identical other than one uses range tombstones and the other does not. Each test generates sub files at L2 which overlap with keys L3. The test that uses range tombstones generates a single file at L2. This single file will generate a very large range overlap that will in turn create excessively large compaction. 1: T001 - T005 2: 000 - 005 In contrast, the test that uses key ranges generates 3 files at L2. As a single file is compacted at a time, those 3 files will generate less work per compaction iteration. 1: 001 - 002 1: 003 - 004 1: 005 2: 000 - 005 Pull Request resolved: #5956 Differential Revision: D18071631 Pulled By: dlambrig fbshipit-source-id: 12abae75fb3e0b022d228c6371698aa5e53385df

…tions (facebook#5956) Summary: For more information on the original problem see this [link](facebook#3977). This change adds two new tests. They are identical other than one uses range tombstones and the other does not. Each test generates sub files at L2 which overlap with keys L3. The test that uses range tombstones generates a single file at L2. This single file will generate a very large range overlap that will in turn create excessively large compaction. 1: T001 - T005 2: 000 - 005 In contrast, the test that uses key ranges generates 3 files at L2. As a single file is compacted at a time, those 3 files will generate less work per compaction iteration. 1: 001 - 002 1: 003 - 004 1: 005 2: 000 - 005 Pull Request resolved: facebook#5956 Differential Revision: D18071631 Pulled By: dlambrig fbshipit-source-id: 12abae75fb3e0b022d228c6371698aa5e53385df

petermattis mentioned this issue Jun 10, 2018

storage: dropping a large table will brick a cluster due to compactions cockroachdb/cockroach#24029

Closed

yiwu-arbug assigned ajkr Jun 21, 2018

ajkr closed this as completed Jul 22, 2018

ajkr reopened this Jul 30, 2018

petermattis mentioned this issue Aug 21, 2018

storage: excessive RocksDB compaction from reasonable suggestion cockroachdb/cockroach#26693

Closed

petermattis mentioned this issue Oct 29, 2018

Fix range tombstones written to more files than necessary #4592

Closed

petermattis mentioned this issue Mar 4, 2019

roachtest: clearrange/checks=true failed cockroachdb/cockroach#34860

Closed

petermattis mentioned this issue Jun 29, 2019

db: sstable output size during compactions should consider range tombstones cockroachdb/pebble#167

Closed

dlambrig mentioned this issue Oct 22, 2019

Add test showing range tombstones can create excessively large compactions #5956

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Range tombstones can create excessively large compactions #3977

Range tombstones can create excessively large compactions #3977

petermattis commented Jun 10, 2018

riversand963 commented Jun 11, 2018

ajkr commented Jul 22, 2018

petermattis commented Jul 23, 2018

siying commented Oct 16, 2019

siying commented Oct 16, 2019

petermattis commented Oct 20, 2019

Range tombstones can create excessively large compactions #3977

Range tombstones can create excessively large compactions #3977

Comments

petermattis commented Jun 10, 2018

riversand963 commented Jun 11, 2018

ajkr commented Jul 22, 2018

petermattis commented Jul 23, 2018

siying commented Oct 16, 2019

siying commented Oct 16, 2019

petermattis commented Oct 20, 2019