Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range tombstones can create excessively large compactions #3977

Open
petermattis opened this issue Jun 10, 2018 · 6 comments
Open

Range tombstones can create excessively large compactions #3977

petermattis opened this issue Jun 10, 2018 · 6 comments
Assignees

Comments

@petermattis
Copy link
Contributor

The TLDR is that range tombstones can cause compactions to generate sstables that do not adhere to the heuristic that an sstable at level N should cover a reasonable number of sstables at level N+1. (See SubcompactionState::ShouldStopBefore). The first and last output file for a compaction is extended to the left and right to hold range tombstones. If these are the first or last file in the level, they can be extended an arbitrarily far distance causing the sstable to cover thousands of sstables at the next level.

Consider the following scenario involving level-style compaction where L6 is the lowest level.

6: "a000000000" - "a000009999"
6: "b000000000" - "b000009999"
6: "c000000000" - "c000009999"
6: "d000000000" - "d000009999"
6: "e000000000" - "e000009999"

This is showing that there are 5 sstables in L6 with the corresponding key ranges. Each sstable is 128MB in size which is the target size I've configured for L6.

I then write to key a000000000, perform a delete range on e000000000-e000010000, and flush:

0: "a000000000" - "e000010000"
6: "a000000000" - "a000009999"
6: "b000000000" - "b000009999"
6: "c000000000" - "c000009999"
6: "d000000000" - "d000009999"
6: "e000000000" - "e000009999"

Now there is an L0 sstable that covers the entire key space. I then compact the range e000000000-e000010000. Everything starts fine. The compactions from L0->L4 are simply file movements because those levels are empty.

[default] Manual compaction starting L0 -> L-2
[default] compact range L0 -> L-2: 'e000000000' seq:72057594037927935, type:17 - 'e000010000' seq:0, type:0
[default] L0 -> L4: compaction 0 overlapping inputs: 'a000000000 - 'e000010000
[default] Manual compaction starting L1 -> L2
[default] compact range L1 -> L2: 'e000000000' seq:72057594037927935, type:17 - 'e000010000' seq:0, type:0
[default] Manual compaction starting L2 -> L3
[default] compact range L2 -> L3: 'e000000000' seq:72057594037927935, type:17 - 'e000010000' seq:0, type:0
[default] Manual compaction starting L3 -> L4
[default] compact range L3 -> L4: 'e000000000' seq:72057594037927935, type:17 - 'e000010000' seq:0, type:0

The compaction code decides to actually perform a compaction from L4->L5 (I'm assuming because there are sstables in L6 and we'd like to ensure the sstables in L5 have good boundaries).

[default] Manual compaction starting L4 -> L5
[default] compact range L4 -> L5: 'e000000000' seq:72057594037927935, type:17 - 'e000010000' seq:0, type:0
[default] L4 -> L5: compaction 0 overlapping inputs: 'a000000000 - 'e000010000
[default] [JOB 4] Compacting 1@4 files to L5, score -1.00
[default] Compaction start summary: Base version 8 Base level 4, inputs: [13(1318B)]
[default] [JOB 4] Generated table #14: 2 keys, 1314 bytes: a000000000 - e000010000
[default] [JOB 4] Compacted 1@4 files to L5 => 1314 bytes

Notice that only 1 sstable was generated in L5 and it has exactly the same boundary as the input from L4. The compaction from L5->L6 is then excessively large, involving all of the L6 sstables:

[default] Manual compaction starting L5 -> L6
[default] compact range L5 -> L6: 'e000000000' seq:72057594037927935, type:17 - 'e000010000' seq:0, type:0
[default] L5 -> L6: compaction 5 overlapping inputs: 'a000000000 - 'e000010000
[default] Manual compaction from level-5 to level-6 from 'e000000000' seq:72057594037927935, type:17 .. 'e000010000' seq:0, type:0; will stop at (end)
[default] [JOB 5] Compacting 1@5 + 5@6 files to L6, score -1.00
[default] Compaction start summary: Base version 9 Base level 5, inputs: [14(1314B)], [6(128MB) 8(128MB) 9(128MB) 10(128MB) 11(128MB)]

In the above, I forced this large compaction to occur by issuing a manual compaction. If I didn't issue a manual compaction I imagine the compaction picker would never choose this L5->L6 compaction, which is preferable in some ways, though the disk space from the keys covered by the DeleteRange would not be reclaimed for a long time.

Rather than extending the first and last output file in a compaction, I think CompactionJob::FinishCompactionOutputFile should instead create sstables before and after the last output file that contain only tombstones if the tombstones cover enough sstables in the grandparent level.

For the enterprising reader who made it to the end of this report, I have a hack/workaround that mostly addresses the issue. Whenever I issue a DeleteRange, also issue a Delete for the key at the start of the range and a key near the end of the range. For example, instead of DeleteRange("e000000000", "e000010000"), I issue DeleteRange("e000000000", "e000010000"); Delete("e000000000"); Delete("e000009999"). This has identical semantics if they are issued in the same batch, but allows the compaction job to generate sstables with good boundaries. With this hack in place, running the same test as before, the L4->L5 compaction generates two sstables:

[default] [JOB 4] Generated table #14: 1 keys, 1241 bytes: a000000000 - a000000000
[default] [JOB 4] Generated table #15: 2 keys, 1314 bytes: e000000000 - e000010000

The second sstable is then used for compaction with L5->L6:

[default] Manual compaction starting L5 -> L6
[default] compact range L5 -> L6: 'e000000000' seq:72057594037927935, type:17 - 'e000010000' seq:0, type:0
[default] L5 -> L6: compaction 1 overlapping inputs: 'e000000000 - 'e000010000
@riversand963
Copy link
Contributor

cc @ajkr

@ajkr
Copy link
Contributor

ajkr commented Jul 22, 2018

I believe you fixed it fully in #4050. Let us know if it's still an issue.

@ajkr ajkr closed this as completed Jul 22, 2018
@petermattis
Copy link
Contributor Author

#4050 fixed an issue where adjacent sstables were treated unnecessarily as an atomic unit if they contained part of the same range tombstone.

The issue here was about something different: SubcompactionState::ShouldStopBefore doesn't take range tombstones into account when determining if an output sstable should be finished. Now that RangeDelAggregator::NewIterator exists, I think it should be possible to adjust the logic in CompactionJob to include range tombstones in the consideration of whether an output sstable should be finished.

@siying
Copy link
Contributor

siying commented Oct 16, 2019

@ajkr do you have any suggestion how we should fix the issue?

@siying
Copy link
Contributor

siying commented Oct 16, 2019

@petermattis do you mean that sometimes we can cut the file so that the file only contain tombstone information without actual keys?

@petermattis
Copy link
Contributor Author

@siying Yes.

Looking at this again another thought comes to mind: it is unfortunate that compactions have to read their inputs in their entirety even if a range tombstone is shadowing a large swath of the data. The very large compactions described above wouldn't be a problem if they could be performed very quickly. And in the scenario depicted there the range tombstone is completely shadowing the data in the lower level sstables. I'm not sure how to identify this scenario easily. We'd also have to pay attention to the existence of snapshots which could invalidate the optimization.

facebook-github-bot pushed a commit that referenced this issue Oct 24, 2019
…tions (#5956)

Summary:
For more information on the original problem see this [link](#3977).

This change adds two new tests. They are identical other than one uses range tombstones and the other does not. Each test generates sub files at L2 which overlap with keys L3. The test that uses range tombstones generates a single file at L2. This single file will generate a very large range overlap that will in turn create excessively large compaction.

1: T001 - T005
2:  000 -  005

In contrast, the test that uses key ranges generates 3 files at L2. As a single file is compacted at a time, those 3 files will generate less work per compaction iteration.

1:  001 - 002
1:  003 - 004
1:  005
2:  000 - 005
Pull Request resolved: #5956

Differential Revision: D18071631

Pulled By: dlambrig

fbshipit-source-id: 12abae75fb3e0b022d228c6371698aa5e53385df
merryChris pushed a commit to merryChris/rocksdb that referenced this issue Nov 18, 2019
…tions (facebook#5956)

Summary:
For more information on the original problem see this [link](facebook#3977).

This change adds two new tests. They are identical other than one uses range tombstones and the other does not. Each test generates sub files at L2 which overlap with keys L3. The test that uses range tombstones generates a single file at L2. This single file will generate a very large range overlap that will in turn create excessively large compaction.

1: T001 - T005
2:  000 -  005

In contrast, the test that uses key ranges generates 3 files at L2. As a single file is compacted at a time, those 3 files will generate less work per compaction iteration.

1:  001 - 002
1:  003 - 004
1:  005
2:  000 - 005
Pull Request resolved: facebook#5956

Differential Revision: D18071631

Pulled By: dlambrig

fbshipit-source-id: 12abae75fb3e0b022d228c6371698aa5e53385df
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants