-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Range tombstones can create excessively large compactions #3977
Comments
cc @ajkr |
I believe you fixed it fully in #4050. Let us know if it's still an issue. |
#4050 fixed an issue where adjacent sstables were treated unnecessarily as an atomic unit if they contained part of the same range tombstone. The issue here was about something different: |
@ajkr do you have any suggestion how we should fix the issue? |
@petermattis do you mean that sometimes we can cut the file so that the file only contain tombstone information without actual keys? |
@siying Yes. Looking at this again another thought comes to mind: it is unfortunate that compactions have to read their inputs in their entirety even if a range tombstone is shadowing a large swath of the data. The very large compactions described above wouldn't be a problem if they could be performed very quickly. And in the scenario depicted there the range tombstone is completely shadowing the data in the lower level sstables. I'm not sure how to identify this scenario easily. We'd also have to pay attention to the existence of snapshots which could invalidate the optimization. |
…tions (#5956) Summary: For more information on the original problem see this [link](#3977). This change adds two new tests. They are identical other than one uses range tombstones and the other does not. Each test generates sub files at L2 which overlap with keys L3. The test that uses range tombstones generates a single file at L2. This single file will generate a very large range overlap that will in turn create excessively large compaction. 1: T001 - T005 2: 000 - 005 In contrast, the test that uses key ranges generates 3 files at L2. As a single file is compacted at a time, those 3 files will generate less work per compaction iteration. 1: 001 - 002 1: 003 - 004 1: 005 2: 000 - 005 Pull Request resolved: #5956 Differential Revision: D18071631 Pulled By: dlambrig fbshipit-source-id: 12abae75fb3e0b022d228c6371698aa5e53385df
…tions (facebook#5956) Summary: For more information on the original problem see this [link](facebook#3977). This change adds two new tests. They are identical other than one uses range tombstones and the other does not. Each test generates sub files at L2 which overlap with keys L3. The test that uses range tombstones generates a single file at L2. This single file will generate a very large range overlap that will in turn create excessively large compaction. 1: T001 - T005 2: 000 - 005 In contrast, the test that uses key ranges generates 3 files at L2. As a single file is compacted at a time, those 3 files will generate less work per compaction iteration. 1: 001 - 002 1: 003 - 004 1: 005 2: 000 - 005 Pull Request resolved: facebook#5956 Differential Revision: D18071631 Pulled By: dlambrig fbshipit-source-id: 12abae75fb3e0b022d228c6371698aa5e53385df
The TLDR is that range tombstones can cause compactions to generate sstables that do not adhere to the heuristic that an sstable at level N should cover a reasonable number of sstables at level N+1. (See
SubcompactionState::ShouldStopBefore
). The first and last output file for a compaction is extended to the left and right to hold range tombstones. If these are the first or last file in the level, they can be extended an arbitrarily far distance causing the sstable to cover thousands of sstables at the next level.Consider the following scenario involving level-style compaction where L6 is the lowest level.
This is showing that there are 5 sstables in L6 with the corresponding key ranges. Each sstable is 128MB in size which is the target size I've configured for L6.
I then write to key
a000000000
, perform a delete range one000000000-e000010000
, and flush:Now there is an L0 sstable that covers the entire key space. I then compact the range
e000000000-e000010000
. Everything starts fine. The compactions from L0->L4 are simply file movements because those levels are empty.The compaction code decides to actually perform a compaction from L4->L5 (I'm assuming because there are sstables in L6 and we'd like to ensure the sstables in L5 have good boundaries).
Notice that only 1 sstable was generated in L5 and it has exactly the same boundary as the input from L4. The compaction from L5->L6 is then excessively large, involving all of the L6 sstables:
In the above, I forced this large compaction to occur by issuing a manual compaction. If I didn't issue a manual compaction I imagine the compaction picker would never choose this L5->L6 compaction, which is preferable in some ways, though the disk space from the keys covered by the
DeleteRange
would not be reclaimed for a long time.Rather than extending the first and last output file in a compaction, I think
CompactionJob::FinishCompactionOutputFile
should instead create sstables before and after the last output file that contain only tombstones if the tombstones cover enough sstables in the grandparent level.For the enterprising reader who made it to the end of this report, I have a hack/workaround that mostly addresses the issue. Whenever I issue a
DeleteRange
, also issue aDelete
for the key at the start of the range and a key near the end of the range. For example, instead ofDeleteRange("e000000000", "e000010000")
, I issueDeleteRange("e000000000", "e000010000"); Delete("e000000000"); Delete("e000009999")
. This has identical semantics if they are issued in the same batch, but allows the compaction job to generate sstables with good boundaries. With this hack in place, running the same test as before, the L4->L5 compaction generates two sstables:The second sstable is then used for compaction with L5->L6:
The text was updated successfully, but these errors were encountered: