Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: sstable output size during compactions should consider range tombstones #167

Closed
petermattis opened this issue Jun 29, 2019 · 6 comments · Fixed by #382
Closed

db: sstable output size during compactions should consider range tombstones #167

petermattis opened this issue Jun 29, 2019 · 6 comments · Fixed by #382

Comments

@petermattis
Copy link
Collaborator

See facebook/rocksdb#3977 for the similar issue in RocksDB. The TLDR is the the sstable output size does not truncate range tombstones. This can result in an sstable covering an excessively large number of sstables in the level below, resulting in an excessively large compaction. There is a comment in the code about this:

	// TODO(peter,rangedel): Need to incorporate the range tombstones in the
	// shouldStopBefore decision.

I'm not clear on exactly how I'd want to incorporate range tombstones in the shouldStopBefore decision. See the RocksDB issues for a problematic scenario. It is possible that we need to truncate and split range tombstones to avoid creating sstables which overlap too much in the grandparent level.

@Ryanfsdf Ryanfsdf self-assigned this Aug 9, 2019
@petermattis petermattis assigned itsbilal and unassigned Ryanfsdf Sep 5, 2019
@petermattis
Copy link
Collaborator Author

@petermattis petermattis assigned ajkr and unassigned itsbilal Sep 30, 2019
@petermattis petermattis added this to Incoming in (Deprecated) Storage via automation Oct 1, 2019
@petermattis petermattis moved this from Incoming to In Progress in (Deprecated) Storage Oct 1, 2019
@petermattis petermattis moved this from In Progress to To Do in (Deprecated) Storage Oct 15, 2019
@petermattis petermattis moved this from To Do to Incoming in (Deprecated) Storage Oct 22, 2019
@petermattis petermattis moved this from Incoming to To Do (next milestone) in (Deprecated) Storage Oct 23, 2019
@petermattis petermattis moved this from To Do (next milestone) to To Do (this milestone) in (Deprecated) Storage Nov 1, 2019
@ajkr
Copy link
Contributor

ajkr commented Nov 5, 2019

I was thinking the iterator underlying compaction iterator could present the range tombstone start keys inline with the point keys. When compaction iterator encounters such a key, it would pass the corresponding end key to shouldStopBefore. If shouldStopBefore returns true, then it stops and cuts the file immediately (i.e., before the start key of the range tombstone whose end key we want to exclude).

Will take a look at what's been done already and whether this is feasible given the current code structure.

@ajkr
Copy link
Contributor

ajkr commented Nov 5, 2019

shouldStopBefore might be due for a remodeling. It recomputes every time we encounter a key how many bytes of grandparent files are now overlapped. But it's not really necessary. We could just compute once at the start of each file where that file's limit should be according to grandparent overlap. The exclusive limit would be the start of the first grandparent file that would cause us to exceed the size limit if we included it. Then, when we pass this limit to finishOutput(), it looks like it gets passed along to Tombstones(), which automatically truncates our tombstones to that point. Will try this out.

@ajkr
Copy link
Contributor

ajkr commented Nov 5, 2019

I tried out @petermattis's original suggestion today: What I think we may want to change is the shouldStopBefore return value. Rather than truncating before the supplied key, I think we want to allow truncation between the previous key that was output for an sstable, and the current key. And that determination will be made using the grandparents level.

It is pretty close to correct but this edge case is troubling me:

  • input level file has point key a, rangedel a-z
  • grandparent level indicates cut should happen somewhere between a and z, let's say at m
  • the compaction iterator only sees key a twice (one is the sentinel key for the rangedel). Neither one meets shouldStopBefore's criteria so there's no file cutting. Then the full rangedel is output leading to a larger-than-desired future compaction.

@petermattis
Copy link
Collaborator Author

the compaction iterator only sees key a twice (one is the sentinel key for the rangedel). Neither one meets shouldStopBefore's criteria so there's no file cutting. Then the full rangedel is output leading to a larger-than-desired future compaction.

Is the concern that there is no other key being output in the compaction? I think there needs to be a final call into shouldStopBefore when we reach the end of the compaction input so that we can place range tombstones into sstables aligned with the grandparent boundary. This actually probably needs to happen in a loop. In your scenario, you could imagine grandparent boundaries at m, q, s, u, x, etc. I think this actually applies to any separation between sstables.

@ajkr
Copy link
Contributor

ajkr commented Nov 6, 2019

Yes there is no other key to pass to shouldStopBefore. Hm, yeah, we can just make it take the empty key or nil key and in that case simply return the next file limit. Yes, I agree it needs to happen in a loop (not just the shouldStopBefore at the end - also there could be multiple file limits between two point keys earlier in the compaction). OK, this is sounding cleaner than I was thinking yesterday.

ajkr added a commit to ajkr/pebble that referenced this issue Dec 2, 2019
Previously range tombstones could cause output files that would later
undergo excessively large compactions due to overlap with the
grandparent level. This happened because only point keys were considered
for cutting output files according to grandparent overlap. Thus if a
range tombstone extended over a region devoid of point keys, compaction
would never split files within that region.

To fix this we now split output files at grandparent boundaries
regardless of the presence of point keys. That means a region covered by
a range tombstone and devoid of point keys can be split at grandparent
boundaries if needed to prevent future huge compactions.

Fixes cockroachdb#167.
ajkr added a commit to ajkr/pebble that referenced this issue Dec 3, 2019
Previously range tombstones could cause output files that would later
undergo excessively large compactions due to overlap with the
grandparent level. This happened because only point keys were considered
for cutting output files according to grandparent overlap. Thus if a
range tombstone extended over a region devoid of point keys, compaction
would never split files within that region.

To fix this we now split output files at grandparent boundaries
regardless of the presence of point keys. That means a region covered by
a range tombstone and devoid of point keys can be split at grandparent
boundaries if needed to prevent future huge compactions.

Fixes cockroachdb#167.
@ajkr ajkr closed this as completed in #382 Dec 4, 2019
(Deprecated) Storage automation moved this from To Do (this milestone) to Done (milestone C) Dec 4, 2019
ajkr added a commit that referenced this issue Dec 4, 2019
Previously range tombstones could cause output files that would later
undergo excessively large compactions due to overlap with the
grandparent level. This happened because only point keys were considered
for cutting output files according to grandparent overlap. Thus if a
range tombstone extended over a region devoid of point keys, compaction
would never split files within that region.

To fix this we now split output files at grandparent boundaries
regardless of the presence of point keys. That means a region covered by
a range tombstone and devoid of point keys can be split at grandparent
boundaries if needed to prevent future huge compactions.

Fixes #167.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants