Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax VersionStorageInfo::GetOverlappingInputs check #4050

Closed

Conversation

petermattis
Copy link
Contributor

Do not consider the range tombstone sentinel key as causing 2 adjacent
sstables in a level to overlap. When a range tombstone's end key is the
largest key in an sstable, the sstable's end key is so to a "sentinel"
value that is the smallest key in the next sstable with a sequence
number of kMaxSequenceNumber. This "sentinel" is guaranteed to not
overlap in internal-key space with the next sstable. Unfortunately,
GetOverlappingFiles uses user-keys to determine overlap and was thus
considering 2 adjacent sstables in a level to overlap if they were
separated by this sentinel key. This in turn would cause compactions to
be larger than necessary.

Note that this conflicts with
#2769 and cases
DBRangeDelTest.CompactionTreatsSplitInputLevelDeletionAtomically to
fail.

@petermattis
Copy link
Contributor Author

@ajkr In #2769 you said:

One of the core assumptions of DeleteRange is that files containing portions of the same range tombstone are treated as a single unit from the perspective of compaction picker. Need better tests for this. This PR adds the tests for manual compaction.

I'm not seeing why that invariant (treating all sstables which overlap a range tombstone as a single unit) is important. The only test which fails with this PR is the one you added in #2769, but that test is simply checking for the invariant you describe, not some problematic outcome of violating the invariant.

If we don't treat the sstables as a single unit, then a range tombstone can exist in multiple levels, though I'm not seeing why that is problematic as compaction already needs to deal with the same range tombstone appearing in multiple adjacent sstables. What am I missing?

@petermattis
Copy link
Contributor Author

Cc @benesch who has been thinking about range tombstones a lot.

@ajkr
Copy link
Contributor

ajkr commented Jun 25, 2018

Hm, looking at it more there might've been some circular reasoning behind that assumption :/. It was like:

  1. For implementation convenience we assumed that, when a range tombstone spans multiple files, those files will be compacted as a single unit since that's how compaction picking already worked. Under this assumption, we don't need to break a range tombstone into smaller pieces before putting it in the meta-block. So the range tombstone meta-block actually contains range tombstones that extend outside the SST file's range.
  2. Given what we decided to do in 1 (not shorten tombstones in meta-block), compaction picker must treat all files containing the same tombstone as a single unit. If it does not we run into this situation described by @zhangjinpeng1987 in bug with DeleteRange #2752 (copying relevant part below):
  • Write a delete range [a, d)
  • After compaction, this delete range may be added to multiple sst files: a.sst, b.sst, c.sst, but the range of these sst files are [a, b), [b, c), [c, d).
  • When a.sst and b.sst reach the bottommost level, the delete range of these two sst files will be dropped.
  • Write new keys in range [a, c).
  • When the new written keys [a, c) reach the bottommost level, their sequence number will be set to zero.
  • When the front c.sst compact with these keys [a, c) that sequence number are zero, these new written keys will be dropped wrongly.

@ajkr
Copy link
Contributor

ajkr commented Jun 25, 2018

In summary I think a prerequisite to this PR would be shortening the range tombstones in the meta-block. Besides that I can't think of any problem.

@petermattis
Copy link
Contributor Author

@ajkr Shortening the range tombstones in the meta-block was something I had a question about. Is there anything preventing that from happening? Feels strange to both @benesch and I that an sstable can contain a range tombstone that extends past the boundaries of the sstable.

@ajkr
Copy link
Contributor

ajkr commented Jun 25, 2018

I think nothing prevents it. We just didn't need to do it before because the compaction picking always picked all files that a range tombstone spanned.

@petermattis
Copy link
Contributor Author

I think nothing prevents it. We just didn't need to do it before because the compaction picking always picked all files that a range tombstone spanned.

Roger. I'll work on a PR to trim range tombstones to sstable boundaries and then revisit this PR. Also, thanks for the explanation of why the current invariant that all of the sstables which overlap a range tombstone must be treated as a single unit. That was not at all obvious.

@ajkr ajkr self-requested a review July 9, 2018 19:54
petermattis added a commit to cockroachdb/rocksdb that referenced this pull request Jul 9, 2018
When adding range tombstones to a RangeDelAggregator, truncate the
tombstones to the sstable boundaries. This is done as an alternative to
truncating the range tombstones within the sstables themselves as it
properly handles existing data where range tombstones are not truncated
in sstables. Truncating range tombstones to sstable boundaries avoids
having two process all of the sstables with overlapping tombstones as a
unit.

See facebook#4050
@petermattis petermattis force-pushed the pmattis/get-overlapping-inputs branch 2 times, most recently from 6af5b58 to a1ea122 Compare July 9, 2018 21:29
@petermattis petermattis changed the title [DO NOT MERGE] Relax VersionStorageInfo::GetOverlappingInputs check Relax VersionStorageInfo::GetOverlappingInputs check Jul 9, 2018
@petermattis
Copy link
Contributor Author

Depends on #4081.

petermattis added a commit to cockroachdb/rocksdb that referenced this pull request Jul 10, 2018
When adding range tombstones to a RangeDelAggregator, truncate the
tombstones to the sstable boundaries. This is done as an alternative to
truncating the range tombstones within the sstables themselves as it
properly handles existing data where range tombstones are not truncated
in sstables. Truncating range tombstones to sstable boundaries avoids
having two process all of the sstables with overlapping tombstones as a
unit.

See facebook#4050
petermattis added a commit to cockroachdb/rocksdb that referenced this pull request Jul 10, 2018
When adding range tombstones to a RangeDelAggregator, truncate the
tombstones to the sstable boundaries. This is done as an alternative to
truncating the range tombstones within the sstables themselves as it
properly handles existing data where range tombstones are not truncated
in sstables. Truncating range tombstones to sstable boundaries avoids
having two process all of the sstables with overlapping tombstones as a
unit.

See facebook#4050
@petermattis petermattis force-pushed the pmattis/get-overlapping-inputs branch from a1ea122 to ac857ae Compare July 11, 2018 22:24
// be careful to use SerializeEndKey(), allocates new memory
InternalKey SerializeEndKey() const {
return InternalKey(end_key_, seq_, kTypeRangeDeletion);
return InternalKey(end_key_, kMaxSequenceNumber, kTypeRangeDeletion);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajkr This was a fun bug to track down. The only usage of SerializeEndKey is from RangeDelAggregator::AddToBuilder. The last sstable output by a compaction could have it's largest key set to RangeTombstone::SerializeEndKey(), but the tombstone end-key is exclusive so that could cause the sstable to overlap with the next sstable. I need to think through whether this was a bug prior to this PR or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh got it... I suspect it wasn't a problem before this PR. In the compaction you described, the "next sstable" should've been pulled into a single compaction due to key-range overlap. The single compaction should be able to output files with non-overlapping key-ranges.

@petermattis petermattis force-pushed the pmattis/get-overlapping-inputs branch from ac857ae to d1d5935 Compare July 12, 2018 13:55
@petermattis petermattis force-pushed the pmattis/get-overlapping-inputs branch from d1d5935 to 2ae3dec Compare July 12, 2018 19:26
petermattis added a commit to petermattis/rocksdb that referenced this pull request Jul 13, 2018
When adding range tombstones to a RangeDelAggregator, truncate the
tombstones to the sstable boundaries. This is done as an alternative to
truncating the range tombstones within the sstables themselves as it
properly handles existing data where range tombstones are not truncated
in sstables. Truncating range tombstones to sstable boundaries avoids
having two process all of the sstables with overlapping tombstones as a
unit.

See facebook#4050
petermattis added a commit to petermattis/rocksdb that referenced this pull request Jul 13, 2018
When adding range tombstones to a RangeDelAggregator, truncate the
tombstones to the sstable boundaries. This is done as an alternative to
truncating the range tombstones within the sstables themselves as it
properly handles existing data where range tombstones are not truncated
in sstables. Truncating range tombstones to sstable boundaries avoids
having two process all of the sstables with overlapping tombstones as a
unit.

See facebook#4050
When adding range tombstones to a RangeDelAggregator, truncate the
tombstones to the sstable boundaries. This is done as an alternative to
truncating the range tombstones within the sstables themselves as it
properly handles existing data where range tombstones are not truncated
in sstables. Truncating range tombstones to sstable boundaries avoids
having two process all of the sstables with overlapping tombstones as a
unit.

See facebook#4050
@petermattis petermattis force-pushed the pmattis/get-overlapping-inputs branch from 2ae3dec to fd766ca Compare July 13, 2018 20:23
Do not consider the range tombstone sentinel key as causing 2 adjacent
sstables in a level to overlap. When a range tombstone's end key is the
largest key in an sstable, the sstable's end key is set to a "sentinel"
value that is the smallest key in the next sstable with a sequence
number of kMaxSequenceNumber. This "sentinel" is guaranteed to not
overlap in internal-key space with the next sstable. Unfortunately,
GetOverlappingFiles uses user-keys to determine overlap and was thus
considering 2 adjacent sstables in a level to overlap if they were
separated by this sentinel key. This in turn would cause compactions to
be larger than necessary.

This previous behavior of GetOverlappingInputs was necessary due to the
following scenario:

  * Write a delete range [a, d).
  * After compaction, this deleted range may be added to multiple sst
    files: a.sst, b.sst, c.sst, but the boundaries of these sst files
    are [a, b), [b, c), [c, d).
  * When a.sst and b.sst reach the bottommost level, the delete range of
    the sst files will be dropped.
  * Write a new key in the range [a, c).
  * When the newly written key [a, c) reaches the bottommost level, its
    sequence number will be set to zero.
  * When the front c.sst compacts with the key in the range [a, c) the
    sequence number of that key is zero, and the key will be incorrectly
    dropped.

That scenario no longer occurs because we are truncating range deletion
tombstones to sstable boundaries when adding them to RangeDelAggregator.
@petermattis petermattis force-pushed the pmattis/get-overlapping-inputs branch from fd766ca to 7363c90 Compare July 13, 2018 20:25
Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@@ -946,22 +949,111 @@ TEST_F(DBRangeDelTest, CompactionTreatsSplitInputLevelDeletionAtomically) {
if (i == 0) {
ASSERT_OK(db_->CompactFiles(
CompactionOptions(), {meta.levels[1].files[0].name}, 2 /* level */));
ASSERT_EQ(0, NumTableFilesAtLevel(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why does CompactFiles pull in all the adjacent files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to. In the test, we tell CompactFiles to compact sstable 17. CompactionPicker::SanitizeCompactionInputFilesForAllLevels pulls in sstables 20, 19, 18, and 7 as well. Looks like that method has logic to expand the input file set similar to VersionSet::GetOverlappingInputs. Probably worthwhile to fix it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can follow up on that sometime, though I'll land this part now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Glad to see this get merged.

// be careful to use SerializeEndKey(), allocates new memory
InternalKey SerializeEndKey() const {
return InternalKey(end_key_, seq_, kTypeRangeDeletion);
return InternalKey(end_key_, kMaxSequenceNumber, kTypeRangeDeletion);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh got it... I suspect it wasn't a problem before this PR. In the compaction you described, the "next sstable" should've been pulled into a single compaction due to key-range overlap. The single compaction should be able to output files with non-overlapping key-ranges.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajkr is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@petermattis petermattis deleted the pmattis/get-overlapping-inputs branch July 20, 2018 11:49
rcane pushed a commit to rcane/rocksdb that referenced this pull request Sep 13, 2018
Summary:
Do not consider the range tombstone sentinel key as causing 2 adjacent
sstables in a level to overlap. When a range tombstone's end key is the
largest key in an sstable, the sstable's end key is so to a "sentinel"
value that is the smallest key in the next sstable with a sequence
number of kMaxSequenceNumber. This "sentinel" is guaranteed to not
overlap in internal-key space with the next sstable. Unfortunately,
GetOverlappingFiles uses user-keys to determine overlap and was thus
considering 2 adjacent sstables in a level to overlap if they were
separated by this sentinel key. This in turn would cause compactions to
be larger than necessary.

Note that this conflicts with
facebook#2769 and cases
`DBRangeDelTest.CompactionTreatsSplitInputLevelDeletionAtomically` to
fail.
Pull Request resolved: facebook#4050

Differential Revision: D8844423

Pulled By: ajkr

fbshipit-source-id: df3f9f1db8f4cff2bff77376b98b83c2ae1d155b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants