-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cut compaction outputs at grandparent boundaries #382
cut compaction outputs at grandparent boundaries #382
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit WIP as multiple tests are failing, one for unknown reasons (looks like an unrelated delete range bug at first glance).
Reviewable status: 0 of 4 files reviewed, 1 unresolved discussion
compaction.go, line 101 at r1 (raw file):
overlappedBytes uint64 // bytes of overlap with grandparent tables // TODO(ajkr): seenKey should be true if a range tombstone has been passed that
I checked and compactionIter
was not returning sentinel keys for the range tombstones output by the compaction -- should it have been?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 4 files reviewed, 2 unresolved discussions
compaction.go, line 1263 at r1 (raw file):
limit := c.nextLimitBefore(nil) for limit != nil { if err := finishOutput(*limit); err != nil {
I should also add a test for an empty region (i.e., devoid of both point keys and range tombstones) in the middle of a compaction that spans multiple grandparent limits. Need to make sure finishOutput()
works fine when called on empty files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure looks like what I was imagining. It's actually a smaller change than I thought it would be, though handling tombstones at the beginning of an sstable will change that.
Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @ajkr)
compaction.go, line 101 at r1 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
I checked and
compactionIter
was not returning sentinel keys for the range tombstones output by the compaction -- should it have been?
Not sure I understand the question. compactionIter
doesn't return range tombstone sentinel keys. Those are generated by the finishOutput
closure when constructing the sstable boundary keys. I'm pretty sure RocksDB compaction mechanics are similar.
It is true that compactionIter.Next
doesn't return range tombstones. Those are buffered in compactionIter.tombstones
. Would it be sufficieint to provide another accessor on compactionIter
that returned the first buffered tombstone (if it exists)?
compaction.go, line 1237 at r1 (raw file):
limit := c.nextLimitBefore(key) for limit != nil {
Nit: I'd structure this as for limit := c.nextLimitBefore(nil); limit != nil; limit = c.nextLimitBefore(nil) {
. Or perhaps:
for {
limit := c.nextLimitBefore(key)
if limit == nil {
break
}
if err := finishOutput(*limit); err != nil {
return nil, pendingOutputs, err
}
}
Then you have a single call to c.nextLimitBefore
for the loop. Ditto for the loop below. Could also pull this loop out into closure so it can be shared, but given it is only a few lines of code I don't feel strongly about that.
testdata/manual_compaction, line 85 at r1 (raw file):
compact a-e L1 ----
I think you should change the test to use Version.DebugString()
to output the LSM structure which will display internal keys for the boundaries rather than just the user keys. This will make the presence of range tombstone sentinel keys clearer. In case you're not aware, the --rewrite
flag will make this change easy.
testdata/manual_compaction, line 129 at r1 (raw file):
11:[b-d] 12:[d-f] 13:[f-g]
Neat!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @ajkr and @petermattis)
compaction.go, line 101 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Not sure I understand the question.
compactionIter
doesn't return range tombstone sentinel keys. Those are generated by thefinishOutput
closure when constructing the sstable boundary keys. I'm pretty sure RocksDB compaction mechanics are similar.It is true that
compactionIter.Next
doesn't return range tombstones. Those are buffered incompactionIter.tombstones
. Would it be sufficieint to provide another accessor oncompactionIter
that returned the first buffered tombstone (if it exists)?
It is hard because Tombstones()
isn't called and compactionIter.tombstones
isn't populated until finishOutput()
, which is too late for us to use its result to truncate the first file. We could use the first range tombstone returned by compactionIter.iter
, but that might end up being elided from the output, and we can't really tell ahead of time since elision depends on how it's fragmented, which we again don't know until Tombstones()
is called.
Going back a step, seenKey
feels inconsistent already. It lets a prefix of grandparent files be omitted from the overlappedBytes
calculation if the first output file doesn't overlap with them. But what if the second output file doesn't overlap with grandparent files between the first output file and the second output file? Currently those intermediate files are counted towards the second output file's overlappedBytes
regardless.
We could just drop seenKey
for simplicity and consistency. Actually I tried to do that at first but got scared by all the test failures. May be time to try it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @ajkr and @petermattis)
compaction.go, line 101 at r1 (raw file):
We could just drop seenKey for simplicity and consistency. Actually I tried to do that at first but got scared by all the test failures. May be time to try it again.
Didn't work out as another problem surfaced. compactionIter
adds range tombstones to a Fragmenter
and then skips over their sentinel keys. By the time it reaches the next point key the Fragmenter
may have already flushed some range tombstones due to non-overlap. Then, nextLimitBefore()
may tell us to truncate to a point before the point to which the Fragmenter
already flushed, which leads to a panic.
Back to plan A, I think this problem could be avoided if compactionIter
emits sentinel range tombstone start keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @ajkr and @petermattis)
compaction.go, line 101 at r1 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
We could just drop seenKey for simplicity and consistency. Actually I tried to do that at first but got scared by all the test failures. May be time to try it again.
Didn't work out as another problem surfaced.
compactionIter
adds range tombstones to aFragmenter
and then skips over their sentinel keys. By the time it reaches the next point key theFragmenter
may have already flushed some range tombstones due to non-overlap. Then,nextLimitBefore()
may tell us to truncate to a point before the point to which theFragmenter
already flushed, which leads to a panic.Back to plan A, I think this problem could be avoided if
compactionIter
emits sentinel range tombstone start keys.
Plan A doesn't work either since it's still possible the Fragmenter
is flushed beyond the file's truncation point. In particular since we add a range tombstone to the Fragmenter
before compactionIter
returns its start key as a sentinel key, the Fragmenter
may flush range tombstones we want to truncate before we call nextLimitBefore
/ FlushTo
.
Another thing to try is changing the Fragmenter
to only flush fragments when FlushTo
is called rather than when a range tombstone is added that comes after all pending tombstones (truncateAndFlush
). Though that feels complicated. I wonder if there's an easier way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @ajkr)
compaction.go, line 101 at r1 (raw file):
Another thing to try is changing the Fragmenter to only flush fragments when FlushTo is called rather than when a range tombstone is added that comes after all pending tombstones (truncateAndFlush). Though that feels complicated. I wonder if there's an easier way.
This sounds desirable. The behavior of compactionIter.Tombstones()
differs from the handling of point operations. With point operations, we get to decide which sstable the operation goes into. But a range tombstone gets added eagerly to compactionIter.Tombstones()
which feels wrong. Perhaps the general structure of compactionIter
should be to return all of the records (both point and range) that will be added to an sstable, and the caller of compactionIter
can use the range tombstones for sstable boundary decisions, but would otherwise ignore them. Or perhaps the rangedel.Fragmenter
should be pulled out of compactionIter
and passed as reference. This would be quite a bit of refactoring, but it is the hiding of range tombstones from the compactionIter
processing that is causing problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @ajkr and @petermattis)
compaction.go, line 101 at r1 (raw file):
Perhaps the general structure of compactionIter should be to return all of the records (both point and range) that will be added to an sstable, and the caller of compactionIter can use the range tombstones for sstable boundary decisions, but would otherwise ignore them.
Agreed with this, but additionally we will need to change Fragmenter
to buffer flushed fragments all the way until FlushTo()
is called, right? Doing this but leaving the Fragmenter
the same seems to not work as Fragmenter
might have already flushed fragments beyond the FlushTo()
truncation point before we call it. That happens just because compactionIter
calls Fragmenter.Add()
before returning a range tombstone sentinel key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @ajkr and @petermattis)
compaction.go, line 101 at r1 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
Perhaps the general structure of compactionIter should be to return all of the records (both point and range) that will be added to an sstable, and the caller of compactionIter can use the range tombstones for sstable boundary decisions, but would otherwise ignore them.
Agreed with this, but additionally we will need to change
Fragmenter
to buffer flushed fragments all the way untilFlushTo()
is called, right? Doing this but leaving theFragmenter
the same seems to not work asFragmenter
might have already flushed fragments beyond theFlushTo()
truncation point before we call it. That happens just becausecompactionIter
callsFragmenter.Add()
before returning a range tombstone sentinel key.
There is also the option of waiting to call Fragmenter.Add()
until the Next()
following a range tombstone sentinel key. It feels a bit confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @ajkr)
compaction.go, line 101 at r1 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
There is also the option of waiting to call
Fragmenter.Add()
until theNext()
following a range tombstone sentinel key. It feels a bit confusing.
I was imagining the the compaction code would call Fragmenter.Add()
, not compactionIter
. compactionIter
would only call Fragmenter.Deleted
. There might be challenges in actually doing this, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @ajkr and @petermattis)
compaction.go, line 101 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I was imagining the the compaction code would call
Fragmenter.Add()
, notcompactionIter
.compactionIter
would only callFragmenter.Deleted
. There might be challenges in actually doing this, though.
Right, I wrote that comment before I understood your point. Makes sense.
662b236
to
b7afd57
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR. RFAL.
Reviewable status: 0 of 10 files reviewed, 2 unresolved discussions (waiting on @petermattis)
compaction.go, line 101 at r1 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
Right, I wrote that comment before I understood your point. Makes sense.
Done.
compaction.go, line 1237 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Nit: I'd structure this as
for limit := c.nextLimitBefore(nil); limit != nil; limit = c.nextLimitBefore(nil) {
. Or perhaps:for { limit := c.nextLimitBefore(key) if limit == nil { break } if err := finishOutput(*limit); err != nil { return nil, pendingOutputs, err } }
Then you have a single call to
c.nextLimitBefore
for the loop. Ditto for the loop below. Could also pull this loop out into closure so it can be shared, but given it is only a few lines of code I don't feel strongly about that.
Done.
compaction.go, line 1263 at r1 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
I should also add a test for an empty region (i.e., devoid of both point keys and range tombstones) in the middle of a compaction that spans multiple grandparent limits. Need to make sure
finishOutput()
works fine when called on empty files.
Not important, the function simply returns nil when nothing's been added. A more important case that the metamorphic tests found is when grandparent level causes multiple files to be cut at the same user key and the non-first file among them has only range tombstones.
testdata/manual_compaction, line 85 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I think you should change the test to use
Version.DebugString()
to output the LSM structure which will display internal keys for the boundaries rather than just the user keys. This will make the presence of range tombstone sentinel keys clearer. In case you're not aware, the--rewrite
flag will make this change easy.
Done.
b7afd57
to
b315597
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. I need to do another pass on the test cases, but I had a few questions first.
Reviewable status: 0 of 10 files reviewed, 5 unresolved discussions (waiting on @ajkr and @petermattis)
compaction.go, line 1179 at r2 (raw file):
// to this function's truncation logic. n := len(ve.NewFiles) if n > 0 && key.UserKey != nil && d.cmp(key.UserKey, ve.NewFiles[n-1].Meta.Largest.UserKey) <= 0 {
I'm not sure what is going on here. How can the key passed to finishOutput
be less than or equal to the last sstable's largest key? I see mention of range tombstones in the comment above, yet the keys involved here are not necessarily range tombstones. Extending the comment with an example might help.
compaction.go, line 1312 at r2 (raw file):
// But do add them to the `Fragmenter` to make them visible to // `compactionIter` so covered keys in the same snapshot stripe // can be elided.
Nit: The first sentence in the comment reads a little bit strange to me as we have never been adding range tombstones to the output table in this loop. I think that sentence could be removed and then the second one adjusted to state: Range tombstones are handled specially. They are fragmented and written later during finishOutput() ...
.
pacer.go, line 118 at r2 (raw file):
func (p *compactionPacer) maybeThrottle(bytesIterated uint64) error { if bytesIterated == 0 { return nil
What is this change about? Did something about your change cause us to call maybeThrottle(0)
?
internal/rangedel/fragmenter.go, line 254 at r2 (raw file):
} if flush {
Can you explain this part of the change? It isn't obvious to me why it is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still reading compactionIter
-- some initial comments
Reviewable status: 0 of 10 files reviewed, 8 unresolved discussions (waiting on @ajkr and @petermattis)
compaction_iter.go, line 212 at r2 (raw file):
i.rangeDelFrag.Add(i.cloneKey(i.key), i.iterValue) } if i.rangeDelFrag.Deleted(i.key, i.curSnapshotSeqNum) {
I am glad that the new code does not fall through here to call Deleted()
with the range tombstone itself. It is harmless but made my head hurt.
compaction_iter.go, line 287 at r2 (raw file):
i.value = make([]byte, len(i.iterValue)) copy(i.value, i.iterValue) i.iterNext()
this call to iterNext()
is possibly a bug.
My understanding is that the consistency of iterKey, iterValue, curSnapshotIdx, curSnapshotSeqNum are established by First() and maintained by nextInStripe(). This call uses iterNext() directly that does not maintain that consistency. Consider a sequence like [a, d)#7, a#6 KeyKindInvalid, a#4 where there was a snapshot at 5. The a#4 will get deleted I think because we step to a#4 but leave the curSnapshotSeqNum equal to 6. This should be easy to test.
compaction_iter.go, line 337 at r2 (raw file):
// as i.key. Adding tombstones earlier can violate a rangedel.Fragmenter // invariant and lead to too many tombstones being added to an sstable. i.rangeDelFrag.Add(i.cloneKey(*key), i.iterValue)
this code was dubious in that it moved to the next key without setting curSnapshotIdx
and curSnapshotSeqNum
. Glad that we are doing the cleaner thing now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 10 files reviewed, 10 unresolved discussions (waiting on @ajkr and @petermattis)
compaction_iter.go, line 128 at r2 (raw file):
iter internalIterator err error key InternalKey
I had a hard time reasoning about key
. Maybe a comment like the following would help?
// key.UserKey is in one of two states:
// - Set to i.iterKey.UserKey: this happens within Next() and is a transient state before the code
// decides to save the key.
// - Set to keyBuf: caused by saving i.iterKey.UserKey. This is the case on return from all public methods -- these methods return key.
// Additionally, it is the internal state when the code is moving to the next key so it can determine
// whether the user key has changed from the previous key.
compaction_iter.go, line 201 at r2 (raw file):
} if i.skip {
I found the iterator positioning hard to understand. How about a comment like the following
// Prior to this call to Next() we are in one of two situations wrt iterKey and related state:
// - skip is false, and iterKey is already at the next key.
// - skip is true and we are at the key that has been returned. Doing the skip will move us forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 10 files reviewed, 12 unresolved discussions (waiting on @ajkr and @petermattis)
compaction_iter.go, line 379 at r2 (raw file):
case InternalKeyKindSet: if i.rangeDelFrag.Deleted(*key, i.curSnapshotSeqNum) {
This seems incorrect in the way it is using i.curSnapshotSeqNum
. There are two questions here (a) is this point deleted, (b) should the point not be emitted despite the snapshot constraints. (b)=>(a) but (a) does not imply (b). This code is looking at !(b) and using it to merge. But (b) could be false and (a) true. For example, if the original merge was at #25, a tombstone at #21, a snapshot at #19 and now a set at #17. The curSnapshotSeqNum
is 17 so it needs to be emitted, but it should not be merged either. So I think the code should be something like
// Current code
if (b) {
...
}
if (a) {
i.skip = false
return
}
// merge -- current code
Did I miss something?
compaction_iter.go, line 444 at r2 (raw file):
default: i.err = fmt.Errorf("invalid internal key kind: %d", i.iterKey.Kind())
Our handling of InternalKeyKindInvalid
seems inconsistent. In this file, Next()
will just return it without setting an error while singleDeleteNext()
and mergeNext()
set the error. Why is that? Also, it seems inconsistent how these keys come to be: Iterator.decodeKey
sets the UserKey
to nil when this happens, while DecodeInternalKey()
in internal.go sets the UserKey
to the full encoded key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 10 files reviewed, 12 unresolved discussions (waiting on @ajkr and @petermattis)
compaction_iter.go, line 379 at r2 (raw file):
Previously, sumeerbhola wrote…
This seems incorrect in the way it is using
i.curSnapshotSeqNum
. There are two questions here (a) is this point deleted, (b) should the point not be emitted despite the snapshot constraints. (b)=>(a) but (a) does not imply (b). This code is looking at !(b) and using it to merge. But (b) could be false and (a) true. For example, if the original merge was at #25, a tombstone at #21, a snapshot at #19 and now a set at #17. ThecurSnapshotSeqNum
is 17 so it needs to be emitted, but it should not be merged either. So I think the code should be something like// Current code if (b) { ... } if (a) { i.skip = false return } // merge -- current code
Did I miss something?
never mind. The !i.nextInStripe()
earlier in the loop would prevent this. We already know this set is in the same stripe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 10 files reviewed, 13 unresolved discussions (waiting on @ajkr and @petermattis)
compaction_iter.go, line 202 at r2 (raw file):
if i.skip { i.skip = false
skip
is somewhat subtle. As far as I can tell, it used because a point being output is going to mask the remaining points (Set, Merge, Delete, SingleDelete that turns into a Delete) with the same UserKey
. The old code used to process range tombstones internally, so it would have the desired behavior in that a preceding skip=true would go through all the remaining keys with the same UserKey
(in the stripe) in this call to skipStripe()
and set skip=false
. The new code will interrupt the skipStripe()
due to a range tombstone and set skip=false
. The next point key could be at the same UserKey
but skip is false. But it may also be deleted by the range tombstone, so we won't unnecessarily emit it -- except if it has the same sequence number.
Example: a#10, [a, c)#7, a#7, a#6, [a, d)#5, a#4
Old code (bullets are calls to Next())
- return a#10, skip is true
- iterate over all the a's and feed the tombstone to the fragmenter.
New code:
- return a#10, skip is true
- return [a, c)perf: pacing user writes, flushes and compactions #7, skip is false
- a#7 is not covered by tombstone, so return a#7, set skip to true
- skip over a#6, return [a, d)sstable: prefix bloom filter #5, skip is false
- a#4 is covered by tombstone, so don't return it.
Is my understanding correct?
In general I think we need more code comments in this file that explain the purpose of each construct and some invariants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for all the detailed reviews. Still working on replying to everything.
Reviewable status: 0 of 10 files reviewed, 13 unresolved discussions (waiting on @ajkr, @petermattis, and @sumeerbhola)
compaction.go, line 1179 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I'm not sure what is going on here. How can the key passed to
finishOutput
be less than or equal to the last sstable's largest key? I see mention of range tombstones in the comment above, yet the keys involved here are not necessarily range tombstones. Extending the comment with an example might help.
The key passed to finishOutput()
can have user key equal to the last file's largest. It should never be smaller. I can change the <=
to an ==
for clarity.
Here is the simplest test case I can come up with:
define target-file-sizes=(1, 1, 1, 1)
L1
a.SET.2:v
c.SET.2:v
L2
a.RANGEDEL.3:c
L3
b.SET.2:v
L3
b.SET.1:v
L3
b.SET.0:v
----
1:
4:[a-c]
2:
5:[a-c]
3:
6:[b-b]
7:[b-b]
8:[b-b]
compact a-c L1
----
2:
9:[a#3,RANGEDEL-b#72057594037927935,RANGEDEL]
10:[b#0,RANGEDEL-c#2,SET]
3:
6:[b#2,SET-b#2,SET]
7:[b#1,SET-b#1,SET]
8:[b#0,SET-b#0,SET]
Without this change the failure is:
--- FAIL: TestManualCompaction (0.01s)
compaction_test.go:728:
testdata/manual_compaction:230:
expected:
2:
9:[a#3,RANGEDEL-b#72057594037927935,RANGEDEL]
10:[b#0,RANGEDEL-c#2,SET]
3:
6:[b#2,SET-b#2,SET]
7:[b#1,SET-b#1,SET]
8:[b#0,SET-b#0,SET]
found:
2:
9:[a#3,RANGEDEL-b#72057594037927935,RANGEDEL]
10:[b#0,RANGEDEL-b#72057594037927935,RANGEDEL]
11:[b#0,RANGEDEL-c#2,SET]
3:
6:[b#2,SET-b#2,SET]
7:[b#1,SET-b#1,SET]
8:[b#0,SET-b#0,SET]
FAIL
FAIL github.com/cockroachdb/pebble 0.015s
In particular note the unexpected presence of 10:[b#0,RANGEDEL-b#72057594037927935,RANGEDEL]
. That showed up because we cut two files at the same key.UserKey
(b
), and the second one had no point keys in it. Then the range tombstone a.RANGEDEL.3:c
got added to 10.sst, and since it had no point keys we used the range tombstone to determine both the file's smallest and largest boundary.
The fix here is to skip writing a file when we know it's smallest/largest user key will be equal and it does not contain any point keys. We know here the file doesn't contain point keys since tw
is still nil when finishOutput()
is called.
BTW, this case was found by the metamorphic test.
compaction.go, line 1312 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Nit: The first sentence in the comment reads a little bit strange to me as we have never been adding range tombstones to the output table in this loop. I think that sentence could be removed and then the second one adjusted to state:
Range tombstones are handled specially. They are fragmented and written later during finishOutput() ...
.
We have never been, but the assumption is the reader cares about consistency of the current state of the code, not the history. The inconsistency that would be apparent to me as a reader is point keys are added to tw
immediately in this loop while range tombstones are not. The phrasing you're suggesting works for me too, so I'll change it.
compaction_iter.go, line 128 at r2 (raw file):
Previously, sumeerbhola wrote…
I had a hard time reasoning about
key
. Maybe a comment like the following would help?// key.UserKey is in one of two states: // - Set to i.iterKey.UserKey: this happens within Next() and is a transient state before the code // decides to save the key. // - Set to keyBuf: caused by saving i.iterKey.UserKey. This is the case on return from all public methods -- these methods return key. // Additionally, it is the internal state when the code is moving to the next key so it can determine // whether the user key has changed from the previous key.
Is the transient state of having key.UserKey
assigned to i.iterKey.UserKey
necessary? I am wondering if we should just get rid of it and use i.iterKey
directly until we decide to save the key. It seems error-prone right now.
compaction_iter.go, line 287 at r2 (raw file):
Previously, sumeerbhola wrote…
this call to
iterNext()
is possibly a bug.
My understanding is that the consistency of iterKey, iterValue, curSnapshotIdx, curSnapshotSeqNum are established by First() and maintained by nextInStripe(). This call uses iterNext() directly that does not maintain that consistency. Consider a sequence like [a, d)#7, a#6 KeyKindInvalid, a#4 where there was a snapshot at 5. The a#4 will get deleted I think because we step to a#4 but leave the curSnapshotSeqNum equal to 6. This should be easy to test.
Yes, good catch. In fact I had to debug this exact problem already since my code for returning range tombstones from compactionIter
was copied from here. We should always use nextInStripe()
for advancing the iterator.
Taking a step back, though, should compaction simply fail if there's a key parsing error?
compaction_iter.go, line 444 at r2 (raw file):
Previously, sumeerbhola wrote…
Our handling of
InternalKeyKindInvalid
seems inconsistent. In this file,Next()
will just return it without setting an error whilesingleDeleteNext()
andmergeNext()
set the error. Why is that? Also, it seems inconsistent how these keys come to be:Iterator.decodeKey
sets theUserKey
to nil when this happens, whileDecodeInternalKey()
in internal.go sets theUserKey
to the full encoded key.
My opinion:
compactionIter::Next()
should seti.err
in the same way as{singleDelete,merge}Next()
. I think we shouldn't allow a compaction that writes out a corrupt key to succeed.DecodeInternalKey()
should set nil user key when the trailer cannot be parsed. We should avoid exposing user key for invalid keys in the first place, either to user reads or compactions. We might as well prevent this possibility from the lowest level by making user key nil in this case.
pacer.go, line 118 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
What is this change about? Did something about your change cause us to call
maybeThrottle(0)
?
Yes, it is called in the main compaction loop, which now sees keys for range tombstones. When we have only seen range tombstones, c.bytesIterated
(the argument passed to maybeThrottle()
) is zero.
internal/rangedel/fragmenter.go, line 254 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Can you explain this part of the change? It isn't obvious to me why it is needed.
It was flushing pending range tombstones that were fully before the point key being checked for deletion. But that point key could turn out to be (a) not deleted, and (b) beyond the truncation point according to nextLimitBefore()
. If nextLimitBefore()
returned a truncation point that was not just before this point key but also before the end of earlier range tombstones, we need to be able to truncate them to that point. We can't do that if they've already been flushed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 10 files reviewed, 13 unresolved discussions (waiting on @petermattis and @sumeerbhola)
compaction_iter.go, line 201 at r2 (raw file):
Previously, sumeerbhola wrote…
I found the iterator positioning hard to understand. How about a comment like the following
// Prior to this call to Next() we are in one of two situations wrt iterKey and related state: // - skip is false, and iterKey is already at the next key. // - skip is true and we are at the key that has been returned. Doing the skip will move us forward.
Yes, it is very confusing that skip=false
implies we're already at the next key. Would adding a variable atNext
be helpful? We could have the following possibilities.
!skip && atNext
: this means we already advanced to the next key after the one we'll return, like inmergeNext()
.!skip && !atNext
: this means we need to callnextInStripe()
at the top of the followingNext()
. It'd be intuitive (for me) for this to be the setting when returning a range tombstone, rather than callingnextInStripe()
before returning. It also allows us to not worry about whether we need to copy the value before returning, like we're doing right now for invalid keys.skip && !atNext
: this means we need to callskipStripe()
at the top of the followingNext()
. It'd be the case when we see aSET
,DELETE
, or other key kind that covers the remainder of the stripe.skip && atNext
: this is invalid.
compaction_iter.go, line 202 at r2 (raw file):
Previously, sumeerbhola wrote…
skip
is somewhat subtle. As far as I can tell, it used because a point being output is going to mask the remaining points (Set, Merge, Delete, SingleDelete that turns into a Delete) with the sameUserKey
. The old code used to process range tombstones internally, so it would have the desired behavior in that a preceding skip=true would go through all the remaining keys with the sameUserKey
(in the stripe) in this call toskipStripe()
and setskip=false
. The new code will interrupt theskipStripe()
due to a range tombstone and setskip=false
. The next point key could be at the sameUserKey
but skip is false. But it may also be deleted by the range tombstone, so we won't unnecessarily emit it -- except if it has the same sequence number.Example: a#10, [a, c)#7, a#7, a#6, [a, d)#5, a#4
Old code (bullets are calls to Next())
- return a#10, skip is true
- iterate over all the a's and feed the tombstone to the fragmenter.
New code:
- return a#10, skip is true
- return [a, c)perf: pacing user writes, flushes and compactions #7, skip is false
- a#7 is not covered by tombstone, so return a#7, set skip to true
- skip over a#6, return [a, d)sstable: prefix bloom filter #5, skip is false
- a#4 is covered by tombstone, so don't return it.
Is my understanding correct?
In general I think we need more code comments in this file that explain the purpose of each construct and some invariants.
Yes your understanding is correct.
"a#7 is not covered by tombstone, so return a#7, set skip to true" - this is a bug as we don't need to output a#7
since a#10
already covers it. Well it's a bug in the sense of outputting one extra key in an unlikely scenario, not a correctness bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 10 files reviewed, 13 unresolved discussions (waiting on @petermattis and @sumeerbhola)
compaction_iter.go, line 202 at r2 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
Yes your understanding is correct.
"a#7 is not covered by tombstone, so return a#7, set skip to true" - this is a bug as we don't need to output
a#7
sincea#10
already covers it. Well it's a bug in the sense of outputting one extra key in an unlikely scenario, not a correctness bug.
I think we can solve this case and some of the confusion about skipStripe()
as follows.
- rename
skipStripe()
toskipCoveredKeysInStripe()
- keep track of what snapshot we were at before calling
skipCoveredKeysInStripe()
- if it returns and we are still in the same stripe (e.g., because it found a range tombstone) we will leave
skip
set to true when we return - if it returns and we are in a different stripe (e.g., because it hit a new user key or a different stripe), we reset
skip
to false before returning
This way, when there's something like a range tombstone that interrupts us while skipping to the next stripe, we can preserve skip=true
so we don't accidentally return extra keys.
b315597
to
13c282a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 10 files reviewed, 14 unresolved discussions (waiting on @petermattis and @sumeerbhola)
compaction.go, line 1179 at r2 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
The key passed to
finishOutput()
can have user key equal to the last file's largest. It should never be smaller. I can change the<=
to an==
for clarity.Here is the simplest test case I can come up with:
define target-file-sizes=(1, 1, 1, 1) L1 a.SET.2:v c.SET.2:v L2 a.RANGEDEL.3:c L3 b.SET.2:v L3 b.SET.1:v L3 b.SET.0:v ---- 1: 4:[a-c] 2: 5:[a-c] 3: 6:[b-b] 7:[b-b] 8:[b-b] compact a-c L1 ---- 2: 9:[a#3,RANGEDEL-b#72057594037927935,RANGEDEL] 10:[b#0,RANGEDEL-c#2,SET] 3: 6:[b#2,SET-b#2,SET] 7:[b#1,SET-b#1,SET] 8:[b#0,SET-b#0,SET]
Without this change the failure is:
--- FAIL: TestManualCompaction (0.01s) compaction_test.go:728: testdata/manual_compaction:230: expected: 2: 9:[a#3,RANGEDEL-b#72057594037927935,RANGEDEL] 10:[b#0,RANGEDEL-c#2,SET] 3: 6:[b#2,SET-b#2,SET] 7:[b#1,SET-b#1,SET] 8:[b#0,SET-b#0,SET] found: 2: 9:[a#3,RANGEDEL-b#72057594037927935,RANGEDEL] 10:[b#0,RANGEDEL-b#72057594037927935,RANGEDEL] 11:[b#0,RANGEDEL-c#2,SET] 3: 6:[b#2,SET-b#2,SET] 7:[b#1,SET-b#1,SET] 8:[b#0,SET-b#0,SET] FAIL FAIL github.com/cockroachdb/pebble 0.015s
In particular note the unexpected presence of
10:[b#0,RANGEDEL-b#72057594037927935,RANGEDEL]
. That showed up because we cut two files at the samekey.UserKey
(b
), and the second one had no point keys in it. Then the range tombstonea.RANGEDEL.3:c
got added to 10.sst, and since it had no point keys we used the range tombstone to determine both the file's smallest and largest boundary.The fix here is to skip writing a file when we know it's smallest/largest user key will be equal and it does not contain any point keys. We know here the file doesn't contain point keys since
tw
is still nil whenfinishOutput()
is called.BTW, this case was found by the metamorphic test.
Done (changed <=
to ==
, updated comment, added test case).
compaction.go, line 1312 at r2 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
We have never been, but the assumption is the reader cares about consistency of the current state of the code, not the history. The inconsistency that would be apparent to me as a reader is point keys are added to
tw
immediately in this loop while range tombstones are not. The phrasing you're suggesting works for me too, so I'll change it.
Done.
compaction_iter.go, line 128 at r2 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
Is the transient state of having
key.UserKey
assigned toi.iterKey.UserKey
necessary? I am wondering if we should just get rid of it and usei.iterKey
directly until we decide to save the key. It seems error-prone right now.
Done (added the comment). Opened #402 for future refactoring.
compaction_iter.go, line 201 at r2 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
Yes, it is very confusing that
skip=false
implies we're already at the next key. Would adding a variableatNext
be helpful? We could have the following possibilities.
!skip && atNext
: this means we already advanced to the next key after the one we'll return, like inmergeNext()
.!skip==false && !atNext
: this means we need to callnextInStripe()
at the top of the followingNext()
. It'd be intuitive (for me) for this to be the setting when returning a range tombstone, rather than callingnextInStripe()
before returning. It also allows us to not worry about whether we need to copy the value before returning, like we're doing right now for invalid keys.skip && !atNext
: this means we need to callskipStripe()
at the top of the followingNext()
. It'd be the case when we see aSET
,DELETE
, or other key kind that covers the remainder of the stripe.skip && atNext
: this is invalid.
Done (named it advanced
instead of atNext
as the advancing that happens at the top of Next()
does not necessarily land on the key that Next()
will return).
compaction_iter.go, line 202 at r2 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
I think we can solve this case and some of the confusion about
skipStripe()
as follows.
- rename
skipStripe()
toskipCoveredKeysInStripe()
- keep track of what snapshot we were at before calling
skipCoveredKeysInStripe()
- if it returns and we are still in the same stripe (e.g., because it found a range tombstone) we will leave
skip
set to true when we return- if it returns and we are in a different stripe (e.g., because it hit a new user key or a different stripe), we reset
skip
to false before returningThis way, when there's something like a range tombstone that interrupts us while skipping to the next stripe, we can preserve
skip=true
so we don't accidentally return extra keys.
Done (called it skipInStripe()
rather than skipCoveredKeysInStripe()
for brevity).
compaction_iter.go, line 287 at r2 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
Yes, good catch. In fact I had to debug this exact problem already since my code for returning range tombstones from
compactionIter
was copied from here. We should always usenextInStripe()
for advancing the iterator.Taking a step back, though, should compaction simply fail if there's a key parsing error?
Done (removed this special case for InternalKeyKindInvalid
so it'll fall through to the error case).
compaction_iter.go, line 444 at r2 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
My opinion:
compactionIter::Next()
should seti.err
in the same way as{singleDelete,merge}Next()
. I think we shouldn't allow a compaction that writes out a corrupt key to succeed.DecodeInternalKey()
should set nil user key when the trailer cannot be parsed. We should avoid exposing user key for invalid keys in the first place, either to user reads or compactions. We might as well prevent this possibility from the lowest level by making user key nil in this case.
Opened #401.
compaction_iter.go, line 348 at r3 (raw file):
} return origSnapshotIdx == i.curSnapshotIdx && i.iterKey != nil && i.cmp(i.key.UserKey, i.iterKey.UserKey) == 0
This key comparison is not desirable.
testdata/compaction_iter, line 920 at r3 (raw file):
tombstones ---- a#3,7:
This change is not desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r3.
Reviewable status: 1 of 10 files reviewed, 15 unresolved discussions (waiting on @ajkr, @petermattis, and @sumeerbhola)
compaction_iter.go, line 249 at r3 (raw file):
// set `skip` to true here. That's because the next key might be // a point key at the same seqnum, which should be considered newer // than, and thus not covered by, the current range tombstone.
The comment after "Subtly ..." needs to be adjusted.
compaction_iter.go, line 348 at r3 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
This key comparison is not desirable.
nextInStripe() could return one of 3 values: TRUE, TRUE_NOT_SKIPPABLE, FALSE, since it is already doing the key comparison to distinguish between the first 2 cases.
13c282a
to
a5377f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR.
Reviewable status: 0 of 10 files reviewed, 14 unresolved discussions (waiting on @petermattis and @sumeerbhola)
compaction_iter.go, line 249 at r3 (raw file):
Previously, sumeerbhola wrote…
The comment after "Subtly ..." needs to be adjusted.
Done.
compaction_iter.go, line 348 at r3 (raw file):
Previously, sumeerbhola wrote…
nextInStripe() could return one of 3 values: TRUE, TRUE_NOT_SKIPPABLE, FALSE, since it is already doing the key comparison to distinguish between the first 2 cases.
Done.
c130d3a
to
6a18cc6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 9 files at r2, 1 of 1 files at r4.
Reviewable status: 2 of 10 files reviewed, 15 unresolved discussions (waiting on @ajkr, @petermattis, and @sumeerbhola)
compaction.go, line 394 at r4 (raw file):
if key != nil && base.InternalCompare(c.cmp, *key, g.Largest) <= 0 { break }
is the seenKey
optimization that omitted a prefix of grandparent files that did not overlap at all been replaced by something that I don't spot here? I also realized now that the original optimization also applied to gaps between files since when returning true the key
would become the starting point of the next file and we have already skipped over the files preceding that key.
As far as I can tell, this new logic is using a new key (point or range tombstone start key) to try to prevent extending previously added range tombstones into the head of grandparents. Is that good enough? Say we had a tombstone [a, z)#10 followed by y#9. And grandparents were [a#8, b#7], [c#8, d#7], [e#8, f#7], ...[x#8, y#7], [y#5, z#8], then when we encounter y#9 this code will pick y#7 as the key to return. Is my understanding correct? It prevents the tombstone from extending into [y#5, z#8], but it has picked up a large number of earlier files. And it confuses me to see code that picks a limit before key -- the current logic was simpler to reason about due to the monotonicity. How about the following alternative: if key is a tombstone, use its end key to decide a grandparent key > key where we must stop this sstable. If we have picked such a future key use that to decide when to stop instead of the key provided in shouldStopBefore. We would actually remember two optional things futureStopKey
, tombstoneEndKey
(both can be nil). We would stop <= futureStopKey
(since sst size constraint may make us stop before). And tombstoneEndKey
would keep getting extended as we add more tombstones (since they may extend beyond the tombstones we have already added). When we stop this sstable, if we have a tombstoneEndKey
that extends into the next sstable use that to compute a future stop and so on. This is very rough -- I think one can construct a tight version of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 10 files reviewed, 15 unresolved discussions (waiting on @petermattis and @sumeerbhola)
compaction.go, line 394 at r4 (raw file):
is the seenKey optimization that omitted a prefix of grandparent files that did not overlap at all been replaced by something that I don't spot here? I also realized now that the original optimization also applied to gaps between files since when returning true the key would become the starting point of the next file and we have already skipped over the files preceding that key.
It is not replaced here. My understanding is its purpose was to not count files at the head of grandparents towards c.overlappedBytes
if they were completely before all keys written by the compaction. The consequence of dropping the seenKey
optimization is the first compaction output file would be cut earlier than it needs to be if it does not overlap with some files at the grandparent head.
I think that is a good point about how the optimization implicitly applied to non-first compaction output files before my change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 10 files reviewed, 15 unresolved discussions (waiting on @petermattis and @sumeerbhola)
compaction.go, line 394 at r4 (raw file):
Say we had a tombstone [a, z)#10 followed by y#9. And grandparents were [a#8, b#7], [c#8, d#7], [e#8, f#7], ...[x#8, y#7], [y#5, z#8], then when we encounter y#9 this code will pick y#7 as the key to return. Is my understanding correct?
It depends how many of the grandparent files are needed to reach c.maxOverlapBytes
. If [a#8, b#7]
alone is larger than c.maxOverlapBytes
then it'll cut at b#7
. Or it could cut at y#7
if adding [x#8, y#7]
's size to c.overlappedBytes
causes it to cross over c.maxOverlapBytes
.
How about the following alternative: if key is a tombstone, use its end key to decide a grandparent key > key where we must stop this sstable. If we have picked such a future key use that to decide when to stop instead of the key provided in shouldStopBefore.
I think it's a good idea. In fact it is a nice hybrid of two of our original plans (#167 (comment) and #244 (review)), neither of which worked out well independently.
@petermattis you mentioned in our 1:1 that some problematic case was discovered in this PR, but I can't seem to think of what it is. Do you mind elaborating? edit: maybe it was one of the much earlier resolved points? Or perhaps there is some misunderstanding about the |
Must have been some sort of miscommunication. I don't recall saying that and don't recall there being any problematic case with this PR. |
Got it, thanks for the clarification. |
6a18cc6
to
a2f1ba3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RFAL. In case it helps, I split out two of the precursor refactorings into #421 and #422. If those are reviewed first, then I can rebase this on top and the non-test code change becomes very small. Will leave it up to the reviewer whether to take that approach or just review everything at once in this PR.
Reviewable status: 0 of 8 files reviewed, 15 unresolved discussions (waiting on @petermattis and @sumeerbhola)
a2f1ba3
to
fc71424
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the other PRs. Please merge them and rebase this one once those are merged and I'll take another look.
Reviewable status: 0 of 5 files reviewed, 15 unresolved discussions (waiting on @petermattis and @sumeerbhola)
fc71424
to
7736852
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebased. I see the expected changes in Github but not in Reviewable. Might be user error.
Reviewable status: 0 of 6 files reviewed, 16 unresolved discussions (waiting on @petermattis and @sumeerbhola)
compaction.go, line 394 at r4 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
Say we had a tombstone [a, z)#10 followed by y#9. And grandparents were [a#8, b#7], [c#8, d#7], [e#8, f#7], ...[x#8, y#7], [y#5, z#8], then when we encounter y#9 this code will pick y#7 as the key to return. Is my understanding correct?
It depends how many of the grandparent files are needed to reach
c.maxOverlapBytes
. If[a#8, b#7]
alone is larger thanc.maxOverlapBytes
then it'll cut atb#7
. Or it could cut aty#7
if adding[x#8, y#7]
's size toc.overlappedBytes
causes it to cross overc.maxOverlapBytes
.How about the following alternative: if key is a tombstone, use its end key to decide a grandparent key > key where we must stop this sstable. If we have picked such a future key use that to decide when to stop instead of the key provided in shouldStopBefore.
I think it's a good idea. In fact it is a nice hybrid of two of our original plans (#167 (comment) and #244 (review)), neither of which worked out well independently.
Done. The optimization is brought back in findGrandparentLimit()
. Choosing limit also now happens before compactionIter moves past it as we now call findGrandparentLimit()
just once before starting a file.
compaction.go, line 398 at r5 (raw file):
for upper := lower; upper < len(c.grandparents); upper++ { overlappedBytes += c.grandparents[upper].Size if overlappedBytes > c.maxOverlapBytes && c.cmp(start, c.grandparents[upper].Largest.UserKey) < 0 {
This is a hack to ensure forward progress is made. Otherwise it gets into an infinite loop when a file contains only range tombstones and the limit is the same as the start key. The drawback is valid cases of limiting to the start key are affected (e.g., the change made in testdata/compaction_find_grandparent_limit
is a casualty of this hack).
pacer.go, line 118 at r2 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
Yes, it is called in the main compaction loop, which now sees keys for range tombstones. When we have only seen range tombstones,
c.bytesIterated
(the argument passed tomaybeThrottle()
) is zero.
I do not know why reviewable is showing this change as part of the full diff even after rebasing. It was landed already in #421.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 8 files at r5.
Reviewable status: 1 of 6 files reviewed, 19 unresolved discussions (waiting on @ajkr, @petermattis, and @sumeerbhola)
compaction.go, line 1300 at r5 (raw file):
grandparentLimit = c.findGrandparentLimit(ve.NewFiles[n-1].Meta.Largest.UserKey) }
will this work out cleanly if the grandparent has a sequence of files that start and end with the same user key (say [a#5500, a#5000], [a#4999, a#4500] [a#4500, a#4000], ... and these files are hefty enough that the grandparent limit keeps returning a (because the files being commpacted also have a's)? We don't care about actually chopping at a since these are anyway going to be one compaction group, so just wondering whether the algorithm will not have some odd behavior like empty output file. I suspect it is fine because of c.cmp(...) <= 0
but maybe a code comment would help.
compaction.go, line 1326 at r5 (raw file):
return nil, pendingOutputs, err } grandparentLimit = c.findGrandparentLimit(key.UserKey)
why this call to findGrandparentLimit
here given that the limit has already been initialized for this file?
compaction.go, line 1342 at r5 (raw file):
} else { limit = grandparentLimit }
if key != nil && (grandparentLimit == nil || c.cmp(key.UserKey, grandparentLimit) <= 0) {
// Size limit reached.
limit = key.UserKey
} else {
// Run out of keys or grandparent limit reached.
if grandParentLimit != nil {
// This may possibly flush everything.
limit = grandParentLimit
} else {
// No grandparentLimit. Flush everything.
}
}
Not sure if the above is simpler, but some comments in the code would definitely help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 6 files reviewed, 19 unresolved discussions (waiting on @petermattis and @sumeerbhola)
compaction.go, line 1300 at r5 (raw file):
Previously, sumeerbhola wrote…
will this work out cleanly if the grandparent has a sequence of files that start and end with the same user key (say [a#5500, a#5000], [a#4999, a#4500] [a#4500, a#4000], ... and these files are hefty enough that the grandparent limit keeps returning a (because the files being commpacted also have a's)? We don't care about actually chopping at a since these are anyway going to be one compaction group, so just wondering whether the algorithm will not have some odd behavior like empty output file. I suspect it is fine because of
c.cmp(...) <= 0
but maybe a code comment would help.
The <=
here would be sufficient to prevent the case you're talking about if we were cutting only based on point keys. But imagine findGrandparentLimit("a") == "a"
and we have the following data to write: DELRANGE("a", "b")
and PUT("b", 1)
. Then we would execute the outer loop forever generating files containing only the DELRANGE
as the inner loop would never execute after advancing to "b".
The solution was to change findGrandparentLimit()
to always return a user key strictly greater than the start key passed to it. I will add an explanatory comment here.
compaction.go, line 1326 at r5 (raw file):
Previously, sumeerbhola wrote…
why this call to
findGrandparentLimit
here given that the limit has already been initialized for this file?
Oops, good catch. Will remove.
compaction.go, line 1342 at r5 (raw file):
Previously, sumeerbhola wrote…
if key != nil && (grandparentLimit == nil || c.cmp(key.UserKey, grandparentLimit) <= 0) { // Size limit reached. limit = key.UserKey } else { // Run out of keys or grandparent limit reached. if grandParentLimit != nil { // This may possibly flush everything. limit = grandParentLimit } else { // No grandparentLimit. Flush everything. } }
Not sure if the above is simpler, but some comments in the code would definitely help.
Yeah this can be written in quite a lot of different ways. Will comment at least.
Previously range tombstones could cause output files that would later undergo excessively large compactions due to overlap with the grandparent level. This happened because only point keys were considered for cutting output files according to grandparent overlap. Thus if a range tombstone extended over a region devoid of point keys, compaction would never split files within that region. To fix this we now split output files at grandparent boundaries regardless of the presence of point keys. That means a region covered by a range tombstone and devoid of point keys can be split at grandparent boundaries if needed to prevent future huge compactions. Fixes cockroachdb#167.
7736852
to
303f51e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! RFAL.
Reviewable status: 0 of 6 files reviewed, 18 unresolved discussions (waiting on @petermattis and @sumeerbhola)
compaction.go, line 1300 at r5 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
The
<=
here would be sufficient to prevent the case you're talking about if we were cutting only based on point keys. But imaginefindGrandparentLimit("a") == "a"
and we have the following data to write:DELRANGE("a", "b")
andPUT("b", 1)
. Then we would execute the outer loop forever generating files containing only theDELRANGE
as the inner loop would never execute after advancing to "b".The solution was to change
findGrandparentLimit()
to always return a user key strictly greater than the start key passed to it. I will add an explanatory comment here.
Done (added comments around findGrandparentLimit()
).
compaction.go, line 1326 at r5 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
Oops, good catch. Will remove.
Done.
compaction.go, line 1342 at r5 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
Yeah this can be written in quite a lot of different ways. Will comment at least.
Done (took your suggestion).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 8 files at r5, 1 of 1 files at r6.
Reviewable status: all files reviewed, 19 unresolved discussions (waiting on @ajkr, @petermattis, and @sumeerbhola)
compaction.go, line 1296 at r6 (raw file):
// to a grandparent file largest key, or nil. Taken together, these // progress guarantees ensure that eventually the input iterator will be // exhausted and the range tombstone fragments will all be flushed.
these loop comments are very helpful!
internal/rangedel/fragmenter.go, line 254 at r2 (raw file):
Previously, ajkr (Andrew Kryczka) wrote…
It was flushing pending range tombstones that were fully before the point key being checked for deletion. But that point key could turn out to be (a) not deleted, and (b) beyond the truncation point according to
nextLimitBefore()
. IfnextLimitBefore()
returned a truncation point that was not just before this point key but also before the end of earlier range tombstones, we need to be able to truncate them to that point. We can't do that if they've already been flushed.
Does this change need a unit test for specifically this case or is it covered by the manual_compaction test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: all files reviewed, 19 unresolved discussions (waiting on @ajkr, @petermattis, and @sumeerbhola)
internal/rangedel/fragmenter.go, line 254 at r2 (raw file):
Previously, sumeerbhola wrote…
Does this change need a unit test for specifically this case or is it covered by the manual_compaction test?
Without this change, the test beginning at line 140 in "testdata/manual_compaction" fails with this message:
$ go test ./ -run TestManualCompaction -count 1
--- FAIL: TestManualCompaction (0.01s)
compaction_test.go:728:
testdata/manual_compaction:165:
expected:
2:
9:[a#3,SET-b#72057594037927935,RANGEDEL]
10:[b#0,RANGEDEL-d#72057594037927935,RANGEDEL]
11:[d#0,RANGEDEL-f#72057594037927935,RANGEDEL]
12:[f#0,RANGEDEL-h#3,SET]
3:
6:[a#0,SET-b#0,SET]
7:[c#0,SET-d#0,SET]
8:[e#0,SET-f#1,SET]
found:
2:
9:[a#3,SET-b#72057594037927935,RANGEDEL]
10:[h#3,SET-h#3,SET]
3:
6:[a#0,SET-b#0,SET]
7:[c#0,SET-d#0,SET]
8:[e#0,SET-f#1,SET]
FAIL
I guess the SET(h)
is reached by compactionIter
while writing the first output file (9.sst) causing it to flush tombstone DELRANGE(a, g)
. It gets cut to b
due to grandparent overlap and then future files do not include that range tombstone as it's already been flushed. So the range [b, g)
reappears once compaction finishes.
Previously range tombstones could cause output files that would later
undergo excessively large compactions due to overlap with the
grandparent level. This happened because only point keys were considered
for cutting output files according to grandparent overlap. Thus if a
range tombstone extended over a region devoid of point keys, compaction
would never split files within that region.
To fix this we now split output files at grandparent boundaries
regardless of the presence of point keys. That means a region covered by
a range tombstone and devoid of point keys can be split at grandparent
boundaries if needed to prevent future huge compactions.
Fixes #167.