Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cut compaction outputs at grandparent boundaries #382

Merged
merged 1 commit into from
Dec 4, 2019

Conversation

ajkr
Copy link
Contributor

@ajkr ajkr commented Nov 6, 2019

Previously range tombstones could cause output files that would later
undergo excessively large compactions due to overlap with the
grandparent level. This happened because only point keys were considered
for cutting output files according to grandparent overlap. Thus if a
range tombstone extended over a region devoid of point keys, compaction
would never split files within that region.

To fix this we now split output files at grandparent boundaries
regardless of the presence of point keys. That means a region covered by
a range tombstone and devoid of point keys can be split at grandparent
boundaries if needed to prevent future huge compactions.

Fixes #167.

@petermattis
Copy link
Collaborator

This change is Reviewable

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit WIP as multiple tests are failing, one for unknown reasons (looks like an unrelated delete range bug at first glance).

Reviewable status: 0 of 4 files reviewed, 1 unresolved discussion


compaction.go, line 101 at r1 (raw file):

	overlappedBytes uint64 // bytes of overlap with grandparent tables

	// TODO(ajkr): seenKey should be true if a range tombstone has been passed that

I checked and compactionIter was not returning sentinel keys for the range tombstones output by the compaction -- should it have been?

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 2 unresolved discussions


compaction.go, line 1263 at r1 (raw file):

	limit := c.nextLimitBefore(nil)
	for limit != nil {
		if err := finishOutput(*limit); err != nil {

I should also add a test for an empty region (i.e., devoid of both point keys and range tombstones) in the middle of a compaction that spans multiple grandparent limits. Need to make sure finishOutput() works fine when called on empty files.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure looks like what I was imagining. It's actually a smaller change than I thought it would be, though handling tombstones at the beginning of an sstable will change that.

Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @ajkr)


compaction.go, line 101 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

I checked and compactionIter was not returning sentinel keys for the range tombstones output by the compaction -- should it have been?

Not sure I understand the question. compactionIter doesn't return range tombstone sentinel keys. Those are generated by the finishOutput closure when constructing the sstable boundary keys. I'm pretty sure RocksDB compaction mechanics are similar.

It is true that compactionIter.Next doesn't return range tombstones. Those are buffered in compactionIter.tombstones. Would it be sufficieint to provide another accessor on compactionIter that returned the first buffered tombstone (if it exists)?


compaction.go, line 1237 at r1 (raw file):

		limit := c.nextLimitBefore(key)
		for limit != nil {

Nit: I'd structure this as for limit := c.nextLimitBefore(nil); limit != nil; limit = c.nextLimitBefore(nil) {. Or perhaps:

for {
  limit := c.nextLimitBefore(key)
  if limit == nil {
    break
  }
  if err := finishOutput(*limit); err != nil {
    return nil, pendingOutputs, err
  }
}

Then you have a single call to c.nextLimitBefore for the loop. Ditto for the loop below. Could also pull this loop out into closure so it can be shared, but given it is only a few lines of code I don't feel strongly about that.


testdata/manual_compaction, line 85 at r1 (raw file):


compact a-e L1
----

I think you should change the test to use Version.DebugString() to output the LSM structure which will display internal keys for the boundaries rather than just the user keys. This will make the presence of range tombstone sentinel keys clearer. In case you're not aware, the --rewrite flag will make this change easy.


testdata/manual_compaction, line 129 at r1 (raw file):

  11:[b-d]
  12:[d-f]
  13:[f-g]

Neat!

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @ajkr and @petermattis)


compaction.go, line 101 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Not sure I understand the question. compactionIter doesn't return range tombstone sentinel keys. Those are generated by the finishOutput closure when constructing the sstable boundary keys. I'm pretty sure RocksDB compaction mechanics are similar.

It is true that compactionIter.Next doesn't return range tombstones. Those are buffered in compactionIter.tombstones. Would it be sufficieint to provide another accessor on compactionIter that returned the first buffered tombstone (if it exists)?

It is hard because Tombstones() isn't called and compactionIter.tombstones isn't populated until finishOutput(), which is too late for us to use its result to truncate the first file. We could use the first range tombstone returned by compactionIter.iter, but that might end up being elided from the output, and we can't really tell ahead of time since elision depends on how it's fragmented, which we again don't know until Tombstones() is called.

Going back a step, seenKey feels inconsistent already. It lets a prefix of grandparent files be omitted from the overlappedBytes calculation if the first output file doesn't overlap with them. But what if the second output file doesn't overlap with grandparent files between the first output file and the second output file? Currently those intermediate files are counted towards the second output file's overlappedBytes regardless.

We could just drop seenKey for simplicity and consistency. Actually I tried to do that at first but got scared by all the test failures. May be time to try it again.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @ajkr and @petermattis)


compaction.go, line 101 at r1 (raw file):

We could just drop seenKey for simplicity and consistency. Actually I tried to do that at first but got scared by all the test failures. May be time to try it again.

Didn't work out as another problem surfaced. compactionIter adds range tombstones to a Fragmenter and then skips over their sentinel keys. By the time it reaches the next point key the Fragmenter may have already flushed some range tombstones due to non-overlap. Then, nextLimitBefore() may tell us to truncate to a point before the point to which the Fragmenter already flushed, which leads to a panic.

Back to plan A, I think this problem could be avoided if compactionIter emits sentinel range tombstone start keys.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @ajkr and @petermattis)


compaction.go, line 101 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

We could just drop seenKey for simplicity and consistency. Actually I tried to do that at first but got scared by all the test failures. May be time to try it again.

Didn't work out as another problem surfaced. compactionIter adds range tombstones to a Fragmenter and then skips over their sentinel keys. By the time it reaches the next point key the Fragmenter may have already flushed some range tombstones due to non-overlap. Then, nextLimitBefore() may tell us to truncate to a point before the point to which the Fragmenter already flushed, which leads to a panic.

Back to plan A, I think this problem could be avoided if compactionIter emits sentinel range tombstone start keys.

Plan A doesn't work either since it's still possible the Fragmenter is flushed beyond the file's truncation point. In particular since we add a range tombstone to the Fragmenter before compactionIter returns its start key as a sentinel key, the Fragmenter may flush range tombstones we want to truncate before we call nextLimitBefore / FlushTo.

Another thing to try is changing the Fragmenter to only flush fragments when FlushTo is called rather than when a range tombstone is added that comes after all pending tombstones (truncateAndFlush). Though that feels complicated. I wonder if there's an easier way.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @ajkr)


compaction.go, line 101 at r1 (raw file):

Another thing to try is changing the Fragmenter to only flush fragments when FlushTo is called rather than when a range tombstone is added that comes after all pending tombstones (truncateAndFlush). Though that feels complicated. I wonder if there's an easier way.

This sounds desirable. The behavior of compactionIter.Tombstones() differs from the handling of point operations. With point operations, we get to decide which sstable the operation goes into. But a range tombstone gets added eagerly to compactionIter.Tombstones() which feels wrong. Perhaps the general structure of compactionIter should be to return all of the records (both point and range) that will be added to an sstable, and the caller of compactionIter can use the range tombstones for sstable boundary decisions, but would otherwise ignore them. Or perhaps the rangedel.Fragmenter should be pulled out of compactionIter and passed as reference. This would be quite a bit of refactoring, but it is the hiding of range tombstones from the compactionIter processing that is causing problems.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @ajkr and @petermattis)


compaction.go, line 101 at r1 (raw file):

Perhaps the general structure of compactionIter should be to return all of the records (both point and range) that will be added to an sstable, and the caller of compactionIter can use the range tombstones for sstable boundary decisions, but would otherwise ignore them.

Agreed with this, but additionally we will need to change Fragmenter to buffer flushed fragments all the way until FlushTo() is called, right? Doing this but leaving the Fragmenter the same seems to not work as Fragmenter might have already flushed fragments beyond the FlushTo() truncation point before we call it. That happens just because compactionIter calls Fragmenter.Add() before returning a range tombstone sentinel key.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @ajkr and @petermattis)


compaction.go, line 101 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Perhaps the general structure of compactionIter should be to return all of the records (both point and range) that will be added to an sstable, and the caller of compactionIter can use the range tombstones for sstable boundary decisions, but would otherwise ignore them.

Agreed with this, but additionally we will need to change Fragmenter to buffer flushed fragments all the way until FlushTo() is called, right? Doing this but leaving the Fragmenter the same seems to not work as Fragmenter might have already flushed fragments beyond the FlushTo() truncation point before we call it. That happens just because compactionIter calls Fragmenter.Add() before returning a range tombstone sentinel key.

There is also the option of waiting to call Fragmenter.Add() until the Next() following a range tombstone sentinel key. It feels a bit confusing.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @ajkr)


compaction.go, line 101 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

There is also the option of waiting to call Fragmenter.Add() until the Next() following a range tombstone sentinel key. It feels a bit confusing.

I was imagining the the compaction code would call Fragmenter.Add(), not compactionIter. compactionIter would only call Fragmenter.Deleted. There might be challenges in actually doing this, though.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @ajkr and @petermattis)


compaction.go, line 101 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I was imagining the the compaction code would call Fragmenter.Add(), not compactionIter. compactionIter would only call Fragmenter.Deleted. There might be challenges in actually doing this, though.

Right, I wrote that comment before I understood your point. Makes sense.

@ajkr ajkr force-pushed the cut-sstable-within-range-tombstone branch 5 times, most recently from 662b236 to b7afd57 Compare November 13, 2019 23:00
Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR. RFAL.

Reviewable status: 0 of 10 files reviewed, 2 unresolved discussions (waiting on @petermattis)


compaction.go, line 101 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Right, I wrote that comment before I understood your point. Makes sense.

Done.


compaction.go, line 1237 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: I'd structure this as for limit := c.nextLimitBefore(nil); limit != nil; limit = c.nextLimitBefore(nil) {. Or perhaps:

for {
  limit := c.nextLimitBefore(key)
  if limit == nil {
    break
  }
  if err := finishOutput(*limit); err != nil {
    return nil, pendingOutputs, err
  }
}

Then you have a single call to c.nextLimitBefore for the loop. Ditto for the loop below. Could also pull this loop out into closure so it can be shared, but given it is only a few lines of code I don't feel strongly about that.

Done.


compaction.go, line 1263 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

I should also add a test for an empty region (i.e., devoid of both point keys and range tombstones) in the middle of a compaction that spans multiple grandparent limits. Need to make sure finishOutput() works fine when called on empty files.

Not important, the function simply returns nil when nothing's been added. A more important case that the metamorphic tests found is when grandparent level causes multiple files to be cut at the same user key and the non-first file among them has only range tombstones.


testdata/manual_compaction, line 85 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think you should change the test to use Version.DebugString() to output the LSM structure which will display internal keys for the boundaries rather than just the user keys. This will make the presence of range tombstone sentinel keys clearer. In case you're not aware, the --rewrite flag will make this change easy.

Done.

@ajkr ajkr force-pushed the cut-sstable-within-range-tombstone branch from b7afd57 to b315597 Compare November 13, 2019 23:11
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. I need to do another pass on the test cases, but I had a few questions first.

Reviewable status: 0 of 10 files reviewed, 5 unresolved discussions (waiting on @ajkr and @petermattis)


compaction.go, line 1179 at r2 (raw file):

				// to this function's truncation logic.
				n := len(ve.NewFiles)
				if n > 0 && key.UserKey != nil && d.cmp(key.UserKey, ve.NewFiles[n-1].Meta.Largest.UserKey) <= 0 {

I'm not sure what is going on here. How can the key passed to finishOutput be less than or equal to the last sstable's largest key? I see mention of range tombstones in the comment above, yet the keys involved here are not necessarily range tombstones. Extending the comment with an example might help.


compaction.go, line 1312 at r2 (raw file):

			// But do add them to the `Fragmenter` to make them visible to
			// `compactionIter` so covered keys in the same snapshot stripe
			// can be elided.

Nit: The first sentence in the comment reads a little bit strange to me as we have never been adding range tombstones to the output table in this loop. I think that sentence could be removed and then the second one adjusted to state: Range tombstones are handled specially. They are fragmented and written later during finishOutput() ....


pacer.go, line 118 at r2 (raw file):

func (p *compactionPacer) maybeThrottle(bytesIterated uint64) error {
	if bytesIterated == 0 {
		return nil

What is this change about? Did something about your change cause us to call maybeThrottle(0)?


internal/rangedel/fragmenter.go, line 254 at r2 (raw file):

	}

	if flush {

Can you explain this part of the change? It isn't obvious to me why it is needed.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still reading compactionIter -- some initial comments

Reviewable status: 0 of 10 files reviewed, 8 unresolved discussions (waiting on @ajkr and @petermattis)


compaction_iter.go, line 212 at r2 (raw file):

			i.rangeDelFrag.Add(i.cloneKey(i.key), i.iterValue)
		}
		if i.rangeDelFrag.Deleted(i.key, i.curSnapshotSeqNum) {

I am glad that the new code does not fall through here to call Deleted() with the range tombstone itself. It is harmless but made my head hurt.


compaction_iter.go, line 287 at r2 (raw file):

			i.value = make([]byte, len(i.iterValue))
			copy(i.value, i.iterValue)
			i.iterNext()

this call to iterNext() is possibly a bug.
My understanding is that the consistency of iterKey, iterValue, curSnapshotIdx, curSnapshotSeqNum are established by First() and maintained by nextInStripe(). This call uses iterNext() directly that does not maintain that consistency. Consider a sequence like [a, d)#7, a#6 KeyKindInvalid, a#4 where there was a snapshot at 5. The a#4 will get deleted I think because we step to a#4 but leave the curSnapshotSeqNum equal to 6. This should be easy to test.


compaction_iter.go, line 337 at r2 (raw file):

			// as i.key. Adding tombstones earlier can violate a rangedel.Fragmenter
			// invariant and lead to too many tombstones being added to an sstable.
			i.rangeDelFrag.Add(i.cloneKey(*key), i.iterValue)

this code was dubious in that it moved to the next key without setting curSnapshotIdx and curSnapshotSeqNum. Glad that we are doing the cleaner thing now.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 10 unresolved discussions (waiting on @ajkr and @petermattis)


compaction_iter.go, line 128 at r2 (raw file):

	iter  internalIterator
	err   error
	key   InternalKey

I had a hard time reasoning about key. Maybe a comment like the following would help?

	// key.UserKey is in one of two states:
	// - Set to i.iterKey.UserKey: this happens within Next() and is a transient state before the code
	//   decides to save the key.
	// - Set to keyBuf: caused by saving i.iterKey.UserKey. This is the case on return from all public methods -- these methods return key.
	//   Additionally, it is the internal state when the code is moving to the next key so it can determine
	//   whether the user key has changed from the previous key.

compaction_iter.go, line 201 at r2 (raw file):

	}

	if i.skip {

I found the iterator positioning hard to understand. How about a comment like the following

	// Prior to this call to Next() we are in one of two situations wrt iterKey and related state:
	// - skip is false, and iterKey is already at the next key.
	// - skip is true and we are at the key that has been returned. Doing the skip will move us forward.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 12 unresolved discussions (waiting on @ajkr and @petermattis)


compaction_iter.go, line 379 at r2 (raw file):

		case InternalKeyKindSet:
			if i.rangeDelFrag.Deleted(*key, i.curSnapshotSeqNum) {

This seems incorrect in the way it is using i.curSnapshotSeqNum. There are two questions here (a) is this point deleted, (b) should the point not be emitted despite the snapshot constraints. (b)=>(a) but (a) does not imply (b). This code is looking at !(b) and using it to merge. But (b) could be false and (a) true. For example, if the original merge was at #25, a tombstone at #21, a snapshot at #19 and now a set at #17. The curSnapshotSeqNum is 17 so it needs to be emitted, but it should not be merged either. So I think the code should be something like

// Current code
if (b) {
  ...
}
if (a) {
  i.skip = false
  return
}
// merge -- current code

Did I miss something?


compaction_iter.go, line 444 at r2 (raw file):

		default:
			i.err = fmt.Errorf("invalid internal key kind: %d", i.iterKey.Kind())

Our handling of InternalKeyKindInvalid seems inconsistent. In this file, Next() will just return it without setting an error while singleDeleteNext() and mergeNext() set the error. Why is that? Also, it seems inconsistent how these keys come to be: Iterator.decodeKey sets the UserKey to nil when this happens, while DecodeInternalKey() in internal.go sets the UserKey to the full encoded key.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 12 unresolved discussions (waiting on @ajkr and @petermattis)


compaction_iter.go, line 379 at r2 (raw file):

Previously, sumeerbhola wrote…

This seems incorrect in the way it is using i.curSnapshotSeqNum. There are two questions here (a) is this point deleted, (b) should the point not be emitted despite the snapshot constraints. (b)=>(a) but (a) does not imply (b). This code is looking at !(b) and using it to merge. But (b) could be false and (a) true. For example, if the original merge was at #25, a tombstone at #21, a snapshot at #19 and now a set at #17. The curSnapshotSeqNum is 17 so it needs to be emitted, but it should not be merged either. So I think the code should be something like

// Current code
if (b) {
  ...
}
if (a) {
  i.skip = false
  return
}
// merge -- current code

Did I miss something?

never mind. The !i.nextInStripe() earlier in the loop would prevent this. We already know this set is in the same stripe.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 13 unresolved discussions (waiting on @ajkr and @petermattis)


compaction_iter.go, line 202 at r2 (raw file):

	if i.skip {
		i.skip = false

skip is somewhat subtle. As far as I can tell, it used because a point being output is going to mask the remaining points (Set, Merge, Delete, SingleDelete that turns into a Delete) with the same UserKey. The old code used to process range tombstones internally, so it would have the desired behavior in that a preceding skip=true would go through all the remaining keys with the same UserKey (in the stripe) in this call to skipStripe() and set skip=false. The new code will interrupt the skipStripe() due to a range tombstone and set skip=false. The next point key could be at the same UserKey but skip is false. But it may also be deleted by the range tombstone, so we won't unnecessarily emit it -- except if it has the same sequence number.

Example: a#10, [a, c)#7, a#7, a#6, [a, d)#5, a#4

Old code (bullets are calls to Next())

  • return a#10, skip is true
  • iterate over all the a's and feed the tombstone to the fragmenter.

New code:

Is my understanding correct?

In general I think we need more code comments in this file that explain the purpose of each construct and some invariants.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for all the detailed reviews. Still working on replying to everything.

Reviewable status: 0 of 10 files reviewed, 13 unresolved discussions (waiting on @ajkr, @petermattis, and @sumeerbhola)


compaction.go, line 1179 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm not sure what is going on here. How can the key passed to finishOutput be less than or equal to the last sstable's largest key? I see mention of range tombstones in the comment above, yet the keys involved here are not necessarily range tombstones. Extending the comment with an example might help.

The key passed to finishOutput() can have user key equal to the last file's largest. It should never be smaller. I can change the <= to an == for clarity.

Here is the simplest test case I can come up with:

define target-file-sizes=(1, 1, 1, 1)
L1
  a.SET.2:v
  c.SET.2:v
L2
  a.RANGEDEL.3:c
L3
  b.SET.2:v
L3
  b.SET.1:v
L3
  b.SET.0:v
----
1:
  4:[a-c]
2:
  5:[a-c]
3:
  6:[b-b]
  7:[b-b]
  8:[b-b]

compact a-c L1
----
2:
  9:[a#3,RANGEDEL-b#72057594037927935,RANGEDEL]
  10:[b#0,RANGEDEL-c#2,SET]
3:
  6:[b#2,SET-b#2,SET]
  7:[b#1,SET-b#1,SET]
  8:[b#0,SET-b#0,SET]

Without this change the failure is:

--- FAIL: TestManualCompaction (0.01s)
    compaction_test.go:728:
        testdata/manual_compaction:230:
        expected:
        2:
          9:[a#3,RANGEDEL-b#72057594037927935,RANGEDEL]
          10:[b#0,RANGEDEL-c#2,SET]
        3:
          6:[b#2,SET-b#2,SET]
          7:[b#1,SET-b#1,SET]
          8:[b#0,SET-b#0,SET]

        found:
        2:
          9:[a#3,RANGEDEL-b#72057594037927935,RANGEDEL]
          10:[b#0,RANGEDEL-b#72057594037927935,RANGEDEL]
          11:[b#0,RANGEDEL-c#2,SET]
        3:
          6:[b#2,SET-b#2,SET]
          7:[b#1,SET-b#1,SET]
          8:[b#0,SET-b#0,SET]
FAIL
FAIL    github.com/cockroachdb/pebble   0.015s

In particular note the unexpected presence of 10:[b#0,RANGEDEL-b#72057594037927935,RANGEDEL]. That showed up because we cut two files at the same key.UserKey (b), and the second one had no point keys in it. Then the range tombstone a.RANGEDEL.3:c got added to 10.sst, and since it had no point keys we used the range tombstone to determine both the file's smallest and largest boundary.

The fix here is to skip writing a file when we know it's smallest/largest user key will be equal and it does not contain any point keys. We know here the file doesn't contain point keys since tw is still nil when finishOutput() is called.

BTW, this case was found by the metamorphic test.


compaction.go, line 1312 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: The first sentence in the comment reads a little bit strange to me as we have never been adding range tombstones to the output table in this loop. I think that sentence could be removed and then the second one adjusted to state: Range tombstones are handled specially. They are fragmented and written later during finishOutput() ....

We have never been, but the assumption is the reader cares about consistency of the current state of the code, not the history. The inconsistency that would be apparent to me as a reader is point keys are added to tw immediately in this loop while range tombstones are not. The phrasing you're suggesting works for me too, so I'll change it.


compaction_iter.go, line 128 at r2 (raw file):

Previously, sumeerbhola wrote…

I had a hard time reasoning about key. Maybe a comment like the following would help?

	// key.UserKey is in one of two states:
	// - Set to i.iterKey.UserKey: this happens within Next() and is a transient state before the code
	//   decides to save the key.
	// - Set to keyBuf: caused by saving i.iterKey.UserKey. This is the case on return from all public methods -- these methods return key.
	//   Additionally, it is the internal state when the code is moving to the next key so it can determine
	//   whether the user key has changed from the previous key.

Is the transient state of having key.UserKey assigned to i.iterKey.UserKey necessary? I am wondering if we should just get rid of it and use i.iterKey directly until we decide to save the key. It seems error-prone right now.


compaction_iter.go, line 287 at r2 (raw file):

Previously, sumeerbhola wrote…

this call to iterNext() is possibly a bug.
My understanding is that the consistency of iterKey, iterValue, curSnapshotIdx, curSnapshotSeqNum are established by First() and maintained by nextInStripe(). This call uses iterNext() directly that does not maintain that consistency. Consider a sequence like [a, d)#7, a#6 KeyKindInvalid, a#4 where there was a snapshot at 5. The a#4 will get deleted I think because we step to a#4 but leave the curSnapshotSeqNum equal to 6. This should be easy to test.

Yes, good catch. In fact I had to debug this exact problem already since my code for returning range tombstones from compactionIter was copied from here. We should always use nextInStripe() for advancing the iterator.

Taking a step back, though, should compaction simply fail if there's a key parsing error?


compaction_iter.go, line 444 at r2 (raw file):

Previously, sumeerbhola wrote…

Our handling of InternalKeyKindInvalid seems inconsistent. In this file, Next() will just return it without setting an error while singleDeleteNext() and mergeNext() set the error. Why is that? Also, it seems inconsistent how these keys come to be: Iterator.decodeKey sets the UserKey to nil when this happens, while DecodeInternalKey() in internal.go sets the UserKey to the full encoded key.

My opinion:

  • compactionIter::Next() should set i.err in the same way as {singleDelete,merge}Next(). I think we shouldn't allow a compaction that writes out a corrupt key to succeed.
  • DecodeInternalKey() should set nil user key when the trailer cannot be parsed. We should avoid exposing user key for invalid keys in the first place, either to user reads or compactions. We might as well prevent this possibility from the lowest level by making user key nil in this case.

pacer.go, line 118 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

What is this change about? Did something about your change cause us to call maybeThrottle(0)?

Yes, it is called in the main compaction loop, which now sees keys for range tombstones. When we have only seen range tombstones, c.bytesIterated (the argument passed to maybeThrottle()) is zero.


internal/rangedel/fragmenter.go, line 254 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Can you explain this part of the change? It isn't obvious to me why it is needed.

It was flushing pending range tombstones that were fully before the point key being checked for deletion. But that point key could turn out to be (a) not deleted, and (b) beyond the truncation point according to nextLimitBefore(). If nextLimitBefore() returned a truncation point that was not just before this point key but also before the end of earlier range tombstones, we need to be able to truncate them to that point. We can't do that if they've already been flushed.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 13 unresolved discussions (waiting on @petermattis and @sumeerbhola)


compaction_iter.go, line 201 at r2 (raw file):

Previously, sumeerbhola wrote…

I found the iterator positioning hard to understand. How about a comment like the following

	// Prior to this call to Next() we are in one of two situations wrt iterKey and related state:
	// - skip is false, and iterKey is already at the next key.
	// - skip is true and we are at the key that has been returned. Doing the skip will move us forward.

Yes, it is very confusing that skip=false implies we're already at the next key. Would adding a variable atNext be helpful? We could have the following possibilities.

  • !skip && atNext: this means we already advanced to the next key after the one we'll return, like in mergeNext().
  • !skip && !atNext: this means we need to call nextInStripe() at the top of the followingNext(). It'd be intuitive (for me) for this to be the setting when returning a range tombstone, rather than calling nextInStripe() before returning. It also allows us to not worry about whether we need to copy the value before returning, like we're doing right now for invalid keys.
  • skip && !atNext: this means we need to call skipStripe() at the top of the following Next(). It'd be the case when we see a SET, DELETE, or other key kind that covers the remainder of the stripe.
  • skip && atNext: this is invalid.

compaction_iter.go, line 202 at r2 (raw file):

Previously, sumeerbhola wrote…

skip is somewhat subtle. As far as I can tell, it used because a point being output is going to mask the remaining points (Set, Merge, Delete, SingleDelete that turns into a Delete) with the same UserKey. The old code used to process range tombstones internally, so it would have the desired behavior in that a preceding skip=true would go through all the remaining keys with the same UserKey (in the stripe) in this call to skipStripe() and set skip=false. The new code will interrupt the skipStripe() due to a range tombstone and set skip=false. The next point key could be at the same UserKey but skip is false. But it may also be deleted by the range tombstone, so we won't unnecessarily emit it -- except if it has the same sequence number.

Example: a#10, [a, c)#7, a#7, a#6, [a, d)#5, a#4

Old code (bullets are calls to Next())

  • return a#10, skip is true
  • iterate over all the a's and feed the tombstone to the fragmenter.

New code:

Is my understanding correct?

In general I think we need more code comments in this file that explain the purpose of each construct and some invariants.

Yes your understanding is correct.

"a#7 is not covered by tombstone, so return a#7, set skip to true" - this is a bug as we don't need to output a#7 since a#10 already covers it. Well it's a bug in the sense of outputting one extra key in an unlikely scenario, not a correctness bug.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 13 unresolved discussions (waiting on @petermattis and @sumeerbhola)


compaction_iter.go, line 202 at r2 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Yes your understanding is correct.

"a#7 is not covered by tombstone, so return a#7, set skip to true" - this is a bug as we don't need to output a#7 since a#10 already covers it. Well it's a bug in the sense of outputting one extra key in an unlikely scenario, not a correctness bug.

I think we can solve this case and some of the confusion about skipStripe() as follows.

  • rename skipStripe() to skipCoveredKeysInStripe()
  • keep track of what snapshot we were at before calling skipCoveredKeysInStripe()
  • if it returns and we are still in the same stripe (e.g., because it found a range tombstone) we will leave skip set to true when we return
  • if it returns and we are in a different stripe (e.g., because it hit a new user key or a different stripe), we reset skip to false before returning

This way, when there's something like a range tombstone that interrupts us while skipping to the next stripe, we can preserve skip=true so we don't accidentally return extra keys.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 14 unresolved discussions (waiting on @petermattis and @sumeerbhola)


compaction.go, line 1179 at r2 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

The key passed to finishOutput() can have user key equal to the last file's largest. It should never be smaller. I can change the <= to an == for clarity.

Here is the simplest test case I can come up with:

define target-file-sizes=(1, 1, 1, 1)
L1
  a.SET.2:v
  c.SET.2:v
L2
  a.RANGEDEL.3:c
L3
  b.SET.2:v
L3
  b.SET.1:v
L3
  b.SET.0:v
----
1:
  4:[a-c]
2:
  5:[a-c]
3:
  6:[b-b]
  7:[b-b]
  8:[b-b]

compact a-c L1
----
2:
  9:[a#3,RANGEDEL-b#72057594037927935,RANGEDEL]
  10:[b#0,RANGEDEL-c#2,SET]
3:
  6:[b#2,SET-b#2,SET]
  7:[b#1,SET-b#1,SET]
  8:[b#0,SET-b#0,SET]

Without this change the failure is:

--- FAIL: TestManualCompaction (0.01s)
    compaction_test.go:728:
        testdata/manual_compaction:230:
        expected:
        2:
          9:[a#3,RANGEDEL-b#72057594037927935,RANGEDEL]
          10:[b#0,RANGEDEL-c#2,SET]
        3:
          6:[b#2,SET-b#2,SET]
          7:[b#1,SET-b#1,SET]
          8:[b#0,SET-b#0,SET]

        found:
        2:
          9:[a#3,RANGEDEL-b#72057594037927935,RANGEDEL]
          10:[b#0,RANGEDEL-b#72057594037927935,RANGEDEL]
          11:[b#0,RANGEDEL-c#2,SET]
        3:
          6:[b#2,SET-b#2,SET]
          7:[b#1,SET-b#1,SET]
          8:[b#0,SET-b#0,SET]
FAIL
FAIL    github.com/cockroachdb/pebble   0.015s

In particular note the unexpected presence of 10:[b#0,RANGEDEL-b#72057594037927935,RANGEDEL]. That showed up because we cut two files at the same key.UserKey (b), and the second one had no point keys in it. Then the range tombstone a.RANGEDEL.3:c got added to 10.sst, and since it had no point keys we used the range tombstone to determine both the file's smallest and largest boundary.

The fix here is to skip writing a file when we know it's smallest/largest user key will be equal and it does not contain any point keys. We know here the file doesn't contain point keys since tw is still nil when finishOutput() is called.

BTW, this case was found by the metamorphic test.

Done (changed <= to ==, updated comment, added test case).


compaction.go, line 1312 at r2 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

We have never been, but the assumption is the reader cares about consistency of the current state of the code, not the history. The inconsistency that would be apparent to me as a reader is point keys are added to tw immediately in this loop while range tombstones are not. The phrasing you're suggesting works for me too, so I'll change it.

Done.


compaction_iter.go, line 128 at r2 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Is the transient state of having key.UserKey assigned to i.iterKey.UserKey necessary? I am wondering if we should just get rid of it and use i.iterKey directly until we decide to save the key. It seems error-prone right now.

Done (added the comment). Opened #402 for future refactoring.


compaction_iter.go, line 201 at r2 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Yes, it is very confusing that skip=false implies we're already at the next key. Would adding a variable atNext be helpful? We could have the following possibilities.

  • !skip && atNext: this means we already advanced to the next key after the one we'll return, like in mergeNext().
  • !skip==false && !atNext: this means we need to call nextInStripe() at the top of the followingNext(). It'd be intuitive (for me) for this to be the setting when returning a range tombstone, rather than calling nextInStripe() before returning. It also allows us to not worry about whether we need to copy the value before returning, like we're doing right now for invalid keys.
  • skip && !atNext: this means we need to call skipStripe() at the top of the following Next(). It'd be the case when we see a SET, DELETE, or other key kind that covers the remainder of the stripe.
  • skip && atNext: this is invalid.

Done (named it advanced instead of atNext as the advancing that happens at the top of Next() does not necessarily land on the key that Next() will return).


compaction_iter.go, line 202 at r2 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

I think we can solve this case and some of the confusion about skipStripe() as follows.

  • rename skipStripe() to skipCoveredKeysInStripe()
  • keep track of what snapshot we were at before calling skipCoveredKeysInStripe()
  • if it returns and we are still in the same stripe (e.g., because it found a range tombstone) we will leave skip set to true when we return
  • if it returns and we are in a different stripe (e.g., because it hit a new user key or a different stripe), we reset skip to false before returning

This way, when there's something like a range tombstone that interrupts us while skipping to the next stripe, we can preserve skip=true so we don't accidentally return extra keys.

Done (called it skipInStripe() rather than skipCoveredKeysInStripe() for brevity).


compaction_iter.go, line 287 at r2 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Yes, good catch. In fact I had to debug this exact problem already since my code for returning range tombstones from compactionIter was copied from here. We should always use nextInStripe() for advancing the iterator.

Taking a step back, though, should compaction simply fail if there's a key parsing error?

Done (removed this special case for InternalKeyKindInvalid so it'll fall through to the error case).


compaction_iter.go, line 444 at r2 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

My opinion:

  • compactionIter::Next() should set i.err in the same way as {singleDelete,merge}Next(). I think we shouldn't allow a compaction that writes out a corrupt key to succeed.
  • DecodeInternalKey() should set nil user key when the trailer cannot be parsed. We should avoid exposing user key for invalid keys in the first place, either to user reads or compactions. We might as well prevent this possibility from the lowest level by making user key nil in this case.

Opened #401.


compaction_iter.go, line 348 at r3 (raw file):

	}
	return origSnapshotIdx == i.curSnapshotIdx && i.iterKey != nil &&
		i.cmp(i.key.UserKey, i.iterKey.UserKey) == 0

This key comparison is not desirable.


testdata/compaction_iter, line 920 at r3 (raw file):

tombstones
----
a#3,7:

This change is not desirable.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r3.
Reviewable status: 1 of 10 files reviewed, 15 unresolved discussions (waiting on @ajkr, @petermattis, and @sumeerbhola)


compaction_iter.go, line 249 at r3 (raw file):

			// set `skip` to true here. That's because the next key might be
			// a point key at the same seqnum, which should be considered newer
			// than, and thus not covered by, the current range tombstone.

The comment after "Subtly ..." needs to be adjusted.


compaction_iter.go, line 348 at r3 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

This key comparison is not desirable.

nextInStripe() could return one of 3 values: TRUE, TRUE_NOT_SKIPPABLE, FALSE, since it is already doing the key comparison to distinguish between the first 2 cases.

@ajkr ajkr force-pushed the cut-sstable-within-range-tombstone branch from 13c282a to a5377f0 Compare November 19, 2019 06:47
Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR.

Reviewable status: 0 of 10 files reviewed, 14 unresolved discussions (waiting on @petermattis and @sumeerbhola)


compaction_iter.go, line 249 at r3 (raw file):

Previously, sumeerbhola wrote…

The comment after "Subtly ..." needs to be adjusted.

Done.


compaction_iter.go, line 348 at r3 (raw file):

Previously, sumeerbhola wrote…

nextInStripe() could return one of 3 values: TRUE, TRUE_NOT_SKIPPABLE, FALSE, since it is already doing the key comparison to distinguish between the first 2 cases.

Done.

@ajkr ajkr force-pushed the cut-sstable-within-range-tombstone branch 3 times, most recently from c130d3a to 6a18cc6 Compare November 19, 2019 06:58
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 9 files at r2, 1 of 1 files at r4.
Reviewable status: 2 of 10 files reviewed, 15 unresolved discussions (waiting on @ajkr, @petermattis, and @sumeerbhola)


compaction.go, line 394 at r4 (raw file):

		if key != nil && base.InternalCompare(c.cmp, *key, g.Largest) <= 0 {
			break
		}

is the seenKey optimization that omitted a prefix of grandparent files that did not overlap at all been replaced by something that I don't spot here? I also realized now that the original optimization also applied to gaps between files since when returning true the key would become the starting point of the next file and we have already skipped over the files preceding that key.
As far as I can tell, this new logic is using a new key (point or range tombstone start key) to try to prevent extending previously added range tombstones into the head of grandparents. Is that good enough? Say we had a tombstone [a, z)#10 followed by y#9. And grandparents were [a#8, b#7], [c#8, d#7], [e#8, f#7], ...[x#8, y#7], [y#5, z#8], then when we encounter y#9 this code will pick y#7 as the key to return. Is my understanding correct? It prevents the tombstone from extending into [y#5, z#8], but it has picked up a large number of earlier files. And it confuses me to see code that picks a limit before key -- the current logic was simpler to reason about due to the monotonicity. How about the following alternative: if key is a tombstone, use its end key to decide a grandparent key > key where we must stop this sstable. If we have picked such a future key use that to decide when to stop instead of the key provided in shouldStopBefore. We would actually remember two optional things futureStopKey, tombstoneEndKey (both can be nil). We would stop <= futureStopKey (since sst size constraint may make us stop before). And tombstoneEndKey would keep getting extended as we add more tombstones (since they may extend beyond the tombstones we have already added). When we stop this sstable, if we have a tombstoneEndKey that extends into the next sstable use that to compute a future stop and so on. This is very rough -- I think one can construct a tight version of this.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 10 files reviewed, 15 unresolved discussions (waiting on @petermattis and @sumeerbhola)


compaction.go, line 394 at r4 (raw file):

is the seenKey optimization that omitted a prefix of grandparent files that did not overlap at all been replaced by something that I don't spot here? I also realized now that the original optimization also applied to gaps between files since when returning true the key would become the starting point of the next file and we have already skipped over the files preceding that key.

It is not replaced here. My understanding is its purpose was to not count files at the head of grandparents towards c.overlappedBytes if they were completely before all keys written by the compaction. The consequence of dropping the seenKey optimization is the first compaction output file would be cut earlier than it needs to be if it does not overlap with some files at the grandparent head.

I think that is a good point about how the optimization implicitly applied to non-first compaction output files before my change.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 10 files reviewed, 15 unresolved discussions (waiting on @petermattis and @sumeerbhola)


compaction.go, line 394 at r4 (raw file):

Say we had a tombstone [a, z)#10 followed by y#9. And grandparents were [a#8, b#7], [c#8, d#7], [e#8, f#7], ...[x#8, y#7], [y#5, z#8], then when we encounter y#9 this code will pick y#7 as the key to return. Is my understanding correct?

It depends how many of the grandparent files are needed to reach c.maxOverlapBytes. If [a#8, b#7] alone is larger than c.maxOverlapBytes then it'll cut at b#7. Or it could cut at y#7 if adding [x#8, y#7]'s size to c.overlappedBytes causes it to cross over c.maxOverlapBytes.

How about the following alternative: if key is a tombstone, use its end key to decide a grandparent key > key where we must stop this sstable. If we have picked such a future key use that to decide when to stop instead of the key provided in shouldStopBefore.

I think it's a good idea. In fact it is a nice hybrid of two of our original plans (#167 (comment) and #244 (review)), neither of which worked out well independently.

@ajkr
Copy link
Contributor Author

ajkr commented Nov 25, 2019

@petermattis you mentioned in our 1:1 that some problematic case was discovered in this PR, but I can't seem to think of what it is. Do you mind elaborating?

edit: maybe it was one of the much earlier resolved points? Or perhaps there is some misunderstanding about the seenKey optimization and that dropping it only causes our accounting to start earlier, not the actual file boundary.

@petermattis
Copy link
Collaborator

@petermattis you mentioned in our 1:1 that some problematic case was discovered in this PR, but I can't seem to think of what it is. Do you mind elaborating?

Must have been some sort of miscommunication. I don't recall saying that and don't recall there being any problematic case with this PR.

@ajkr
Copy link
Contributor Author

ajkr commented Nov 26, 2019

Must have been some sort of miscommunication. I don't recall saying that and don't recall there being any problematic case with this PR.

Got it, thanks for the clarification.

@ajkr ajkr force-pushed the cut-sstable-within-range-tombstone branch from 6a18cc6 to a2f1ba3 Compare December 2, 2019 01:40
Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFAL. In case it helps, I split out two of the precursor refactorings into #421 and #422. If those are reviewed first, then I can rebase this on top and the non-test code change becomes very small. Will leave it up to the reviewer whether to take that approach or just review everything at once in this PR.

Reviewable status: 0 of 8 files reviewed, 15 unresolved discussions (waiting on @petermattis and @sumeerbhola)

@ajkr ajkr force-pushed the cut-sstable-within-range-tombstone branch from a2f1ba3 to fc71424 Compare December 2, 2019 20:26
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the other PRs. Please merge them and rebase this one once those are merged and I'll take another look.

Reviewable status: 0 of 5 files reviewed, 15 unresolved discussions (waiting on @petermattis and @sumeerbhola)

@ajkr ajkr force-pushed the cut-sstable-within-range-tombstone branch from fc71424 to 7736852 Compare December 2, 2019 22:15
Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased. I see the expected changes in Github but not in Reviewable. Might be user error.

Reviewable status: 0 of 6 files reviewed, 16 unresolved discussions (waiting on @petermattis and @sumeerbhola)


compaction.go, line 394 at r4 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Say we had a tombstone [a, z)#10 followed by y#9. And grandparents were [a#8, b#7], [c#8, d#7], [e#8, f#7], ...[x#8, y#7], [y#5, z#8], then when we encounter y#9 this code will pick y#7 as the key to return. Is my understanding correct?

It depends how many of the grandparent files are needed to reach c.maxOverlapBytes. If [a#8, b#7] alone is larger than c.maxOverlapBytes then it'll cut at b#7. Or it could cut at y#7 if adding [x#8, y#7]'s size to c.overlappedBytes causes it to cross over c.maxOverlapBytes.

How about the following alternative: if key is a tombstone, use its end key to decide a grandparent key > key where we must stop this sstable. If we have picked such a future key use that to decide when to stop instead of the key provided in shouldStopBefore.

I think it's a good idea. In fact it is a nice hybrid of two of our original plans (#167 (comment) and #244 (review)), neither of which worked out well independently.

Done. The optimization is brought back in findGrandparentLimit(). Choosing limit also now happens before compactionIter moves past it as we now call findGrandparentLimit() just once before starting a file.


compaction.go, line 398 at r5 (raw file):

	for upper := lower; upper < len(c.grandparents); upper++ {
		overlappedBytes += c.grandparents[upper].Size
		if overlappedBytes > c.maxOverlapBytes && c.cmp(start, c.grandparents[upper].Largest.UserKey) < 0 {

This is a hack to ensure forward progress is made. Otherwise it gets into an infinite loop when a file contains only range tombstones and the limit is the same as the start key. The drawback is valid cases of limiting to the start key are affected (e.g., the change made in testdata/compaction_find_grandparent_limit is a casualty of this hack).


pacer.go, line 118 at r2 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Yes, it is called in the main compaction loop, which now sees keys for range tombstones. When we have only seen range tombstones, c.bytesIterated (the argument passed to maybeThrottle()) is zero.

I do not know why reviewable is showing this change as part of the full diff even after rebasing. It was landed already in #421.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 8 files at r5.
Reviewable status: 1 of 6 files reviewed, 19 unresolved discussions (waiting on @ajkr, @petermattis, and @sumeerbhola)


compaction.go, line 1300 at r5 (raw file):

			grandparentLimit = c.findGrandparentLimit(ve.NewFiles[n-1].Meta.Largest.UserKey)
		}

will this work out cleanly if the grandparent has a sequence of files that start and end with the same user key (say [a#5500, a#5000], [a#4999, a#4500] [a#4500, a#4000], ... and these files are hefty enough that the grandparent limit keeps returning a (because the files being commpacted also have a's)? We don't care about actually chopping at a since these are anyway going to be one compaction group, so just wondering whether the algorithm will not have some odd behavior like empty output file. I suspect it is fine because of c.cmp(...) <= 0 but maybe a code comment would help.


compaction.go, line 1326 at r5 (raw file):

				return nil, pendingOutputs, err
			}
			grandparentLimit = c.findGrandparentLimit(key.UserKey)

why this call to findGrandparentLimit here given that the limit has already been initialized for this file?


compaction.go, line 1342 at r5 (raw file):

		} else {
			limit = grandparentLimit
		}
if key != nil && (grandparentLimit == nil || c.cmp(key.UserKey, grandparentLimit) <= 0) {
     // Size limit reached.
     limit = key.UserKey
 } else {
     // Run out of keys or grandparent limit reached.
     if grandParentLimit != nil {
         // This may possibly flush everything.
         limit = grandParentLimit
     } else {
         // No grandparentLimit. Flush everything.
     }
 }

Not sure if the above is simpler, but some comments in the code would definitely help.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 6 files reviewed, 19 unresolved discussions (waiting on @petermattis and @sumeerbhola)


compaction.go, line 1300 at r5 (raw file):

Previously, sumeerbhola wrote…

will this work out cleanly if the grandparent has a sequence of files that start and end with the same user key (say [a#5500, a#5000], [a#4999, a#4500] [a#4500, a#4000], ... and these files are hefty enough that the grandparent limit keeps returning a (because the files being commpacted also have a's)? We don't care about actually chopping at a since these are anyway going to be one compaction group, so just wondering whether the algorithm will not have some odd behavior like empty output file. I suspect it is fine because of c.cmp(...) <= 0 but maybe a code comment would help.

The <= here would be sufficient to prevent the case you're talking about if we were cutting only based on point keys. But imagine findGrandparentLimit("a") == "a" and we have the following data to write: DELRANGE("a", "b") and PUT("b", 1). Then we would execute the outer loop forever generating files containing only the DELRANGE as the inner loop would never execute after advancing to "b".

The solution was to change findGrandparentLimit() to always return a user key strictly greater than the start key passed to it. I will add an explanatory comment here.


compaction.go, line 1326 at r5 (raw file):

Previously, sumeerbhola wrote…

why this call to findGrandparentLimit here given that the limit has already been initialized for this file?

Oops, good catch. Will remove.


compaction.go, line 1342 at r5 (raw file):

Previously, sumeerbhola wrote…
if key != nil && (grandparentLimit == nil || c.cmp(key.UserKey, grandparentLimit) <= 0) {
     // Size limit reached.
     limit = key.UserKey
 } else {
     // Run out of keys or grandparent limit reached.
     if grandParentLimit != nil {
         // This may possibly flush everything.
         limit = grandParentLimit
     } else {
         // No grandparentLimit. Flush everything.
     }
 }

Not sure if the above is simpler, but some comments in the code would definitely help.

Yeah this can be written in quite a lot of different ways. Will comment at least.

Previously range tombstones could cause output files that would later
undergo excessively large compactions due to overlap with the
grandparent level. This happened because only point keys were considered
for cutting output files according to grandparent overlap. Thus if a
range tombstone extended over a region devoid of point keys, compaction
would never split files within that region.

To fix this we now split output files at grandparent boundaries
regardless of the presence of point keys. That means a region covered by
a range tombstone and devoid of point keys can be split at grandparent
boundaries if needed to prevent future huge compactions.

Fixes cockroachdb#167.
@ajkr ajkr force-pushed the cut-sstable-within-range-tombstone branch from 7736852 to 303f51e Compare December 3, 2019 16:26
Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! RFAL.

Reviewable status: 0 of 6 files reviewed, 18 unresolved discussions (waiting on @petermattis and @sumeerbhola)


compaction.go, line 1300 at r5 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

The <= here would be sufficient to prevent the case you're talking about if we were cutting only based on point keys. But imagine findGrandparentLimit("a") == "a" and we have the following data to write: DELRANGE("a", "b") and PUT("b", 1). Then we would execute the outer loop forever generating files containing only the DELRANGE as the inner loop would never execute after advancing to "b".

The solution was to change findGrandparentLimit() to always return a user key strictly greater than the start key passed to it. I will add an explanatory comment here.

Done (added comments around findGrandparentLimit()).


compaction.go, line 1326 at r5 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Oops, good catch. Will remove.

Done.


compaction.go, line 1342 at r5 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Yeah this can be written in quite a lot of different ways. Will comment at least.

Done (took your suggestion).

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 6 of 8 files at r5, 1 of 1 files at r6.
Reviewable status: all files reviewed, 19 unresolved discussions (waiting on @ajkr, @petermattis, and @sumeerbhola)


compaction.go, line 1296 at r6 (raw file):

	// to a grandparent file largest key, or nil. Taken together, these
	// progress guarantees ensure that eventually the input iterator will be
	// exhausted and the range tombstone fragments will all be flushed.

these loop comments are very helpful!


internal/rangedel/fragmenter.go, line 254 at r2 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

It was flushing pending range tombstones that were fully before the point key being checked for deletion. But that point key could turn out to be (a) not deleted, and (b) beyond the truncation point according to nextLimitBefore(). If nextLimitBefore() returned a truncation point that was not just before this point key but also before the end of earlier range tombstones, we need to be able to truncate them to that point. We can't do that if they've already been flushed.

Does this change need a unit test for specifically this case or is it covered by the manual_compaction test?

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: all files reviewed, 19 unresolved discussions (waiting on @ajkr, @petermattis, and @sumeerbhola)


internal/rangedel/fragmenter.go, line 254 at r2 (raw file):

Previously, sumeerbhola wrote…

Does this change need a unit test for specifically this case or is it covered by the manual_compaction test?

Without this change, the test beginning at line 140 in "testdata/manual_compaction" fails with this message:

$ go test ./ -run TestManualCompaction -count 1
--- FAIL: TestManualCompaction (0.01s)
    compaction_test.go:728: 
        testdata/manual_compaction:165: 
        expected:
        2:
          9:[a#3,SET-b#72057594037927935,RANGEDEL]
          10:[b#0,RANGEDEL-d#72057594037927935,RANGEDEL]
          11:[d#0,RANGEDEL-f#72057594037927935,RANGEDEL]
          12:[f#0,RANGEDEL-h#3,SET]
        3:
          6:[a#0,SET-b#0,SET]
          7:[c#0,SET-d#0,SET]
          8:[e#0,SET-f#1,SET]
        
        found:
        2:
          9:[a#3,SET-b#72057594037927935,RANGEDEL]
          10:[h#3,SET-h#3,SET]
        3:
          6:[a#0,SET-b#0,SET]
          7:[c#0,SET-d#0,SET]
          8:[e#0,SET-f#1,SET]
FAIL

I guess the SET(h) is reached by compactionIter while writing the first output file (9.sst) causing it to flush tombstone DELRANGE(a, g). It gets cut to b due to grandparent overlap and then future files do not include that range tombstone as it's already been flushed. So the range [b, g) reappears once compaction finishes.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

db: sstable output size during compactions should consider range tombstones
3 participants