Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement SingleDelete #249

Merged
merged 1 commit into from Sep 6, 2019

Conversation

@hueypark
Copy link
Contributor

hueypark commented Aug 28, 2019

Informs #222.

In Pebble SingleDelete combined with Delete/Merge, it is converted to Delete. But, since SingleDelete is only suitable for special workloads, comment not mention it. If we commenting on them, someone will try to use that unusual behavior.

@petermattis

This comment has been minimized.

Copy link
Collaborator

petermattis commented Aug 28, 2019

This change is Reviewable

@petermattis

This comment has been minimized.

Copy link
Collaborator

petermattis commented Aug 28, 2019

@hueypark Thanks for the contribution. I'm on vacation this week and won't be able to review until I return (next week).

Copy link
Collaborator

ajkr left a comment

this looks great! thanks for your contribution.

Reviewed 11 of 11 files at r1.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @hueypark)


compaction_iter.go, line 417 at r1 (raw file):

		case InternalKeyKindDelete:
			// We've hit Delete. SingleDelete transform to Delete.
			i.key.SetKind(InternalKeyKindDelete)

Makes sense.


compaction_iter.go, line 436 at r1 (raw file):

			// point and then skip entries until the next snapshot stripe.
			i.skip = true
			return &i.key, i.value

If there's a SingleDelete followed by a range deletion tombstone, will this cause the SingleDelete to be output? Should we set i.valid = false?

Copy link
Collaborator

ajkr left a comment

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @hueypark)


compaction_iter.go, line 436 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

If there's a SingleDelete followed by a range deletion tombstone, will this cause the SingleDelete to be output? Should we set i.valid = false?

This might also require changing the caller of singleDeleteNext() to skip the rest of the stripe if !i.valid && i.skip.

Copy link
Contributor Author

hueypark left a comment

@ajkr Thanks for your review.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ajkr and @hueypark)


compaction_iter.go, line 436 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

This might also require changing the caller of singleDeleteNext() to skip the rest of the stripe if !i.valid && i.skip.

I do not know what case you are concerned about. Can you share a test? Please share it like the following example(like in testdata/compaction_iter).

define
a.RANGEDEL.3:c
b.SINGLEDEL.2:
a.SET.1:val
----

iter
first
tombstones
----
.
a-c#3
.
Copy link
Contributor Author

hueypark left a comment

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ajkr)


compaction_iter.go, line 417 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Makes sense.

Thanks.

Copy link
Collaborator

ajkr left a comment

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @hueypark)


compaction_iter.go, line 436 at r1 (raw file):

Previously, hueypark (Jaewan Park) wrote…

I do not know what case you are concerned about. Can you share a test? Please share it like the following example(like in testdata/compaction_iter).

define
a.RANGEDEL.3:c
b.SINGLEDEL.2:
a.SET.1:val
----

iter
first
tombstones
----
.
a-c#3
.

Oops, I meant to say the SingleDelete is newer than the range tombstone.

Test:

define
a.SINGLEDEL.3:
a.RANGEDEL.2:c
a.SET.1:val
----

iter
first
tombstones
----
.
a-c#2
.

According to the current API requirement (SingleDelete covers exactly one Set), this case can only happen under misuse of API, so it's not a big deal.

While rereading the code I thought of a more important test case to investigate:

define
a.SINGLEDEL.4:
a.SET.3:val
a.SINGLEDEL.2:
a.SET.1:val
----

iter snapshots=2
first
next
----
a#2,7:
a#1,1:val
.

I think it is important to preserve the a#2 SingleDelete otherwise the a#1 Set can reappear to readers at snapshots newer than seqnum 2.

@hueypark hueypark force-pushed the hueypark:single-delete branch from f50536d to c05c553 Aug 31, 2019
Copy link
Contributor Author

hueypark left a comment

Reviewable status: 9 of 11 files reviewed, 1 unresolved discussion (waiting on @ajkr)


compaction_iter.go, line 436 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Oops, I meant to say the SingleDelete is newer than the range tombstone.

Test:

define
a.SINGLEDEL.3:
a.RANGEDEL.2:c
a.SET.1:val
----

iter
first
tombstones
----
.
a-c#2
.

According to the current API requirement (SingleDelete covers exactly one Set), this case can only happen under misuse of API, so it's not a big deal.

While rereading the code I thought of a more important test case to investigate:

define
a.SINGLEDEL.4:
a.SET.3:val
a.DEL.2:
a.SET.1:val
----

iter snapshots=2
first
next
next
----
a#2,0:
a#1,1:val
.

I think it is important to preserve the a#2 Delete otherwise the a#1 Set can reappear after compaction.

Thank you for your detailed description.
I added relevant tests and improved the singleDeleteNext implementation.

Copy link
Collaborator

petermattis left a comment

Thanks for the contribution! Generally looks good. I have a number of comments about the comments, a few about coding nits, and one big missing area of testing.

Reviewable status: 9 of 11 files reviewed, 11 unresolved discussions (waiting on @ajkr and @hueypark)


batch.go, line 529 at r1 (raw file):

}

// SingleDelete adds an action to the batch that single deletes the entry for key.

Let's add another sentence here linking back to the comment for Writer.SingleDelete. Something like: "See Writer.SingleDelete for more details on the semantics of SingleDelete".


compaction_iter.go, line 416 at r1 (raw file):

		switch key.Kind() {
		case InternalKeyKindDelete:
			// We've hit Delete. SingleDelete transform to Delete.

Let's reword this comment slightly: We've hit a Delete, transform the SingleDelete into a full Delete.


compaction_iter.go, line 425 at r1 (raw file):

		case InternalKeyKindMerge:
			// We've hit Merge. SingleDelete transform to Delete.

Let's reword this comment slightly: We've hit a Merge, transform the SingleDelete into a full Delete.


compaction_iter.go, line 436 at r1 (raw file):

Previously, hueypark (Jaewan Park) wrote…

Thank you for your detailed description.
I added relevant tests and improved the singleDeleteNext implementation.

For the Set(), RangeDel(), SingleDel() sequence, I think we'd want the SingleDel to still combine with the Set and cause both to be elided from the compaction output. That might be difficult to accomplish, though, and this may be a rare case.


db.go, line 94 at r1 (raw file):

	// SingleDelete deletes the value for the given key. Requires that the key
	// exist and was not overwritten. This feature is performance optimization
	// for a very specific workload.

This comment should be expanded to describe the semantics of SingleDelete and contrast how it differs from Delete. I'd include some of the language from the issue which describes SingleDelete as anti-matter in contrast with normal Delete which acts as a tombstone. Also worthwhile to call out the behavior of SingleDelete in the presence of Merges and Deletes.

I'd like to see some big warning language about the potential problems from incorrect usage. For example, Set("a"), Set("a"), SingleDelete("a") can cause key "a" to temporarily be deleted and then to reappear at a later time after compactions have been performed. The behavior is non-deterministic, and affected by the ordering of the compactions.

Here is what I have in mind:

SingleDelete is similar to Delete in that it deletes the value for the given key. Like Delete, 
it is a blind operation that will succeed even if the given key does not exist.

WARNING: Undefined (non-deterministic) behavior will result if a key is overwritten and
then deleted using SingleDelete. The record may appear deleted immediately, but be
resurrected at a later time after compactions have been performed. Or the record may
be deleted permanently. A Delete operation lays down a "tombstone" which shadows all 
previous versions of a key. The SingleDelete operation is akin to "anti-matter" and will 
only delete the most recently written version for a key. These different semantics allow 
the DB to avoid propagating a SingleDelete operation during a compaction as soon as the 
corresponding Set operation is encountered. However, the semantics come with 
non-deterministic behavior as the series of operations `Set, Set, SingleDelete` can be 
processed during compactions as either `(Set, Set) + (SingleDelete)` or `(Set) + (Set, SingleDelete)`.
These semantics require extreme care to handle properly. Only use if you have a workload
where the performance gain is critical and you can guarantee that a record is written once
and then deleted once.

SingleDelete is internally transformed into a Delete if the most recent record for a key is either
a Merge or Delete record.

It is safe to modify the contents of the arguments after SingleDelete returns.

db_test.go, line 572 at r2 (raw file):

	verifyGetNotFound(t, d, key)
	verifyGetNotFound(t, d, key2)

These tests are a bit hard to follow and to extend to more cases. I think they are ok as high-level tests, but I think you should also add test cases to testdata/iterator to test all of the cases for SINGLEDEL just as you've done for testdata/compaction_iter.


iterator.go, line 66 at r1 (raw file):

			continue

		case InternalKeyKindSingleDelete:

This is the same behavior as for InternalKeyKindDelete. Can we use the same case-statement:

case InternalKeyKindDelete, InternalKeyKindSingleDelete:
  ...

iterator.go, line 140 at r1 (raw file):

			continue

		case InternalKeyKindSingleDelete:

See comment above about combining this with the InternalKeyKindDelete case.


iterator.go, line 244 at r1 (raw file):

			return true

		case InternalKeyKindSingleDelete:

See comment above about combining this with the InternalKeyKindDelete case.


mem_table.go, line 118 at r2 (raw file):

	case InternalKeyKindDelete:
		return nil, ErrNotFound
	case InternalKeyKindSingleDelete:

case InternalKeyKindDelete, InternalKeyKindSingleDelete:


testdata/compaction_iter, line 762 at r2 (raw file):

.
a-c#2
.

I'd like to see a test case for the undefined behavior:

a.SINGLEDEL.3:
a.SET.2:b
a.SET.1:c

I think the behavior should be similar to what Iterator does and to emit nothing.

Copy link
Collaborator

ajkr left a comment

Reviewable status: 9 of 11 files reviewed, 11 unresolved discussions (waiting on @ajkr, @hueypark, and @petermattis)


compaction_iter.go, line 436 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

For the Set(), RangeDel(), SingleDel() sequence, I think we'd want the SingleDel to still combine with the Set and cause both to be elided from the compaction output. That might be difficult to accomplish, though, and this may be a rare case.

@petermattis We should clarify whether it's a rare case or an API misuse. I think supporting this case properly implies we also need to support SingleDel() covering zero Set()s. For example the RangeDel() and Set() may be compacted together and dropped in a compaction to bottommost level, while the SingleDel() is still in an upper level. Then we'd be left with a SingleDel() covering nothing.

I think there are extra complexities around the corner if we go in this direction. It could make sense if we need SingleDel() to support covering zero or one Set()s, whereas today it is required to cover exactly one Set() (with no intervening deletes, range deletes, etc.). But then we'll need to add logic for eliding SingleDel()s during compaction to bottom level, remove the logic for transforming SingleDel() into regular tombstone when it meets a merge or tombstone (because we might be writing that regular tombstone to the bottom level and then it'd be stuck there indefinitely), and perhaps more.

@hueypark hueypark force-pushed the hueypark:single-delete branch from c05c553 to ba4587b Sep 1, 2019
Copy link
Contributor Author

hueypark left a comment

Reviewable status: 5 of 12 files reviewed, 11 unresolved discussions (waiting on @ajkr and @petermattis)


batch.go, line 529 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Let's add another sentence here linking back to the comment for Writer.SingleDelete. Something like: "See Writer.SingleDelete for more details on the semantics of SingleDelete".

Done.


compaction_iter.go, line 416 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Let's reword this comment slightly: We've hit a Delete, transform the SingleDelete into a full Delete.

Done.


compaction_iter.go, line 425 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Let's reword this comment slightly: We've hit a Merge, transform the SingleDelete into a full Delete.

Done.


compaction_iter.go, line 436 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

@petermattis We should clarify whether it's a rare case or an API misuse. I think supporting this case properly implies we also need to support SingleDel() covering zero Set()s. For example the RangeDel() and Set() may be compacted together and dropped in a compaction to bottommost level, while the SingleDel() is still in an upper level. Then we'd be left with a SingleDel() covering nothing.

I think there are extra complexities around the corner if we go in this direction. It could make sense if we need SingleDel() to support covering zero or one Set()s, whereas today it is required to cover exactly one Set() (with no intervening deletes, range deletes, etc.). But then we'll need to add logic for eliding SingleDel()s during compaction to bottom level, remove the logic for transforming SingleDel() into regular tombstone when it meets a merge or tombstone (because we might be writing that regular tombstone to the bottom level and then it'd be stuck there indefinitely), and perhaps more.

There is no need to consider this issue. Since RangeDel () already elided Set (), it passes the following test:

define
a.SINGLEDEL.3:
a.RANGEDEL.2:c
a.SET.1:val
----

iter
first
tombstones
----
.
a-c#2
.

db.go, line 94 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This comment should be expanded to describe the semantics of SingleDelete and contrast how it differs from Delete. I'd include some of the language from the issue which describes SingleDelete as anti-matter in contrast with normal Delete which acts as a tombstone. Also worthwhile to call out the behavior of SingleDelete in the presence of Merges and Deletes.

I'd like to see some big warning language about the potential problems from incorrect usage. For example, Set("a"), Set("a"), SingleDelete("a") can cause key "a" to temporarily be deleted and then to reappear at a later time after compactions have been performed. The behavior is non-deterministic, and affected by the ordering of the compactions.

Here is what I have in mind:

SingleDelete is similar to Delete in that it deletes the value for the given key. Like Delete, 
it is a blind operation that will succeed even if the given key does not exist.

WARNING: Undefined (non-deterministic) behavior will result if a key is overwritten and
then deleted using SingleDelete. The record may appear deleted immediately, but be
resurrected at a later time after compactions have been performed. Or the record may
be deleted permanently. A Delete operation lays down a "tombstone" which shadows all 
previous versions of a key. The SingleDelete operation is akin to "anti-matter" and will 
only delete the most recently written version for a key. These different semantics allow 
the DB to avoid propagating a SingleDelete operation during a compaction as soon as the 
corresponding Set operation is encountered. However, the semantics come with 
non-deterministic behavior as the series of operations `Set, Set, SingleDelete` can be 
processed during compactions as either `(Set, Set) + (SingleDelete)` or `(Set) + (Set, SingleDelete)`.
These semantics require extreme care to handle properly. Only use if you have a workload
where the performance gain is critical and you can guarantee that a record is written once
and then deleted once.

SingleDelete is internally transformed into a Delete if the most recent record for a key is either
a Merge or Delete record.

It is safe to modify the contents of the arguments after SingleDelete returns.

Thank you for your detailed explanation. I only removed the part about Set, Set, SingleDelete.


db_test.go, line 572 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

These tests are a bit hard to follow and to extend to more cases. I think they are ok as high-level tests, but I think you should also add test cases to testdata/iterator to test all of the cases for SINGLEDEL just as you've done for testdata/compaction_iter.

Done.


iterator.go, line 66 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This is the same behavior as for InternalKeyKindDelete. Can we use the same case-statement:

case InternalKeyKindDelete, InternalKeyKindSingleDelete:
  ...

Done.


iterator.go, line 140 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

See comment above about combining this with the InternalKeyKindDelete case.

Done.


iterator.go, line 244 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

See comment above about combining this with the InternalKeyKindDelete case.

Done.


mem_table.go, line 118 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

case InternalKeyKindDelete, InternalKeyKindSingleDelete:

Done.


testdata/compaction_iter, line 762 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'd like to see a test case for the undefined behavior:

a.SINGLEDEL.3:
a.SET.2:b
a.SET.1:c

I think the behavior should be similar to what Iterator does and to emit nothing.

At first, I implemented it as emitting nothing, but after thinking about it, I decided to remove only one SET before SINGLEDEL. The reasons are as follows:

  1. Processing multiple SETs in front of SINGLEDEL is not an essential requirement for SINGLEDEL designed for specific use cases.

  2. If we make SINGLEDEL to swallow all the SETs before it, singleDeleteNext must traverse all the iterators (not just one) appearing in front of him, and complicating the handling of other edge cases.

Copy link
Collaborator

petermattis left a comment

Reviewable status: 5 of 12 files reviewed, 5 unresolved discussions (waiting on @ajkr and @hueypark)


compaction_iter.go, line 436 at r1 (raw file):
Ah, got it. I was missing something here: the RANGEDEL causes the SET to be elided, and also for the SINGLEDEL to be elided. LGTM

@petermattis We should clarify whether it's a rare case or an API misuse. I think supporting this case properly implies we also need to support SingleDel() covering zero Set()s. For example the RangeDel() and Set() may be compacted together and dropped in a compaction to bottommost level, while the SingleDel() is still in an upper level. Then we'd be left with a SingleDel() covering nothing.

I think there are extra complexities around the corner if we go in this direction. It could make sense if we need SingleDel() to support covering zero or one Set()s, whereas today it is required to cover exactly one Set() (with no intervening deletes, range deletes, etc.). But then we'll need to add logic for eliding SingleDel()s during compaction to bottom level, remove the logic for transforming SingleDel() into regular tombstone when it meets a merge or tombstone (because we might be writing that regular tombstone to the bottom level and then it'd be stuck there indefinitely), and perhaps more.

I was presuming that a bottom-level SINGLEDEL would be handled the same as a bottom-level DEL: we'd elide it. I've added a comment above where this elision needs to take place.

Requiring SINGLEDEL to cover exactly one SET and misbehaving badly when that requirement is violated (i.e. accumulating SINGLEDEL records at the bottom level) feels like poor API design. Yeah, we're requiring the lot from the user to use SINGLEDEL correctly, but might as well prevent more badness when possible. Not allowing a RANGEDEL to lie between a SINGLEDEL and a SET further burdens the user. Perhaps I'm misjudging how much effort this will be.


compaction_iter.go, line 210 at r3 (raw file):

		i.key = *i.iterKey
		switch i.key.Kind() {
		case InternalKeyKindDelete:

I just noticed that we could be calling i.rangeDelFrag.Deleted(...) here to determine if the record is covered by a range deletion tombstone. Interestingly, adding that block of code doesn't change any of the tests, which makes me suspect we're missing some test coverage here. I'll see about adding this in a separate PR.


compaction_iter.go, line 225 at r3 (raw file):

			return &i.key, i.value

		case InternalKeyKindSingleDelete:

We need the block of code above for eliding deletion tombstones:

// If we're at the last snapshot stripe and the tombstone can be elided
// skip to the next stripe (which will be the next user key).
if i.curSnapshotIdx == 0 && i.elideTombstone(i.key.UserKey) {
	i.saveKey()
	i.skipStripe()
	continue
}

db.go, line 351 at r3 (raw file):

}

// SingleDelete is part of Writer.

Nit: have this comment mirror the structure of the other operation comments (e.g. Set, Delete, etc). I think you can pull the comment directly from Batch.SingleDelete.


testdata/compaction_iter, line 762 at r2 (raw file):

Previously, hueypark (Jaewan Park) wrote…

At first, I implemented it as emitting nothing, but after thinking about it, I decided to remove only one SET before SINGLEDEL. The reasons are as follows:

  1. Processing multiple SETs in front of SINGLEDEL is not an essential requirement for SINGLEDEL designed for specific use cases.

  2. If we make SINGLEDEL to swallow all the SETs before it, singleDeleteNext must traverse all the iterators (not just one) appearing in front of him, and complicating the handling of other edge cases.

Ok, I can see why having SINGLEDEL only elide an adjacent SET makes the implementation simpler.

I'm not seeing where you added this test case, though. Did I miss it?

@ajkr I wonder if we should log a fatal error if we discover a SINGLEDEL covering 2 SETS for the same key.


testdata/iterator, line 1313 at r3 (raw file):

first
----
.

Please add test cases demonstrating the behavior of SINGLEDEL when it shadows multiple SET operations for a given key. Ditto for multiple MERGE operations. In both cases, I believe SINGLEDEL should behave like DEL.

Copy link
Collaborator

petermattis left a comment

Reviewable status: 5 of 12 files reviewed, 5 unresolved discussions (waiting on @ajkr and @hueypark)


compaction_iter.go, line 210 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I just noticed that we could be calling i.rangeDelFrag.Deleted(...) here to determine if the record is covered by a range deletion tombstone. Interestingly, adding that block of code doesn't change any of the tests, which makes me suspect we're missing some test coverage here. I'll see about adding this in a separate PR.

See #252

@hueypark hueypark force-pushed the hueypark:single-delete branch from ba4587b to 04ccff4 Sep 1, 2019
Copy link
Contributor Author

hueypark left a comment

Reviewable status: 5 of 12 files reviewed, 4 unresolved discussions (waiting on @ajkr and @petermattis)


compaction_iter.go, line 210 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

See #252

That's cool! I removed the RangeDelete related processing from SingleDelete.


compaction_iter.go, line 225 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

We need the block of code above for eliding deletion tombstones:

// If we're at the last snapshot stripe and the tombstone can be elided
// skip to the next stripe (which will be the next user key).
if i.curSnapshotIdx == 0 && i.elideTombstone(i.key.UserKey) {
	i.saveKey()
	i.skipStripe()
	continue
}

Done.


db.go, line 351 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: have this comment mirror the structure of the other operation comments (e.g. Set, Delete, etc). I think you can pull the comment directly from Batch.SingleDelete.

Done.


testdata/compaction_iter, line 762 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ok, I can see why having SINGLEDEL only elide an adjacent SET makes the implementation simpler.

I'm not seeing where you added this test case, though. Did I miss it?

@ajkr I wonder if we should log a fatal error if we discover a SINGLEDEL covering 2 SETS for the same key.

I added a related test.


testdata/iterator, line 1313 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Please add test cases demonstrating the behavior of SINGLEDEL when it shadows multiple SET operations for a given key. Ditto for multiple MERGE operations. In both cases, I believe SINGLEDEL should behave like DEL.

Done.

@ajkr

This comment has been minimized.

Copy link
Collaborator

ajkr commented Sep 2, 2019

Requiring SINGLEDEL to cover exactly one SET and misbehaving badly when that requirement is violated (i.e. accumulating SINGLEDEL records at the bottom level) feels like poor API design.

This doesn't make sense to me. The whole SingleDelete optimization is to drop a SINGLEDEL when it encounters the SET it covers. What about if the SINGLEDEL covers two SETs? The outcome is non-deterministic (which IMO is a worse outcome than tombstone stuck in bottom level). What if the SINGLEDEL covers a MERGE or DELETE? According to this PR it transforms the SINGLEDEL into a DELETE even if it's writing to bottom level, which is the same bad behavior you're asking to avoid.

Yeah, we're requiring the lot from the user to use SINGLEDEL correctly, but might as well prevent more badness when possible. Not allowing a RANGEDEL to lie between a SINGLEDEL and a SET further burdens the user. Perhaps I'm misjudging how much effort this will be.

I don't think I'm arguing for burdening the user. I'm arguing that if our requirement for SingleDelete is that it deletes "at most one SET" rather than "exactly one SET", we should clarify that and do all the work to make it happen. I do not understand how not half-supporting an explicitly unsupported use case (SINGLEDEL covering zero SETs) is "poor API design".

EDIT: Also just to clarify further, if we do not need support for SINGLEDEL covering zero SETs, I am not arguing for anything. It doesn't matter much to me whether the implementation has no/partial/full support for a use case that the API will continue to explicitly not support. We should consider whether general API misuse should cause the DB to go into read-only mode or crash, but that's separate from the question of whether SINGLEDEL covering zero SETs should be supported or not.

@petermattis

This comment has been minimized.

Copy link
Collaborator

petermattis commented Sep 2, 2019

Requiring SINGLEDEL to cover exactly one SET and misbehaving badly when that requirement is violated (i.e. accumulating SINGLEDEL records at the bottom level) feels like poor API design.

This doesn't make sense to me. The whole SingleDelete optimization is to drop a SINGLEDEL when it encounters the SET it covers. What about if the SINGLEDEL covers two SETs? The outcome is non-deterministic (which IMO is a worse outcome than tombstone stuck in bottom level). What if the SINGLEDEL covers a MERGE or DELETE? According to this PR it transforms the SINGLEDEL into a DELETE even if it's writing to bottom level, which is the same bad behavior you're asking to avoid.

Heh, I suppose referring to good API design anywhere near SINGLEDEL is problematic as the API is fundamentally fragile.

Let me take a step back. I'd like to have SINGLEDEL provide reasonable behavior when possible. A SINGLEDEL covering two SETs is going to be non-deterministic as that is fundamental to the SINGLEDEL operation. But what about a SINGELDEL covering 0 records? Or one covering a MERGE or a DELETE? I think we can define reasonable semantics for all of these cases and I think those semantics are not onerous to implement. I don't believe we (CRDB) need those semantics, but I think we (Pebble) should implement something to cover those cases, whether it be a fatal error, a panic, or something.

The semantics I'm arguing for is that a SINGLEDEL acts like DEL except that when it is compacted with a SET (in the same snapshot stripe) it is elided from the output. A SINGLEDEL at the bottom level of the tree should be elided. SINGEDEL covering a MERGE/DELETE should be transformed into a DEL. My intuition is that implementing these semantics will be as much work as implementing and testing any other semantics (including crashing).

I don't think I'm arguing for burdening the user. I'm arguing that if our requirement for SingleDelete is that it deletes "at most one SET" rather than "exactly one SET", we should clarify that and do all the work to make it happen.

Ack, I think we should clarify the semantics, and I think they should be "SingleDelete deletes at most one SET", though the phrasing on that is slightly problematic. SINGLEDEL might delete more that one SET (depending on the order of compactions). Perhaps, "SINGLEDEL can delete zero or one SET records. If there are two or more SET records (i.e. a record was overwritten), the behavior of SINGLEDEL is non-deterministic."

We should consider whether general API misuse should cause the DB to go into read-only mode or crash, but that's separate from the question of whether SINGLEDEL covering zero SETs should be supported or not.

When possible I think we should provide guard rails. Crashing is reasonable in some cases, or returning an error. For example, Pebble returns an error if you close a DB with open iterators as this is indicative of an iterator leak. And operations on a closed DB return an error. Unfortunately, with SINGLEDEL there is a misuse (SINGLEDEL covering 2 or more SETs) which we can't efficiently detect.

Copy link
Collaborator

petermattis left a comment

Reviewable status: 5 of 12 files reviewed, 6 unresolved discussions (waiting on @ajkr and @hueypark)


compaction_iter.go, line 412 at r4 (raw file):

	case InternalKeyKindSet:
		i.nextInStripe()

Rather than calling nextInStripe() which will skip over the SET, we could could skipStripe() which will skip over all of the remaining records in the stripe. This would make the operation of SINGLEDEL and DEL within a stripe nearly equivalent: they would shadow all of the remaining records in the stripe.


compaction_iter.go, line 413 at r4 (raw file):

	case InternalKeyKindSet:
		i.nextInStripe()
		return i.Next()

I wonder if we should manually collapse this tail recursion and return an extra parameter indicating to Next whether we have a valid record to return or whether processing should continue looking for the next record.


compaction_iter.go, line 415 at r4 (raw file):

		return i.Next()

	case InternalKeyKindMerge:

We can merge this case with InternalKeyKindDelete.


compaction_iter.go, line 418 at r4 (raw file):

		// We've hit a Merge, transform the SingleDelete into a full Delete.
		i.key.SetKind(InternalKeyKindDelete)
		i.nextInStripe()

Rather than calling nextInStripe here, I think this should be setting i.skip = true.


testdata/compaction_iter, line 672 at r4 (raw file):

----
a#3,0:
a#1,2:a

Why did this second merge operation stick around? I would think it would be skipped due to the SINGLEDEL being transformed into a DEL.

@hueypark hueypark force-pushed the hueypark:single-delete branch from 04ccff4 to 6f55045 Sep 2, 2019
Copy link
Contributor Author

hueypark left a comment

Reviewable status: 5 of 12 files reviewed, 6 unresolved discussions (waiting on @ajkr and @petermattis)


compaction_iter.go, line 412 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Rather than calling nextInStripe() which will skip over the SET, we could could skipStripe() which will skip over all of the remaining records in the stripe. This would make the operation of SINGLEDEL and DEL within a stripe nearly equivalent: they would shadow all of the remaining records in the stripe.

Done.


compaction_iter.go, line 413 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I wonder if we should manually collapse this tail recursion and return an extra parameter indicating to Next whether we have a valid record to return or whether processing should continue looking for the next record.

Done.


compaction_iter.go, line 415 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

We can merge this case with InternalKeyKindDelete.

Done.


compaction_iter.go, line 418 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Rather than calling nextInStripe here, I think this should be setting i.skip = true.

Done.


testdata/compaction_iter, line 672 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Why did this second merge operation stick around? I would think it would be skipped due to the SINGLEDEL being transformed into a DEL.

Done.

Copy link
Collaborator

petermattis left a comment

@hueypark I think this is nearly done, though @ajkr and I need to resolve our discussion about the semantics of SingleDelete.

Reviewable status: 5 of 12 files reviewed, 10 unresolved discussions (waiting on @ajkr, @hueypark, and @petermattis)


compaction_iter.go, line 228 at r5 (raw file):

			case InternalKeyKindSingleDelete:
				key, value := i.singleDeleteNext()

At this point, I wonder if it would be better to inline singleDeleteNext():

// For a SingleDelete, examine the next record in the stripe to determine the
// action to take.

// Save the current key.
i.saveKey()
i.value = i.iterValue
i.valid = true

if !i.nextInStripe() {
	i.skip = false
	return &i.key, i.value
}

switch i.iterKey.Kind() {
case InternalKeyKindDelete, InternalKeyKindMerge:
	// We've hit a Delete or Merge, transform the SingleDelete into a full Delete.
	i.key.SetKind(InternalKeyKindDelete)
	i.skip = true
	return &i.key, i.value

case InternalKeyKindSet:
	i.skipStripe()

case InternalKeyKindSingleDelete:
	i.nextInStripe()

case InternalKeyKindRangeDelete:

default:
	i.err = fmt.Errorf("invalid internal key kind: %d", i.iterKey.Kind())
	return nil, nil
}
continue

I mention this possibility because using i.valid to communicate between singleDeleteNext() and Next() is subtle.


compaction_iter.go, line 233 at r5 (raw file):

				}

				return i.Next()

Rather than calling i.Next() (recursively), can you continue here?


compaction_iter.go, line 423 at r5 (raw file):

	case InternalKeyKindSingleDelete:
		i.nextInStripe()
		return &i.key, i.value

Do we need to be preserving the SINGLEDEL record in this case? I think setting i.valid = false here would have better behavior. Consider SINGLEDEL.3, SINGLEDEL.2, SET.1. With the current code I think this will output SINGLEDEL.3, SET.1 which seems wrong. Better to elide the SINGLEDEL.3 so that the SINGLEDEL.2 and SET.1 can combine and be elided as well. Am I misunderstanding something here?


testdata/compaction_iter, line 569 at r5 (raw file):

----

iter

Can you also add test cases for iter elide-tombstones=true? This will test the elision of single delete tombstones at the bottom level of the LSM. You'll want to test the various cases where a SINGLEDEL record gets transformed into a DEL record.

@hueypark hueypark force-pushed the hueypark:single-delete branch from 6f55045 to 9effe71 Sep 3, 2019
Copy link
Contributor Author

hueypark left a comment

Reviewable status: 5 of 12 files reviewed, 10 unresolved discussions (waiting on @ajkr and @petermattis)


compaction_iter.go, line 228 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

At this point, I wonder if it would be better to inline singleDeleteNext():

// For a SingleDelete, examine the next record in the stripe to determine the
// action to take.

// Save the current key.
i.saveKey()
i.value = i.iterValue
i.valid = true

if !i.nextInStripe() {
	i.skip = false
	return &i.key, i.value
}

switch i.iterKey.Kind() {
case InternalKeyKindDelete, InternalKeyKindMerge:
	// We've hit a Delete or Merge, transform the SingleDelete into a full Delete.
	i.key.SetKind(InternalKeyKindDelete)
	i.skip = true
	return &i.key, i.value

case InternalKeyKindSet:
	i.skipStripe()

case InternalKeyKindSingleDelete:
	i.nextInStripe()

case InternalKeyKindRangeDelete:

default:
	i.err = fmt.Errorf("invalid internal key kind: %d", i.iterKey.Kind())
	return nil, nil
}
continue

I mention this possibility because using i.valid to communicate between singleDeleteNext() and Next() is subtle.

I removed the subtle use of i.valid but left singleDeleteNext().


compaction_iter.go, line 233 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Rather than calling i.Next() (recursively), can you continue here?

Done.


compaction_iter.go, line 423 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Do we need to be preserving the SINGLEDEL record in this case? I think setting i.valid = false here would have better behavior. Consider SINGLEDEL.3, SINGLEDEL.2, SET.1. With the current code I think this will output SINGLEDEL.3, SET.1 which seems wrong. Better to elide the SINGLEDEL.3 so that the SINGLEDEL.2 and SET.1 can combine and be elided as well. Am I misunderstanding something here?

Done. And I added a test for this.


testdata/compaction_iter, line 569 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Can you also add test cases for iter elide-tombstones=true? This will test the elision of single delete tombstones at the bottom level of the LSM. You'll want to test the various cases where a SINGLEDEL record gets transformed into a DEL record.

Done.

Copy link
Collaborator

petermattis left a comment

:lgtm:

Reviewable status: 5 of 12 files reviewed, 4 unresolved discussions (waiting on @ajkr and @hueypark)


db_test.go, line 888 at r6 (raw file):

	benchmark := func(useSingleDelete bool) {
		d, err := Open(

This benchmark includes the time for opening and closing the DB. You could pass in *testing.B to this function and call b.StopTimer() and b.StartTimer() to avoid including those regions in the benchmark timing.

I'm curious if this benchmark shows any difference between SingleDelete and Delete. My guess is no, and that this is primarily exercising the commit pipeline.


testdata/compaction_iter, line 609 at r6 (raw file):

----

iter

Add an elide-tombstones=true variant of this test, to verify the DEL tombstone gets elided.


testdata/compaction_iter, line 621 at r6 (raw file):

----

iter

Add an elide-tombstones=true variant of this test, to verify the DEL tombstone gets elided.

Copy link
Collaborator

ajkr left a comment

A SINGLEDEL at the bottom level of the tree should be elided. SINGEDEL covering a MERGE/DELETE should be transformed into a DEL.

Got it. I didn't realize before that the elision check comes before the possible SINGLEDEL->DELETE transformation. This way, SINGLEDEL covering MERGE/DELETE cannot cause us to write out a DELETE that should be elided to the bottom level.

"SINGLEDEL can delete zero or one SET records. If there are two or more SET records (i.e. a record was overwritten), the behavior of SINGLEDEL is non-deterministic."

Yes, this sounds good to me.

Crashing is reasonable in some cases, or returning an error. For example, Pebble returns an error if you close a DB with open iterators as this is indicative of an iterator leak. And operations on a closed DB return an error. Unfortunately, with SINGLEDEL there is a misuse (SINGLEDEL covering 2 or more SETs) which we can't efficiently detect.

Maybe we can return an error if it's detected in the foreground and crash if it's detected in the background. In RocksDB background errors would lead to read-only mode which lead to writers seeing errors in the foreground later, but I think we don't support dynamically changing to read-only mode yet.

Reviewable status: 5 of 12 files reviewed, 5 unresolved discussions (waiting on @ajkr and @hueypark)


compaction_iter.go, line 418 at r6 (raw file):

		case InternalKeyKindSet:
			i.skipStripe()

I don't think we should skip the rest of the stripe because it causes this test to fail:

define
a.SINGLEDEL.4:
a.SET.3:val
a.SINGLEDEL.2:
a.SET.1:val
----

iter snapshots=2
first
next
----
a#2,7:
a#1,1:val
.

testdata/compaction_iter, line 762 at r2 (raw file):

@ajkr I wonder if we should log a fatal error if we discover a SINGLEDEL covering 2 SETS for the same key.

Yes I like that. Keep the DB in the bad state until we can debug, rather than continue with corruption and possibly lose evidence of the corruption.

Copy link
Collaborator

petermattis left a comment

Reviewable status: 5 of 12 files reviewed, 5 unresolved discussions (waiting on @ajkr and @hueypark)


compaction_iter.go, line 418 at r6 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

I don't think we should skip the rest of the stripe because it causes this test to fail:

define
a.SINGLEDEL.4:
a.SET.3:val
a.SINGLEDEL.2:
a.SET.1:val
----

iter snapshots=2
first
next
----
a#2,7:
a#1,1:val
.

Good find. Yes, the stripe skipping is problematic. This needs to call nextInStripe instead to skip over the SET.

@hueypark hueypark force-pushed the hueypark:single-delete branch from 9effe71 to bd12fab Sep 3, 2019
Copy link
Contributor Author

hueypark left a comment

Reviewable status: 4 of 12 files reviewed, 5 unresolved discussions (waiting on @ajkr and @petermattis)


compaction_iter.go, line 418 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Good find. Yes, the stripe skipping is problematic. This needs to call nextInStripe instead to skip over the SET.

Done.


db_test.go, line 888 at r6 (raw file):

I'm curious if this benchmark shows any difference between SingleDelete and Delete.

Corrected to meaningful benchmarks.

My guess is no, and that this is primarily exercising the commit pipeline.

Can you explain in more detail? I do not exactly understand the meaning of primarily exercising the commit pipeline.


testdata/compaction_iter, line 609 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Add an elide-tombstones=true variant of this test, to verify the DEL tombstone gets elided.

Done.


testdata/compaction_iter, line 621 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Add an elide-tombstones=true variant of this test, to verify the DEL tombstone gets elided.

Done.

Copy link
Collaborator

petermattis left a comment

Reviewable status: 4 of 12 files reviewed, 4 unresolved discussions (waiting on @ajkr and @hueypark)


compaction_iter.go, line 419 at r7 (raw file):

		case InternalKeyKindSet:
			i.nextInStripe()
			i.valid = false

Is setting i.valid = false still necessary here and below?


db_test.go, line 888 at r6 (raw file):

Can you explain in more detail? I do not exactly understand the meaning of primarily exercising the commit pipeline.

The performance of DB.Set, DB.Delete, and DB.SingleSelete will be dominated by the internal call to DB.Commit. The performance benefit of SingleDelete over Delete will only become apparent when compactions occur. If you're trying to show that benefit, I'd structure the benchmark differently. I'd probably doing something targeted that created a compactionIter on top of a data set of SingleDelete,Set pairs or Delete,Set pairs, and benchmarked the time for creating the corresponding sstable (using sstable.Writer). The SingleDelete,Set dataset should be much faster to compact because no data will be output. I can flesh this idea out a little more if you'd like, though I'm also fine with just removing this benchmark for now.

Copy link
Contributor Author

hueypark left a comment

Reviewable status: 4 of 12 files reviewed, 4 unresolved discussions (waiting on @ajkr and @petermattis)


compaction_iter.go, line 419 at r7 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is setting i.valid = false still necessary here and below?

Yes, we need. i.valid = false ignores SET combined with SINGLEDEL.


db_test.go, line 888 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Can you explain in more detail? I do not exactly understand the meaning of primarily exercising the commit pipeline.

The performance of DB.Set, DB.Delete, and DB.SingleSelete will be dominated by the internal call to DB.Commit. The performance benefit of SingleDelete over Delete will only become apparent when compactions occur. If you're trying to show that benefit, I'd structure the benchmark differently. I'd probably doing something targeted that created a compactionIter on top of a data set of SingleDelete,Set pairs or Delete,Set pairs, and benchmarked the time for creating the corresponding sstable (using sstable.Writer). The SingleDelete,Set dataset should be much faster to compact because no data will be output. I can flesh this idea out a little more if you'd like, though I'm also fine with just removing this benchmark for now.

In the last commit(maybeScheduleFlush() is called), there is a sufficient benchmark difference. In the future, we may be able to create benchmarks that are better suited for CockroachDB, but at this point I'm satisfied.

benchmark result

BenchmarkDelete/delete-16         	       3	 449213904 ns/op
BenchmarkDelete/single-delete-16  	      20	 104266859 ns/op
Copy link
Collaborator

petermattis left a comment

Reviewable status: 4 of 12 files reviewed, 4 unresolved discussions (waiting on @ajkr and @hueypark)


compaction_iter.go, line 419 at r7 (raw file):

Previously, hueypark (Jaewan Park) wrote…

Yes, we need. i.valid = false ignores SET combined with SINGLEDEL.

Ah, I was confused. Carry on.


db_test.go, line 888 at r6 (raw file):

Previously, hueypark (Jaewan Park) wrote…

In the last commit(maybeScheduleFlush() is called), there is a sufficient benchmark difference. In the future, we may be able to create benchmarks that are better suited for CockroachDB, but at this point I'm satisfied.

benchmark result

BenchmarkDelete/delete-16         	       3	 449213904 ns/op
BenchmarkDelete/single-delete-16  	      20	 104266859 ns/op

Ah, the flushing is a side-effect due to the size and number of values inserted. I think you should make this more explicit:

diff --git a/db_test.go b/db_test.go
index f1f6653..103b3d4 100644
--- a/db_test.go
+++ b/db_test.go
@@ -882,7 +882,7 @@ func BenchmarkDelete(b *testing.B) {
        for i := 0; i < keyCount; i++ {
                keys[i] = []byte(strconv.Itoa(rng.Int()))
        }
-       val := bytes.Repeat([]byte("x"), 1000)
+       val := bytes.Repeat([]byte("x"), 10)

        benchmark := func(b *testing.B, useSingleDelete bool) {
                d, err := Open(
@@ -908,6 +908,12 @@ func BenchmarkDelete(b *testing.B) {
                                d.Delete(key, nil)
                        }
                }
+               // Manually flush as it is flushing/compaction where SingleDelete
+               // performance shows up. With SingleDelete, we can elide all of the
+               // SingleDelete and Set records.
+               if err := d.Flush(); err != nil {
+                       b.Fatal(err)
+               }
                b.StopTimer()
        }
In Pebble SingleDelete combined with Delete/Merge, it is converted to Delete.
@hueypark hueypark force-pushed the hueypark:single-delete branch from bd12fab to 3b18283 Sep 5, 2019
Copy link
Contributor Author

hueypark left a comment

Reviewable status: 4 of 12 files reviewed, 4 unresolved discussions (waiting on @ajkr, @hueypark, and @petermattis)


db_test.go, line 888 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ah, the flushing is a side-effect due to the size and number of values inserted. I think you should make this more explicit:

diff --git a/db_test.go b/db_test.go
index f1f6653..103b3d4 100644
--- a/db_test.go
+++ b/db_test.go
@@ -882,7 +882,7 @@ func BenchmarkDelete(b *testing.B) {
        for i := 0; i < keyCount; i++ {
                keys[i] = []byte(strconv.Itoa(rng.Int()))
        }
-       val := bytes.Repeat([]byte("x"), 1000)
+       val := bytes.Repeat([]byte("x"), 10)

        benchmark := func(b *testing.B, useSingleDelete bool) {
                d, err := Open(
@@ -908,6 +908,12 @@ func BenchmarkDelete(b *testing.B) {
                                d.Delete(key, nil)
                        }
                }
+               // Manually flush as it is flushing/compaction where SingleDelete
+               // performance shows up. With SingleDelete, we can elide all of the
+               // SingleDelete and Set records.
+               if err := d.Flush(); err != nil {
+                       b.Fatal(err)
+               }
                b.StopTimer()
        }

Done. Thank you for the detailed explanation.

@petermattis

This comment has been minimized.

Copy link
Collaborator

petermattis commented Sep 6, 2019

@hueypark Thanks again for the contribution. Apologies for all of the back and forth, but I think this ended up in a good place. I'm going to go ahead and merge this PR.

Heads up that I'm planning to move this repo to github.com/cockroachdb/pebble today.

@petermattis petermattis merged commit c5b20fc into cockroachdb:master Sep 6, 2019
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@hueypark

This comment has been minimized.

Copy link
Contributor Author

hueypark commented Sep 6, 2019

@petermattis @ajkr Thank again for your kind review on this PR.

And I want to contribute to CockroachDB in the long term. Assign me freely whenever there is an appropriate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.