Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: add CanDeterministicallySingleDelete #3077

Merged
merged 1 commit into from
Nov 21, 2023

Conversation

jbowens
Copy link
Collaborator

@jbowens jbowens commented Nov 15, 2023

Add a new specialized CanDeterministicallySingleDelete function that allows a
user to inspect hidden internal keys within the current user key to determine
whether the behavior of a SingleDelete of the current Key would be
deterministic, assuming no writes are made to the key between Iterator creation
and commit of the SingleDelete. The CanDeterministicallySingleDelete func may
only be called on an Iterator oriented in the forward direction and does not
change the external Iterator position.

During intent resolution, CockroachDB will write a SINGLEDEL tombstone to
delete the intent if the intent's value indicates that it must have only been
written once. Subtle logic that cuts across the stack makes this optimization
possible, and one possible explanation for a recent instance of replica
divergence (cockroachdb/cockroach#114421) is a bug in this logic. We anticipate
using CanDeterministicallySingleDelete within CockroachDB's intent resolution
to validate this logic at runtime.

There is some potential for this function to drive the decision to use
SingleDelete when creating batches for local Engine application only.

Informs cockroachdb/cockroach#114492.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jbowens jbowens force-pushed the internal-next branch 2 times, most recently from 92ea423 to 3f35cf1 Compare November 16, 2023 15:50
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @jbowens)


iterator.go line 1921 at r1 (raw file):

// InternalNext advances internal Iterator state forward to expose the
// InternalKeyKind of the next internal key with a user key equal to Key().

So InternalNext behaves as if nothing was actually done to the iterator, yes?
This looks fine.

We may have to do something about stability of the key/value pair already returned, if we want the caller to be able to use them after calling InternalNext. I guess the key is already stable since we are not fiddling with i.key. And cloning the LazyValue does feel unnecessary since CockroachDB eagerly extracts the value and parses the MVCCMetadata proto. In which case we could just document the instability of the value.

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is RFAL

Reviewable status: 0 of 9 files reviewed, 1 unresolved discussion (waiting on @sumeerbhola)


iterator.go line 1921 at r1 (raw file):

Previously, sumeerbhola wrote…

So InternalNext behaves as if nothing was actually done to the iterator, yes?
This looks fine.

We may have to do something about stability of the key/value pair already returned, if we want the caller to be able to use them after calling InternalNext. I guess the key is already stable since we are not fiddling with i.key. And cloning the LazyValue does feel unnecessary since CockroachDB eagerly extracts the value and parses the MVCCMetadata proto. In which case we could just document the instability of the value.

Done.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 9 of 9 files at r2, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @jbowens)


iterator.go line 1931 at r2 (raw file):

// InternalNext, unlike all other Iterator methods, exposes internal LSM state.
// InternalNext advances the Iterator's internal iterator to the next shadowed
// key with the a user key equal to Key(). When a key is overwritten or deleted,

nit: ... with a user ...


iterator.go line 1935 at r2 (raw file):

// allows the caller to see whether an obsolete internal key exists with the
// current Key(), and what it's key kind is. Note that the existence of an
// internal key is nondeterministic and depedent on internal LSM state. These

dependent


iterator.go line 1956 at r2 (raw file):

		return InternalNextExhausted, base.InternalKeyKindInvalid
	}
	i.lastPositioningOp = unknownLastPositionOp

I was thinking that we will lose the monotonic seek optimization during intent resolution of a batch of intents because of this, but that may be ok. But then I noticed the loop with the call to ltIter.NextEngineKey in https://github.com/cockroachdb/cockroach/blob/e23591614d7ae4447a1d788c0e23e68a24018878/pkg/storage/mvcc.go#L4912 which suggests we already don't hit that optimization.


iterator.go line 1965 at r2 (raw file):

			// variant so that the caller does not need to check i.iter.Error()
			// in the common case that the next internal key has a new user key.
			if i.err = firstError(i.err, i.iter.Error()); i.err != nil {

isn't i.err guaranteed to be nil prior to this assignment, so we could do i.err = i.iter.Error()?


metamorphic/ops.go line 1294 at r2 (raw file):

	_, _ = i.InternalNext()
	// The return value of an InternalNext is dependent on internal LSM state
	// and non-deterministic, so we don't record it. Including the operation

should we also have a test option that makes iterInternalNextOp a noop?


metamorphic/ops.go line 1298 at r2 (raw file):

	// not change the result of any other Iterator operation that should be
	// deterministic, regardless of the outcome of InternalNext.
	h.Recordf("%s // %v", o, i.Error())

why is the error deterministic?
Is it because the only errors we happen to encounter here in the metamorphic test are caused by i.err already being true prior to the call to i.InternalNext() or errors.New("switching from reverse to forward via peeking is prohibited")?


testdata/iter_histories/internal_next line 109 at r2 (raw file):

flush
----

Can you add a comment stating that we have two versions for each user key.

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: 6 of 9 files reviewed, 7 unresolved discussions (waiting on @sumeerbhola)


iterator.go line 1935 at r2 (raw file):

Previously, sumeerbhola wrote…

dependent

Done.


iterator.go line 1956 at r2 (raw file):

Previously, sumeerbhola wrote…

I was thinking that we will lose the monotonic seek optimization during intent resolution of a batch of intents because of this, but that may be ok. But then I noticed the loop with the call to ltIter.NextEngineKey in https://github.com/cockroachdb/cockroach/blob/e23591614d7ae4447a1d788c0e23e68a24018878/pkg/storage/mvcc.go#L4912 which suggests we already don't hit that optimization.

Hmm, I don't see any reason why we couldn't leave lastPositioningOp untouched, but these things can be subtle.


iterator.go line 1965 at r2 (raw file):

Previously, sumeerbhola wrote…

isn't i.err guaranteed to be nil prior to this assignment, so we could do i.err = i.iter.Error()?

Yeah, Done.


metamorphic/ops.go line 1294 at r2 (raw file):

Previously, sumeerbhola wrote…

should we also have a test option that makes iterInternalNextOp a noop?

I think we should omit it because it allows us to assert that errors are deterministic. Some of the InternalNext branches are already just about noops, and the metamorphic tests are good at constructing a variety of internal LSM states that'll hit them.


metamorphic/ops.go line 1298 at r2 (raw file):

Previously, sumeerbhola wrote…

why is the error deterministic?
Is it because the only errors we happen to encounter here in the metamorphic test are caused by i.err already being true prior to the call to i.InternalNext() or errors.New("switching from reverse to forward via peeking is prohibited")?

Yeah, that's right. Added a comment here.


testdata/iter_histories/internal_next line 109 at r2 (raw file):

Previously, sumeerbhola wrote…

Can you add a comment stating that we have two versions for each user key.

Done.

@jbowens jbowens changed the title db: add InternalNext iterator operation db: add CanDeterministicallySingleDelete Nov 20, 2023
@jbowens jbowens force-pushed the internal-next branch 3 times, most recently from 6af9173 to fa1f4d6 Compare November 20, 2023 23:11
Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeerbhola — I reworked this to keep the "internal next" operation internal, instead exposing a purpose-built function pebble.CanDeterministicallySingleDelete that also handles the knowledge of what internal key kind sequences are okay. PTAL!

Reviewable status: 0 of 9 files reviewed, 7 unresolved discussions (waiting on @sumeerbhola)

@jbowens jbowens requested review from nvanbenschoten and a team November 20, 2023 23:16
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 9 of 9 files at r3, all commit messages.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @jbowens and @nvanbenschoten)


iterator.go line 2862 at r3 (raw file):

//
// If CanDeterministicallySingleDelete returns true AND the key at the iterator
// position is not modified between the creation ofi the Iterator and the commit

of


iterator.go line 2864 at r3 (raw file):

// position is not modified between the creation ofi the Iterator and the commit
// of a batch containing a SingleDelete over the key, then the caller can be
// assured that SingleDelete is equivalent to Delete.

How about:
... is equivalent to Delete on the local engine. For batches that will be replicated, CanDeterministicallySingleDelete can return true on one engine, but it may not be true on another engine, since the first engine may have had multiple SETs collapsed into one.


iterator.go line 2880 at r3 (raw file):

			//   SINGLEDEL; SET; SINGLEDEL; SET;
			// is not deterministic, because the first SET may consume the
			// second SINGLEDEL resulting in two SETs shadowed by one SINGLEDEL.

It is safe because we have SETWITHDEL. And hopefully we did a similar check when writing the SINGLEDEL so we don't have to continue now.

@jbowens jbowens force-pushed the internal-next branch 2 times, most recently from e14bd8c to aaaebc5 Compare November 21, 2023 02:03
Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: 8 of 9 files reviewed, 10 unresolved discussions (waiting on @nvanbenschoten and @sumeerbhola)


iterator.go line 2862 at r3 (raw file):

Previously, sumeerbhola wrote…

of

Done.


iterator.go line 2864 at r3 (raw file):

Previously, sumeerbhola wrote…

How about:
... is equivalent to Delete on the local engine. For batches that will be replicated, CanDeterministicallySingleDelete can return true on one engine, but it may not be true on another engine, since the first engine may have had multiple SETs collapsed into one.

Done.


iterator.go line 2880 at r3 (raw file):

Previously, sumeerbhola wrote…

It is safe because we have SETWITHDEL. And hopefully we did a similar check when writing the SINGLEDEL so we don't have to continue now.

Ah, right!

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 9 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @jbowens and @sumeerbhola)


-- commits line 18 at r4:
Is the thinking that we will add calls to CanDeterministicallySingleDelete in testing builds of CockroachDB? Behind a buildutil.CrdbTestBuild?

If so, where will we add those? Just during evaluation of ResolveIntent(Range) requests above Raft on the leaseholder? Or also before applying WriteBatches that contain SingleDelete operations on all replicas below Raft?

The second check would be more comprehensive, but also more difficult to introduce without adding logic to iterate over every WriteBatch during raft application to search for SingleDelete operations. It makes me wonder whether this kind of assertion would be more natural to push into Pebble. For example, I could imagine a mode where Pebble could be configured to check for determinism whenever applying any SingleDelete operation. We might not want to be reading on the apply path though, so that may not work.

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @nvanbenschoten and @sumeerbhola)


-- commits line 18 at r4:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is the thinking that we will add calls to CanDeterministicallySingleDelete in testing builds of CockroachDB? Behind a buildutil.CrdbTestBuild?

If so, where will we add those? Just during evaluation of ResolveIntent(Range) requests above Raft on the leaseholder? Or also before applying WriteBatches that contain SingleDelete operations on all replicas below Raft?

The second check would be more comprehensive, but also more difficult to introduce without adding logic to iterate over every WriteBatch during raft application to search for SingleDelete operations. It makes me wonder whether this kind of assertion would be more natural to push into Pebble. For example, I could imagine a mode where Pebble could be configured to check for determinism whenever applying any SingleDelete operation. We might not want to be reading on the apply path though, so that may not work.

The intention is to call it in production builds just in evaluation. We expect the regression introduced to be small, because we already advance over these internal keys in NextEngineKey. Although this function doesn’t change the externally-visible iterator position, it does reduce the work the subsequent Next needs to perform.

Long-term I would like to explore pushing the single deletion decision point into Pebble (some thoughts in cockroachdb/cockroach#114492). I think it’d be nice to either completely obscure the single deletion optimization to the user, or make it simply a suggestion that Pebble may ignore if it’s unsafe or too expensive to prove it’s safe. Doing so might regress in terms of resource utilization per intent resolved since we need to introduce a read somewhere there wasn’t previously, but that seems like it should be offset by its wider applicability, like the raft log.

@jbowens
Copy link
Collaborator Author

jbowens commented Nov 21, 2023

-- commits line 18 at r4:

Previously, jbowens (Jackson Owens) wrote…

The intention is to call it in production builds just in evaluation. We expect the regression introduced to be small, because we already advance over these internal keys in NextEngineKey. Although this function doesn’t change the externally-visible iterator position, it does reduce the work the subsequent Next needs to perform.

Long-term I would like to explore pushing the single deletion decision point into Pebble (some thoughts in cockroachdb/cockroach#114492). I think it’d be nice to either completely obscure the single deletion optimization to the user, or make it simply a suggestion that Pebble may ignore if it’s unsafe or too expensive to prove it’s safe. Doing so might regress in terms of resource utilization per intent resolved since we need to introduce a read somewhere there wasn’t previously, but that seems like it should be offset by its wider applicability, like the raft log.

To be clear, we intend to call it in production during evaluation only as an invariant assertion.

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @nvanbenschoten and @sumeerbhola)


iterator.go line 2880 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Ah, right!

On second thought, maybe we should still check beneath the observed SINGLEDEL because in the context of MVCCResolveWriteIntentRange we're going to iterate over those internal keys during the next call to NextEngineKey anyways: may as well validate that the already-written SINGLEDEL obeyed invariants too?

Add a new specialized CanDeterministicallySingleDelete function that allows a
user to inspect hidden internal keys within the current user key to determine
whether the behavior of a SingleDelete of the current Key would be
deterministic, assuming no writes are made to the key between Iterator creation
and commit of the SingleDelete. The CanDeterministicallySingleDelete func may
only be called on an Iterator oriented in the forward direction and does not
change the external Iterator position.

During intent resolution, CockroachDB will write a SINGLEDEL tombstone to
delete the intent if the intent's value indicates that it must have only been
written once. Subtle logic that cuts across the stack makes this optimization
possible, and one possible explanation for a recent instance of replica
divergence (cockroachdb/cockroach#114421) is a bug in this logic. We anticipate
using CanDeterministicallySingleDelete within CockroachDB's intent resolution
to validate this logic at runtime.

There is some potential for this function to drive the decision to use
SingleDelete when creating batches for local Engine application only.

Informs cockroachdb/cockroach#114492.
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r5.
Reviewable status: all files reviewed (commit messages unreviewed), 11 unresolved discussions (waiting on @nvanbenschoten)

@jbowens
Copy link
Collaborator Author

jbowens commented Nov 21, 2023

TFTR!

@jbowens jbowens merged commit d228f9d into cockroachdb:master Nov 21, 2023
11 checks passed
@jbowens jbowens deleted the internal-next branch November 21, 2023 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants