internal/base: add doc comment discussing TrySeekUsingNext #3329

jbowens · 2024-02-21T20:42:55Z

No description provided.

cockroach-teamcity · 2024-02-21T20:43:03Z

This change is

jbowens

This is still a work-in-progress, but I'm having trouble structuring it in a coherent way

Reviewable status: 0 of 2 files reviewed, all discussions resolved (waiting on @sumeerbhola)

sumeerbhola

this is definitely much better than what we currently have, so I am good with merging this

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @jbowens)

internal/base/doc.go line 30 at r2 (raw file):

// beneath the range deletion. However in doing so, a TrySeekUsingNext flag
// passed by the merging iterator's client no longer transitively holds for
// subsequent seeks of child level iterators in all cases. The merging iterator

do we currently do this? I only see the usual test coverage case in merging_iter.go.

	if invariants.Enabled && flags.TrySeekUsingNext() && !m.forceEnableSeekOpt &&
		disableSeekOpt(key, uintptr(unsafe.Pointer(m))) {
		flags = flags.DisableTrySeekUsingNext()
	}

internal/base/doc.go line 72 at r2 (raw file):

// The pebble levelIter makes use of the TrySeekUsingNext flag to avoid a naive
// seek among a level's file metadatas. When TrySeekUsingNext is passed by the
// caller, the relevant key must fall within the current file or later.

if its a later file we disable DisableTrySeekUsingNext(), so isn't this just for the current file?

internal/base/doc.go line 79 at r2 (raw file):

//
// The sstable iterators use the TrySeekUsingNext flag to avoid naive seeks
// through a table's index structures:

perhaps point to the long comment in reader_iter.go

internal/base/iterator.go line 234 at r2 (raw file):

// instead focuses on the contract expected of the caller.

// TrySeekUsingNext is set when the caller has knowledge that no action has been

that it has performed no action ...

internal/base/iterator.go line 246 at r2 (raw file):

// not return a key less than the current iterator position even if a naive seek
// would land there.
//

We should probably also say that the above promise from the caller is the same for SeekPrefixGE. That is, the prefixes of k1 and k2 can be different. The callee must remember if it did not position itself for k1 (e.g. an sstable iterator that did not position itself for k1 due to the bloom filter not matching), and that it needs to do the full work for k2.

sumeerbhola

Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @jbowens)

internal/base/doc.go line 68 at r2 (raw file):
hmm, I assume this relates to the other statement

If true, the callee should not return a key less than the current iterator position even if a naive seek would land there.

doesn't that remove the optionality on the iterator to ignore TrySeekUsingNext? We do that when the bloom filter did not match.

jbowens

Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)

internal/base/doc.go line 30 at r2 (raw file):

Previously, sumeerbhola wrote…

do we currently do this? I only see the usual test coverage case in merging_iter.go.
	if invariants.Enabled && flags.TrySeekUsingNext() && !m.forceEnableSeekOpt &&
		disableSeekOpt(key, uintptr(unsafe.Pointer(m))) {
		flags = flags.DisableTrySeekUsingNext()
	}

we don't currently do this, but it's one of the options that was considered to address @RaduBerinde's problem in #3324.

internal/base/doc.go line 68 at r2 (raw file):

Previously, sumeerbhola wrote…

hmm, I assume this relates to the other statement

If true, the callee should not return a key less than the current iterator position even if a naive seek would land there.

doesn't that remove the optionality on the iterator to ignore TrySeekUsingNext? We do that when the bloom filter did not match.

it does remove the optionality, but in a way that the bloom filter use is still compliant.

The contract is that if the previous call to SeekPrefixGE(k1) returned some key k, then SeekPrefixGE(k2, TrySeekUsingNext()=true) must return some key ≥ k and ≥ k2. In the bloom filter case, the bloom filter exclusion causes to return no key at all. This ensures we won't position other levels' range deletions according to some key k that's > k2 during the previous seek.

internal/base/doc.go line 72 at r2 (raw file):

Previously, sumeerbhola wrote…

if its a later file we disable DisableTrySeekUsingNext(), so isn't this just for the current file?

yeah. this is trying to say that the seek among the file metadatas (not within the files themselves) is constrained to [current file, +∞) rather than (-∞,+∞). I expanded the comment to clarify

internal/base/doc.go line 79 at r2 (raw file):

Previously, sumeerbhola wrote…

perhaps point to the long comment in reader_iter.go

Done.

internal/base/iterator.go line 234 at r2 (raw file):

Previously, sumeerbhola wrote…

that it has performed no action ...

Done.

internal/base/iterator.go line 246 at r2 (raw file):

Previously, sumeerbhola wrote…

We should probably also say that the above promise from the caller is the same for SeekPrefixGE. That is, the prefixes of k1 and k2 can be different. The callee must remember if it did not position itself for k1 (e.g. an sstable iterator that did not position itself for k1 due to the bloom filter not matching), and that it needs to do the full work for k2.

Done.

jbowens

Reviewable status: 0 of 3 files reviewed, 6 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)

internal/base/doc.go line 68 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

it does remove the optionality, but in a way that the bloom filter use is still compliant.

The contract is that if the previous call to SeekPrefixGE(k1) returned some key k, then SeekPrefixGE(k2, TrySeekUsingNext()=true) must return some key ≥ k and ≥ k2. In the bloom filter case, the bloom filter exclusion causes to return no key at all. This ensures we won't position other levels' range deletions according to some key k that's > k2 during the previous seek.

Hrm, maybe a better way of thinking about it is that the mergingIter is violating the contract. You might get a sequence like:

mergingIter.SeekPrefixGE(k1)
  # mergingIter observes a range deletion deleting the span [k1,k3)
  > levelIter.SeekPrefixGE(k3)
mergingIter.SeekPrefixGE(k2, TrySeekUsingNext()=true)
  # mergingIter does not observe the [k1,k3) range deletion because the
  # relevant iterator has already been positioned to a key ≥ k3
  > levelIter.SeekPrefixGE(k2, TrySeekUsingNext()=true)

The caller obeyed the contract, passing TrySeekUsingNext()=true because k2 ≥ k1. The merging iterator violated it, because in the second seek k2 < k3, and yet it passed TrySeekUsingNext()=true.

The semantics that the merging iterator is relying on if TrySeekUsingNext()=true, a iter.Seek[Prefix]GE(k) should be interpreted as iter.Seek[Prefix]GE(max(k, iter.Key()))

RaduBerinde · 2024-02-24T15:44:25Z

The semantics that the merging iterator is relying on if TrySeekUsingNext()=true, a iter.Seek[Prefix]GE(k) should be interpreted as iter.Seek[Prefix]GE(max(k, iter.Key()))

The max is something the merging iterator could do instead of requiring all implementations to tolerate it, no?

jbowens · 2024-02-29T15:21:34Z

The max is something the merging iterator could do instead of requiring all implementations to tolerate it, no?

That's true, but the original design of TrySeekUsingNext strived to limit the number of key comparisons by performing the comparison at the top-level, and propagating the flag indicating the possibility of applying the optimization to its children. It's possible to avoid performing any kind of max key comparison within the leaf implementations. I think there's a tradeoff between a complicated, subtle interface and eking out performance. I'm not sure the performance warrants the complexity, but it still feels like there's some possible refactor or recharacterization that lets us get the best of both worlds...

jbowens

Bumping this

Reviewable status: 0 of 3 files reviewed, 6 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)

sumeerbhola

Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @jbowens and @RaduBerinde)

internal/base/doc.go line 68 at r2 (raw file):
The subtleties in the second and third bullets are related to SeekPrefixGE. I am getting increasingly unsure about (a) what exactly is our current behavior and what makes it correct, (b) what we think is the minimal behavior needed to make things correct (which may suggest a direction for future optimizations, or serve as a way to guarantee correctness of in-progress changes/optimizations).

Can you write down both (a) and (b), and attempt a small proof sketch?

For instance, in the example you described above, I am trying to figure out what causes the mergingIter to have moved past the [k1,k3) span in the first seek. Say that levelIter ignored the seek to k3, due to the bloom filter, and returned nil. In that case it has not moved to k3. Now is the fact that the mergingIter never adds a key that is not matching the prefix to the heap mean that it cannot have moved past the span?
If we realized that the k3 is a different prefix and did not seek the others, the span would still be there on the next seek, if the next seek was to a key < k3. So we would be correct I think. I think this relates to your comment in the other PR

b) maintain an invariant that all the higher levels remain in the heap and their l.tombstones cascade down to l.tombstone.End such that a subsequent SeekGE(k2) will end up adjusting its seek key to the original l.tombstone.End by the time it reaches the unpositioned levels.

I don't know if they need to remain in the heap. Won't they be readded to the heap on the next SeekPrefixGE?

I am very wary of any further code changes in this area without fully reasoning about correctness.

internal/base/doc.go line 31 at r3 (raw file):

// passed by the merging iterator's client no longer transitively holds for
// subsequent seeks of child level iterators in all cases. The merging iterator
// assumes responsibility for ensuring that SeekPrefixGE is propagated to its

nit: for ensuring that TrySeekUsingNext ...?

jbowens

Reviewable status: 2 of 3 files reviewed, 7 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)

internal/base/doc.go line 68 at r2 (raw file):

Previously, sumeerbhola wrote…

The subtleties in the second and third bullets are related to SeekPrefixGE. I am getting increasingly unsure about (a) what exactly is our current behavior and what makes it correct, (b) what we think is the minimal behavior needed to make things correct (which may suggest a direction for future optimizations, or serve as a way to guarantee correctness of in-progress changes/optimizations).

Can you write down both (a) and (b), and attempt a small proof sketch?

For instance, in the example you described above, I am trying to figure out what causes the mergingIter to have moved past the [k1,k3) span in the first seek. Say that levelIter ignored the seek to k3, due to the bloom filter, and returned nil. In that case it has not moved to k3. Now is the fact that the mergingIter never adds a key that is not matching the prefix to the heap mean that it cannot have moved past the span?
If we realized that the k3 is a different prefix and did not seek the others, the span would still be there on the next seek, if the next seek was to a key < k3. So we would be correct I think. I think this relates to your comment in the other PR

b) maintain an invariant that all the higher levels remain in the heap and their l.tombstones cascade down to l.tombstone.End such that a subsequent SeekGE(k2) will end up adjusting its seek key to the original l.tombstone.End by the time it reaches the unpositioned levels.

I don't know if they need to remain in the heap. Won't they be readded to the heap on the next SeekPrefixGE?

I am very wary of any further code changes in this area without fully reasoning about correctness.

I've pushed an update to the comment that clarifies a few invariants although there's something around not iterating beyond the iteration prefix that I still need to codify.

sumeerbhola

Reviewable status: 2 of 3 files reviewed, 8 unresolved discussions (waiting on @jbowens and @RaduBerinde)

internal/base/doc.go line 59 at r5 (raw file):

// f_i among files f_0, f_1, ..., f_n and the key k_i among keys k_0, k_1, ...,
// k_n known to the internal iterator. We maintain the following merging
// iterator invariants:

I'm reading this for now by only thinking about SeekGE.

Terminology question:

are f_0, ..., f_n the files that each level is currently positioned at? M2 suggests these are files in a single level. The use of the same n is confusing if it is the latter. Same question for k_0, k_1, ...

The property I think that we want is that if a level_i has moved to key k_i for various reasons (before SeekGE(s2)), and there is any visible RANGEDEL [s0, s1) in level_i where s1 <= k_i, and s1 > s, then all lower levels must be moved to a key >= s1. This is because we may have moved past the file containing the RANGEDEL in level_i, and the effect of that has to have been enforced. M1 ensures that if this invariant was true before calling SeekGE(s2), then propagating TrySeekUsingNext to everyone will continue to enforce it.

This also means that

level iterator must remember its file position. With SeekGE a file must be open, or the level is exhausted.
When a file is open, that file must respect TrySeekUsingNext. I don't think this is captured by M2.

SeekPrefixGE is tricky in that we want to allow the file iter to choose to not to position itself if the bloom filter does not match. Also, a levelIter should be allowed to not load a file if that file's start key is beyond the prefix. Which would mean that for correctness we can't move level_i past a RANGEDEL [s0, s1) unless s1 matches the prefix.

Hmm, maybe this is also an insufficient statement. Say we have SeekPrefixGE(b@100) and there is a RANGEDEL[b@200, b@50) and SET b@40, in two consecutive files in level_1. And level_2 has a file with bounds [a, c] and is loaded, and does not match the bloom filter so leaves itself unpositioned. If we then do SeekPrefixGE(b@80), what happens in this level_2 file. Ah yes, it's bloom filter did not match, so can't have b@. So a table iterator can ignore that RANGEDEL precisely because it knows it has no key covered by it that may be uncovered later.

Apologies if I am rambling -- I am starting to swap this back. Also, if I am slowing you and @RaduBerinde back in making progress on this and the other PR, feel free to treat my comments as drive-by commentary and proceed at your own pace.

jbowens

Reviewable status: 2 of 3 files reviewed, 8 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)

internal/base/doc.go line 31 at r3 (raw file):

Previously, sumeerbhola wrote…

nit: for ensuring that TrySeekUsingNext ...?

didn't follow this comment?

internal/base/doc.go line 59 at r5 (raw file):

M2 suggests these are files in a single level. The use of the same n is confusing if it is the latter. Same question for k_0, k_1, ...

Same level; not sure how to express it without defining an overwhelming number of variables

Which would mean that for correctness we can't move level_i past a RANGEDEL [s0, s1) unless s1 matches the prefix.

I think we can move a level past s1, we just can't propagate it out of the mergingIter (eg, we need to not rely on the outcome of isNextEntryDeleted on a subsequent iterator seek)

jbowens force-pushed the tryseekusingnext-doc branch from 303d580 to 404f675 Compare February 21, 2024 20:49

jbowens marked this pull request as ready for review February 21, 2024 22:03

jbowens requested review from sumeerbhola and a team February 21, 2024 22:03

jbowens commented Feb 21, 2024

View reviewed changes

jbowens force-pushed the tryseekusingnext-doc branch from 404f675 to bdaab6d Compare February 22, 2024 17:29

sumeerbhola approved these changes Feb 22, 2024

View reviewed changes

sumeerbhola requested changes Feb 22, 2024

View reviewed changes

jbowens force-pushed the tryseekusingnext-doc branch from bdaab6d to a48efa2 Compare February 23, 2024 17:55

jbowens commented Feb 23, 2024

View reviewed changes

jbowens force-pushed the tryseekusingnext-doc branch from a48efa2 to 6191567 Compare February 23, 2024 17:56

jbowens commented Feb 23, 2024

View reviewed changes

RaduBerinde mentioned this pull request Feb 24, 2024

sstable: fixes to prefix replacing iterator #3344

Closed

RaduBerinde mentioned this pull request Feb 24, 2024

db: fix incorrect SeekPrefixGE call in merging iterator #3324

Closed

jbowens commented Aug 12, 2024

View reviewed changes

sumeerbhola requested a review from RaduBerinde August 15, 2024 14:53

sumeerbhola requested changes Aug 15, 2024

View reviewed changes

jbowens force-pushed the tryseekusingnext-doc branch from 6191567 to 50ad0ad Compare August 15, 2024 21:34

jbowens commented Aug 15, 2024

View reviewed changes

internal/base: add doc comment discussing TrySeekUsingNext

098c6f3

jbowens force-pushed the tryseekusingnext-doc branch from 50ad0ad to 098c6f3 Compare August 15, 2024 21:52

sumeerbhola requested changes Aug 16, 2024

View reviewed changes

jbowens commented Aug 26, 2024

View reviewed changes

sumeerbhola mentioned this pull request Sep 18, 2024

storage,admission: investigate read-only batch latency during high-volume snapshot ingest cockroachdb/cockroach#89788

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internal/base: add doc comment discussing TrySeekUsingNext #3329

internal/base: add doc comment discussing TrySeekUsingNext #3329

jbowens commented Feb 21, 2024

cockroach-teamcity commented Feb 21, 2024

jbowens left a comment

sumeerbhola left a comment

sumeerbhola left a comment

jbowens left a comment

jbowens left a comment

RaduBerinde commented Feb 24, 2024

jbowens commented Feb 29, 2024

jbowens left a comment

sumeerbhola left a comment

jbowens left a comment

sumeerbhola left a comment

jbowens left a comment

internal/base: add doc comment discussing TrySeekUsingNext #3329

Are you sure you want to change the base?

internal/base: add doc comment discussing TrySeekUsingNext #3329

Conversation

jbowens commented Feb 21, 2024

cockroach-teamcity commented Feb 21, 2024

jbowens left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

RaduBerinde commented Feb 24, 2024

jbowens commented Feb 29, 2024

jbowens left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment