sstable: fixes to prefix replacing iterator #3344

RaduBerinde · 2024-02-24T02:40:06Z

We add more logic inside the prefix replacing iterators:

pass through SeekGE calls even when not strictly necessary, to
enable optimizations indicated by the flags.
keep track of whether we are positioned before/after the synthetic
prefix range, to enable expected behaviors like SeekGE after the
range followed by Prev (or SeekLT to before the range followed
by Next)

cockroach-teamcity · 2024-02-24T02:40:13Z

This change is

RaduBerinde

It's a bit hard to know for sure this is correct without some of the semantics being clear (see discussion in #3329) but this passes more iterations of the metamorphic test (w prefix replacement added) than what we have. I haven't stressed it for a long time yet but I plan to do that before merging.

Reviewable status: 0 of 3 files reviewed, 1 unresolved discussion (waiting on @dt, @itsbilal, and @jbowens)

sstable/prefix_replacing_iterator.go line 157 at r1 (raw file):

			// We still want to seek the underlying iterator to potentially enable
			// optimizations passed through flags.
			p.i.SeekGE(p.rewriteArg(p.lower), flags)

@jbowens do you think this is enough to retain any seek optimizations?

itsbilal · 2024-02-26T19:19:48Z

sstable/prefix_replacing_iterator.go

+const (
+	// inSync indicates that the prefix replacing iterator is "in sync" with the
+	// underlying iterator; any Next/Prev calls can be passed through.
+	inSync iteratorState = iota


nit: this can be inRange, or alternatively iterPosCur if we want to really mimic wording in other iterators.

itsbilal · 2024-02-26T19:29:52Z

sstable/prefix_replacing_iterator.go

 		return nil, base.LazyValue{}
+	case beforeRange:
+		p.state = inSync
+		return p.rewriteResult(p.i.First())


This could be a violation of the internalIterator contract, right? We don't allow First() to be called if there's a lower bound, instead we want SeekGE(lower) to be called. Same with Last() and the upper bound. You might need to save the bounds in SetBounds so you can do the appropriate translation here instead of passing through a first/last to the child iter.

Good point, done. TFTR!

itsbilal

LGTM save for NextPrefix fix

itsbilal · 2024-02-26T20:28:20Z

sstable/prefix_replacing_iterator.go

+	case beforeRange:
+		p.state = inRange
+		if p.lowerBound != nil {
+			return p.rewriteResult(p.i.SeekGE(p.lowerBound, base.SeekGEFlagsNone))


this might need to be SeekGE(succKey) as the lower bound might not have the prefix we want. And same for the First() call below; we can actually unconditionally call SeekGE(succKey) here.

I need to think about this some more. What I did is incorrect, we can't just pass a lower bound below without rewriting its prefix.. We need to basically figure out the intersection of the given bounds and the vtable bounds (most of that is happening today below us).

We add more logic inside the prefix replacing iterators: - pass through `SeekGE` calls even when not strictly necessary, to enable optimizations indicated by the flags. - keep track of whether we are positioned before/after the synthetic prefix range, to enable expected behaviors like `SeekGE` after the range followed by `Prev` (or `SeekLT` to before the range followed by `Next`)

RaduBerinde · 2024-02-26T20:46:47Z

Looks like this requires a bunch more work. I wonder if I should hold off and see if we can switch to the block-level prefix synthesis instead @dt @msbutler

itsbilal · 2024-02-26T20:47:51Z

Does it require that much more work? If it's just NextPrefix that's affected, the lazy solution is to just call the prefix replacing iter's own SeekGE(succKey) in that case.

sumeerbhola

Reviewable status: 0 of 3 files reviewed, 14 unresolved discussions (waiting on @dt, @itsbilal, @jbowens, and @RaduBerinde)

sstable/prefix_replacing_iterator.go line 157 at r1 (raw file):

Previously, RaduBerinde wrote…

@jbowens do you think this is enough to retain any seek optimizations?

I'm a little confused here. I think all we need is to track whether we did the absolute positioning or ignored the SeekPrefixGE (see other comment below).
So we would remember positioned/unpositioned. The ignore case here is the only one which would result in a state transition to unpositioned. Any relative positioning call when unpositioned will return nil. Any absolute positioning call where the state was unpositioned would turn off the TrySeekUsingNext. This is following the same pattern as the next lower layer where we do

		if !i.lastBloomFilterMatched {
			// Iterator is not positioned based on last seek.
			flags = flags.DisableTrySeekUsingNext()
		}

I could be wrong, since we need to flesh out the reasoning in https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1708716407912329?thread_ts=1708549343.720779&cid=CAC6K3SLU

sstable/prefix_replacing_iterator.go line 26 at r2 (raw file):

	syntheticPrefix []byte

	// lower is a valid key that has syntheticPreficx and is a lower bound for all

nit: syntheticPrefix

Aren't we requiring that i only produce keys with contentPrefix. So how can a key starting with syntheticPrefix be a lower bound for i?

sstable/prefix_replacing_iterator.go line 63 at r2 (raw file):

var _ Iterator = (*prefixReplacingIterator)(nil)

// newPrefixReplacingIterator wraps an iterator over keys that have

nit: over point keys ...

sstable/prefix_replacing_iterator.go line 68 at r2 (raw file):

// `contentPrefix`.
//
// lower is a valid key that has syntheticPreficx and is a lower bound for all

syntheticPrefix

Same question about "lower bound for all keys produced by i`.

sstable/prefix_replacing_iterator.go line 201 at r2 (raw file):

Previously, RaduBerinde wrote…

Good point, done. TFTR!

I suspect we are trying to be too clever with this code.

The underlying iterator i already has the bounds, but they are in terms of ContentPrefix. So our difficulty arises when someone tries SeekGE above the upper bound or SeekLT below the lower bound, since we can't translate them into seek keys in terms of ContentPrefix. But levelIter shouldn't be doing that, since it loads the relevant file with the first key >= seek key in the case of SeekGE and < seek key in case of SeekLT. So how about we treat those seek cases as simply calls to Last(); Next() and First(); Prev(), since they are not performance sensitive (unless I am mistaken).

I think this eliminates the need for beforeRange and afterRange and lowerBound, upperBound.

sstable/prefix_replacing_iterator.go line 49 at r3 (raw file):

	// inRange indicates that the prefix replacing iterator is "in sync" with the
	// underlying iterator; any Next/Prev calls can be passed through.
	inRange prefixReplacingIteratorState = iota

is there a reason to default to inRange instead of making the iota value panic-worthy?

sstable/prefix_replacing_iterator.go line 56 at r3 (raw file):

	// prefix range. A Next() call should return the first key/span in the range.
	beforeRange
	empty

comment for empty?

sstable/prefix_replacing_iterator.go line 136 at r3 (raw file):

	p.state = inRange
	if !bytes.HasPrefix(key, p.syntheticPrefix) {
		if p.cmp(key, p.lower) > 0 {

this is an odd use of p.lower that is worth a comment. One would expect this check to happen against upper, but this works because lower and upper share SyntheticPrefix.

sstable/prefix_replacing_iterator.go line 142 at r3 (raw file):

		// Key must be before the range; seek to the lower bound instead.
		// We don't use First because we may miss out on optimizations passed
		// through SeekEFlags.

SeekGEFlags

sstable/prefix_replacing_iterator.go line 143 at r3 (raw file):

		// We don't use First because we may miss out on optimizations passed
		// through SeekEFlags.
		key = p.lower

Presumably prefixReplacingIterator is being used in the performance sensitive case after the sstable has been ingested, so this is in a level and being iterated over using levelIter. Is that accurate?

sstable/prefix_replacing_iterator.go line 226 at r3 (raw file):

		return p.rewriteResult(p.i.First())
	case inRange:
		return p.rewriteResult(p.i.NextPrefix(succKey))

what is the contract regarding succKey. Are we guaranteed that succKey has SyntheticPrefix? I believe not since levelIter will just pass through. So we have to be able to handle a succKey that is greater than the upper, i.e.,

if !bytes.HasPrefix(succKey, p.syntheticPrefix) {
		if p.cmp(key, p.lower) > 0 {
                   // Go past the end of this sstable
                   ...
                   return
                 }  else {
                   //panic. Can't be calling NextPrefix on this iterator if it is already positioned after `succKey`
                 }
}
// succKey has synthetic prefix. So rewrite succKey and call on i.

sstable/prefix_replacing_iterator.go line 268 at r3 (raw file):

	// is going to rewrite them itself or pass them to something e.g. vState that
	// will rewrite them.
	if x, ok := p.i.(interface{ SetBoundsWithSyntheticPrefix() bool }); ok && x.SetBoundsWithSyntheticPrefix() {

this seems error-prone. What if someone has a typo and causes us to fall into the path below. I guess tests will catch it quickly enough, but what iterator type do we wrap that does not implement SetBoundsWithSyntheticPrefix?

sstable/reader_virtual.go line 137 at r2 (raw file):

	i, err := v.reader.newIterWithBlockPropertyFiltersAndContext(
		ctx, transforms, lower, upper, filterer, useFilterBlock,
		stats, categoryAndQoS, statsCollector, rp, &v.vState)

Seems like since virtualState.{lower,upper} also have SyntheticPrefix, the narrowed bound in virtualState.constrainBounds also have SyntheticPrefix, so the inverted bounds will have ContentPrefix. Is my understanding correct?

RaduBerinde

Reviewable status: 0 of 3 files reviewed, 14 unresolved discussions (waiting on @dt, @itsbilal, @jbowens, and @sumeerbhola)

sstable/prefix_replacing_iterator.go line 157 at r1 (raw file):