[WIP] *: separated lock table by sumeerbhola · Pull Request #55664 · cockroachdb/cockroach

sumeerbhola · 2020-10-17T20:47:11Z

The change here is large and will be broken up into multiple
PRs after discussion (details below).

The largest and messiest part of this change is the MVCC engine
API (Engine/Reader/Writer/Iterator) change.
I ended up here after 2 failed attempts. Our current
use of the MVCC API is quite indiscriminate. We use it for
real MVCC keys, including intents, for inline meta (timeseries),
and non-MVCC key space. For the last category we use
MakeMVCCMetadataKey to construct an MVCCKey without any timestamps.
This PR tries to clean this up in the following manner:

Adds additional methods to the API to differentiate between
these use cases.
Introduces a StorageKey that is a generalization of an MVCCKey
(see below for why).
Most callers that are using MVCCKey for the non-versioned key
space continue to use MVCCKey, since making them switch to an
API with roachpb.Keys would make the refactor even larger.

I initially tried to keep the API change partly as an addition
in a separate interface, and make callsites wrap Reader/Writer
when they needed the additional interface, but that resulted
in changes in even more places, and not being sure if
wrapping had been forgotten. Now the Pebble-based implementations
of Engine/Reader/Writer/Iterator directly implement these
interfaces.

The changes here accomodate the transition from interleaved to
separated intents. Various methods in Reader/Writer/Iterator
know how to make separated intents appear as interleaved and
to make an interleaved intent become separated when rewritten
or vice versa. These are used by mvcc.go and friends -- the
changes there are relatively straightforward and we are not
touching any of the tricky code in mvcc.go.

Another aspect of the change here is the lock table key space
and StorageKey. The latter is a generalization of MVCCKey
that permits us to stuff a UUID into the suffix of the key
(not just hlc.Timestamp), so that prefix iteration can continue
to use a bloom filter when searching for a separated intent.
I think we cannot afford to lose the performance characteristics
of prefix iteration (though we will need to revisit this if
we ever introduce replicated span shared locks, since prefix
iteration will not be sufficient to find those locks).

The lock table has the following changes:

It waits on locks found using a storage.Iterator (this does
not include the optimization for limited scans which will
be part of the real PR).
finalizedTxnCache is moved into lockTableImpl. When doing
ScanAndEnqueue if an unreplicated lock is encountered that
is held by a txn in this cache, it is immediately removed.
If a replicated lock is encountered a LockUpdate is created
and the caller does not wait on that lock. At the end of
ScanAndEnqueue if there are no locks to wait on, the caller
is given this list of LockUpdates that it must resolve, and
lockTableGuard.ShouldWait returns true. The caller will release
latches, resolve these locks and then do another scan. Compared to
the current batching of resolution that can only batch for a single
span, this can batch across all spans of the request.

There are still many TODOs here, some of which I know how
to fix and some (especially in the callers of the engine API, since
there are so many), I will need some input on.

My current thinking is that this will split into the following
PRs. For this review I am looking for a sanity check and any
alternative ideas for this MVCC engine API refactor, so I can
work on the second bullet below - unfortunately it will touch
a very large number of files, including tests (tests were not
changed in this WIP PR). The other PRs would be small in comparison.

add lock table key changes to keys package and StorageKey.
interface changes to storage package and all integration changes.
The interface implementations will behave as if separated
intents do not exist. We may not catch integration mistakes in
calling the new API, but these will not be real bugs yet since
there are no separated locks/intents.
introduce alternative implementations that can handle
separated locks. These can be done in separate PRs for Engine,
Reader, Writer, Iterator and will include unit tests of the
new behavior (separated, interleaved and mix of both).
From an integration perspective, the separated locks will
still be turned off.
turn on separated locks for all CockroachDB tests
and fix bugs. Also (somehow) run these tests with mix of
interleaved and separated intents. This will happen over
multiple PRs (N tests at a time and corresponding fixes).
Lock table changes to observe separated locks including
optimization for limited scans (the former is included in
this WIP, but not the latter).

Release note: None

cockroach-teamcity · 2020-10-17T20:47:18Z

This change is

petermattis

I mainly looked at storage/engine.go. My biggest concern is that the Iterator and Writer interfaces were already large, and now they are even larger. Is it possible to have more, but smaller interfaces? My thinking is that this will be external usage of these interfaces clearer. For example, if you have a storage.Iterator it isn't clear whether you're iterating over mvcc keys or storage keys. A storage.MVCCIterator would make this clear. I see you mentioned trying some other approaches with separate interfaces, so perhaps this is a non-starter for some reason.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @itsbilal, @nvanbenschoten, @petermattis, and @sumeerbhola)

pkg/storage/engine.go, line 82 at r1 (raw file):

	// UnsafeStorageKey returns the same value as StorageKey, but the memory is
	// invalidated on the next call to {Next,NextKey,Prev,SeekGE,SeekLT,Close}.
	UnsafeStorageKey() StorageKey

I was hoping at some point that we could adopt Pebble's internal API for returning the key and value from the positioning calls. The advantage to doing that is a reduction in indirect calls that had a measurable perf impact. But the distinction between MVCCKey and StorageKey seems to preclude making that API change. Just thinking out loud here.

Thinking about this more, I think SeekGE and Next could eventually return (StorageKey, []byte) given that there is an easy transformation from StorageKey to MVCCKey.

It is somewhat odd that UnsafeStorageKey() is a method on SimpleIterator while there are no positioning routines on SimpleIterator which take a StorageKey. Could this method be moved to Iterator? Alternately, should SeekStorageGE be move from Iterator to SimpleIterator?

pkg/storage/engine.go, line 116 at r1 (raw file):

	// SeekStorageGE advances the iterator to the first key in the engine which
	// is >= the provided key.
	SeekStorageGE(key StorageKey)

The duplication of the positioning routines really bloats up these interfaces. Do you ever use both sets of positioning routines on the same iterator? I'm wondering if there could be separate MVCCIterator and StorageIterator interfaces (they might both be implemented by the same struct).

pkg/storage/engine.go, line 293 at r1 (raw file):

	// Get returns the value for the given key, nil otherwise.
	//
	// Deprecated: use MVCCGet instead.

Should we do the work to actually remove usage of these deprecated methods?

pkg/storage/engine.go, line 334 at r1 (raw file):

// NB: In this file "intent" refers to non-inline meta. We also refer to it as
// MVCC-meta.
type PrecedingIntentState int

Does this have to be exported? (I haven't looked at the rest of this change). It would be preferable if it could be an implementation detail of the storage package.

pkg/storage/engine.go, line 360 at r1 (raw file):

	//
	// It is safe to modify the contents of the arguments after Clear returns.
	Clear(key MVCCKey) error

Perhaps rename to ClearMVCCKey. Then ClearKeyWithEmptyTimestamp could become ClearRawKey.

pkg/storage/pebble.go, line 60 at r1 (raw file):

//
// StorageKeyCompare compares cockroach keys, including the MVCC timestamps.
func StorageKeyCompare(a, b []byte) int {

Perhaps we should call this simply storage.Compare. Ditto for storage.Merger.

pkg/storage/storage_key.go, line 11 at r1 (raw file):

)

type StorageKey struct {

Externally, this will show up as storage.StorageKey, which stutters. I don't have a better name to offer at the moment, but I suspect there is one.

pkg/storage/storage_key.go, line 82 at r1 (raw file):

}

func IsMVCCKey(skey StorageKey) bool {

Should this be a method on StorageKey?

sumeerbhola

My biggest concern is that the Iterator and Writer interfaces were already large, and now they are even larger. Is it possible to have more, but smaller interfaces?

I think Iterator could be narrowed into one interface that iterates over StorageKey and another over MVCCKey, since we can differentiate based on NewIterator(opts IterOptions, iterKind IterKind) Iterator. I came to this differentiation a bit late, and was getting exhausted so decided to send this out. One question there is whether we want a different interface for iterating over parts of the key space which isn't MVCC at all (just pretends to be, and always has an empty hlc.Timestamp).

The Writer situation is more problematic. I was initially trying to wrap Writer in the various call-sites based on whether it operated on real MVCC values with intents, or only inline MVCCMetadata. But it was resulting in too many changes in the caller and places I was forgetting to wrap, or a chain of callers causing multiple wrappings. The reason I mention this is that to have separate interfaces for different kinds of Writer we need to somewhere recognize what is needed and "construct" an appropriate Writer. But unlike Iterator, the Engine itself is a Writer (so "construction" would really be "wrapping"). Though maybe we don't really use Engine for most of the Writer calls since it is not an efficient way to do things compared to a Batch. IIUC, Batch is being created in replica_write.go (newBatchedEngine), which is too early to know what the purpose of this batch is.
One way to simplify would be if we could say that all "low-level" writes to the MVCC key space should happen via mvcc.go and other functions in the storage package. So make most of Writer, as exported to other packages, be about writing StorageKey. The package internal interface would be broader, but at least it is not exposed.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @itsbilal, @nvanbenschoten, @petermattis, and @sumeerbhola)

pkg/storage/engine.go, line 82 at r1 (raw file):

Thinking about this more, I think SeekGE and Next could eventually return (StorageKey, []byte) given that there is an easy transformation from StorageKey to MVCCKey.

I was somewhat concerned about the performance implications of doing a 2-step conversion instead of parsing the int values directly when it is an hlc.Timestamp.
But this may not matter if we split the Iterator interface.

nvb

So far, I've only taken a high-level pass, but I intend to give the changes in pkg/kv/kvserver/concurrency more scrutiny once they're a little further along. This all matched my expectations. I appreciate that you didn't shy away from cutting up the Engine interfaces. It's a big change, but I think things would have been more tricky to follow if you tried to preserve the interfaces instead of being explicit. And I'm hopeful that with a clearer split between StorageKey and MVCCKey, and once the migration is complete that removes all handling of interleaved intents, we'll end up in a cleaner state than where we started.

Is there anything, in particular, that you'd like input on about the KV side of this change?

Reviewed 88 of 88 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @itsbilal, @petermattis, and @sumeerbhola)

pkg/keys/constants.go, line 169 at r1 (raw file):

	// reference to a provisional MVCC value.
	LocalRangeLockTablePrefix             = roachpb.Key(makeKey(localPrefix, roachpb.RKey("l")))
	LocalRangeLockTableExclusiveTxnSuffix = roachpb.RKey("exc-")

I'm interested in why you decided to keep each lock strength for a given (key,txn) pair in a separate key. You mention above that the protos may be different for different strengths because an exclusive strength lock needs to do double duty as a reference to a provisional MVCC value. Does this double-duty really necessitate a different proto, instead of a proto with some optional fields?

Is the idea that you'd like to avoid decoding protos when searching for contending locks? So for instance, this encoding scheme allows you to ignore locks with a shared lock strength when acquiring another shared lock strength? If so, this seems smart, as I think it means that all uncontended lock acquisitions can be O(1), even if other compatible locks already exist on the key.

We might want to consider how this impacts ranged locks. As is, I don't think we'd get the same benefit. But maybe we could if we partitioned the entire lock table keyspace by lock strength (i.e. moved the strength to the key prefix)? This comes with a trade-off though, as it pessimizes single-key operations.

pkg/keys/doc.go, line 170 at r1 (raw file):

	// 		`localRangeIDUnreplicatedInfix`.
	// 	  - Range local keys all share `LocalRangePrefix`.
	//    - Range lock (which are also local keys) all share `LocalRangeLockTablePrefix`.

nit: spacing looks off here.

pkg/keys/doc.go, line 217 at r1 (raw file):

	LockTableKeyExclusive,

	//   4. Store local keys: These contain metadata about an individual store.

5.

pkg/keys/keys.go, line 411 at r1 (raw file):

	// doubly-local lock table keys. For example, local range descriptor keys can
	// be locked during split and merge transactions.
	buf := make(roachpb.Key, 0, len(LocalRangeLockTablePrefix)+len(key)+1)

Should this be adding four bytes to accommodate for the suffix, which is always 4 bytes? This question appears to apply to MakeRangeKeyPrefix as well.

pkg/kv/kvserver/replica_raftstorage.go, line 1026 at r1 (raw file):

) error {
	getKeyRanges := func(desc *roachpb.RangeDescriptor) []rditer.KeyRange {
		ranges := make([]rditer.KeyRange, 0, 4)

Should this be 3? Also, I guess we don't need the [:] below now that we're returning a slice.

pkg/kv/kvserver/store_snapshot.go, line 212 at r1 (raw file):

// 1. Replicated range-id local key range
// 2. Range-local key range
// 3. Lock table keys

Could you mention that this will be two separate key ranges?

pkg/kv/kvserver/batcheval/intent_test.go, line 41 at r1 (raw file):

		ie.onNewIterator(opts)
	}
	return ie.Engine.NewIterator(opts, storage.MVCCKeyAndIntentsIterKind)

Is this correct? Should it be passing iterKind?

pkg/kv/kvserver/concurrency/concurrency_manager.go, line 205 at r1 (raw file):

		return nil, m.twq.MaybeWaitForQuery(ctx, t)
	default:
		// TODO(nvanbenschoten): in the future, use this hook to update the lock

Just a heads-up that this is where I was picturing us hooking up "virtual intent resolution" once we move to make intent resolution not acquire latches.

pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 130 at r1 (raw file):

	toResolve := guard.ResolveBeforeEvaluation()
	if len(toResolve) > 0 {
		w.resolveDeferredIntents(ctx, &err, &toResolve)

Should this be deferred? Or is the intent for it to happen before waiting for other locks?

pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 147 at r1 (raw file):

	// can't hold reservations), any other requests that slip ahead will simply
	// re-discover the intent(s) during evaluation and resolve them themselves.
	var deferredResolution []roachpb.LockUpdate

Do we still need this?

pkg/storage/engine.go, line 99 at r1 (raw file):

//
// Implementations of Iterator have one of the following behaviors:
// - Only look at the MVCC key space (including inline-meta)

It would be useful to reference the parameter that each of these behaviors corresponds to.

pkg/storage/engine.go, line 116 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The duplication of the positioning routines really bloats up these interfaces. Do you ever use both sets of positioning routines on the same iterator? I'm wondering if there could be separate MVCCIterator and StorageIterator interfaces (they might both be implemented by the same struct).

I like this idea. It also provides some extra type safety.

pkg/storage/engine.go, line 334 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Does this have to be exported? (I haven't looked at the rest of this change). It would be preferable if it could be an implementation detail of the storage package.

I suspect that it does need to be exported so that implementors of Writer can use it, but Sumeer should confirm.

pkg/storage/engine.go, line 360 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Perhaps rename to ClearMVCCKey. Then ClearKeyWithEmptyTimestamp could become ClearRawKey.

I like that. We'd want to do the same thing with SingleClear and Put. I expect that if you wanted to make that change, you'd want to hold off for now, because the diff would be large.

pkg/storage/engine.go, line 369 at r1 (raw file):

	// REQUIRES: state is ExistingIntentInterleaved or ExistingIntentSeparated
	ClearMVCCMeta(
		key roachpb.Key, state PrecedingIntentState, possiblyUpdated bool,

s/possiblyUpdated/precedingPossiblyUpdated/ to stay consistent with every other use.

pkg/storage/engine.go, line 444 at r1 (raw file):

	// a key.
	PutMVCCMeta(
		key roachpb.Key, value []byte, state PrecedingIntentState, precedingPossiblyUpdated bool,

Could you include detail about this precedingPossiblyUpdated bool in the comment?

pkg/storage/intent_interleaving_iter.go, line 21 at r1 (raw file):

)

type intentInterleavingIter struct {

I know this is a WIP, but this would be a great place to host a comment explaining how this iter impl fits into the picture.

pkg/storage/intent_interleaving_iter.go, line 26 at r1 (raw file):

	intentIter Iterator
	// The decoded key from the lock table.
	intentKey roachpb.Key

Is it clear from the context that this is unsafe? If not, consider adding that to the name or at least to the comment.

pkg/storage/intent_interleaving_iter.go, line 56 at r1 (raw file):

}

// TODO: fix this to be correct when seeks are not only to keys with empty timestamp.

Is this complete already?

pkg/storage/intent_interleaving_iter.go, line 109 at r1 (raw file):

}

func (i *intentInterleavingIter) tryDecodeLockedKey() error {

nit: "locked" key or "lock" key?

pkg/storage/intent_interleaving_iter.go, line 182 at r1 (raw file):

	}
	if i.intentCmp <= 0 {
		// Currently, there is at most 1 intent for a key, so doing Next() is correct.

So this won't be correct when other locking strengths are added? Or when "virtual intent resolution" is added.

Going back to my earlier comment about partitioning the entire lock table keyspace by lock strength, this might be more motivation to do make that change. If exclusive locks were split out then we wouldn't have to worry about them at all in this file. I don't have a good enough understanding to know whether that would be a real win or not.

pkg/storage/meta_reader_writer.go, line 21 at r1 (raw file):

)

// Configuration information used here to decide how to wrap Reader and Writer.

I like how you pulled these decisions into this file. It makes it easier to follow.

pkg/storage/pebble.go, line 1291 at r1 (raw file):

	var iter *pebbleIterator
	if iterKind == StorageKeyIterKind {

I found this to be surprising. Is this really the only case where we create iterators with this kind? If so, should we make the kind more specific and directly reference the lock table?

pkg/storage/rocksdb.go, line 207 at r1 (raw file):

func (r *RocksDB) ClearKeyWithEmptyTimestamp(key roachpb.Key) error {
	panic("implement me")

Or delete me!

pkg/storage/storage_key.go, line 11 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Externally, this will show up as storage.StorageKey, which stutters. I don't have a better name to offer at the moment, but I suspect there is one.

Maybe EngineKey or LSMKey.

pkg/storage/storage_key.go, line 83 at r1 (raw file):

func IsMVCCKey(skey StorageKey) bool {
	return len(skey.Suffix) != 16

We'll want to spell out what the various suffix lengths mean somewhere.

Also, I don't want to distract from this PR, but want to confirm that we're keeping #41720 (comment) in mind while making the decision to use the suffix byte length as the indicator of key type.

pkg/storage/enginepb/mvcc.proto, line 76 at r1 (raw file):

  optional util.hlc.LegacyTimestamp merge_timestamp = 7;

  // Set to true iff we can guarantee that the transaction did not do more

🎆

sumeerbhola

TFTRs!
I've addressed some of the comments and added TODOs for some of the others (and I will keep track of the remaining).

Some of my responses have questions that need resolution. I think once that is done, some of the refactor can start. I expect to do some mechanical renaming first.
@itsbilal do you have a rough timeline for merging #55509? I would like to refactor after that is merged.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @itsbilal, @nvanbenschoten, @petermattis, and @sumeerbhola)

pkg/keys/constants.go, line 169 at r1 (raw file):

I'm interested in why you decided to keep each lock strength for a given (key,txn) pair in a separate key. You mention above that the protos may be different for different strengths because an exclusive strength lock needs to do double duty as a reference to a provisional MVCC value. Does this double-duty really necessitate a different proto, instead of a proto with some optional fields?

I'm not a fan of stuffing things into the same proto if they serve a different purpose. And if the lifetime of the locks are different and can mostly go through just a (Put, SingleClear) sequence, that seems better than reading and mutating locks shared by different transactions. Ah, you are not suggesting sharing between different txns and just sharing for different strengths on the same key for the same txn. Since the strength lattice is totally ordered we shouldn't need multiple strength locks active for the same txn, and separating them may allow for SingleClear to be used more frequently (maybe).

Is the idea that you'd like to avoid decoding protos when searching for contending locks? So for instance, this encoding scheme allows you to ignore locks with a shared lock strength when acquiring another shared lock strength? If so, this seems smart, as I think it means that all uncontended lock acquisitions can be O(1), even if other compatible locks already exist on the key.

Do you mean O(1) in the sense of being able to use a bloom filter, or in the sense of avoiding decoding protos? For the former, the shared lock will still need to look for both upgrade and exclusive locks so that doesn't seem to help. Hmm, what do you think about me moving a byte into the part that currently only has the UUID, that specifies the strength? That way we can use a prefix iterator to find all per-key locks for an MVCC Get.

We might want to consider how this impacts ranged locks. As is, I don't think we'd get the same benefit. But maybe we could if we partitioned the entire lock table keyspace by lock strength (i.e. moved the strength to the key prefix)? This comes with a trade-off though, as it pessimizes single-key operations.

I suspect we may need to move ranged locks to a separate section of the key space and maintain an in-memory summary (which would need to be re-initialized on a leaseholder change) in order to not pessimize single-key operations. I could grab another character following the LocalRangeLockTablePrefix character for single key locks, to prepare for this future. So there would be two characters preceding the encoding of the roachpb.Key. How does that sound to you? I don't want to separate by lock strength since that would mean seeking into each lock strength that could conflict and that would require more pebble Iterators that we need to keep coordinated.

pkg/keys/doc.go, line 170 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: spacing looks off here.

Odd. Seems to be a reviewable artifact -- github shows it fine.

pkg/keys/doc.go, line 217 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

5.

Done

pkg/keys/keys.go, line 411 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should this be adding four bytes to accommodate for the suffix, which is always 4 bytes? This question appears to apply to MakeRangeKeyPrefix as well.

Those 4 bytes are not part of the roachpb.Key. I think I forgot to update some of the comments when I made that change (at least one stale comment in constants.go, which I've now fixed).

pkg/kv/kvserver/replica_raftstorage.go, line 1026 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should this be 3? Also, I guess we don't need the [:] below now that we're returning a slice.

MakeRangeLockTableKeyRanges returns two KeyRanges. I've changed the interface to return an array instead of a slice to make this explicit.

pkg/kv/kvserver/store_snapshot.go, line 212 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you mention that this will be two separate key ranges?

Done

pkg/kv/kvserver/batcheval/intent_test.go, line 41 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this correct? Should it be passing iterKind?

Done.
This was leftover from a goland refactor of the NewIterator interface -- there are probably other places like this that are broken.

pkg/kv/kvserver/concurrency/concurrency_manager.go, line 205 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Just a heads-up that this is where I was picturing us hooking up "virtual intent resolution" once we move to make intent resolution not acquire latches.

Ack

pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 130 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should this be deferred? Or is the intent for it to happen before waiting for other locks?

The resolution should happen before waiting again since the ScanAndEnqueue has not found any reason to wait other than these intents.

pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 147 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we still need this?

Removed.

pkg/storage/engine.go, line 99 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It would be useful to reference the parameter that each of these behaviors corresponds to.

I've added a TODO which I'll fix when redoing this refactor.

pkg/storage/engine.go, line 116 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I like this idea. It also provides some extra type safety.

I'll split into two.

pkg/storage/engine.go, line 334 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I suspect that it does need to be exported so that implementors of Writer can use it, but Sumeer should confirm.

It could probably stay internal to the storage package -- I'll try that out.

pkg/storage/engine.go, line 360 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I like that. We'd want to do the same thing with SingleClear and Put. I expect that if you wanted to make that change, you'd want to hold off for now, because the diff would be large.

I'd like alternatives to ClearRawKey since we already use "raw" to refer to the underlying Pebble key. The things I can think of are ClearRoachPBKey, ClearNonMVCCKey, which are not great. And there will be a ClearStorageKey (and of course the same will apply to Put, SingleClear). Any suggestions?

If we can agree on names I can produce some initial mechanical PRs that just change the existing names to lay the groundwork for the more complex refactorings.

pkg/storage/engine.go, line 369 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/possiblyUpdated/precedingPossiblyUpdated/ to stay consistent with every other use.

Done

pkg/storage/engine.go, line 444 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you include detail about this precedingPossiblyUpdated bool in the comment?

added a TODO

pkg/storage/intent_interleaving_iter.go, line 21 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I know this is a WIP, but this would be a great place to host a comment explaining how this iter impl fits into the picture.

added a TODO.

pkg/storage/intent_interleaving_iter.go, line 26 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is it clear from the context that this is unsafe? If not, consider adding that to the name or at least to the comment.

Done

pkg/storage/intent_interleaving_iter.go, line 56 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this complete already?

Yes, forgot to remove.

pkg/storage/intent_interleaving_iter.go, line 109 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: "locked" key or "lock" key?

Changed.

pkg/storage/intent_interleaving_iter.go, line 182 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

So this won't be correct when other locking strengths are added? Or when "virtual intent resolution" is added.

Going back to my earlier comment about partitioning the entire lock table keyspace by lock strength, this might be more motivation to do make that change. If exclusive locks were split out then we wouldn't have to worry about them at all in this file. I don't have a good enough understanding to know whether that would be a real win or not.

I assume mostly there won't be any locks, so having a single lock space for all strengths allows us to confirm that with a single iterator on that lock table (and a single seek -- lets ignore range locks for now). The downside is that if there are many non-exclusive replicated locks the interleaving iterator will need to step through the others when evaluating limited scans (I am thinking post transition when there are no interleaved intents).
I'm leaning towards the first. Thoughts?

pkg/storage/pebble.go, line 60 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Perhaps we should call this simply storage.Compare. Ditto for storage.Merger.

I've added a TODO.

pkg/storage/pebble.go, line 1291 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I found this to be surprising. Is this really the only case where we create iterators with this kind? If so, should we make the kind more specific and directly reference the lock table?

Yes, this can be a bit confusing. replica_data_iter.go also created StorageKeyIterKind. But IIUC, this iterator reuse optimization is for the BatchRequest evaluation path. I've added a comment in pebbleReadOnly and renamed these to say {normal,prefix}StorageIter.

pkg/storage/storage_key.go, line 11 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Maybe EngineKey or LSMKey.

I'm disinclined towards LSMKey since that suggests a raw key ([]byte). I can change to EngineKey (added a TODO).

pkg/storage/storage_key.go, line 82 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Should this be a method on StorageKey?

Done

pkg/storage/storage_key.go, line 83 at r1 (raw file):

We'll want to spell out what the various suffix lengths mean somewhere.

added a TODO.

Also, I don't want to distract from this PR, but want to confirm that we're keeping #41720 (comment) in mind while making the decision to use the suffix byte length as the indicator of key type.

I'm hoping we can continue distinguishing based on suffix length -- if future additions make that not naively possible we could pad with a character.

petermattis

Some of my responses have questions that need resolution. I think once that is done, some of the refactor can start. I expect to do some mechanical renaming first.

I think I addressed the comments you had questions on, but if I missed something, ping me on the code that needs my attention.

I'm thumbs-up on doing separate PRs with mechanical renaming and introducing the StorageKey (nee EngineKey) interfaces. Let's keep that separate from the separated lock table stuff so we can properly bikeshed the interface adjustments. This PR has been useful to see how it will all tie together, though.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @itsbilal, @nvanbenschoten, @petermattis, and @sumeerbhola)

pkg/storage/engine.go, line 82 at r1 (raw file):

Previously, sumeerbhola wrote…

Thinking about this more, I think SeekGE and Next could eventually return (StorageKey, []byte) given that there is an easy transformation from StorageKey to MVCCKey.

I was somewhat concerned about the performance implications of doing a 2-step conversion instead of parsing the int values directly when it is an hlc.Timestamp.
But this may not matter if we split the Iterator interface.

Ack on the perf implications (which we need to be sensitive to). I think that won't be a problem with several Iterator interfaces for iterating over MVCCKeys vs StorageKeys.

pkg/storage/engine.go, line 360 at r1 (raw file):

Previously, sumeerbhola wrote…

I'd like alternatives to ClearRawKey since we already use "raw" to refer to the underlying Pebble key. The things I can think of are ClearRoachPBKey, ClearNonMVCCKey, which are not great. And there will be a ClearStorageKey (and of course the same will apply to Put, SingleClear). Any suggestions?

If we can agree on names I can produce some initial mechanical PRs that just change the existing names to lay the groundwork for the more complex refactorings.

ClearUnversionedKey?

pkg/storage/engine.go, line 479 at r2 (raw file):

// Engine is the interface that wraps the core operations of a key/value store.
type Engine interface {
	ReadWriter

Related to all this work, and something that came up in the storage meeting today: should we work to change the Engine interface to only implement Writer. That would force all write operations to use NewBatch or NewWriteOnlyBatch. I think it may also allow NewReadOnly to return a Reader instead of a ReadWriter, which would be a fantastic outcome.

sumeerbhola

@nvanbenschoten @petermattis

The question (summary of what was in my earlier responses) left to be answered is about arrangement of the lock table key space. I am current inclined towards:

grab a 2 byte prefix: the usual 1 byte + a second byte to represent that these are single key locks (not range locks). So range locks, if/when we add them, will not be intermingled with single key locks. Range locks do not permit the bloom filter optimization, so we may consider optimizations like a new leaseholder reading that part of the key space and maintaining an in-memory summary -- this reading would be cheaper if these were not mixed in with intents.
instead of a 16 byte suffix in the versioned part for the UUID, use 17 bytes. The first byte would be the strength, which would be followed by the UUID. This would allow a Get to use bloom filters even when we introduce multiple lock strengths for single key locks.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @itsbilal, @nvanbenschoten, @petermattis, and @sumeerbhola)

pkg/storage/engine.go, line 360 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

ClearUnversionedKey?

I like that.

pkg/storage/engine.go, line 479 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Related to all this work, and something that came up in the storage meeting today: should we work to change the Engine interface to only implement Writer. That would force all write operations to use NewBatch or NewWriteOnlyBatch. I think it may also allow NewReadOnly to return a Reader instead of a ReadWriter, which would be a fantastic outcome.

I assume you meant not implement Writer, except for ApplyBatchRepr which we need for commit? I've added a TODO to explore that.

pebbleReadOnly only implementing Reader seems unrelated -- IIUC, it is because the interfaces in KV are using a ReadWriter because the code paths for a read-only batch and a read-write batch are the same.

nvb

Reviewed 2 of 16 files at r2, 2 of 2 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @itsbilal, @nvanbenschoten, @petermattis, and @sumeerbhola)

pkg/keys/constants.go, line 169 at r1 (raw file):

Since the strength lattice is totally ordered we shouldn't need multiple strength locks active for the same txn, and separating them may allow for SingleClear to be used more frequently (maybe).

I could still see an argument for keeping them separate in order to support ROLLBACK operations. Or if not, we'll probably need a sequence history in each value, similar to what we have today in the lock table.

Do you mean O(1) in the sense of being able to use a bloom filter, or in the sense of avoiding decoding protos? For the former, the shared lock will still need to look for both upgrade and exclusive locks so that doesn't seem to help.

I meant O(1) in terms of never needing to search through multiple lock keys that do not contend with the current operation. If we partitioned the lock table entirely by lock strength, then the existence of any key in a given key span would be enough to indicate lock contention.

But I think I convinced myself above that this is never necessary, even without a lock keyspace partitioned by strength. The reason for this is that the existence of even 1 shared lock proves that no upgrade/exclusive lock exists on the same key, so a read will never need to search through a collection of shared locks in order to determine whether there is a conflicting exclusive lock.

This argument may not quite apply when we start exploring "disabling" locks during latch-less intent resolution.

Hmm, what do you think about me moving a byte into the part that currently only has the UUID, that specifies the strength? That way we can use a prefix iterator to find all per-key locks for an MVCC Get.

You're saying that you would move the existing LocalRangeLockTableExclusiveTxnSuffix byte into the StorageKey.Suffix so that it would be ignored by bloom filters, enabling a single lookup in a bloom filter to search for a lock of any strength on a key? And so this wouldn't actually be changing the size of the entire key, just the size of the storage key suffix? I like that idea.

I do feel like I'm missing some terminology to discuss this concept. How is the key suffix that is ignored by bloom filters usually discussed? As "versions"?

I suspect we may need to move ranged locks to a separate section of the key space and maintain an in-memory summary (which would need to be re-initialized on a leaseholder change) in order to not pessimize single-key operations. I could grab another character following the LocalRangeLockTablePrefix character for single key locks, to prepare for this future. So there would be two characters preceding the encoding of the roachpb.Key. How does that sound to you? I don't want to separate by lock strength since that would mean seeking into each lock strength that could conflict and that would require more pebble Iterators that we need to keep coordinated.

That makes sense to me. It also means that if we restricted ranged locks to only a subset of locking strengths, we could avoid the need for some operations to search this "ranged" section. For instance, if we only allowed Shared ranged locks, non-locking reads would not need to consult this part of the lock table at all.

pkg/keys/keys.go, line 411 at r1 (raw file):

Previously, sumeerbhola wrote…

Those 4 bytes are not part of the roachpb.Key. I think I forgot to update some of the comments when I made that change (at least one stale comment in constants.go, which I've now fixed).

Oh, I see. I thought the +1 was to make room for the suffix that is appended one level up. But it's actually for the bytesMarker prefixed in EncodeBytesAscending.

pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 130 at r1 (raw file):

Previously, sumeerbhola wrote…

The resolution should happen before waiting again since the ScanAndEnqueue has not found any reason to wait other than these intents.

Ack.

pkg/storage/intent_interleaving_iter.go, line 182 at r1 (raw file):

Previously, sumeerbhola wrote…

I assume mostly there won't be any locks, so having a single lock space for all strengths allows us to confirm that with a single iterator on that lock table (and a single seek -- lets ignore range locks for now). The downside is that if there are many non-exclusive replicated locks the interleaving iterator will need to step through the others when evaluating limited scans (I am thinking post transition when there are no interleaved intents).
I'm leaning towards the first. Thoughts?

I think I'm leaning towards the first as well. I assume we could have a fast-path similar to the one we have in MVCC that would iterate a few times before seeking to skip past all of the shared locks on a given key. This works because we know that if shared locks exist on a key, no exclusive locks can also exist on that key. This would at least place an upper bound on how disruptive a build-up of shared locks on a single key can be to other readers.

pkg/storage/storage_key.go, line 83 at r1 (raw file):

Previously, sumeerbhola wrote…

We'll want to spell out what the various suffix lengths mean somewhere.

added a TODO.

Also, I don't want to distract from this PR, but want to confirm that we're keeping #41720 (comment) in mind while making the decision to use the suffix byte length as the indicator of key type.

I'm hoping we can continue distinguishing based on suffix length -- if future additions make that not naively possible we could pad with a character.

Ack. So adding the extra strength byte for lock keys also helps for this, right? Because then we've freed up length 16 again?

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @itsbilal, @nvanbenschoten, @petermattis, and @sumeerbhola)

pkg/keys/constants.go, line 169 at r1 (raw file):

How is the key suffix that is ignored by bloom filters usually discussed? As "versions"?

I think we call it the timestamp now. I was going to use "versions" even though it isn't quite appropriate, so happy to hear alternatives.

pkg/keys/keys.go, line 411 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Oh, I see. I thought the +1 was to make room for the suffix that is appended one level up. But it's actually for the bytesMarker prefixed in EncodeBytesAscending.

Yes, but this seems dubious since it doesn't account for the \x00\x01 terminator (I'd reused the logic from MakeRangeKeyPrefix without looking carefully).

pkg/storage/intent_interleaving_iter.go, line 182 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think I'm leaning towards the first as well. I assume we could have a fast-path similar to the one we have in MVCC that would iterate a few times before seeking to skip past all of the shared locks on a given key. This works because we know that if shared locks exist on a key, no exclusive locks can also exist on that key. This would at least place an upper bound on how disruptive a build-up of shared locks on a single key can be to other readers.

Ack

pkg/storage/storage_key.go, line 83 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Ack. So adding the extra strength byte for lock keys also helps for this, right? Because then we've freed up length 16 again?

Yep

The change here is large and will be broken up into multiple PRs after discussion (details below). The largest and messiest part of this change is the MVCC engine API (Engine/Reader/Writer/Iterator) change. I ended up here after 2 failed attempts. Our current use of the MVCC API is quite indiscriminate. We use it for real MVCC keys, including intents, for inline meta (timeseries), and non-MVCC key space. For the last category we use MakeMVCCMetadataKey to construct an MVCCKey without any timestamps. This PR tries to clean this up in the following manner: - Adds additional methods to the API to differentiate between these use cases. - Introduces a StorageKey that is a generalization of an MVCCKey (see below for why). Most callers that are using MVCCKey for the non-versioned key space continue to use MVCCKey, since making them switch to an API with roachpb.Keys would make the refactor even larger. I initially tried to keep the API change partly as an addition in a separate interface, and make callsites wrap Reader/Writer when they needed the additional interface, but that resulted in changes in even more places, and not being sure if wrapping had been forgotten. Now the Pebble-based implementations of Engine/Reader/Writer/Iterator directly implement these interfaces. The changes here accomodate the transition from interleaved to separated intents. Various methods in Reader/Writer/Iterator know how to make separated intents appear as interleaved and to make an interleaved intent become separated when rewritten or vice versa. These are used by mvcc.go and friends -- the changes there are relatively straightforward and we are not touching any of the tricky code in mvcc.go. Another aspect of the change here is the lock table key space and StorageKey. The latter is a generalization of MVCCKey that permits us to stuff a UUID into the suffix of the key (not just hlc.Timestamp), so that prefix iteration can continue to use a bloom filter when searching for a separated intent. I think we cannot afford to lose the performance characteristics of prefix iteration (though we will need to revisit this if we ever introduce replicated span shared locks, since prefix iteration will not be sufficient to find those locks). The lock table has the following changes: - It waits on locks found using a storage.Iterator (this does not include the optimization for limited scans which will be part of the real PR). - finalizedTxnCache is moved into lockTableImpl. When doing ScanAndEnqueue if an unreplicated lock is encountered that is held by a txn in this cache, it is immediately removed. If a replicated lock is encountered a LockUpdate is created and the caller does not wait on that lock. At the end of ScanAndEnqueue if there are no locks to wait on, the caller is given this list of LockUpdates that it must resolve, and lockTableGuard.ShouldWait returns true. The caller will release latches, resolve these locks and then do another scan. Compared to the current batching of resolution that can only batch for a single span, this can batch across all spans of the request. There are still many TODOs here, some of which I know how to fix and some (especially in the callers of the engine API, since there are so many), I will need some input on. My current thinking is that this will split into the following PRs. For this review I am looking for a sanity check and any alternative ideas for this MVCC engine API refactor, so I can work on the second bullet below - unfortunately it will touch a very large number of files, including tests (tests were not changed in this WIP PR). The other PRs would be small in comparison. - add lock table key changes to keys package and StorageKey. - interface changes to storage package and all integration changes. The interface implementations will behave as if separated intents do not exist. We may not catch integration mistakes in calling the new API, but these will not be real bugs yet since there are no separated locks/intents. - introduce alternative implementations that can handle separated locks. These can be done in separate PRs for Engine, Reader, Writer, Iterator and will include unit tests of the new behavior (separated, interleaved and mix of both). From an integration perspective, the separated locks will still be turned off. - turn on separated locks for all CockroachDB tests and fix bugs. Also (somehow) run these tests with mix of interleaved and separated intents. This will happen over multiple PRs (N tests at a time and corresponding fixes). - Lock table changes to observe separated locks including optimization for limited scans (the former is included in this WIP, but not the latter). Release note: None

sumeerbhola requested review from a team, irfansharif, itsbilal, nvb and petermattis October 17, 2020 20:47

sumeerbhola requested a review from a team as a code owner October 17, 2020 20:47

sumeerbhola requested review from adityamaru and removed request for a team October 17, 2020 20:47

sumeerbhola removed request for a team and adityamaru October 17, 2020 20:47

petermattis reviewed Oct 18, 2020

View reviewed changes

sumeerbhola commented Oct 19, 2020

View reviewed changes

nvb reviewed Oct 19, 2020

View reviewed changes

sumeerbhola force-pushed the slt00 branch from 9f5f280 to 31c2763 Compare October 20, 2020 18:18

sumeerbhola commented Oct 20, 2020

View reviewed changes

petermattis reviewed Oct 21, 2020

View reviewed changes

sumeerbhola commented Oct 21, 2020

View reviewed changes

sumeerbhola force-pushed the slt00 branch from 31c2763 to a8a60dc Compare October 21, 2020 13:37

nvb reviewed Oct 21, 2020

View reviewed changes

sumeerbhola commented Oct 22, 2020

View reviewed changes

sumeerbhola force-pushed the slt00 branch from a8a60dc to 55008a8 Compare October 22, 2020 13:33

nvb mentioned this pull request Oct 22, 2020

kv: implement Update lock strength for unreplicated locks #49684

Open

irfansharif removed their request for review February 11, 2021 18:14

tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021

knz marked this pull request as draft May 7, 2021 09:08

knz removed the X-noremind Bots won't notify about PRs with X-noremind label May 7, 2021

sumeerbhola closed this May 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] *: separated lock table#55664

[WIP] *: separated lock table#55664
sumeerbhola wants to merge 1 commit intocockroachdb:masterfrom
sumeerbhola:slt00

sumeerbhola commented Oct 17, 2020

Uh oh!

cockroach-teamcity commented Oct 17, 2020

Uh oh!

petermattis left a comment

Uh oh!

sumeerbhola left a comment

Uh oh!

nvb left a comment

Uh oh!

sumeerbhola left a comment

Uh oh!

petermattis left a comment

Uh oh!

sumeerbhola left a comment

Uh oh!

nvb left a comment

Uh oh!

sumeerbhola left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

sumeerbhola commented Oct 17, 2020

Uh oh!

cockroach-teamcity commented Oct 17, 2020

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants