storage/concurrency: lock table for tracking replicated and unreplicated locks and sequencing requests #43740

sumeerbhola · 2020-01-06T16:00:42Z

No description provided.

cockroach-teamcity · 2020-01-06T16:00:50Z

This change is

…totype SFU Informs cockroachdb#41720. This commit creates a new concurrency package that provides a concurrency manager structure that encapsulates the details of concurrency control and contention handling for serializable key-value transactions. Any reader of this commit should start at `concurrency_control.go` and move out from there. The new package has a few primary objectives: 1. centralize the handling of request synchronization and transaction contention handling in a single location, allowing for the topic to be documented and understood in isolation. 2. rework contention handling to react to intent state transitions directly. This simplifies the transaction queueing story, reduces the frequency of transaction push RPCs, and allows waiters to proceed after intent resolution as soon as possible. 3. create a framework that naturally permits "update" locking, which is required for kv-level SELECT FOR UPDATE support (cockroachdb#6583). 4. provide stronger guarantees around fairness when transactions conflict, in order to reduce tail latencies under contended scenarios. 5. create a structure that can extend to address the long-term goals of a fully centralized lock-table laid out in cockroachdb#41720. WARNING: this is still a WIP. Notably, the lockTableImpl is mocked out with a working but incomplete implementation. See cockroachdb#43740 for a more complete strawman. Release note: None

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):

	nextSeqNum uint64
	// Containers for lockState structs
	locks [spanset.NumSpanScope]btree.BTree

I looked into replacing btree.BTree with something akin to the spanlatch.btree.

I don't think it directly applies. spanlatch.btree is making an implicit queue such that one is waiting behind all the conflicting latches already in the btree, computed by taking a snapshot of the btree. In the lockTableImple the ordering is defined by the seqNum and a scanAndEnqueue call can cause a request to be inserted before an existing request in a queue. So we need explicit queues per lock (where the locks are on points, not intervals). The explicit queues are simple wrt synchronization -- just use a per lockState mutex to reduce contention. Iterating over a btree needs a RWMutex for the btree to be held in read-mode. Even a lock release can make do with holding this in read-mode -- any lockState that need to be garbage collected can be added to a list of potential garbage that can be periodically handled by a background thread that acquires the btree RWMutex in write-mode.

The user-facing activities that need the btree RWMutex in write-mode are (a) acquiring a lock that was not already in the btree and (b) adding a discovered lock. If we are worried about read-write contention on the btree mutex we could eliminate this via copy-on-write: when doing scanandenqueue the request is holding spanlatches so no other conflicting request can be acquiring or releasing locks or adding to a queue in lockTable. So take a snapshot of the btree and release the btree mutex. Then iterate over this snapshot and add to queues, make reservations (using the per lockState mutex). If successfully reserved everywhere can proceed to evaluation. Else use this snapshot for iterating and waiting. Next call to scanandenqueue will acquire a new snapshot.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)

pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):

Previously, sumeerbhola wrote…

I looked into replacing btree.BTree with something akin to the spanlatch.btree.

I don't think it directly applies. spanlatch.btree is making an implicit queue such that one is waiting behind all the conflicting latches already in the btree, computed by taking a snapshot of the btree. In the lockTableImple the ordering is defined by the seqNum and a scanAndEnqueue call can cause a request to be inserted before an existing request in a queue. So we need explicit queues per lock (where the locks are on points, not intervals). The explicit queues are simple wrt synchronization -- just use a per lockState mutex to reduce contention. Iterating over a btree needs a RWMutex for the btree to be held in read-mode. Even a lock release can make do with holding this in read-mode -- any lockState that need to be garbage collected can be added to a list of potential garbage that can be periodically handled by a background thread that acquires the btree RWMutex in write-mode.

The user-facing activities that need the btree RWMutex in write-mode are (a) acquiring a lock that was not already in the btree and (b) adding a discovered lock. If we are worried about read-write contention on the btree mutex we could eliminate this via copy-on-write: when doing scanandenqueue the request is holding spanlatches so no other conflicting request can be acquiring or releasing locks or adding to a queue in lockTable. So take a snapshot of the btree and release the btree mutex. Then iterate over this snapshot and add to queues, make reservations (using the per lockState mutex). If successfully reserved everywhere can proceed to evaluation. Else use this snapshot for iterating and waiting. Next call to scanandenqueue will acquire a new snapshot.

I wasn't terribly thrilled with btree.Btree when I looked at using it in Pebble at one point. Something about the code diverging too much from the original C++ btree code it was based on, though the precise reasons escape me. See cockroachdb/pebble@d99331e for the old Pebble btree code which was the basis for spanlatch.btree. I know Nathan made a number of changes to spanlatch.btree when he forked it from the Pebble code. It may be better to fork spanlatch.btree and adapt it to the needs of the code here.

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)

pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):
I think this is missing the bigger point that we talked about yesterday which is that there is a strong desire to be able to capture a read-only snapshot of the lock table state (but not necessarily the wait-queue on each lock) that is held throughout request evaluation. This is what enables a number of improvements in #41720 including being able to resolve intents without waiting for a full consensus round.

It may be better to fork spanlatch.btree and adapt it to the needs of the code here.

I was thinking the same thing. We've done this dance one too many times and I'm thinking about making the dive and using https://github.com/mmatczuk/go_generics (used by https://github.com/scylladb/go-set). Thoughts?

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):

I was thinking the same thing. We've done this dance one too many times and I'm thinking about making the dive and using https://github.com/mmatczuk/go_generics (used by https://github.com/scylladb/go-set). Thoughts?

Worth a try.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)

pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I was thinking the same thing. We've done this dance one too many times and I'm thinking about making the dive and using https://github.com/mmatczuk/go_generics (used by https://github.com/scylladb/go-set). Thoughts?

Worth a try.

We may be agreeing for the most part: the copy-on-write alternative I outlined at the end is a fork of spanlatch.btree and not btree.BTree. If we can have a generic one, I'd be happy to use that instead of forking: the lockTable currently does not need the ability to store spans but eventually may (if/when there are read locks on spans).

I didn't understand the part about "missing the bigger point": (a) I don't see how a setup where queues can be created after the first call to scanAndEnqueue() by a request, that this request eventually realizes it needs to wait on, allows for a snapshot taken in that first call to scanAndEnqueue() to be reused for subsequent calls to scanAndEnqueue() -- I think one needs to snapshot again, (b) The lifetime of the snapshot shouldn't affect when the intents can be resolved, (c) your prototype in #41720 appears to use btree.BTree and simple synchronization -- if that is working well for a contended YCSB workload are we prematurely optimizing the btree implementation by discussing snapshots and copy-on-write?

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):
My “missing the bigger point” came off a stronger than I was intending. You're definitely not missing the point here 😃

I don't see how a setup where queues can be created after the first call to scanAndEnqueue() by a request, that this request eventually realizes it needs to wait on, allows for a snapshot taken in that first call to scanAndEnqueue() to be reused for subsequent calls to scanAndEnqueue() -- I think one needs to snapshot again

Yes exactly, we'll need to take another snapshot again after we've re-acquired latches, before subsequent calls to scanAndEnqueue.

The lifetime of the snapshot shouldn't affect when the intents can be resolved

I don't think I understand this comment. The idea of the snapshot is that it allows us to capture the state of the lock table and reference it for free throughout the evaluation of a request. This gives us the opportunity to "virtualize" the resolution of an intent without needing to physically resolve it. See #41720 (comment) for more on this idea.

your prototype in #41720 appears to use btree.BTree and simple synchronization -- if that is working well for a contended YCSB workload are we prematurely optimizing the btree implementation by discussing snapshots and copy-on-write?

I wouldn't want to merge that though. See the warning in the commit and PR description. The latch manager made pretty large improvements to throughput on large machines because it reduced synchronization between non-conflicting requests on a Range. I wouldn't want to regress that here.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)

pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):

The lifetime of the snapshot shouldn't affect when the intents can be resolved

I don't think I understand this comment. The idea of the snapshot is that it allows us to capture the state of the lock table and reference it for free throughout the evaluation of a request. This gives us the opportunity to "virtualize" the resolution of an intent without needing to physically resolve it. See #41720 (comment) for more on this idea.

I must have misunderstood what you were saying earlier -- I think we are consistent wrt "virtualizing" the intent resolution: once this lockTable is also managing the segregated replicated locks there would be a (bad names) releaseButDoNotRemoveReplicatedLock() method (in addition to a releaseReplicatedLock() which would cleanup the persistent state), which would just mark the in-memory lockState as released and process the queued entries.

Should I hold off on forking the spanlatch.btree or if you are busy do you want me to look at using go generics for that fork?

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

pkg/storage/concurrency/lock_table.go, line 269 at r1 (raw file):

	// corresponding intent has been applied -- at that point the lockTable could GC these locks
	// since the persistent state will allow it to discover them again when there is contention.
	addDiscoveredExclusiveLock(key roachpb.Key, txn *enginepb.TxnMeta, ts hlc.Timestamp) error

It's interesting that you don't immediately enter the wait-queue (but don't wait) for this discovered lock. That's how the #43775 works, and I think the justification was that it helped ensure a degree of fairness between the first request to observe an intent and other requests that followed. Does the request seq num help avoid these kinds of inversions?

pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):

I must have misunderstood what you were saying earlier -- I think we are consistent wrt "virtualizing" the intent resolution: once this lockTable is also managing the segregated replicated locks there would be a (bad names) releaseButDoNotRemoveReplicatedLock() method (in addition to a releaseReplicatedLock() which would cleanup the persistent state), which would just mark the in-memory lockState as released and process the queued entries.

Yes, that's exactly right.

Should I hold off on forking the spanlatch.btree or if you are busy do you want me to look at using go generics for that fork?

I'm hoping to have something out tonight that does this which I can hand off to you, so feel free to hold off. It's actually all working fairly well.

Besides that, my major takeaways from this prototype so far are:

We can remove the persistence aspects of this prototype for now, as we came to the decision during our conversation ealier this week that we aren't intending to persist the unreplicated portion of the lock table, at least not in the initial draft.
I'm interested in seeing how you envision the queueHandle structure fitting in with pushing transaction records to check for txn liveness and deadlock detection. This would correspond to the lockWaitQueueWaiter structure in [DNM] storage/concurrency: introduce concurrency control package, prototype SFU #43775. We discussed earlier this week that we may be able to avoid a distinguished waiter entirely.
Similarly, I think it would be worthwhile trying to fit this strawman into the interface and type defs provided in cbe5c2e#diff-d323684eb8ef832308d443e5fd7e55ba. This will reveal whether our two approaches have any impedance mismatches and whether the lockTable interface that the rest of [DNM] storage/concurrency: introduce concurrency control package, prototype SFU #43775 works with is missing anything.

…-Tree See cockroachdb#43740. This PR pulls the existing immutable interval B-Tree structure out of the spanlatch package and into a utility library. In order to avoid a loss of both type safety and performance, it opts for code generation by using https://github.com/mmatczuk/go_generics. The new util library exposes a script that is meant to be invoked by go generate to create specializations of the ordered tree structure for individual use cases. This ensures that each use can achieve the performance of being hand-written without needing to continue the pattern of copying the btree code and manually modifying it each time we want to use it.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)

pkg/storage/concurrency/lock_table.go, line 269 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It's interesting that you don't immediately enter the wait-queue (but don't wait) for this discovered lock. That's how the #43775 works, and I think the justification was that it helped ensure a degree of fairness between the first request to observe an intent and other requests that followed. Does the request seq num help avoid these kinds of inversions?

As we discussed yesterday, the comment was stale -- it does enter the wait-queue immediately.

pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I must have misunderstood what you were saying earlier -- I think we are consistent wrt "virtualizing" the intent resolution: once this lockTable is also managing the segregated replicated locks there would be a (bad names) releaseButDoNotRemoveReplicatedLock() method (in addition to a releaseReplicatedLock() which would cleanup the persistent state), which would just mark the in-memory lockState as released and process the queued entries.

Yes, that's exactly right.

Should I hold off on forking the spanlatch.btree or if you are busy do you want me to look at using go generics for that fork?

I'm hoping to have something out tonight that does this which I can hand off to you, so feel free to hold off. It's actually all working fairly well.

Besides that, my major takeaways from this prototype so far are:

We can remove the persistence aspects of this prototype for now, as we came to the decision during our conversation ealier this week that we aren't intending to persist the unreplicated portion of the lock table, at least not in the initial draft.

I'm interested in seeing how you envision the queueHandle structure fitting in with pushing transaction records to check for txn liveness and deadlock detection. This would correspond to the lockWaitQueueWaiter structure in [DNM] storage/concurrency: introduce concurrency control package, prototype SFU #43775. We discussed earlier this week that we may be able to avoid a distinguished waiter entirely.

Similarly, I think it would be worthwhile trying to fit this strawman into the interface and type defs provided in cbe5c2e#diff-d323684eb8ef832308d443e5fd7e55ba. This will reveal whether our two approaches have any impedance mismatches and whether the lockTable interface that the rest of [DNM] storage/concurrency: introduce concurrency control package, prototype SFU #43775 works with is missing anything.

After spending a few hours rewriting parts of this code yesterday and this morning, to introduce read reservations, and noticing a previous overlooked issue, I am having self doubt about the value of the fairness enforced by the queuing.

First, the previously overlooked issue. The code here has the following queueing behavior: a request queues if it conflicts with anyone ahead-of it. The ahead-of relationship is defined based on the seq-num and the lock holder is always ahead of everyone. If it does not queue, it makes sure it is remembered by adding a reservation. A reservation is just a marker to say that don't go ahead of me if you conflict unless you can break my reservation -- it prevents the sequencing of requests being decided by who wins the race to acquire spanlatches when a lock is released. With these semantics a request that is queued is not necessarily waiting-for the lock holder or reservation holders.
Say write-reservation with ts=20 and the first queued request is write ts=10 and the next queued request is read ts=12. The read is not blocked by the write reservation. From a deadlock detection perspective we can pretend the read ts=12 is blocked by write ts=20 in that we can prove that if all txns in the system are blocked we will detect a cycle, but that theoretical guarantee is not very useful since it can delay detecting the useful cycles when the system is not yet completely stalled. And more importantly, pushing write ts=20 is not going to help this reader. So the wait-for tracking needs to track who read ts=12 is actually conflicting with, which is write ts=10. This is mildly annoying wrt code but is doable. Note that the bad case of someone waiting-for a read by a transaction T that does not have a TxnMeta does not occur since readers don't wait for other readers and a writer waiting-for a read-by-T must also be waiting-for a writer before the read-by-T.

But I want to make sure the above queueing behavior is what we desire.

In the current system an intent at ts=20 would not prevent the read ts=12 from proceeding since it would not know about the write ts=10. So we are reducing concurrency by adding a queue. I tried to work out the code consequences of a different behavior where a request waits only if if it conflicts with the lock holder or existing reservations (that it cannot break) -- there are some complexities like "break reservations only if can break reservations of everyone it conflicts with"; whenever a waited-on request (reserver or lock holder) goes away there is churn in that one starts with smallest waiter and see who it will evict, then next one and so on. So this could be expensive and am not keen on this option either.

This commit introduces a new `concurrency/lock` package that provides type definitions for locking-related concepts used by concurrency control in the key-value layer. The package initially provides a lock `Strength` enumeration consisting of `Shared`, `Upgrade`, and `Exclusive` and a lock `Durability` enumeration consisting of `Unreplicated` and `Replicated`. These definitions are currently unused but will be referenced soon by cockroachdb#43740 and cockroachdb#43775. The changes around SELECT FOR UPDATE in the optimizer (cockroachdb#44015) are also approaching the point where they'll soon need the type definitions to interface with the KV API. Release note: None

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)

pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):

Previously, sumeerbhola wrote…

After spending a few hours rewriting parts of this code yesterday and this morning, to introduce read reservations, and noticing a previous overlooked issue, I am having self doubt about the value of the fairness enforced by the queuing.

First, the previously overlooked issue. The code here has the following queueing behavior: a request queues if it conflicts with anyone ahead-of it. The ahead-of relationship is defined based on the seq-num and the lock holder is always ahead of everyone. If it does not queue, it makes sure it is remembered by adding a reservation. A reservation is just a marker to say that don't go ahead of me if you conflict unless you can break my reservation -- it prevents the sequencing of requests being decided by who wins the race to acquire spanlatches when a lock is released. With these semantics a request that is queued is not necessarily waiting-for the lock holder or reservation holders.
Say write-reservation with ts=20 and the first queued request is write ts=10 and the next queued request is read ts=12. The read is not blocked by the write reservation. From a deadlock detection perspective we can pretend the read ts=12 is blocked by write ts=20 in that we can prove that if all txns in the system are blocked we will detect a cycle, but that theoretical guarantee is not very useful since it can delay detecting the useful cycles when the system is not yet completely stalled. And more importantly, pushing write ts=20 is not going to help this reader. So the wait-for tracking needs to track who read ts=12 is actually conflicting with, which is write ts=10. This is mildly annoying wrt code but is doable. Note that the bad case of someone waiting-for a read by a transaction T that does not have a TxnMeta does not occur since readers don't wait for other readers and a writer waiting-for a read-by-T must also be waiting-for a writer before the read-by-T.

But I want to make sure the above queueing behavior is what we desire.

In the current system an intent at ts=20 would not prevent the read ts=12 from proceeding since it would not know about the write ts=10. So we are reducing concurrency by adding a queue. I tried to work out the code consequences of a different behavior where a request waits only if if it conflicts with the lock holder or existing reservations (that it cannot break) -- there are some complexities like "break reservations only if can break reservations of everyone it conflicts with"; whenever a waited-on request (reserver or lock holder) goes away there is churn in that one starts with smallest waiter and see who it will evict, then next one and so on. So this could be expensive and am not keen on this option either.

(documenting the discussion with Nathan)

Regarding fairness: Based on SFU experiments, when there is contention, we desire fairness based on request arrival order, across different txns doing SFU and writes, since it reduces txn retries. This is because request arrival order typically correlates with txn timestamp order, while without queuing, requests race in an order that does not correlate well with txn timestamp order. Since SFU uses exclusive locks, the current experimental evidence does not require queueing based fairness for (non-locking) reads.

So that gives us 2 simplification options for reads:

Option 1: Reads wait in the queue for their turn but don't reserve when reach front. This does not simplify the earlier example of write reservation at ts=20, followed by queued write at ts=10, followed by read at ts=12. The read needs to push the txn at ts=10 and not the one with the reservation.
Option 2: Reads do not wait in the queue for their turn and they only wait if the lock is held in a conflicting manner. In this case reads only need to push the lock holder. And writes push the lock holder or the write reservation. This is simple.

We will adopt option 2 for now and revisit once we have more real-world experience.
This option also extends without much complexity when we add shared locks. Such requests will use a new SpanAccess value of SpanReadLock. These requests will conflict with writes regardless of timestamp, and so push either (a) the lock holder or reservation holder, if it is a writer, (b) if the lock is held or reserved by readers, it will push the first writer in the queue that is before this reader. SpanReadLock requests that are adjacent in the queue will get an opportunity to acquire the shared lock at the same time (since there will be read reservations), but if they are separated by a writer a later read request can't jump in front of the writer. This is because jumping ahead in a naive way will create a deadlock in the lockTable (detectable, but we shouldn't create deadlocks due to our algorithmic choices): the example is req2 with seq-num=2 jumping ahead of req1 with seq-num=1 for a key where the former does SpanReadLock access and latter does SpanReadWrite but for another key where both requests are doing SpanReadWrite, req2 is behind req1.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)

pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):

Previously, sumeerbhola wrote…

(documenting the discussion with Nathan)

Regarding fairness: Based on SFU experiments, when there is contention, we desire fairness based on request arrival order, across different txns doing SFU and writes, since it reduces txn retries. This is because request arrival order typically correlates with txn timestamp order, while without queuing, requests race in an order that does not correlate well with txn timestamp order. Since SFU uses exclusive locks, the current experimental evidence does not require queueing based fairness for (non-locking) reads.

So that gives us 2 simplification options for reads:

Option 1: Reads wait in the queue for their turn but don't reserve when reach front. This does not simplify the earlier example of write reservation at ts=20, followed by queued write at ts=10, followed by read at ts=12. The read needs to push the txn at ts=10 and not the one with the reservation.

Option 2: Reads do not wait in the queue for their turn and they only wait if the lock is held in a conflicting manner. In this case reads only need to push the lock holder. And writes push the lock holder or the write reservation. This is simple.

We will adopt option 2 for now and revisit once we have more real-world experience.
This option also extends without much complexity when we add shared locks. Such requests will use a new SpanAccess value of SpanReadLock. These requests will conflict with writes regardless of timestamp, and so push either (a) the lock holder or reservation holder, if it is a writer, (b) if the lock is held or reserved by readers, it will push the first writer in the queue that is before this reader. SpanReadLock requests that are adjacent in the queue will get an opportunity to acquire the shared lock at the same time (since there will be read reservations), but if they are separated by a writer a later read request can't jump in front of the writer. This is because jumping ahead in a naive way will create a deadlock in the lockTable (detectable, but we shouldn't create deadlocks due to our algorithmic choices): the example is req2 with seq-num=2 jumping ahead of req1 with seq-num=1 for a key where the former does SpanReadLock access and latter does SpanReadWrite but for another key where both requests are doing SpanReadWrite, req2 is behind req1.

Trying to digest various elements of today's discussion:

Is IgnoredSeqNumRange going to be in the TxnMeta or need to be passed in explicitly -- it seems the latter based on https://github.com/cockroachdb/cockroach/blob/43d3b6406467974f8c3bc9790561b110c4865c68/pkg/storage/engine/enginepb/mvcc3.proto but I am not sure what the latest thinking is?
Regarding replacing changeLockTs(...) with updateLocks(span roachpb.Span, txn *enginepb.TxnMeta, ignored *IgnoredSeqNumRange, ts hlc.Timestamp): My understanding is that when a transaction is successfully pushed for a span, (a) intents in that span will be made consistent with the pushed ts, and this will involve removal or rewrites of the intents, (b) then updateLocks() will be called. In (b), what we had discussed was:
- for unreplicated locks, remove all locks for earlier epochs, and for the current epoch remove the locks for the ignored seqnums. For the remaining locks the timestamp will be bumped up. Question: if I recall correctly, intents were not removed when restarting at a higher epoch since keeping those locks increased the chance of the retry succeeding -- do we want to do that for unreplicated locks at earlier epochs too by promoting them to the current epoch and bumping their timestamp? And how are we treating surviving intents from earlier epochs in the context of savepoints -- do we only keep the ones that were not rolled back?
- For replicated locks, forget them since they can always be discovered later. I think this was your preference because some intents may have been removed and some may have had their timestamp adjusted and instead of telling the lockTable about the state change for each intent it is simpler to just remove. Was that the correct reason or have I forgotten something?

This commit introduces a new `concurrency/lock` package that provides type definitions for locking-related concepts used by concurrency control in the key-value layer. The package initially provides a lock `Strength` enumeration consisting of `Shared`, `Upgrade`, and `Exclusive` and a lock `Durability` enumeration consisting of `Unreplicated` and `Replicated`. These definitions are currently unused but will be referenced soon by cockroachdb#43740 and cockroachdb#43775. The changes around SELECT FOR UPDATE in the optimizer (cockroachdb#44015) are also approaching the point where they'll soon need the type definitions to interface with the KV API. Release note: None

nvanbenschoten

This is getting a lot closer. Thanks for all of your work on this.

Reviewed 5 of 5 files at r13, 3 of 3 files at r14.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @petermattis and @sumeerbhola)

pkg/storage/concurrency/lock_table.go, line 107 at r10 (raw file):

Previously, sumeerbhola wrote…

Thanks. Removed it from the comment.
If we already dedup push attempts by tracking what is in-flight, why do we need a distinguished state -- all waiters could try to push immediately and only one would be allowed to. What am I missing?

Actually you're right, on the synchronous txn pushing path (which we'd be on here), we pass false for skipIfInFlight.

pkg/storage/concurrency/lock_table.go, line 140 at r10 (raw file):

Previously, sumeerbhola wrote…

Done

And also update the comment to be in the same order.

pkg/storage/concurrency/lock_table.go, line 667 at r10 (raw file):

Previously, sumeerbhola wrote…

Added a comment.

Thanks. This is all going to get more complicated when we allow multiple lock strengths, but for now it looks good.

pkg/storage/concurrency/lock_table.go, line 732 at r10 (raw file):

Previously, sumeerbhola wrote…

That is a good point. I've added a TODO in lockState listing all the places where we are searching in these two lists and added your suggestion on how to make some of these faster. I'd prefer to wait until we have some profiling data -- my expectation is that inactive waiters will be rare, so inQueue will be rarely true.
Is that ok?

SGTM

pkg/storage/concurrency/lock_table.go, line 740 at r10 (raw file):

Previously, sumeerbhola wrote…

I've updated the TODO that says to fix all the error messages to also include replacing "bug" errors with panics.
Is it ok if I do this in a followup PR since none of the error paths are currently tested.

SGTM

pkg/storage/concurrency/lock_table.go, line 1199 at r10 (raw file):

Previously, sumeerbhola wrote…

Done

Hm, I don't see that yet.

pkg/storage/concurrency/lock_table.go, line 1499 at r10 (raw file):

Previously, sumeerbhola wrote…

Do you mean that g.spans should not be accessed by lockTableImpl or are you referring to the outer scope in this function? If the latter, this seems a minor issue with the generally very awkward iteration that happens here because of what BTree exposes. I am hoping this will be simpler when switching to our own btree.

SGTM, thanks.

pkg/storage/concurrency/lock_table.go, line 1516 at r10 (raw file):

Previously, sumeerbhola wrote…

I've added a comment that we don't need the read-lock but holding it is harmless. The reason I released it here is because it was awkward to remember to release it in the various return paths below and using defer would require me to use a closure.

The isEmptyLock check in tryActiveWait would be needed even if we held the read-lock. It happens because the transition to empty for a lock happens without holding the write-lock on a tree (since we don't want to pessimistically hold the write-lock for the tree), so before the write lock is acquired read-locks can be acquired and see the empty lock.

👍 I guess that transition will also be possible with the COW btree.

pkg/storage/concurrency/lock_table.go, line 1542 at r10 (raw file):

Previously, sumeerbhola wrote…

This is mostly a performance optimization to avoid doing the string equality comparison. A single call to AscendRange()may call this func multiple times so the first func call will set this to false. I've added a comment.

👍

pkg/storage/concurrency/lock_table.go, line 33 at r13 (raw file):

// The database is read and written using "requests". Transactions are composed
// of one or more requests. Isolation id needed across requests. Additionally,

s/id/is/

pkg/storage/concurrency/lock_table.go, line 532 at r13 (raw file):

	// https://www.postgresql.org/docs/12/sql-select.html#SQL-FOR-UPDATE-SHARE
	seqs []enginepb.TxnSeq
	// The timestamp at which the lock is held. If the transaction

Looks like this comment got cut off.

pkg/storage/concurrency/lock_table.go, line 819 at r13 (raw file):

	if waitForTxn == nil {
		if l.reservation == g {
			// Already reserved by this request

nit: missing a period.

pkg/storage/concurrency/lock_table.go, line 951 at r13 (raw file):

				return nil, errors.Errorf("caller violated contract")
			}
			if lastSeq == txn.Sequence {

I'm glad you thought of this case. You might want to just add // Detect idempotent lock acquisition. above it.

pkg/storage/concurrency/lock_table.go, line 1147 at r13 (raw file):

}

// In increasing order of Start, and for equal Start in decreasing order of End.

I don't think we need this. We should be able to require the caller to provide use seq num ranges with these properties. See

cockroach/pkg/roachpb/data.proto

Line 497 in 6bf3a7b

// A list of ignored seqnum ranges.

.

pkg/storage/concurrency/lock_table.go, line 1202 at r13 (raw file):

		return nil, false, nil
	}
	if intent.Status == roachpb.COMMITTED || intent.Status == roachpb.ABORTED {

intent.Status.IsFinalized()

pkg/storage/concurrency/lock_table.go, line 1575 at r13 (raw file):

			return key.Compare(s[i].Key) < 0
		})
		if i > 0 && ((len(s[i-1].EndKey) > 0 && key.Compare(s[i-1].EndKey) < 0) || key.Equal(s[i-1].Key)) {

Span.ContainsKey?

pkg/storage/concurrency/lock_table_test.go, line 371 at r13 (raw file):

				return str + err.Error()
			}
			if state.stateKind == doneWaiting {

nit: move this into the switch statement.

pkg/storage/concurrency/lock_table_test.go, line 383 at r13 (raw file):

				typeStr = "waitElsewhere"
			case waitSelf:
				return fmt.Sprintf("%sstate=waitSelf", str)

nit: return str + "state=waitSelf" like you have with doneWaiting

pkg/storage/concurrency/testdata/lock_table, line 30 at r14 (raw file):

done r=req1
----

Let's add a print here.

pkg/storage/concurrency/testdata/lock_table, line 45 at r14 (raw file):

done r=req2
----

Let's add a print here. In general I think it would be useful to add a few more of these in here because it allows the reader to "checkpoint" their understanding of the state of the lockTable.

pkg/storage/concurrency/testdata/lock_table, line 205 at r14 (raw file):

global: num=5
 lock: "a"
  holder: txn: 00000000-0000-0000-0000-000000000003, ts: 0.000000006,0

nit: it would be nice if we could chop off these leading zeros somehow. Maybe just add a TODO?

See cockroachdb#43740. This PR pulls the existing immutable interval B-Tree structure out of the spanlatch package and into a utility library. In order to avoid a loss of both type safety and performance, it opts for code generation by using https://github.com/mmatczuk/go_generics. The new util library exposes a script that is meant to be invoked by go generate to create specializations of the ordered tree structure for individual use cases. This ensures that each use can achieve the performance of being hand-written without needing to continue the pattern of copying the btree code and manually modifying it each time we want to use it.

@petermattis

43850: util/interval: generate type-safe specializations of interval B-tree r=petermattis a=nvanbenschoten See #43740. This PR pulls the existing immutable interval B-Tree structure out of the spanlatch package and into a utility library. In order to avoid a loss of both type safety and performance, it opts for code generation by using https://github.com/mmatczuk/go_generics. The new util library exposes a script that is meant to be invoked by go generate to create specializations of the ordered tree structure for individual use cases. This ensures that each use can achieve the performance of being hand-written without needing to continue the pattern of copying the btree code and manually modifying it each time we want to use it. cc. @petermattis @sumeerbhola I'm curious to get your thoughts on this experiment. I'm hoping this also unblocks @sumeerbhola on #43740. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

sumeerbhola

This is getting a lot closer. Thanks for all of your work on this.

This has been fun!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)

pkg/storage/concurrency/lock_table.go, line 107 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Actually you're right, on the synchronous txn pushing path (which we'd be on here), we pass false for skipIfInFlight.

Thanks for the clarification.

pkg/storage/concurrency/lock_table.go, line 140 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

And also update the comment to be in the same order.

Done

pkg/storage/concurrency/lock_table.go, line 1199 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Hm, I don't see that yet.

I forgot to update my response. I subsequently removed the done field from lockGuardImpl since it wasn't needed with the deferred calls to findNextLockAfter() -- the holder of a requestGuard shouldn't be calling methods on a requestGuard that it has already passed to lockTable.done().

pkg/storage/concurrency/lock_table.go, line 33 at r13 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/id/is/

Done

pkg/storage/concurrency/lock_table.go, line 532 at r13 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Looks like this comment got cut off.

I deleted the incomplete sentence. The epoch and ts monotonicity is mentioned in the comment earlier in the struct.

pkg/storage/concurrency/lock_table.go, line 819 at r13 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: missing a period.

Done

pkg/storage/concurrency/lock_table.go, line 951 at r13 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm glad you thought of this case. You might want to just add // Detect idempotent lock acquisition. above it.

Added. Also added a TODO to test this.

pkg/storage/concurrency/lock_table.go, line 1147 at r13 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't think we need this. We should be able to require the caller to provide use seq num ranges with these properties. See

cockroach/pkg/roachpb/data.proto

Line 497 in 6bf3a7b

// A list of ignored seqnum ranges.

.

That is good. Removed.

pkg/storage/concurrency/lock_table.go, line 1202 at r13 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

intent.Status.IsFinalized()

Done

pkg/storage/concurrency/lock_table.go, line 1575 at r13 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Span.ContainsKey?

I wasn't aware of its existence. It is doing return bytes.Compare(key, s.Key) >= 0 && bytes.Compare(key, s.EndKey) < 0 which only seems correct if the span does not have an empty EndKey -- some of the call sites look like they are calling with a replica range which would satisfy this constraint.

pkg/storage/concurrency/lock_table.go, line 626 at r15 (raw file):

	//
	// Read reservations are not permitted due to the complexities discussed in
	// the review for #43740. Additionally, reads do not queue for their turn at

I figured it was better to make this more self-contained so added some examples here.

pkg/storage/concurrency/lock_table.go, line 688 at r15 (raw file):

	//   breaks the reservation at A.
	//
	// Extension for Shared and Upgrade locks:

Added a comment on how this will extend with other lock strengths.

pkg/storage/concurrency/lock_table_test.go, line 371 at r13 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: move this into the switch statement.

Done

pkg/storage/concurrency/lock_table_test.go, line 383 at r13 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: return str + "state=waitSelf" like you have with doneWaiting

Done

pkg/storage/concurrency/testdata/lock_table, line 30 at r14 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's add a print here.

Done

pkg/storage/concurrency/testdata/lock_table, line 45 at r14 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's add a print here. In general I think it would be useful to add a few more of these in here because it allows the reader to "checkpoint" their understanding of the state of the lockTable.

Done (I've added prints for more commands).

pkg/storage/concurrency/testdata/lock_table, line 205 at r14 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: it would be nice if we could chop off these leading zeros somehow. Maybe just add a TODO?

Added a TODO

nvanbenschoten

I think this is ready to merge once CI is resolved! Let's just make sure to squash all three commits into a single commit.

Reviewed 3 of 3 files at r15.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @petermattis and @sumeerbhola)

pkg/storage/concurrency/lock_table.go, line 1575 at r13 (raw file):

Previously, sumeerbhola wrote…

I wasn't aware of its existence. It is doing return bytes.Compare(key, s.Key) >= 0 && bytes.Compare(key, s.EndKey) < 0 which only seems correct if the span does not have an empty EndKey -- some of the call sites look like they are calling with a replica range which would satisfy this constraint.

Good point.

pkg/storage/concurrency/lock_table.go, line 688 at r15 (raw file):

Previously, sumeerbhola wrote…

Added a comment on how this will extend with other lock strengths.

Nice comment.

pkg/storage/concurrency/testdata/lock_table, line 45 at r14 (raw file):

Previously, sumeerbhola wrote…

Done (I've added prints for more commands).

Thanks. This is much easier to read.

…ted locks. - Includes a basic test. - TODOs planned for followup PRs: - Better error messages. - More tests (as listed in TODOs in the code). - Use the copy-on-write btree. Release note: None

sumeerbhola · 2020-01-31T22:31:56Z

bors r+

43740: storage/concurrency: lock table for tracking replicated and unreplicated locks and sequencing requests r=sumeerbhola a=sumeerbhola Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>

craig · 2020-01-31T23:17:50Z

Build succeeded

GitHub CI (Cockroach)

Informs cockroachdb#41720. The commit creates a new concurrency package that provides a concurrency manager structure that encapsulates the details of concurrency control and contention handling for serializable key-value transactions. Interested readers should start at `concurrency_control.go` and move out from there. The new package has a few primary objectives: 1. centralize the handling of request synchronization and transaction contention handling in a single location, allowing for the topic to be documented, understood, and tested in isolation. 2. rework contention handling to react to intent state transitions directly. This simplifies the transaction queueing story, reduces the frequency of transaction push RPCs, and allows waiters to proceed immediately after intent resolution. 3. create a framework that naturally permits "update" locking, which is required for kv-level SELECT FOR UPDATE support (cockroachdb#6583). 4. provide stronger guarantees around fairness when transactions conflict, to reduce tail latencies under contended sceneries. 5. create a structure that can extend to address the long-term goals of a fully centralized lock-table laid out in cockroachdb#41720. This commit pulls over a lot of already reviewed code from cockroachdb#43775. The big differences are that it updates the lockTable interface to match the one introduced in cockroachdb#43740 and it addresses the remaining TODOs to document the rest of the concurrency control mechanisms in CockroachDB. At this point, a reader of this file should get a good idea of how KV transactions interact in CRDB... now we just need to make the system actually work this way.

44787: storage/concurrency: define concurrency control interfaces r=nvanbenschoten a=nvanbenschoten Informs #41720. The commit creates a new concurrency package that provides a concurrency manager structure that encapsulates the details of concurrency control and contention handling for serializable key-value transactions. Interested readers should start at `concurrency_control.go` and move out from there. The new package has a few primary objectives: 1. centralize the handling of request synchronization and transaction contention handling in a single location, allowing for the topic to be documented, understood, and tested in isolation. 2. rework contention handling to react to intent state transitions directly. This simplifies the transaction queueing story, reduces the frequency of transaction push RPCs, and allows waiters to proceed immediately after intent resolution. 3. create a framework that naturally permits "update" locking, which is required for kv-level SELECT FOR UPDATE support (#6583). 4. provide stronger guarantees around fairness when transactions conflict, to reduce tail latencies under contended sceneries. 5. create a structure that can extend to address the long-term goals of a fully centralized lock-table laid out in #41720. This commit pulls over a lot of already reviewed code from #43775. The big differences are that it updates the lockTable interface to match the one introduced in #43740 and it addresses the remaining TODOs to document the rest of the concurrency control mechanisms in CockroachDB. At this point, a reader of this file should get a good idea of how KV transactions interact in CRDB... now we just need to make the system actually work this way. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

…totype SFU Informs cockroachdb#41720. This commit creates a new concurrency package that provides a concurrency manager structure that encapsulates the details of concurrency control and contention handling for serializable key-value transactions. Any reader of this commit should start at `concurrency_control.go` and move out from there. The new package has a few primary objectives: 1. centralize the handling of request synchronization and transaction contention handling in a single location, allowing for the topic to be documented and understood in isolation. 2. rework contention handling to react to intent state transitions directly. This simplifies the transaction queueing story, reduces the frequency of transaction push RPCs, and allows waiters to proceed after intent resolution as soon as possible. 3. create a framework that naturally permits "update" locking, which is required for kv-level SELECT FOR UPDATE support (cockroachdb#6583). 4. provide stronger guarantees around fairness when transactions conflict, in order to reduce tail latencies under contended scenarios. 5. create a structure that can extend to address the long-term goals of a fully centralized lock-table laid out in cockroachdb#41720. WARNING: this is still a WIP. Notably, the lockTableImpl is mocked out with a working but incomplete implementation. See cockroachdb#43740 for a more complete strawman. Release note: None

sumeerbhola requested a review from nvanbenschoten January 6, 2020 16:00

sumeerbhola commented Jan 7, 2020

View reviewed changes

petermattis reviewed Jan 7, 2020

View reviewed changes

nvanbenschoten reviewed Jan 7, 2020

View reviewed changes

petermattis reviewed Jan 7, 2020

View reviewed changes

sumeerbhola commented Jan 7, 2020

View reviewed changes

nvanbenschoten reviewed Jan 7, 2020

View reviewed changes

sumeerbhola commented Jan 8, 2020

View reviewed changes

nvanbenschoten reviewed Jan 9, 2020

View reviewed changes

nvanbenschoten mentioned this pull request Jan 9, 2020

util/interval: generate type-safe specializations of interval B-tree #43850

Merged

sumeerbhola force-pushed the locktable branch from 9940538 to 6eba50d Compare January 15, 2020 02:31

sumeerbhola commented Jan 15, 2020

View reviewed changes

nvanbenschoten mentioned this pull request Jan 16, 2020

storage/concurrency/lock: define lock Strength and Durability modes #44054

Merged

sumeerbhola commented Jan 16, 2020

View reviewed changes

sumeerbhola commented Jan 17, 2020

View reviewed changes

nvanbenschoten reviewed Jan 30, 2020

View reviewed changes

sumeerbhola force-pushed the locktable branch from d5d2163 to 8762818 Compare January 31, 2020 15:59

sumeerbhola commented Jan 31, 2020

View reviewed changes

nvanbenschoten approved these changes Jan 31, 2020

View reviewed changes

sumeerbhola force-pushed the locktable branch from 8762818 to 8c2a3b6 Compare January 31, 2020 21:11

storage/concurrency: lock table for tracking replicated and unreplica…

2333372

…ted locks. - Includes a basic test. - TODOs planned for followup PRs: - Better error messages. - More tests (as listed in TODOs in the code). - Use the copy-on-write btree. Release note: None

sumeerbhola force-pushed the locktable branch from 8c2a3b6 to 2333372 Compare January 31, 2020 21:12

craig bot pushed a commit that referenced this pull request Jan 31, 2020

Merge #43740

954ff78

43740: storage/concurrency: lock table for tracking replicated and unreplicated locks and sequencing requests r=sumeerbhola a=sumeerbhola Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>

craig bot merged commit 2333372 into cockroachdb:master Jan 31, 2020

sumeerbhola deleted the locktable branch February 1, 2020 15:54

nvanbenschoten mentioned this pull request Feb 6, 2020

storage/concurrency: define concurrency control interfaces #44787

Merged

nvanbenschoten mentioned this pull request Feb 11, 2020

storage/concurrency: prep lock-table for 20.1 release #44976

Closed

35 tasks

rmloveland mentioned this pull request Feb 19, 2020

Document new approach to locking and queueing in the key value layer cockroachdb/docs#6571

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage/concurrency: lock table for tracking replicated and unreplicated locks and sequencing requests #43740

storage/concurrency: lock table for tracking replicated and unreplicated locks and sequencing requests #43740

sumeerbhola commented Jan 6, 2020

cockroach-teamcity commented Jan 6, 2020

sumeerbhola left a comment

petermattis left a comment

nvanbenschoten left a comment

petermattis left a comment

sumeerbhola left a comment

nvanbenschoten left a comment

sumeerbhola left a comment

nvanbenschoten left a comment

sumeerbhola left a comment

sumeerbhola left a comment

sumeerbhola left a comment

nvanbenschoten left a comment

sumeerbhola left a comment

nvanbenschoten left a comment

sumeerbhola commented Jan 31, 2020

craig bot commented Jan 31, 2020

storage/concurrency: lock table for tracking replicated and unreplicated locks and sequencing requests #43740

storage/concurrency: lock table for tracking replicated and unreplicated locks and sequencing requests #43740

Conversation

sumeerbhola commented Jan 6, 2020

cockroach-teamcity commented Jan 6, 2020

sumeerbhola left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

sumeerbhola commented Jan 31, 2020

craig bot commented Jan 31, 2020

Build succeeded