New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage/concurrency: lock table for tracking replicated and unreplicated locks and sequencing requests #43740
Conversation
…totype SFU Informs cockroachdb#41720. This commit creates a new concurrency package that provides a concurrency manager structure that encapsulates the details of concurrency control and contention handling for serializable key-value transactions. Any reader of this commit should start at `concurrency_control.go` and move out from there. The new package has a few primary objectives: 1. centralize the handling of request synchronization and transaction contention handling in a single location, allowing for the topic to be documented and understood in isolation. 2. rework contention handling to react to intent state transitions directly. This simplifies the transaction queueing story, reduces the frequency of transaction push RPCs, and allows waiters to proceed after intent resolution as soon as possible. 3. create a framework that naturally permits "update" locking, which is required for kv-level SELECT FOR UPDATE support (cockroachdb#6583). 4. provide stronger guarantees around fairness when transactions conflict, in order to reduce tail latencies under contended scenarios. 5. create a structure that can extend to address the long-term goals of a fully centralized lock-table laid out in cockroachdb#41720. WARNING: this is still a WIP. Notably, the lockTableImpl is mocked out with a working but incomplete implementation. See cockroachdb#43740 for a more complete strawman. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):
nextSeqNum uint64 // Containers for lockState structs locks [spanset.NumSpanScope]btree.BTree
I looked into replacing btree.BTree
with something akin to the spanlatch.btree
.
I don't think it directly applies. spanlatch.btree
is making an implicit queue such that one is waiting behind all the conflicting latches already in the btree, computed by taking a snapshot of the btree. In the lockTableImple
the ordering is defined by the seqNum
and a scanAndEnqueue
call can cause a request to be inserted before an existing request in a queue. So we need explicit queues per lock (where the locks are on points, not intervals). The explicit queues are simple wrt synchronization -- just use a per lockState
mutex to reduce contention. Iterating over a btree needs a RWMutex
for the btree to be held in read-mode. Even a lock release can make do with holding this in read-mode -- any lockState
that need to be garbage collected can be added to a list of potential garbage that can be periodically handled by a background thread that acquires the btree RWMutex
in write-mode.
The user-facing activities that need the btree RWMutex
in write-mode are (a) acquiring a lock that was not already in the btree and (b) adding a discovered lock. If we are worried about read-write contention on the btree mutex we could eliminate this via copy-on-write: when doing scanandenqueue
the request is holding spanlatches so no other conflicting request can be acquiring or releasing locks or adding to a queue in lockTable
. So take a snapshot of the btree and release the btree mutex. Then iterate over this snapshot and add to queues, make reservations (using the per lockState
mutex). If successfully reserved everywhere can proceed to evaluation. Else use this snapshot for iterating and waiting. Next call to scanandenqueue
will acquire a new snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)
pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):
Previously, sumeerbhola wrote…
I looked into replacing
btree.BTree
with something akin to thespanlatch.btree
.I don't think it directly applies.
spanlatch.btree
is making an implicit queue such that one is waiting behind all the conflicting latches already in the btree, computed by taking a snapshot of the btree. In thelockTableImple
the ordering is defined by theseqNum
and ascanAndEnqueue
call can cause a request to be inserted before an existing request in a queue. So we need explicit queues per lock (where the locks are on points, not intervals). The explicit queues are simple wrt synchronization -- just use a perlockState
mutex to reduce contention. Iterating over a btree needs aRWMutex
for the btree to be held in read-mode. Even a lock release can make do with holding this in read-mode -- anylockState
that need to be garbage collected can be added to a list of potential garbage that can be periodically handled by a background thread that acquires the btreeRWMutex
in write-mode.The user-facing activities that need the btree
RWMutex
in write-mode are (a) acquiring a lock that was not already in the btree and (b) adding a discovered lock. If we are worried about read-write contention on the btree mutex we could eliminate this via copy-on-write: when doingscanandenqueue
the request is holding spanlatches so no other conflicting request can be acquiring or releasing locks or adding to a queue inlockTable
. So take a snapshot of the btree and release the btree mutex. Then iterate over this snapshot and add to queues, make reservations (using the perlockState
mutex). If successfully reserved everywhere can proceed to evaluation. Else use this snapshot for iterating and waiting. Next call toscanandenqueue
will acquire a new snapshot.
I wasn't terribly thrilled with btree.Btree
when I looked at using it in Pebble at one point. Something about the code diverging too much from the original C++ btree code it was based on, though the precise reasons escape me. See cockroachdb/pebble@d99331e for the old Pebble btree code which was the basis for spanlatch.btree
. I know Nathan made a number of changes to spanlatch.btree
when he forked it from the Pebble code. It may be better to fork spanlatch.btree
and adapt it to the needs of the code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)
pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):
I think this is missing the bigger point that we talked about yesterday which is that there is a strong desire to be able to capture a read-only snapshot of the lock table state (but not necessarily the wait-queue on each lock) that is held throughout request evaluation. This is what enables a number of improvements in #41720 including being able to resolve intents without waiting for a full consensus round.
It may be better to fork spanlatch.btree and adapt it to the needs of the code here.
I was thinking the same thing. We've done this dance one too many times and I'm thinking about making the dive and using https://github.com/mmatczuk/go_generics (used by https://github.com/scylladb/go-set). Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):
I was thinking the same thing. We've done this dance one too many times and I'm thinking about making the dive and using https://github.com/mmatczuk/go_generics (used by https://github.com/scylladb/go-set). Thoughts?
Worth a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)
pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I was thinking the same thing. We've done this dance one too many times and I'm thinking about making the dive and using https://github.com/mmatczuk/go_generics (used by https://github.com/scylladb/go-set). Thoughts?
Worth a try.
We may be agreeing for the most part: the copy-on-write alternative I outlined at the end is a fork of spanlatch.btree
and not btree.BTree
. If we can have a generic one, I'd be happy to use that instead of forking: the lockTable currently does not need the ability to store spans but eventually may (if/when there are read locks on spans).
I didn't understand the part about "missing the bigger point": (a) I don't see how a setup where queues can be created after the first call to scanAndEnqueue()
by a request, that this request eventually realizes it needs to wait on, allows for a snapshot taken in that first call to scanAndEnqueue()
to be reused for subsequent calls to scanAndEnqueue()
-- I think one needs to snapshot again, (b) The lifetime of the snapshot shouldn't affect when the intents can be resolved, (c) your prototype in #41720 appears to use btree.BTree
and simple synchronization -- if that is working well for a contended YCSB workload are we prematurely optimizing the btree implementation by discussing snapshots and copy-on-write?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):
My “missing the bigger point” came off a stronger than I was intending. You're definitely not missing the point here 😃
I don't see how a setup where queues can be created after the first call to scanAndEnqueue() by a request, that this request eventually realizes it needs to wait on, allows for a snapshot taken in that first call to scanAndEnqueue() to be reused for subsequent calls to scanAndEnqueue() -- I think one needs to snapshot again
Yes exactly, we'll need to take another snapshot again after we've re-acquired latches, before subsequent calls to scanAndEnqueue
.
The lifetime of the snapshot shouldn't affect when the intents can be resolved
I don't think I understand this comment. The idea of the snapshot is that it allows us to capture the state of the lock table and reference it for free throughout the evaluation of a request. This gives us the opportunity to "virtualize" the resolution of an intent without needing to physically resolve it. See #41720 (comment) for more on this idea.
your prototype in #41720 appears to use btree.BTree and simple synchronization -- if that is working well for a contended YCSB workload are we prematurely optimizing the btree implementation by discussing snapshots and copy-on-write?
I wouldn't want to merge that though. See the warning in the commit and PR description. The latch manager made pretty large improvements to throughput on large machines because it reduced synchronization between non-conflicting requests on a Range. I wouldn't want to regress that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)
pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):
The lifetime of the snapshot shouldn't affect when the intents can be resolved
I don't think I understand this comment. The idea of the snapshot is that it allows us to capture the state of the lock table and reference it for free throughout the evaluation of a request. This gives us the opportunity to "virtualize" the resolution of an intent without needing to physically resolve it. See #41720 (comment) for more on this idea.
I must have misunderstood what you were saying earlier -- I think we are consistent wrt "virtualizing" the intent resolution: once this lockTable
is also managing the segregated replicated locks there would be a (bad names) releaseButDoNotRemoveReplicatedLock()
method (in addition to a releaseReplicatedLock()
which would cleanup the persistent state), which would just mark the in-memory lockState
as released and process the queued entries.
Should I hold off on forking the spanlatch.btree
or if you are busy do you want me to look at using go generics for that fork?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/storage/concurrency/lock_table.go, line 269 at r1 (raw file):
// corresponding intent has been applied -- at that point the lockTable could GC these locks // since the persistent state will allow it to discover them again when there is contention. addDiscoveredExclusiveLock(key roachpb.Key, txn *enginepb.TxnMeta, ts hlc.Timestamp) error
It's interesting that you don't immediately enter the wait-queue (but don't wait) for this discovered lock. That's how the #43775 works, and I think the justification was that it helped ensure a degree of fairness between the first request to observe an intent and other requests that followed. Does the request seq num help avoid these kinds of inversions?
pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):
I must have misunderstood what you were saying earlier -- I think we are consistent wrt "virtualizing" the intent resolution: once this lockTable is also managing the segregated replicated locks there would be a (bad names) releaseButDoNotRemoveReplicatedLock() method (in addition to a releaseReplicatedLock() which would cleanup the persistent state), which would just mark the in-memory lockState as released and process the queued entries.
Yes, that's exactly right.
Should I hold off on forking the spanlatch.btree or if you are busy do you want me to look at using go generics for that fork?
I'm hoping to have something out tonight that does this which I can hand off to you, so feel free to hold off. It's actually all working fairly well.
Besides that, my major takeaways from this prototype so far are:
- We can remove the persistence aspects of this prototype for now, as we came to the decision during our conversation ealier this week that we aren't intending to persist the unreplicated portion of the lock table, at least not in the initial draft.
- I'm interested in seeing how you envision the
queueHandle
structure fitting in with pushing transaction records to check for txn liveness and deadlock detection. This would correspond to thelockWaitQueueWaiter
structure in [DNM] storage/concurrency: introduce concurrency control package, prototype SFU #43775. We discussed earlier this week that we may be able to avoid a distinguished waiter entirely. - Similarly, I think it would be worthwhile trying to fit this strawman into the interface and type defs provided in cbe5c2e#diff-d323684eb8ef832308d443e5fd7e55ba. This will reveal whether our two approaches have any impedance mismatches and whether the
lockTable
interface that the rest of [DNM] storage/concurrency: introduce concurrency control package, prototype SFU #43775 works with is missing anything.
…-Tree See cockroachdb#43740. This PR pulls the existing immutable interval B-Tree structure out of the spanlatch package and into a utility library. In order to avoid a loss of both type safety and performance, it opts for code generation by using https://github.com/mmatczuk/go_generics. The new util library exposes a script that is meant to be invoked by go generate to create specializations of the ordered tree structure for individual use cases. This ensures that each use can achieve the performance of being hand-written without needing to continue the pattern of copying the btree code and manually modifying it each time we want to use it.
…-Tree See cockroachdb#43740. This PR pulls the existing immutable interval B-Tree structure out of the spanlatch package and into a utility library. In order to avoid a loss of both type safety and performance, it opts for code generation by using https://github.com/mmatczuk/go_generics. The new util library exposes a script that is meant to be invoked by go generate to create specializations of the ordered tree structure for individual use cases. This ensures that each use can achieve the performance of being hand-written without needing to continue the pattern of copying the btree code and manually modifying it each time we want to use it.
9940538
to
6eba50d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)
pkg/storage/concurrency/lock_table.go, line 269 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It's interesting that you don't immediately enter the wait-queue (but don't wait) for this discovered lock. That's how the #43775 works, and I think the justification was that it helped ensure a degree of fairness between the first request to observe an intent and other requests that followed. Does the request seq num help avoid these kinds of inversions?
As we discussed yesterday, the comment was stale -- it does enter the wait-queue immediately.
pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I must have misunderstood what you were saying earlier -- I think we are consistent wrt "virtualizing" the intent resolution: once this lockTable is also managing the segregated replicated locks there would be a (bad names) releaseButDoNotRemoveReplicatedLock() method (in addition to a releaseReplicatedLock() which would cleanup the persistent state), which would just mark the in-memory lockState as released and process the queued entries.
Yes, that's exactly right.
Should I hold off on forking the spanlatch.btree or if you are busy do you want me to look at using go generics for that fork?
I'm hoping to have something out tonight that does this which I can hand off to you, so feel free to hold off. It's actually all working fairly well.
Besides that, my major takeaways from this prototype so far are:
- We can remove the persistence aspects of this prototype for now, as we came to the decision during our conversation ealier this week that we aren't intending to persist the unreplicated portion of the lock table, at least not in the initial draft.
- I'm interested in seeing how you envision the
queueHandle
structure fitting in with pushing transaction records to check for txn liveness and deadlock detection. This would correspond to thelockWaitQueueWaiter
structure in [DNM] storage/concurrency: introduce concurrency control package, prototype SFU #43775. We discussed earlier this week that we may be able to avoid a distinguished waiter entirely.- Similarly, I think it would be worthwhile trying to fit this strawman into the interface and type defs provided in cbe5c2e#diff-d323684eb8ef832308d443e5fd7e55ba. This will reveal whether our two approaches have any impedance mismatches and whether the
lockTable
interface that the rest of [DNM] storage/concurrency: introduce concurrency control package, prototype SFU #43775 works with is missing anything.
After spending a few hours rewriting parts of this code yesterday and this morning, to introduce read reservations, and noticing a previous overlooked issue, I am having self doubt about the value of the fairness enforced by the queuing.
First, the previously overlooked issue. The code here has the following queueing behavior: a request queues if it conflicts with anyone ahead-of it. The ahead-of relationship is defined based on the seq-num and the lock holder is always ahead of everyone. If it does not queue, it makes sure it is remembered by adding a reservation. A reservation is just a marker to say that don't go ahead of me if you conflict unless you can break my reservation -- it prevents the sequencing of requests being decided by who wins the race to acquire spanlatches when a lock is released. With these semantics a request that is queued is not necessarily waiting-for the lock holder or reservation holders.
Say write-reservation with ts=20 and the first queued request is write ts=10 and the next queued request is read ts=12. The read is not blocked by the write reservation. From a deadlock detection perspective we can pretend the read ts=12 is blocked by write ts=20 in that we can prove that if all txns in the system are blocked we will detect a cycle, but that theoretical guarantee is not very useful since it can delay detecting the useful cycles when the system is not yet completely stalled. And more importantly, pushing write ts=20 is not going to help this reader. So the wait-for tracking needs to track who read ts=12 is actually conflicting with, which is write ts=10. This is mildly annoying wrt code but is doable. Note that the bad case of someone waiting-for a read by a transaction T that does not have a TxnMeta does not occur since readers don't wait for other readers and a writer waiting-for a read-by-T must also be waiting-for a writer before the read-by-T.
But I want to make sure the above queueing behavior is what we desire.
In the current system an intent at ts=20 would not prevent the read ts=12 from proceeding since it would not know about the write ts=10. So we are reducing concurrency by adding a queue. I tried to work out the code consequences of a different behavior where a request waits only if if it conflicts with the lock holder or existing reservations (that it cannot break) -- there are some complexities like "break reservations only if can break reservations of everyone it conflicts with"; whenever a waited-on request (reserver or lock holder) goes away there is churn in that one starts with smallest waiter and see who it will evict, then next one and so on. So this could be expensive and am not keen on this option either.
This commit introduces a new `concurrency/lock` package that provides type definitions for locking-related concepts used by concurrency control in the key-value layer. The package initially provides a lock `Strength` enumeration consisting of `Shared`, `Upgrade`, and `Exclusive` and a lock `Durability` enumeration consisting of `Unreplicated` and `Replicated`. These definitions are currently unused but will be referenced soon by cockroachdb#43740 and cockroachdb#43775. The changes around SELECT FOR UPDATE in the optimizer (cockroachdb#44015) are also approaching the point where they'll soon need the type definitions to interface with the KV API. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)
pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):
Previously, sumeerbhola wrote…
After spending a few hours rewriting parts of this code yesterday and this morning, to introduce read reservations, and noticing a previous overlooked issue, I am having self doubt about the value of the fairness enforced by the queuing.
First, the previously overlooked issue. The code here has the following queueing behavior: a request queues if it conflicts with anyone ahead-of it. The ahead-of relationship is defined based on the seq-num and the lock holder is always ahead of everyone. If it does not queue, it makes sure it is remembered by adding a reservation. A reservation is just a marker to say that don't go ahead of me if you conflict unless you can break my reservation -- it prevents the sequencing of requests being decided by who wins the race to acquire spanlatches when a lock is released. With these semantics a request that is queued is not necessarily waiting-for the lock holder or reservation holders.
Say write-reservation with ts=20 and the first queued request is write ts=10 and the next queued request is read ts=12. The read is not blocked by the write reservation. From a deadlock detection perspective we can pretend the read ts=12 is blocked by write ts=20 in that we can prove that if all txns in the system are blocked we will detect a cycle, but that theoretical guarantee is not very useful since it can delay detecting the useful cycles when the system is not yet completely stalled. And more importantly, pushing write ts=20 is not going to help this reader. So the wait-for tracking needs to track who read ts=12 is actually conflicting with, which is write ts=10. This is mildly annoying wrt code but is doable. Note that the bad case of someone waiting-for a read by a transaction T that does not have a TxnMeta does not occur since readers don't wait for other readers and a writer waiting-for a read-by-T must also be waiting-for a writer before the read-by-T.But I want to make sure the above queueing behavior is what we desire.
In the current system an intent at ts=20 would not prevent the read ts=12 from proceeding since it would not know about the write ts=10. So we are reducing concurrency by adding a queue. I tried to work out the code consequences of a different behavior where a request waits only if if it conflicts with the lock holder or existing reservations (that it cannot break) -- there are some complexities like "break reservations only if can break reservations of everyone it conflicts with"; whenever a waited-on request (reserver or lock holder) goes away there is churn in that one starts with smallest waiter and see who it will evict, then next one and so on. So this could be expensive and am not keen on this option either.
(documenting the discussion with Nathan)
Regarding fairness: Based on SFU experiments, when there is contention, we desire fairness based on request arrival order, across different txns doing SFU and writes, since it reduces txn retries. This is because request arrival order typically correlates with txn timestamp order, while without queuing, requests race in an order that does not correlate well with txn timestamp order. Since SFU uses exclusive locks, the current experimental evidence does not require queueing based fairness for (non-locking) reads.
So that gives us 2 simplification options for reads:
- Option 1: Reads wait in the queue for their turn but don't reserve when reach front. This does not simplify the earlier example of write reservation at ts=20, followed by queued write at ts=10, followed by read at ts=12. The read needs to push the txn at ts=10 and not the one with the reservation.
- Option 2: Reads do not wait in the queue for their turn and they only wait if the lock is held in a conflicting manner. In this case reads only need to push the lock holder. And writes push the lock holder or the write reservation. This is simple.
We will adopt option 2 for now and revisit once we have more real-world experience.
This option also extends without much complexity when we add shared locks. Such requests will use a new SpanAccess
value of SpanReadLock
. These requests will conflict with writes regardless of timestamp, and so push either (a) the lock holder or reservation holder, if it is a writer, (b) if the lock is held or reserved by readers, it will push the first writer in the queue that is before this reader. SpanReadLock requests that are adjacent in the queue will get an opportunity to acquire the shared lock at the same time (since there will be read reservations), but if they are separated by a writer a later read request can't jump in front of the writer. This is because jumping ahead in a naive way will create a deadlock in the lockTable (detectable, but we shouldn't create deadlocks due to our algorithmic choices): the example is req2 with seq-num=2 jumping ahead of req1 with seq-num=1 for a key where the former does SpanReadLock
access and latter does SpanReadWrite
but for another key where both requests are doing SpanReadWrite
, req2 is behind req1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)
pkg/storage/concurrency/lock_table.go, line 238 at r2 (raw file):
Previously, sumeerbhola wrote…
(documenting the discussion with Nathan)
Regarding fairness: Based on SFU experiments, when there is contention, we desire fairness based on request arrival order, across different txns doing SFU and writes, since it reduces txn retries. This is because request arrival order typically correlates with txn timestamp order, while without queuing, requests race in an order that does not correlate well with txn timestamp order. Since SFU uses exclusive locks, the current experimental evidence does not require queueing based fairness for (non-locking) reads.
So that gives us 2 simplification options for reads:
- Option 1: Reads wait in the queue for their turn but don't reserve when reach front. This does not simplify the earlier example of write reservation at ts=20, followed by queued write at ts=10, followed by read at ts=12. The read needs to push the txn at ts=10 and not the one with the reservation.
- Option 2: Reads do not wait in the queue for their turn and they only wait if the lock is held in a conflicting manner. In this case reads only need to push the lock holder. And writes push the lock holder or the write reservation. This is simple.
We will adopt option 2 for now and revisit once we have more real-world experience.
This option also extends without much complexity when we add shared locks. Such requests will use a newSpanAccess
value ofSpanReadLock
. These requests will conflict with writes regardless of timestamp, and so push either (a) the lock holder or reservation holder, if it is a writer, (b) if the lock is held or reserved by readers, it will push the first writer in the queue that is before this reader. SpanReadLock requests that are adjacent in the queue will get an opportunity to acquire the shared lock at the same time (since there will be read reservations), but if they are separated by a writer a later read request can't jump in front of the writer. This is because jumping ahead in a naive way will create a deadlock in the lockTable (detectable, but we shouldn't create deadlocks due to our algorithmic choices): the example is req2 with seq-num=2 jumping ahead of req1 with seq-num=1 for a key where the former doesSpanReadLock
access and latter doesSpanReadWrite
but for another key where both requests are doingSpanReadWrite
, req2 is behind req1.
Trying to digest various elements of today's discussion:
-
Is
IgnoredSeqNumRange
going to be in theTxnMeta
or need to be passed in explicitly -- it seems the latter based on https://github.com/cockroachdb/cockroach/blob/43d3b6406467974f8c3bc9790561b110c4865c68/pkg/storage/engine/enginepb/mvcc3.proto but I am not sure what the latest thinking is? -
Regarding replacing
changeLockTs(...)
withupdateLocks(span roachpb.Span, txn *enginepb.TxnMeta, ignored *IgnoredSeqNumRange, ts hlc.Timestamp)
: My understanding is that when a transaction is successfully pushed for a span, (a) intents in that span will be made consistent with the pushed ts, and this will involve removal or rewrites of the intents, (b) thenupdateLocks()
will be called. In (b), what we had discussed was:- for unreplicated locks, remove all locks for earlier epochs, and for the current epoch remove the locks for the ignored seqnums. For the remaining locks the timestamp will be bumped up. Question: if I recall correctly, intents were not removed when restarting at a higher epoch since keeping those locks increased the chance of the retry succeeding -- do we want to do that for unreplicated locks at earlier epochs too by promoting them to the current epoch and bumping their timestamp? And how are we treating surviving intents from earlier epochs in the context of savepoints -- do we only keep the ones that were not rolled back?
- For replicated locks, forget them since they can always be discovered later. I think this was your preference because some intents may have been removed and some may have had their timestamp adjusted and instead of telling the lockTable about the state change for each intent it is simpler to just remove. Was that the correct reason or have I forgotten something?
This commit introduces a new `concurrency/lock` package that provides type definitions for locking-related concepts used by concurrency control in the key-value layer. The package initially provides a lock `Strength` enumeration consisting of `Shared`, `Upgrade`, and `Exclusive` and a lock `Durability` enumeration consisting of `Unreplicated` and `Replicated`. These definitions are currently unused but will be referenced soon by cockroachdb#43740 and cockroachdb#43775. The changes around SELECT FOR UPDATE in the optimizer (cockroachdb#44015) are also approaching the point where they'll soon need the type definitions to interface with the KV API. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting a lot closer. Thanks for all of your work on this.
Reviewed 5 of 5 files at r13, 3 of 3 files at r14.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @petermattis and @sumeerbhola)
pkg/storage/concurrency/lock_table.go, line 107 at r10 (raw file):
Previously, sumeerbhola wrote…
Thanks. Removed it from the comment.
If we already dedup push attempts by tracking what is in-flight, why do we need a distinguished state -- all waiters could try to push immediately and only one would be allowed to. What am I missing?
Actually you're right, on the synchronous txn pushing path (which we'd be on here), we pass false
for skipIfInFlight
.
pkg/storage/concurrency/lock_table.go, line 140 at r10 (raw file):
Previously, sumeerbhola wrote…
Done
And also update the comment to be in the same order.
pkg/storage/concurrency/lock_table.go, line 667 at r10 (raw file):
Previously, sumeerbhola wrote…
Added a comment.
Thanks. This is all going to get more complicated when we allow multiple lock strengths, but for now it looks good.
pkg/storage/concurrency/lock_table.go, line 732 at r10 (raw file):
Previously, sumeerbhola wrote…
That is a good point. I've added a TODO in
lockState
listing all the places where we are searching in these two lists and added your suggestion on how to make some of these faster. I'd prefer to wait until we have some profiling data -- my expectation is that inactive waiters will be rare, soinQueue
will be rarely true.
Is that ok?
SGTM
pkg/storage/concurrency/lock_table.go, line 740 at r10 (raw file):
Previously, sumeerbhola wrote…
I've updated the TODO that says to fix all the error messages to also include replacing "bug" errors with panics.
Is it ok if I do this in a followup PR since none of the error paths are currently tested.
SGTM
pkg/storage/concurrency/lock_table.go, line 1199 at r10 (raw file):
Previously, sumeerbhola wrote…
Done
Hm, I don't see that yet.
pkg/storage/concurrency/lock_table.go, line 1499 at r10 (raw file):
Previously, sumeerbhola wrote…
Do you mean that
g.spans
should not be accessed bylockTableImpl
or are you referring to the outer scope in this function? If the latter, this seems a minor issue with the generally very awkward iteration that happens here because of whatBTree
exposes. I am hoping this will be simpler when switching to our own btree.
SGTM, thanks.
pkg/storage/concurrency/lock_table.go, line 1516 at r10 (raw file):
Previously, sumeerbhola wrote…
I've added a comment that we don't need the read-lock but holding it is harmless. The reason I released it here is because it was awkward to remember to release it in the various return paths below and using
defer
would require me to use a closure.The
isEmptyLock
check intryActiveWait
would be needed even if we held the read-lock. It happens because the transition to empty for a lock happens without holding the write-lock on a tree (since we don't want to pessimistically hold the write-lock for the tree), so before the write lock is acquired read-locks can be acquired and see the empty lock.
👍 I guess that transition will also be possible with the COW btree.
pkg/storage/concurrency/lock_table.go, line 1542 at r10 (raw file):
Previously, sumeerbhola wrote…
This is mostly a performance optimization to avoid doing the string equality comparison. A single call to
AscendRange()
may call this func multiple times so the first func call will set this to false. I've added a comment.
👍
pkg/storage/concurrency/lock_table.go, line 33 at r13 (raw file):
// The database is read and written using "requests". Transactions are composed // of one or more requests. Isolation id needed across requests. Additionally,
s/id/is/
pkg/storage/concurrency/lock_table.go, line 532 at r13 (raw file):
// https://www.postgresql.org/docs/12/sql-select.html#SQL-FOR-UPDATE-SHARE seqs []enginepb.TxnSeq // The timestamp at which the lock is held. If the transaction
Looks like this comment got cut off.
pkg/storage/concurrency/lock_table.go, line 819 at r13 (raw file):
if waitForTxn == nil { if l.reservation == g { // Already reserved by this request
nit: missing a period.
pkg/storage/concurrency/lock_table.go, line 951 at r13 (raw file):
return nil, errors.Errorf("caller violated contract") } if lastSeq == txn.Sequence {
I'm glad you thought of this case. You might want to just add // Detect idempotent lock acquisition.
above it.
pkg/storage/concurrency/lock_table.go, line 1147 at r13 (raw file):
} // In increasing order of Start, and for equal Start in decreasing order of End.
I don't think we need this. We should be able to require the caller to provide use seq num ranges with these properties. See
cockroach/pkg/roachpb/data.proto
Line 497 in 6bf3a7b
// A list of ignored seqnum ranges. |
pkg/storage/concurrency/lock_table.go, line 1202 at r13 (raw file):
return nil, false, nil } if intent.Status == roachpb.COMMITTED || intent.Status == roachpb.ABORTED {
intent.Status.IsFinalized()
pkg/storage/concurrency/lock_table.go, line 1575 at r13 (raw file):
return key.Compare(s[i].Key) < 0 }) if i > 0 && ((len(s[i-1].EndKey) > 0 && key.Compare(s[i-1].EndKey) < 0) || key.Equal(s[i-1].Key)) {
Span.ContainsKey
?
pkg/storage/concurrency/lock_table_test.go, line 371 at r13 (raw file):
return str + err.Error() } if state.stateKind == doneWaiting {
nit: move this into the switch statement.
pkg/storage/concurrency/lock_table_test.go, line 383 at r13 (raw file):
typeStr = "waitElsewhere" case waitSelf: return fmt.Sprintf("%sstate=waitSelf", str)
nit: return str + "state=waitSelf"
like you have with doneWaiting
pkg/storage/concurrency/testdata/lock_table, line 30 at r14 (raw file):
done r=req1 ----
Let's add a print
here.
pkg/storage/concurrency/testdata/lock_table, line 45 at r14 (raw file):
done r=req2 ----
Let's add a print
here. In general I think it would be useful to add a few more of these in here because it allows the reader to "checkpoint" their understanding of the state of the lockTable.
pkg/storage/concurrency/testdata/lock_table, line 205 at r14 (raw file):
global: num=5 lock: "a" holder: txn: 00000000-0000-0000-0000-000000000003, ts: 0.000000006,0
nit: it would be nice if we could chop off these leading zeros somehow. Maybe just add a TODO?
See cockroachdb#43740. This PR pulls the existing immutable interval B-Tree structure out of the spanlatch package and into a utility library. In order to avoid a loss of both type safety and performance, it opts for code generation by using https://github.com/mmatczuk/go_generics. The new util library exposes a script that is meant to be invoked by go generate to create specializations of the ordered tree structure for individual use cases. This ensures that each use can achieve the performance of being hand-written without needing to continue the pattern of copying the btree code and manually modifying it each time we want to use it.
43850: util/interval: generate type-safe specializations of interval B-tree r=petermattis a=nvanbenschoten See #43740. This PR pulls the existing immutable interval B-Tree structure out of the spanlatch package and into a utility library. In order to avoid a loss of both type safety and performance, it opts for code generation by using https://github.com/mmatczuk/go_generics. The new util library exposes a script that is meant to be invoked by go generate to create specializations of the ordered tree structure for individual use cases. This ensures that each use can achieve the performance of being hand-written without needing to continue the pattern of copying the btree code and manually modifying it each time we want to use it. cc. @petermattis @sumeerbhola I'm curious to get your thoughts on this experiment. I'm hoping this also unblocks @sumeerbhola on #43740. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
d5d2163
to
8762818
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting a lot closer. Thanks for all of your work on this.
This has been fun!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)
pkg/storage/concurrency/lock_table.go, line 107 at r10 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Actually you're right, on the synchronous txn pushing path (which we'd be on here), we pass
false
forskipIfInFlight
.
Thanks for the clarification.
pkg/storage/concurrency/lock_table.go, line 140 at r10 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
And also update the comment to be in the same order.
Done
pkg/storage/concurrency/lock_table.go, line 1199 at r10 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Hm, I don't see that yet.
I forgot to update my response. I subsequently removed the done
field from lockGuardImpl
since it wasn't needed with the deferred calls to findNextLockAfter()
-- the holder of a requestGuard
shouldn't be calling methods on a requestGuard
that it has already passed to lockTable.done()
.
pkg/storage/concurrency/lock_table.go, line 33 at r13 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/id/is/
Done
pkg/storage/concurrency/lock_table.go, line 532 at r13 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Looks like this comment got cut off.
I deleted the incomplete sentence. The epoch and ts
monotonicity is mentioned in the comment earlier in the struct.
pkg/storage/concurrency/lock_table.go, line 819 at r13 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: missing a period.
Done
pkg/storage/concurrency/lock_table.go, line 951 at r13 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'm glad you thought of this case. You might want to just add
// Detect idempotent lock acquisition.
above it.
Added. Also added a TODO to test this.
pkg/storage/concurrency/lock_table.go, line 1147 at r13 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I don't think we need this. We should be able to require the caller to provide use seq num ranges with these properties. See
.cockroach/pkg/roachpb/data.proto
Line 497 in 6bf3a7b
// A list of ignored seqnum ranges.
That is good. Removed.
pkg/storage/concurrency/lock_table.go, line 1202 at r13 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
intent.Status.IsFinalized()
Done
pkg/storage/concurrency/lock_table.go, line 1575 at r13 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Span.ContainsKey
?
I wasn't aware of its existence. It is doing return bytes.Compare(key, s.Key) >= 0 && bytes.Compare(key, s.EndKey) < 0
which only seems correct if the span does not have an empty EndKey
-- some of the call sites look like they are calling with a replica range which would satisfy this constraint.
pkg/storage/concurrency/lock_table.go, line 626 at r15 (raw file):
// // Read reservations are not permitted due to the complexities discussed in // the review for #43740. Additionally, reads do not queue for their turn at
I figured it was better to make this more self-contained so added some examples here.
pkg/storage/concurrency/lock_table.go, line 688 at r15 (raw file):
// breaks the reservation at A. // // Extension for Shared and Upgrade locks:
Added a comment on how this will extend with other lock strengths.
pkg/storage/concurrency/lock_table_test.go, line 371 at r13 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: move this into the switch statement.
Done
pkg/storage/concurrency/lock_table_test.go, line 383 at r13 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit:
return str + "state=waitSelf"
like you have withdoneWaiting
Done
pkg/storage/concurrency/testdata/lock_table, line 30 at r14 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Let's add a
Done
pkg/storage/concurrency/testdata/lock_table, line 45 at r14 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Let's add a
Done (I've added prints for more commands).
pkg/storage/concurrency/testdata/lock_table, line 205 at r14 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: it would be nice if we could chop off these leading zeros somehow. Maybe just add a TODO?
Added a TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ready to merge once CI is resolved! Let's just make sure to squash all three commits into a single commit.
Reviewed 3 of 3 files at r15.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @petermattis and @sumeerbhola)
pkg/storage/concurrency/lock_table.go, line 1575 at r13 (raw file):
Previously, sumeerbhola wrote…
I wasn't aware of its existence. It is doing
return bytes.Compare(key, s.Key) >= 0 && bytes.Compare(key, s.EndKey) < 0
which only seems correct if the span does not have an emptyEndKey
-- some of the call sites look like they are calling with a replica range which would satisfy this constraint.
Good point.
pkg/storage/concurrency/lock_table.go, line 688 at r15 (raw file):
Previously, sumeerbhola wrote…
Added a comment on how this will extend with other lock strengths.
Nice comment.
pkg/storage/concurrency/testdata/lock_table, line 45 at r14 (raw file):
Previously, sumeerbhola wrote…
Done (I've added prints for more commands).
Thanks. This is much easier to read.
8762818
to
8c2a3b6
Compare
…ted locks. - Includes a basic test. - TODOs planned for followup PRs: - Better error messages. - More tests (as listed in TODOs in the code). - Use the copy-on-write btree. Release note: None
8c2a3b6
to
2333372
Compare
bors r+ |
Build succeeded |
Informs cockroachdb#41720. The commit creates a new concurrency package that provides a concurrency manager structure that encapsulates the details of concurrency control and contention handling for serializable key-value transactions. Interested readers should start at `concurrency_control.go` and move out from there. The new package has a few primary objectives: 1. centralize the handling of request synchronization and transaction contention handling in a single location, allowing for the topic to be documented, understood, and tested in isolation. 2. rework contention handling to react to intent state transitions directly. This simplifies the transaction queueing story, reduces the frequency of transaction push RPCs, and allows waiters to proceed immediately after intent resolution. 3. create a framework that naturally permits "update" locking, which is required for kv-level SELECT FOR UPDATE support (cockroachdb#6583). 4. provide stronger guarantees around fairness when transactions conflict, to reduce tail latencies under contended sceneries. 5. create a structure that can extend to address the long-term goals of a fully centralized lock-table laid out in cockroachdb#41720. This commit pulls over a lot of already reviewed code from cockroachdb#43775. The big differences are that it updates the lockTable interface to match the one introduced in cockroachdb#43740 and it addresses the remaining TODOs to document the rest of the concurrency control mechanisms in CockroachDB. At this point, a reader of this file should get a good idea of how KV transactions interact in CRDB... now we just need to make the system actually work this way.
Informs cockroachdb#41720. The commit creates a new concurrency package that provides a concurrency manager structure that encapsulates the details of concurrency control and contention handling for serializable key-value transactions. Interested readers should start at `concurrency_control.go` and move out from there. The new package has a few primary objectives: 1. centralize the handling of request synchronization and transaction contention handling in a single location, allowing for the topic to be documented, understood, and tested in isolation. 2. rework contention handling to react to intent state transitions directly. This simplifies the transaction queueing story, reduces the frequency of transaction push RPCs, and allows waiters to proceed immediately after intent resolution. 3. create a framework that naturally permits "update" locking, which is required for kv-level SELECT FOR UPDATE support (cockroachdb#6583). 4. provide stronger guarantees around fairness when transactions conflict, to reduce tail latencies under contended sceneries. 5. create a structure that can extend to address the long-term goals of a fully centralized lock-table laid out in cockroachdb#41720. This commit pulls over a lot of already reviewed code from cockroachdb#43775. The big differences are that it updates the lockTable interface to match the one introduced in cockroachdb#43740 and it addresses the remaining TODOs to document the rest of the concurrency control mechanisms in CockroachDB. At this point, a reader of this file should get a good idea of how KV transactions interact in CRDB... now we just need to make the system actually work this way.
Informs cockroachdb#41720. The commit creates a new concurrency package that provides a concurrency manager structure that encapsulates the details of concurrency control and contention handling for serializable key-value transactions. Interested readers should start at `concurrency_control.go` and move out from there. The new package has a few primary objectives: 1. centralize the handling of request synchronization and transaction contention handling in a single location, allowing for the topic to be documented, understood, and tested in isolation. 2. rework contention handling to react to intent state transitions directly. This simplifies the transaction queueing story, reduces the frequency of transaction push RPCs, and allows waiters to proceed immediately after intent resolution. 3. create a framework that naturally permits "update" locking, which is required for kv-level SELECT FOR UPDATE support (cockroachdb#6583). 4. provide stronger guarantees around fairness when transactions conflict, to reduce tail latencies under contended sceneries. 5. create a structure that can extend to address the long-term goals of a fully centralized lock-table laid out in cockroachdb#41720. This commit pulls over a lot of already reviewed code from cockroachdb#43775. The big differences are that it updates the lockTable interface to match the one introduced in cockroachdb#43740 and it addresses the remaining TODOs to document the rest of the concurrency control mechanisms in CockroachDB. At this point, a reader of this file should get a good idea of how KV transactions interact in CRDB... now we just need to make the system actually work this way.
Informs cockroachdb#41720. The commit creates a new concurrency package that provides a concurrency manager structure that encapsulates the details of concurrency control and contention handling for serializable key-value transactions. Interested readers should start at `concurrency_control.go` and move out from there. The new package has a few primary objectives: 1. centralize the handling of request synchronization and transaction contention handling in a single location, allowing for the topic to be documented, understood, and tested in isolation. 2. rework contention handling to react to intent state transitions directly. This simplifies the transaction queueing story, reduces the frequency of transaction push RPCs, and allows waiters to proceed immediately after intent resolution. 3. create a framework that naturally permits "update" locking, which is required for kv-level SELECT FOR UPDATE support (cockroachdb#6583). 4. provide stronger guarantees around fairness when transactions conflict, to reduce tail latencies under contended sceneries. 5. create a structure that can extend to address the long-term goals of a fully centralized lock-table laid out in cockroachdb#41720. This commit pulls over a lot of already reviewed code from cockroachdb#43775. The big differences are that it updates the lockTable interface to match the one introduced in cockroachdb#43740 and it addresses the remaining TODOs to document the rest of the concurrency control mechanisms in CockroachDB. At this point, a reader of this file should get a good idea of how KV transactions interact in CRDB... now we just need to make the system actually work this way.
44787: storage/concurrency: define concurrency control interfaces r=nvanbenschoten a=nvanbenschoten Informs #41720. The commit creates a new concurrency package that provides a concurrency manager structure that encapsulates the details of concurrency control and contention handling for serializable key-value transactions. Interested readers should start at `concurrency_control.go` and move out from there. The new package has a few primary objectives: 1. centralize the handling of request synchronization and transaction contention handling in a single location, allowing for the topic to be documented, understood, and tested in isolation. 2. rework contention handling to react to intent state transitions directly. This simplifies the transaction queueing story, reduces the frequency of transaction push RPCs, and allows waiters to proceed immediately after intent resolution. 3. create a framework that naturally permits "update" locking, which is required for kv-level SELECT FOR UPDATE support (#6583). 4. provide stronger guarantees around fairness when transactions conflict, to reduce tail latencies under contended sceneries. 5. create a structure that can extend to address the long-term goals of a fully centralized lock-table laid out in #41720. This commit pulls over a lot of already reviewed code from #43775. The big differences are that it updates the lockTable interface to match the one introduced in #43740 and it addresses the remaining TODOs to document the rest of the concurrency control mechanisms in CockroachDB. At this point, a reader of this file should get a good idea of how KV transactions interact in CRDB... now we just need to make the system actually work this way. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
…totype SFU Informs cockroachdb#41720. This commit creates a new concurrency package that provides a concurrency manager structure that encapsulates the details of concurrency control and contention handling for serializable key-value transactions. Any reader of this commit should start at `concurrency_control.go` and move out from there. The new package has a few primary objectives: 1. centralize the handling of request synchronization and transaction contention handling in a single location, allowing for the topic to be documented and understood in isolation. 2. rework contention handling to react to intent state transitions directly. This simplifies the transaction queueing story, reduces the frequency of transaction push RPCs, and allows waiters to proceed after intent resolution as soon as possible. 3. create a framework that naturally permits "update" locking, which is required for kv-level SELECT FOR UPDATE support (cockroachdb#6583). 4. provide stronger guarantees around fairness when transactions conflict, in order to reduce tail latencies under contended scenarios. 5. create a structure that can extend to address the long-term goals of a fully centralized lock-table laid out in cockroachdb#41720. WARNING: this is still a WIP. Notably, the lockTableImpl is mocked out with a working but incomplete implementation. See cockroachdb#43740 for a more complete strawman. Release note: None
No description provided.