-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/concurrency: avoid redundant txn pushes and batch intent resolution #49218
kv/concurrency: avoid redundant txn pushes and batch intent resolution #49218
Conversation
76d0b90
to
639ecb8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. I haven't yet looked carefully at the tests.
Reviewed 3 of 10 files at r1, 6 of 8 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @nvanbenschoten, and @sumeerbhola)
pkg/kv/kvserver/concurrency/concurrency_control.go, line 636 at r2 (raw file):
WaitOnLock(context.Context, Request, *roachpb.Intent) *Error // ClearCaches wipes all caches maintained by the lockTableWaiter.
Is this also for the situation where the lockTable is disabled? Could you add a code comment?
And is this just to recover memory -- a stale cache of committed/aborted transactions is presumably harmless for correctness?
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 136 at r2 (raw file):
// batching of intent resolution while cleaning up after abandoned txns. var deferredResolution []roachpb.LockUpdate defer w.resolveDeferredIntents(ctx, &err, &deferredResolution)
If there are N cache hits we would build up N intents in this slice and then a miss would cause this request to wait without doing the resolution. Should the deferred batch be flushed based on a timer? Perhaps worth a code comment.
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 162 at r2 (raw file):
// had a cache of aborted transaction IDs that allowed us to notice // and quickly resolve abandoned intents then we might be able to // get rid of this state.
We now have a cache. But don't we still need the short timer to quickly add the aborted txn to the cache? If so, probably should reword this comment.
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 196 at r2 (raw file):
// This allows us to accumulate a group of intents to resolve // and send them together as a batch. if livenessPush {
I think it is worth commenting again here that if the lock is held there will be at least one waiter (the distinguished waiter) that will have livenessPush=true
, so will do this work that will benefit all waiters.
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 207 at r2 (raw file):
// the future (before we observe the corresponding MVCC state). // This is safe because we already handle cases where locks // exist only in the MVCC keyspace and not in the lockTable.
This is justifying why this is correct for replicated locks. How about adding something like:
For unreplicated locks, the lockTable
is the source of truth, and removing the locks held by an already committed/aborted transaction does not affect correctness.
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 592 at r2 (raw file):
type finalizedTxnCache struct { mu syncutil.Mutex txns [8]*roachpb.Transaction // [MRU, ..., LRU]
How big wrt memory footprint are these Transaction
objects? I am trying to understand the reason for the choice of 8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as someone looking at this for the first time.
Reviewed 9 of 10 files at r1, 8 of 8 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 344 at r1 (raw file):
} else { // Disable all queues - the concurrency manager will no longer be // informed about all state transitions to locks and transactions.
Comment here could be updated to say we're also clearing out the cache, and why (also, unclear to me why we need this).
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 96 at r1 (raw file):
// For now, we don't do this because we don't share any state between // separate concurrency.Manager instances. knownTxns finalizedTxnCache
[nit] I think knownTxns
is a bit vague, but I don't have a better suggestion. Perhaps just finalizedTxnCache
? The type itself is a generic txn LRU cache, and the comment block above the type definition is better suited here, I think.
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 119 at r1 (raw file):
ResolveIntent(context.Context, roachpb.LockUpdate, intentresolver.ResolveOptions) *Error // ResolveIntent synchronously resolves the provided batch of intents.
s/ResolveIntent/ResolveIntents
639ecb8
to
9d39d20
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @sumeerbhola)
pkg/kv/kvserver/concurrency/concurrency_control.go, line 636 at r2 (raw file):
Is this also for the situation where the lockTable is disabled? Could you add a code comment?
Done.
And is this just to recover memory -- a stale cache of committed/aborted transactions is presumably harmless for correctness?
Yes, that's correct. A stale cache entry is harmless, it just consumes memory and can also interact with tests.
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 344 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Comment here could be updated to say we're also clearing out the cache, and why (also, unclear to me why we need this).
Done.
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 96 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
[nit] I think
knownTxns
is a bit vague, but I don't have a better suggestion. Perhaps justfinalizedTxnCache
? The type itself is a generic txn LRU cache, and the comment block above the type definition is better suited here, I think.
Good point! Done.
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 119 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
s/ResolveIntent/ResolveIntents
Done.
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 136 at r2 (raw file):
Previously, sumeerbhola wrote…
If there are N cache hits we would build up N intents in this slice and then a miss would cause this request to wait without doing the resolution. Should the deferred batch be flushed based on a timer? Perhaps worth a code comment.
I added a comment. I don't think there's a strong need to flush deferred intent resolution on a timer. There is a subtle interaction here in cases where the request that deferred the intent resolution does not acquire a reservation on the key, but flushing wouldn't address this anyway.
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 162 at r2 (raw file):
Previously, sumeerbhola wrote…
We now have a cache. But don't we still need the short timer to quickly add the aborted txn to the cache? If so, probably should reword this comment.
Done. I just removed it because I don't think the cache is effective enough at avoiding redundant pushes to warrant risking multiple 100ms delays. Right now, it's not even shared between ranges. Even if it was, there's no way to share it between nodes. We really don't want to slow down the initial push if we have scenarios where this delay can compound.
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 196 at r2 (raw file):
Previously, sumeerbhola wrote…
I think it is worth commenting again here that if the lock is held there will be at least one waiter (the distinguished waiter) that will have
livenessPush=true
, so will do this work that will benefit all waiters.
Done.
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 207 at r2 (raw file):
Previously, sumeerbhola wrote…
This is justifying why this is correct for replicated locks. How about adding something like:
For unreplicated locks, thelockTable
is the source of truth, and removing the locks held by an already committed/aborted transaction does not affect correctness.
Done.
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 592 at r2 (raw file):
Previously, sumeerbhola wrote…
How big wrt memory footprint are these
Transaction
objects? I am trying to understand the reason for the choice of 8.
They are 288 bytes large, so 8 puts us just over 2KB per leaseholder range. Even on large nodes, we tend to top out at about 10k leaseholders per node, so that puts the maximum size of this cache across an entire process at ~20MB, which seems fine for now. There's a TODO to move this to a per-store cache, which would allow us to increase its size significantly and remove the num_leaseholder factor from the size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 10 files at r3, 8 of 8 files at r4.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @sumeerbhola)
bors r+ |
Build failed (retrying...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 10 files at r1, 2 of 10 files at r3, 8 of 8 files at r4.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @irfansharif, @nvanbenschoten, and @sumeerbhola)
pkg/kv/kvserver/concurrency/testdata/concurrency_manager/clear_abandoned_intents, line 69 at r4 (raw file):
[3] sequence req1: pushing timestamp of txn 00000002 above 0.000000010,1 [3] sequence req1: blocked on select in concurrency_test.(*cluster).PushTransaction
Could you add another debug-lock-table here and add a comment that since txn1 is the distinguished waiter for "a" it will try to push and that is the first intent resolution of "a" below (due to the abort of txn2), and then the cache kicks in to cause a batch resolution of the others.
bors r- |
Canceled |
Fixes cockroachdb#48790. Informs cockroachdb#36876. Closes cockroachdb#31664. This commit adds a per-Range LRU cache of transactions that are known to be aborted or committed. We use this cache in the lockTableWaiter for two purposes: 1. when we see a lock held by a known-finalized txn, we neither wait out the kv.lock_table.coordinator_liveness_push_delay (10 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range). 2. we use the existence of a transaction in the cache as an indication that it may have abandoned multiple intents, perhaps due to a failure of the transaction coordinator node, so we begin deferring intent resolution to enable batching. Together, these two changes make us much more effective as cleaning up after failed transactions that have abandoned a large number of intents. The following example demonstrates this: ```sql --- BEFORE CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK; SELECT * FROM keys; k ----- (0 rows) Time: 2m50.801304266s CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK; INSERT INTO keys2 SELECT generate_series(1, 10000); INSERT 10000 Time: 3m26.874571045s --- AFTER CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK; SELECT * FROM keys; k ----- (0 rows) Time: 5.138220753s CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK; INSERT INTO keys2 SELECT generate_series(1, 10000); INSERT 10000 Time: 48.763541138s ``` Notice that we are still not as fast at cleaning up intents on the insertion path as we are at doing so on the retrieval path. This is because we only batch the resolution of intents observed by a single request at a time. For the scanning case, a single ScanRequest notices all 10,000 intents and cleans them all up together. For the insertion case, each of the 10,000 PutRequests notice a single intent, and each intent is cleaned up individually. So this case is only benefited by the first part of this change (no liveness delay or txn record push) and not the second part of this change (intent resolution batching). For this reason, we still haven't solved all of cockroachdb#36876. To completely address that, we'll need to defer propagation of WriteIntentError during batch evaluation, like we do for WriteTooOldErrors. Or we can wait out the future LockTable changes - once we remove all cases where an intent is not "discovered", the changes here will effectively address cockroachdb#36876. This was a partial regression in v20.1, so we'll want to backport this to that release branch. This change is on the larger side, but I feel ok about it because the mechanics aren't too tricky. I'll wait a week before backporting just to see if anything falls out. Release note (bug fix): Abandoned intents due to failed transaction coordinators are now cleaned up much faster. This resolves a regression in v20.1.0 compared to prior releases.
These structs are 288 bytes large - a little too large to copy around unnecessarily when we already have pointers to their original, immutable instance on the heap.
9d39d20
to
3420bf8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @sumeerbhola)
pkg/kv/kvserver/concurrency/testdata/concurrency_manager/clear_abandoned_intents, line 69 at r4 (raw file):
Previously, sumeerbhola wrote…
Could you add another debug-lock-table here and add a comment that since txn1 is the distinguished waiter for "a" it will try to push and that is the first intent resolution of "a" below (due to the abort of txn2), and then the cache kicks in to cause a batch resolution of the others.
Done.
Build succeeded |
…y to 50ms This change increases the default value for `kv.lock_table.coordinator_liveness_push_delay`. Increasing this delay is a big win for contended workloads because it allows transactions to queue locally to conflicting locks (in the lockTable's lockWaitQueue) instead of remotely next to the conflicting lock holders' txn records (in the txnWaitQueue) in more cases. This has a big effect on high contention / high concurrency workloads. As we'll see in the following experiment, YCSB workload A would change regimes somewhere between 200 and 256 concurrent threads. Operations would slow down just enough that waiting txns would start performing liveness pushes and become much less effective at queuing and responding to transaction commits. This would cause the rest of the operations to slow down and suddenly everyone was pushing and the front of every lockWaitQueue was waiting in the txnWaitQueue. The first group of selects and updates on the left (50/50 ratio, so they overlap) shows YCSB workload A run with 256 threads and a 10ms liveness push delay. We expect ~36k qps, but we're only sustaining between 5-15k qps. When we bump this delay to 50ms on the right, we see much higher throughput, stabilizing at ~36k. The workload is able to stay in the efficient regime (no liveness pushes, listening directly to lock state transitions) for far longer. I've tested up to 512 concurrent workers on the same workload and never seen us enter the slow regime. <todo throughput graph> <todo pushes graph> This partially addresses an existing TODO: > We could increase this default to somewhere on the order of the > transaction heartbeat timeout (5s) if we had a better way to avoid paying > the cost on each of a transaction's abandoned locks and instead only pay > it once per abandoned transaction per range or per node. This could come > in a few different forms, including: > - a per-store cache of recently detected abandoned transaction IDs > - a per-range reverse index from transaction ID to locked keys > > TODO(nvanbenschoten): increasing this default value. The finalizedTxnCache (introduced in cockroachdb#49218) gets us part of the way here. It allows us to pay the liveness push delay cost once per abandoned transaction per range instead of once per each of an abandoned transaction's locks. This helped us to feel comfortable increasing the default delay from the original 10ms to the current 50ms. Still, to feel comfortable increasing this further, we'd want to improve this cache (e.g. lifting it to the store level) to reduce the cost to once per abandoned transaction per store. To confirm that we're not regressing performance noticeably here, we run the same tests that we ran in cockroachdb#49218: ``` --- BEFORE SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '10ms'; CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK; SELECT * FROM keys; k ----- (0 rows) Time: 5.574853s CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK; INSERT INTO keys2 SELECT generate_series(1, 10000); INSERT 10000 Time: 33.155231s --- AFTER SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '50ms'; CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK; SELECT * FROM keys; k ----- (0 rows) Time: 5.547465s CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK; INSERT INTO keys2 SELECT generate_series(1, 10000); INSERT 10000 Time: 35.429984s ``` Release note (performance improvement): Transaction liveness pushes are now delayed by 50ms, up from 10ms. This allows high contention workloads to sustain high throughput up to much larger concurrency levels.
48306: sql: implement JSON{,B}_{OBJECT,}_AGG r=yuzefovich a=C0rWin Fixes #19978 This PR adds built in aggregate function allowing to augment query result set into json object. Adds implementation of json{,b}_object_agg built in function, by adding a new construction which augments constructions and allows to keep track on memory usage. Finally, links json{,b}_object_agg built in function to enable support of queries which aggregates name/value pairs into JSON object. Release note (sql change): enable aggregate queries which collect key/value pairs into JSON object. 50133: opt: refactor and test the JoinMultiplicity library r=DrewKimball a=DrewKimball Previously, the logic in the JoinMultiplicity library was complicated and untested. Multiple bugs have been both found and introduced during refactoring of the original JoinFiltersMatchAllLeftRows logic. This patch refactors the code to make it easier to understand and introduces unit testing for the library. This patch also makes the following changes to the behavior of multiplicity_builder: 1. Fixed an issue that caused filtersMatchAllLeftRows to return false positives. 2. Foreign key columns only invalidate a foreign key if they were nullable in the base table. Since we derive the foreign key from a left not-null column in the first place, we know that columns from this table haven't been null-extended, so a not-null column will stay not-null. This is useful when foreign key columns are not in the output set. 3. All columns from a table (including non-output columns) are added to the UnfilteredCols field. This allows foreign key relations to be validated even when the foreign key columns are not in the output set. Fixes #50059 Release note: None 50161: kv/concurrency: increase kv.lock_table.coordinator_liveness_push_delay to 50ms r=nvanbenschoten a=nvanbenschoten This change increases the default value for `kv.lock_table.coordinator_liveness_push_delay`. Increasing this delay is a big win for contended workloads because it allows transactions to queue locally to conflicting locks (in the lockTable's lockWaitQueue) instead of remotely next to the conflicting lock holders' txn records (in the txnWaitQueue) in more cases. This has a big effect on high contention / high concurrency workloads. As we'll see in the following experiment, YCSB workload A would change regimes somewhere between 200 and 256 concurrent threads. Operations would slow down just enough that waiting txns would start performing liveness pushes and become much less effective at queuing and responding to transaction commits. This would cause the rest of the operations to slow down and suddenly everyone was pushing and the front of every lockWaitQueue was waiting in the txnWaitQueue. The first group of selects and updates on the left (50/50 ratio, so they overlap) shows YCSB workload A run with 256 threads and a 10ms liveness push delay. We expect ~36k qps, but we're only sustaining between 5-15k qps. When we bump this delay to 50ms on the right, we see much higher throughput, stabilizing at ~36k. The workload is able to stay in the efficient regime (no liveness pushes, listening directly to lock state transitions) for far longer. I've tested up to 512 concurrent workers on the same workload and never seen us enter the slow regime. <img width="978" alt="Screen Shot 2020-06-12 at 5 08 30 PM" src="https://user-images.githubusercontent.com/5438456/84550421-886d4f80-acd8-11ea-9982-2f2f7a3a4d6b.png"> <img width="1057" alt="Screen Shot 2020-06-12 at 5 18 27 PM" src="https://user-images.githubusercontent.com/5438456/84550428-8c996d00-acd8-11ea-8045-d2db4357e9da.png"> _10ms delay on left, 50ms delay on right_ This partially addresses an existing TODO: > We could increase this default to somewhere on the order of the > transaction heartbeat timeout (5s) if we had a better way to avoid paying > the cost on each of a transaction's abandoned locks and instead only pay > it once per abandoned transaction per range or per node. This could come > in a few different forms, including: > - a per-store cache of recently detected abandoned transaction IDs > - a per-range reverse index from transaction ID to locked keys > > TODO(nvanbenschoten): increasing this default value. The finalizedTxnCache (introduced in #49218) gets us part of the way here. It allows us to pay the liveness push delay cost once per abandoned transaction per range instead of once per each of an abandoned transaction's locks. This helped us to feel comfortable increasing the default delay from the original 10ms to the current 50ms. Still, to feel comfortable increasing this further, we'd want to improve this cache (e.g. lifting it to the store level) to reduce the cost to once per abandoned transaction per store. To confirm that we're not regressing performance noticeably here, we run the same tests that we ran in #49218: ``` --- BEFORE SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '10ms'; CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK; SELECT * FROM keys; k ----- (0 rows) Time: 5.574853s CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK; INSERT INTO keys2 SELECT generate_series(1, 10000); INSERT 10000 Time: 33.155231s --- AFTER SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '50ms'; CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK; SELECT * FROM keys; k ----- (0 rows) Time: 5.547465s CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK; INSERT INTO keys2 SELECT generate_series(1, 10000); INSERT 10000 Time: 35.429984s ``` Release note (performance improvement): Transaction liveness pushes are now delayed by 50ms, up from 10ms. This allows high contention workloads to sustain high throughput up to much larger concurrency levels. 50245: ui: remove compactor queue graphs r=jbowens a=jbowens Remove the compactor queue graphs from the Admin UI. They were commonly confused with storage-engine level compactions, and the compactor queue itself will (after future work) no longer be used for Pebble storage engines at all. I left the underlying metrics, assuming they'll still be useful for diangosing problems on nodes running with RocksDB. Release note (admin ui change): The 'Queues' dashboard in the Admin UI no longer shows a 'Compaction Queue' graph and the 'Queue Processing Failures' and 'Queue Processing Times' graphs no longer include the 'Compaction' queue metrics because these were commonly confused. Co-authored-by: Artem Barger <bartem@il.ibm.com> Co-authored-by: Drew Kimball <drewk@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
…uests Fixes cockroachdb#50390. Related to cockroachdb#60585. Related to cockroachdb#103126. Closes cockroachdb#64500, which was an earlier attempt to solve this issue using a similar approach. This commit addresses the comments on that PR, which focused on the pagination of intent resolution when bypassing the request batcher. We still don't try to solve this issue, and instead limit the cases where intent resolution bypasses the request batcher to those where pagination is not necessary. This commit adds a new `sendImmediately` option to the `IntentResolver` `ResolveOptions`, which instructs the `IntentResolver` to send the provided intent resolution requests immediately, instead of adding them to a batch and waiting up to 10ms (defaultIntentResolutionBatchWait) for that batch to fill up with other intent resolution requests. This can be used to avoid any batching-induced latency and should be used only by foreground traffic that is willing to trade off some throughput for lower latency. The commit then sets this flag for single-key intent resolution requests initiated by the `lockTableWaiter`. Unlike most calls into the `IntentResolver`, which are performed by background tasks that are more than happy to trade increased latency for increased throughput, the `lockTableWaiter` calls into the `IntentResolver` on behalf of a foreground operation. It wants intent resolution to complete as soon as possible, so it is willing to trade reduced throughput for reduced latency. I tested this out by writing 10,000 different intents in a normal-priority transaction and then scanning over the table using a high-priority transaction. The test was performed on a 3-node roachprod cluster to demonstrate the effect with realistic network and disk latencies. ``` -- session 1 CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); -- session 2 BEGIN PRIORITY HIGH; SELECT count(*) FROM keys; ``` Without this change, the scan took 70.078s. With this change, the scan took 15.958s. This 78% speed up checks out. Each encountered intent is resolved serially (see cockroachdb#103126), so the per-intent latency drops from 7ms to 1.6ms. This improvement by about 5ms agrees with the `defaultIntentResolutionBatchIdle`, which delays each resolution request that passes through a RequestBatcher. With this change, these resolve intent requests are issued immediately and this delay is not experienced. While this is a significant improvement and will be important for Read Committed transactions after cockroachdb#102014, this is still not quite enough to resolve cockroachdb#103126. For that, we need to batch the resolve intent requests themselves using a form of deferred intent resolution like we added in cockroachdb#49218 (for finalized transactions). A similar improvement is seen for scans that encounter many abandoned intents from many different transactions. This scenario bypasses the deferred intent resolution path added in cockroachdb#49218, because the intents are each written by different transactions. The speedup for this scenario was presented in cockroachdb#64500. Release note (performance improvement): SQL statements that must clean up intents from many different previously abandoned transactions now do so moderately more efficiently.
…uests Fixes cockroachdb#50390. Related to cockroachdb#60585. Related to cockroachdb#103126. Closes cockroachdb#64500, which was an earlier attempt to solve this issue using a similar approach. This commit addresses the comments on that PR, which focused on the pagination of intent resolution when bypassing the request batcher. We still don't try to solve this issue, and instead limit the cases where intent resolution bypasses the request batcher to those where pagination is not necessary. This commit adds a new `sendImmediately` option to the `IntentResolver` `ResolveOptions`, which instructs the `IntentResolver` to send the provided intent resolution requests immediately, instead of adding them to a batch and waiting up to 10ms (defaultIntentResolutionBatchWait) for that batch to fill up with other intent resolution requests. This can be used to avoid any batching-induced latency and should be used only by foreground traffic that is willing to trade off some throughput for lower latency. The commit then sets this flag for single-key intent resolution requests initiated by the `lockTableWaiter`. Unlike most calls into the `IntentResolver`, which are performed by background tasks that are more than happy to trade increased latency for increased throughput, the `lockTableWaiter` calls into the `IntentResolver` on behalf of a foreground operation. It wants intent resolution to complete as soon as possible, so it is willing to trade reduced throughput for reduced latency. I tested this out by writing 10,000 different intents in a normal-priority transaction and then scanning over the table using a high-priority transaction. The test was performed on a 3-node roachprod cluster to demonstrate the effect with realistic network and disk latencies. ``` -- session 1 CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); -- session 2 BEGIN PRIORITY HIGH; SELECT count(*) FROM keys; ``` Without this change, the scan took 70.078s. With this change, the scan took 15.958s. This 78% speed up checks out. Each encountered intent is resolved serially (see cockroachdb#103126), so the per-intent latency drops from 7ms to 1.6ms. This improvement by about 5ms agrees with the `defaultIntentResolutionBatchIdle`, which delays each resolution request that passes through a RequestBatcher. With this change, these resolve intent requests are issued immediately and this delay is not experienced. While this is a significant improvement and will be important for Read Committed transactions after cockroachdb#102014, this is still not quite enough to resolve cockroachdb#103126. For that, we need to batch the resolve intent requests themselves using a form of deferred intent resolution like we added in cockroachdb#49218 (for finalized transactions). A similar improvement is seen for scans that encounter many abandoned intents from many different transactions. This scenario bypasses the deferred intent resolution path added in cockroachdb#49218, because the intents are each written by different transactions. The speedup for this scenario was presented in cockroachdb#64500. Release note (performance improvement): SQL statements that must clean up intents from many different previously abandoned transactions now do so moderately more efficiently.
103265: kv: resolve conflicting intents immediately for latency-sensitive requests r=nvanbenschoten a=nvanbenschoten Fixes #50390. Related to #60585. Related to #103126. Closes #64500, which was an earlier attempt to solve this issue using a similar approach. This commit addresses the comments on that PR, which focused on the pagination of intent resolution when bypassing the request batcher. We still don't try to solve this issue, and instead limit the cases where intent resolution bypasses the request batcher to those where pagination is not necessary. This commit adds a new `sendImmediately` option to the `IntentResolver` `ResolveOptions`, which instructs the `IntentResolver` to send the provided intent resolution requests immediately, instead of adding them to a batch and waiting up to 10ms (`defaultIntentResolutionBatchWait`) for that batch to fill up with other intent resolution requests. This can be used to avoid any batching-induced latency and should be used only by foreground traffic that is willing to trade off some throughput for lower latency. The commit then sets this flag for single-key intent resolution requests initiated by the `lockTableWaiter`. Unlike most calls into the `IntentResolver`, which are performed by background tasks that are more than happy to trade increased latency for increased throughput, the `lockTableWaiter` calls into the `IntentResolver` on behalf of a foreground operation. It wants intent resolution to complete as soon as possible, so it is willing to trade reduced throughput for reduced latency. I tested this out by writing 10,000 different intents in a normal-priority transaction and then scanning over the table using a high-priority transaction. The test was performed on a 3-node roachprod cluster to demonstrate the effect with realistic network and disk latencies. ```sql -- session 1 CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); -- session 2 BEGIN PRIORITY HIGH; SELECT count(*) FROM keys; ``` Without this change, the scan took **70.078s**. With this change, the scan took **15.958s**. This **78%** speed-up checks out. Each encountered intent is resolved serially (see #103126), so the **per-intent latency** drops from **7ms** to **1.6ms.** This improvement by about 5ms agrees with the `defaultIntentResolutionBatchIdle`, which delays each resolution request that passes through a RequestBatcher. With this change, these resolve intent requests are issued immediately and this delay is not experienced. While this is a significant improvement and will be important for Read Committed transactions after #102014, this is still not quite enough to resolve #103126. For that, we need to batch the resolve intent requests themselves using a form of deferred intent resolution like we added in #49218 (for finalized transactions). A similar improvement is seen for scans that encounter many abandoned intents from many different transactions. This scenario bypasses the deferred intent resolution path added in #49218, because the intents are each written by different transactions. The speedup for this scenario was presented in #64500. Release note (performance improvement): SQL statements that must clean up intents from many different previously abandoned transactions now do so moderately more efficiently. 103362: sql: validate primary / secondary region localities at end of txn r=fqazi a=fqazi Previously, if a database was restored with skip_localities, there was no way to modify this database to set the primary region since validation would kick in too early during the statement. This meant fixing the regions in a restored database was impossible if the primary region was no longer valid. To address this, this patch, delays locality validation till the end of the transaction. Fixes: #103290 Release note (bug fix): SET PRIMARY REGION and SET SECONDARY REGION did not validate transactionally, which could prevent cleaning up removed regions after a restore. 103373: concurrency: small refactors in preparation for reservation removal r=nvanbenschoten a=arulajmani See individual commits for details. Informs: #103361 103538: go.mod: bump Pebble to 6f2788660198, rework shared storage wrapper r=RaduBerinde a=RaduBerinde 6f278866 shared: improve interface for more efficient reading 9eb2c407 db: log events to testing.T in unit tests f32e7dc6 db: add reserved Pebblev4 sstable format 5a6b91b8 objstorage: improve test and add read ahead test 2bc4319e objstorage: remove genericFileReadable 8143ffb9 objstorage: fix readaheadState initialization 06d08888 db: add Compact.Duration metric e7213de0 db: add Uptime metric e9005aed db: don't delete files during ingest application 222b43ec internal/arenaskl: fix Skiplist doc typo Release note: None Epic: none Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Arul Ajmani <arulajmani@gmail.com> Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
Fixes cockroachdb#103126. This commit extends the infrastructure introduced in cockroachdb#49218 for transaction timestamp pushes. It avoids redundant txn pushes of PENDING transactions and batches the resolution of PENDING intents. This breaks the O(num_intents) work performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into something closer to O(num_ranges) work. The commit accomplishes its goals by adding a second per-Range LRU cache of transactions that are PENDING and are known to have been pushed to higher timestamps. We use this cache for two purposes: 1. when we are a non-locking read and we see a lock at a conflicting timestamp who is held by a pushed txn above our read timestamp, we neither wait out the kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range). 2. we use the existence of a transaction in the cache as an indication that it may have written multiple intents, so we begin deferring intent resolution to enable batching. Together, these two changes make us much more effective at pushing transactions with a large number of intents. The following example (from cockroachdb#103126) demonstrates this: ```sql -- SETUP: run in a 3-node GCP roachprod cluster --- session 1 - write 100k intents CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 100000); --- session 2 - push intents with high-priority txn without uncertainty interval BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms'; SELECT count(*) FROM keys; --- BEFORE this PR and before cockroachdb#103265 (i.e. v23.1.2): takes ~7.1ms per intent Time: 714.441s total --- BEFORE this PR: takes ~1.5ms per intent Time: 151.880s total --- AFTER this PR: takes ~24μs per intent Time: 2.405s ``` The change does have an unfortunate limitation. Deferred intent resolution is only currently enabled for non-locking readers without uncertainty intervals. Readers with uncertainty intervals must contend with the possibility of pushing a conflicting intent up into their uncertainty interval and causing more work for themselves, which is avoided with care by the lockTableWaiter but difficult to coordinate through the txnStatusCache. This limitation is acceptable because the most important case here is optimizing the Export requests issued by backup. This limitation also hints at the long-term plan for this interaction, which is that non-locking readers can ignore known pending intents without the need to even resolve those intents (see cockroachdb#94730). This will require a request-scoped cache of pending, pushed transactions, which does not have the same problems with uncertainty intervals. Release note (performance improvement): Backups no longer perform work proportional to the number of pending intents that they encounter, so they are over 100x faster when encountering long-running, bulk writing transactions.
Fixes cockroachdb#103126. This commit extends the infrastructure introduced in cockroachdb#49218 for transaction timestamp pushes. It avoids redundant txn pushes of PENDING transactions and batches the resolution of PENDING intents. This breaks the O(num_intents) work performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into something closer to O(num_ranges) work. The commit accomplishes its goals by adding a second per-Range LRU cache of transactions that are PENDING and are known to have been pushed to higher timestamps. We use this cache for two purposes: 1. when we are a non-locking read and we see a lock at a conflicting timestamp who is held by a pushed txn above our read timestamp, we neither wait out the kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range). 2. we use the existence of a transaction in the cache as an indication that it may have written multiple intents, so we begin deferring intent resolution to enable batching. Together, these two changes make us much more effective at pushing transactions with a large number of intents. The following example (from cockroachdb#103126) demonstrates this: ```sql -- SETUP: run in a 3-node GCP roachprod cluster --- session 1 - write 100k intents CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 100000); --- session 2 - push intents with high-priority txn without uncertainty interval BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms'; SELECT count(*) FROM keys; --- BEFORE this PR and before cockroachdb#103265 (i.e. v23.1.2): takes ~7.1ms per intent Time: 714.441s total --- BEFORE this PR: takes ~1.5ms per intent Time: 151.880s total --- AFTER this PR: takes ~24μs per intent Time: 2.405s ``` The change does have an unfortunate limitation. Deferred intent resolution is only currently enabled for non-locking readers without uncertainty intervals. Readers with uncertainty intervals must contend with the possibility of pushing a conflicting intent up into their uncertainty interval and causing more work for themselves, which is avoided with care by the lockTableWaiter but difficult to coordinate through the txnStatusCache. This limitation is acceptable because the most important case here is optimizing the Export requests issued by backup. This limitation also hints at the long-term plan for this interaction, which is that non-locking readers can ignore known pending intents without the need to even resolve those intents (see cockroachdb#94730). This will require a request-scoped cache of pending, pushed transactions, which does not have the same problems with uncertainty intervals. Release note (performance improvement): Backups no longer perform work proportional to the number of pending intents that they encounter, so they are over 100x faster when encountering long-running, bulk writing transactions.
104784: kv/concurrency: batch intent resolution of pushed intents from same txn r=arulajmani a=nvanbenschoten Fixes #103126. This commit extends the infrastructure introduced in #49218 for transaction timestamp pushes. It avoids redundant txn pushes of PENDING transactions and batches the resolution of PENDING intents. This breaks the O(num_intents) work performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into something closer to O(num_ranges) work. The commit accomplishes its goals by adding a second per-Range LRU cache of transactions that are PENDING and are known to have been pushed to higher timestamps. We use this cache for two purposes: 1. when we are a non-locking read and we see a lock at a conflicting timestamp who is held by a pushed txn above our read timestamp, we neither wait out the kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range). 2. we use the existence of a transaction in the cache as an indication that it may have written multiple intents, so we begin deferring intent resolution to enable batching. Together, these two changes make us much more effective at pushing transactions with a large number of intents. The following example (from #103126) demonstrates this: ```sql -- SETUP: run in a 3-node GCP roachprod cluster --- session 1 - write 100k intents CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 100000); --- session 2 - push intents with high-priority txn without uncertainty interval BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms'; SELECT count(*) FROM keys; --- BEFORE this PR and before #103265 (i.e. v23.1.2): takes ~7.1ms per intent Time: 714.441s total --- BEFORE this PR: takes ~1.5ms per intent Time: 151.880s total --- AFTER this PR: takes ~24μs per intent Time: 2.405s ``` The change does have an unfortunate limitation. Deferred intent resolution is only currently enabled for non-locking readers without uncertainty intervals. Readers with uncertainty intervals must contend with the possibility of pushing a conflicting intent up into their uncertainty interval and causing more work for themselves, which is avoided with care by the lockTableWaiter but difficult to coordinate through the txnStatusCache. This limitation is acceptable because the most important case here is optimizing the Export requests issued by backup. This limitation also hints at the long-term plan for this interaction, which is that non-locking readers can ignore known pending intents without the need to even resolve those intents (see #94730). This will require a request-scoped cache of pending, pushed transactions, which does not have the same problems with uncertainty intervals. Release note (performance improvement): Backups no longer perform work proportional to the number of pending intents that they encounter, so they are over 100x faster when encountering long-running, bulk writing transactions. Co-authored-by: Arul Ajmani <arulajmani@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Fixes cockroachdb#103126. This commit extends the infrastructure introduced in cockroachdb#49218 for transaction timestamp pushes. It avoids redundant txn pushes of PENDING transactions and batches the resolution of PENDING intents. This breaks the O(num_intents) work performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into something closer to O(num_ranges) work. The commit accomplishes its goals by adding a second per-Range LRU cache of transactions that are PENDING and are known to have been pushed to higher timestamps. We use this cache for two purposes: 1. when we are a non-locking read and we see a lock at a conflicting timestamp who is held by a pushed txn above our read timestamp, we neither wait out the kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range). 2. we use the existence of a transaction in the cache as an indication that it may have written multiple intents, so we begin deferring intent resolution to enable batching. Together, these two changes make us much more effective at pushing transactions with a large number of intents. The following example (from cockroachdb#103126) demonstrates this: ```sql -- SETUP: run in a 3-node GCP roachprod cluster --- session 1 - write 100k intents CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 100000); --- session 2 - push intents with high-priority txn without uncertainty interval BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms'; SELECT count(*) FROM keys; --- BEFORE this PR and before cockroachdb#103265 (i.e. v23.1.2): takes ~7.1ms per intent Time: 714.441s total --- BEFORE this PR: takes ~1.5ms per intent Time: 151.880s total --- AFTER this PR: takes ~24μs per intent Time: 2.405s ``` The change does have an unfortunate limitation. Deferred intent resolution is only currently enabled for non-locking readers without uncertainty intervals. Readers with uncertainty intervals must contend with the possibility of pushing a conflicting intent up into their uncertainty interval and causing more work for themselves, which is avoided with care by the lockTableWaiter but difficult to coordinate through the txnStatusCache. This limitation is acceptable because the most important case here is optimizing the Export requests issued by backup. This limitation also hints at the long-term plan for this interaction, which is that non-locking readers can ignore known pending intents without the need to even resolve those intents (see cockroachdb#94730). This will require a request-scoped cache of pending, pushed transactions, which does not have the same problems with uncertainty intervals. Release note (performance improvement): Backups no longer perform work proportional to the number of pending intents that they encounter, so they are over 100x faster when encountering long-running, bulk writing transactions.
Fixes cockroachdb#103126. This commit extends the infrastructure introduced in cockroachdb#49218 for transaction timestamp pushes. It avoids redundant txn pushes of PENDING transactions and batches the resolution of PENDING intents. This breaks the O(num_intents) work performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into something closer to O(num_ranges) work. The commit accomplishes its goals by adding a second per-Range LRU cache of transactions that are PENDING and are known to have been pushed to higher timestamps. We use this cache for two purposes: 1. when we are a non-locking read and we see a lock at a conflicting timestamp who is held by a pushed txn above our read timestamp, we neither wait out the kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range). 2. we use the existence of a transaction in the cache as an indication that it may have written multiple intents, so we begin deferring intent resolution to enable batching. Together, these two changes make us much more effective at pushing transactions with a large number of intents. The following example (from cockroachdb#103126) demonstrates this: ```sql -- SETUP: run in a 3-node GCP roachprod cluster --- session 1 - write 100k intents CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 100000); --- session 2 - push intents with high-priority txn without uncertainty interval BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms'; SELECT count(*) FROM keys; --- BEFORE this PR and before cockroachdb#103265 (i.e. v23.1.2): takes ~7.1ms per intent Time: 714.441s total --- BEFORE this PR: takes ~1.5ms per intent Time: 151.880s total --- AFTER this PR: takes ~24μs per intent Time: 2.405s ``` The change does have an unfortunate limitation. Deferred intent resolution is only currently enabled for non-locking readers without uncertainty intervals. Readers with uncertainty intervals must contend with the possibility of pushing a conflicting intent up into their uncertainty interval and causing more work for themselves, which is avoided with care by the lockTableWaiter but difficult to coordinate through the txnStatusCache. This limitation is acceptable because the most important case here is optimizing the Export requests issued by backup. This limitation also hints at the long-term plan for this interaction, which is that non-locking readers can ignore known pending intents without the need to even resolve those intents (see cockroachdb#94730). This will require a request-scoped cache of pending, pushed transactions, which does not have the same problems with uncertainty intervals. Release note (performance improvement): Backups no longer perform work proportional to the number of pending intents that they encounter, so they are over 100x faster when encountering long-running, bulk writing transactions.
Fixes #48790.
Informs #36876.
Closes #31664.
This commit adds a per-Range LRU cache of transactions that are known to be aborted or committed. We use this cache in the lockTableWaiter for two purposes:
kv.lock_table.coordinator_liveness_push_delay
(10 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range).Together, these two changes make us much more effective as cleaning up after failed transactions that have abandoned a large number of intents. The following example demonstrates this:
Notice that we are still not as fast at cleaning up intents on the insertion path as we are at doing so on the retrieval path. This is because we only batch the resolution of intents observed by a single request at a time. For the scanning case, a single ScanRequest notices all 10,000 intents and cleans them all up together. For the insertion case, each of the 10,000 PutRequests notices a single intent, and each intent is cleaned up individually. So this case is only benefited by the first part of this change (no liveness delay or txn record push) and not the second part of this change (intent resolution batching).
For this reason, we still haven't solved all of #36876. To completely address that, we'll need to defer propagation of
WriteIntentError
during batch evaluation, as we do forWriteTooOldError
s. Or we can wait out the future LockTable changes - once we remove all cases where an intent is not "discovered", the changes here will effectively address #36876.This was a partial regression in v20.1, so we'll want to backport this to that release branch. This change is on the larger side, but I feel ok about it because the mechanics aren't too tricky. I'll wait a week before backporting just to see if anything falls out.
Release note (bug fix): Abandoned intents due to failed transaction coordinators are now cleaned up much faster. This resolves a regression in v20.1.0 compared to prior releases.
@irfansharif I'm adding you as a reviewer because there's not really anyone else on KV that knows this code, so we should change that.