Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/concurrency: avoid redundant txn pushes and batch intent resolution #49218

Merged

Conversation

nvanbenschoten
Copy link
Member

Fixes #48790.
Informs #36876.
Closes #31664.

This commit adds a per-Range LRU cache of transactions that are known to be aborted or committed. We use this cache in the lockTableWaiter for two purposes:

  1. when we see a lock held by a known-finalized txn, we neither wait out the kv.lock_table.coordinator_liveness_push_delay (10 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range).
  2. we use the existence of a transaction in the cache as an indication that it may have abandoned multiple intents, perhaps due to a failure of the transaction coordinator node, so we begin deferring intent resolution to enable batching.

Together, these two changes make us much more effective as cleaning up after failed transactions that have abandoned a large number of intents. The following example demonstrates this:

--- BEFORE

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m50.801304266s


CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 3m26.874571045s



--- AFTER

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.138220753s


CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 48.763541138s

Notice that we are still not as fast at cleaning up intents on the insertion path as we are at doing so on the retrieval path. This is because we only batch the resolution of intents observed by a single request at a time. For the scanning case, a single ScanRequest notices all 10,000 intents and cleans them all up together. For the insertion case, each of the 10,000 PutRequests notices a single intent, and each intent is cleaned up individually. So this case is only benefited by the first part of this change (no liveness delay or txn record push) and not the second part of this change (intent resolution batching).

For this reason, we still haven't solved all of #36876. To completely address that, we'll need to defer propagation of WriteIntentError during batch evaluation, as we do for WriteTooOldErrors. Or we can wait out the future LockTable changes - once we remove all cases where an intent is not "discovered", the changes here will effectively address #36876.

This was a partial regression in v20.1, so we'll want to backport this to that release branch. This change is on the larger side, but I feel ok about it because the mechanics aren't too tricky. I'll wait a week before backporting just to see if anything falls out.

Release note (bug fix): Abandoned intents due to failed transaction coordinators are now cleaned up much faster. This resolves a regression in v20.1.0 compared to prior releases.

@irfansharif I'm adding you as a reviewer because there's not really anyone else on KV that knows this code, so we should change that.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschotenn/lockTableWaiterCache branch from 76d0b90 to 639ecb8 Compare May 19, 2020 14:00
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. I haven't yet looked carefully at the tests.

Reviewed 3 of 10 files at r1, 6 of 8 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @nvanbenschoten, and @sumeerbhola)


pkg/kv/kvserver/concurrency/concurrency_control.go, line 636 at r2 (raw file):

	WaitOnLock(context.Context, Request, *roachpb.Intent) *Error

	// ClearCaches wipes all caches maintained by the lockTableWaiter.

Is this also for the situation where the lockTable is disabled? Could you add a code comment?
And is this just to recover memory -- a stale cache of committed/aborted transactions is presumably harmless for correctness?


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 136 at r2 (raw file):

	// batching of intent resolution while cleaning up after abandoned txns.
	var deferredResolution []roachpb.LockUpdate
	defer w.resolveDeferredIntents(ctx, &err, &deferredResolution)

If there are N cache hits we would build up N intents in this slice and then a miss would cause this request to wait without doing the resolution. Should the deferred batch be flushed based on a timer? Perhaps worth a code comment.


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 162 at r2 (raw file):

				// had a cache of aborted transaction IDs that allowed us to notice
				// and quickly resolve abandoned intents then we might be able to
				// get rid of this state.

We now have a cache. But don't we still need the short timer to quickly add the aborted txn to the cache? If so, probably should reword this comment.


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 196 at r2 (raw file):

				// This allows us to accumulate a group of intents to resolve
				// and send them together as a batch.
				if livenessPush {

I think it is worth commenting again here that if the lock is held there will be at least one waiter (the distinguished waiter) that will have livenessPush=true, so will do this work that will benefit all waiters.


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 207 at r2 (raw file):

						// the future (before we observe the corresponding MVCC state).
						// This is safe because we already handle cases where locks
						// exist only in the MVCC keyspace and not in the lockTable.

This is justifying why this is correct for replicated locks. How about adding something like:
For unreplicated locks, the lockTable is the source of truth, and removing the locks held by an already committed/aborted transaction does not affect correctness.


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 592 at r2 (raw file):

type finalizedTxnCache struct {
	mu   syncutil.Mutex
	txns [8]*roachpb.Transaction // [MRU, ..., LRU]

How big wrt memory footprint are these Transaction objects? I am trying to understand the reason for the choice of 8.

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as someone looking at this for the first time.

Reviewed 9 of 10 files at r1, 8 of 8 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 344 at r1 (raw file):

	} else {
		// Disable all queues - the concurrency manager will no longer be
		// informed about all state transitions to locks and transactions.

Comment here could be updated to say we're also clearing out the cache, and why (also, unclear to me why we need this).


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 96 at r1 (raw file):

	// For now, we don't do this because we don't share any state between
	// separate concurrency.Manager instances.
	knownTxns finalizedTxnCache

[nit] I think knownTxns is a bit vague, but I don't have a better suggestion. Perhaps just finalizedTxnCache? The type itself is a generic txn LRU cache, and the comment block above the type definition is better suited here, I think.


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 119 at r1 (raw file):

	ResolveIntent(context.Context, roachpb.LockUpdate, intentresolver.ResolveOptions) *Error

	// ResolveIntent synchronously resolves the provided batch of intents.

s/ResolveIntent/ResolveIntents

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschotenn/lockTableWaiterCache branch from 639ecb8 to 9d39d20 Compare May 23, 2020 05:42
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @sumeerbhola)


pkg/kv/kvserver/concurrency/concurrency_control.go, line 636 at r2 (raw file):

Is this also for the situation where the lockTable is disabled? Could you add a code comment?

Done.

And is this just to recover memory -- a stale cache of committed/aborted transactions is presumably harmless for correctness?

Yes, that's correct. A stale cache entry is harmless, it just consumes memory and can also interact with tests.


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 344 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Comment here could be updated to say we're also clearing out the cache, and why (also, unclear to me why we need this).

Done.


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 96 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

[nit] I think knownTxns is a bit vague, but I don't have a better suggestion. Perhaps just finalizedTxnCache? The type itself is a generic txn LRU cache, and the comment block above the type definition is better suited here, I think.

Good point! Done.


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 119 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/ResolveIntent/ResolveIntents

Done.


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 136 at r2 (raw file):

Previously, sumeerbhola wrote…

If there are N cache hits we would build up N intents in this slice and then a miss would cause this request to wait without doing the resolution. Should the deferred batch be flushed based on a timer? Perhaps worth a code comment.

I added a comment. I don't think there's a strong need to flush deferred intent resolution on a timer. There is a subtle interaction here in cases where the request that deferred the intent resolution does not acquire a reservation on the key, but flushing wouldn't address this anyway.


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 162 at r2 (raw file):

Previously, sumeerbhola wrote…

We now have a cache. But don't we still need the short timer to quickly add the aborted txn to the cache? If so, probably should reword this comment.

Done. I just removed it because I don't think the cache is effective enough at avoiding redundant pushes to warrant risking multiple 100ms delays. Right now, it's not even shared between ranges. Even if it was, there's no way to share it between nodes. We really don't want to slow down the initial push if we have scenarios where this delay can compound.


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 196 at r2 (raw file):

Previously, sumeerbhola wrote…

I think it is worth commenting again here that if the lock is held there will be at least one waiter (the distinguished waiter) that will have livenessPush=true, so will do this work that will benefit all waiters.

Done.


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 207 at r2 (raw file):

Previously, sumeerbhola wrote…

This is justifying why this is correct for replicated locks. How about adding something like:
For unreplicated locks, the lockTable is the source of truth, and removing the locks held by an already committed/aborted transaction does not affect correctness.

Done.


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 592 at r2 (raw file):

Previously, sumeerbhola wrote…

How big wrt memory footprint are these Transaction objects? I am trying to understand the reason for the choice of 8.

They are 288 bytes large, so 8 puts us just over 2KB per leaseholder range. Even on large nodes, we tend to top out at about 10k leaseholders per node, so that puts the maximum size of this cache across an entire process at ~20MB, which seems fine for now. There's a TODO to move this to a per-store cache, which would allow us to increase its size significantly and remove the num_leaseholder factor from the size.

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 10 files at r3, 8 of 8 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @sumeerbhola)

@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented May 26, 2020

Build failed (retrying...)

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 10 files at r1, 2 of 10 files at r3, 8 of 8 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif, @nvanbenschoten, and @sumeerbhola)


pkg/kv/kvserver/concurrency/testdata/concurrency_manager/clear_abandoned_intents, line 69 at r4 (raw file):

[3] sequence req1: pushing timestamp of txn 00000002 above 0.000000010,1
[3] sequence req1: blocked on select in concurrency_test.(*cluster).PushTransaction

Could you add another debug-lock-table here and add a comment that since txn1 is the distinguished waiter for "a" it will try to push and that is the first intent resolution of "a" below (due to the abort of txn2), and then the cache kicks in to cause a batch resolution of the others.

@nvanbenschoten
Copy link
Member Author

bors r-

@craig
Copy link
Contributor

craig bot commented May 26, 2020

Canceled

Fixes cockroachdb#48790.
Informs cockroachdb#36876.
Closes cockroachdb#31664.

This commit adds a per-Range LRU cache of transactions that are known to
be aborted or committed. We use this cache in the lockTableWaiter for
two purposes:
1. when we see a lock held by a known-finalized txn, we neither wait out
   the kv.lock_table.coordinator_liveness_push_delay (10 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have abandoned multiple intents, perhaps due to a failure of the
   transaction coordinator node, so we begin deferring intent resolution to
   enable batching.

Together, these two changes make us much more effective as cleaning up
after failed transactions that have abandoned a large number of intents.
The following example demonstrates this:
```sql
--- BEFORE

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m50.801304266s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 3m26.874571045s

--- AFTER

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.138220753s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 48.763541138s
```

Notice that we are still not as fast at cleaning up intents on the
insertion path as we are at doing so on the retrieval path. This is
because we only batch the resolution of intents observed by a single
request at a time. For the scanning case, a single ScanRequest notices
all 10,000 intents and cleans them all up together. For the insertion
case, each of the 10,000 PutRequests notice a single intent, and each
intent is cleaned up individually. So this case is only benefited by
the first part of this change (no liveness delay or txn record push)
and not the second part of this change (intent resolution batching).

For this reason, we still haven't solved all of cockroachdb#36876. To completely
address that, we'll need to defer propagation of WriteIntentError during
batch evaluation, like we do for WriteTooOldErrors. Or we can wait out
the future LockTable changes - once we remove all cases where an intent
is not "discovered", the changes here will effectively address cockroachdb#36876.

This was a partial regression in v20.1, so we'll want to backport this
to that release branch. This change is on the larger side, but I feel ok
about it because the mechanics aren't too tricky. I'll wait a week before
backporting just to see if anything falls out.

Release note (bug fix): Abandoned intents due to failed transaction
coordinators are now cleaned up much faster. This resolves a regression
in v20.1.0 compared to prior releases.
These structs are 288 bytes large - a little too large to copy around
unnecessarily when we already have pointers to their original, immutable
instance on the heap.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschotenn/lockTableWaiterCache branch from 9d39d20 to 3420bf8 Compare May 26, 2020 18:52
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @sumeerbhola)


pkg/kv/kvserver/concurrency/testdata/concurrency_manager/clear_abandoned_intents, line 69 at r4 (raw file):

Previously, sumeerbhola wrote…

Could you add another debug-lock-table here and add a comment that since txn1 is the distinguished waiter for "a" it will try to push and that is the first intent resolution of "a" below (due to the abort of txn2), and then the cache kicks in to cause a batch resolution of the others.

Done.

@craig
Copy link
Contributor

craig bot commented May 26, 2020

Build succeeded

@craig craig bot merged commit 6f92e58 into cockroachdb:master May 26, 2020
@nvanbenschoten nvanbenschoten deleted the nvanbenschotenn/lockTableWaiterCache branch May 27, 2020 19:55
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 12, 2020
…y to 50ms

This change increases the default value for
`kv.lock_table.coordinator_liveness_push_delay`. Increasing this delay
is a big win for contended workloads because it allows transactions to
queue locally to conflicting locks (in the lockTable's lockWaitQueue)
instead of remotely next to the conflicting lock holders' txn records
(in the txnWaitQueue) in more cases.

This has a big effect on high contention / high concurrency workloads.
As we'll see in the following experiment, YCSB workload A would change
regimes somewhere between 200 and 256 concurrent threads. Operations
would slow down just enough that waiting txns would start performing
liveness pushes and become much less effective at queuing and responding
to transaction commits. This would cause the rest of the operations to
slow down and suddenly everyone was pushing and the front of every
lockWaitQueue was waiting in the txnWaitQueue.

The first group of selects and updates on the left (50/50 ratio, so they
overlap) shows YCSB workload A run with 256 threads and a 10ms liveness
push delay. We expect ~36k qps, but we're only sustaining between 5-15k
qps. When we bump this delay to 50ms on the right, we see much higher
throughput, stabilizing at ~36k. The workload is able to stay in the
efficient regime (no liveness pushes, listening directly to lock state
transitions) for far longer. I've tested up to 512 concurrent workers on
the same workload and never seen us enter the slow regime.

<todo throughput graph>

<todo pushes graph>

This partially addresses an existing TODO:

> We could increase this default to somewhere on the order of the
> transaction heartbeat timeout (5s) if we had a better way to avoid paying
> the cost on each of a transaction's abandoned locks and instead only pay
> it once per abandoned transaction per range or per node. This could come
> in a few different forms, including:
> - a per-store cache of recently detected abandoned transaction IDs
> - a per-range reverse index from transaction ID to locked keys
>
> TODO(nvanbenschoten): increasing this default value.

The finalizedTxnCache (introduced in cockroachdb#49218) gets us part of the way
here. It allows us to pay the liveness push delay cost once per
abandoned transaction per range instead of once per each of an abandoned
transaction's locks. This helped us to feel comfortable increasing the
default delay from the original 10ms to the current 50ms. Still, to feel
comfortable increasing this further, we'd want to improve this cache
(e.g. lifting it to the store level) to reduce the cost to once per
abandoned transaction per store.

To confirm that we're not regressing performance noticeably here, we run
the same tests that we ran in cockroachdb#49218:
```
--- BEFORE
SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '10ms';

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.574853s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 33.155231s

--- AFTER
SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '50ms';

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.547465s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 35.429984s
```

Release note (performance improvement): Transaction liveness pushes are
now delayed by 50ms, up from 10ms. This allows high contention workloads
to sustain high throughput up to much larger concurrency levels.
craig bot pushed a commit that referenced this pull request Jun 15, 2020
48306: sql: implement JSON{,B}_{OBJECT,}_AGG r=yuzefovich a=C0rWin

Fixes #19978

This PR adds built in aggregate function allowing to augment query result set into json object.

Adds implementation of json{,b}_object_agg built in function, by adding a new construction which augments constructions and allows to keep track on memory usage. Finally, links json{,b}_object_agg built in function to enable support of queries which aggregates name/value pairs into JSON object.

Release note (sql change): enable aggregate queries which collect key/value pairs into JSON object.

50133: opt: refactor and test the JoinMultiplicity library r=DrewKimball a=DrewKimball

Previously, the logic in the JoinMultiplicity library was complicated
and untested. Multiple bugs have been both found and introduced
during refactoring of the original JoinFiltersMatchAllLeftRows logic.

This patch refactors the code to make it easier to understand and
introduces unit testing for the library.
This patch also makes the following changes to the behavior of
multiplicity_builder:
1. Fixed an issue that caused filtersMatchAllLeftRows to return false
   positives.
2. Foreign key columns only invalidate a foreign key if they were
   nullable in the base table. Since we derive the foreign key
   from a left not-null column in the first place, we know that
   columns from this table haven't been null-extended, so a not-null
   column will stay not-null. This is useful when foreign key
   columns are not in the output set.
3. All columns from a table (including non-output columns) are
   added to the UnfilteredCols field. This allows foreign key
   relations to be validated even when the foreign key columns are
   not in the output set.

Fixes #50059

Release note: None

50161: kv/concurrency: increase kv.lock_table.coordinator_liveness_push_delay to 50ms r=nvanbenschoten a=nvanbenschoten

This change increases the default value for `kv.lock_table.coordinator_liveness_push_delay`. Increasing this delay is a big win for contended workloads because it allows transactions to queue locally to conflicting locks (in the lockTable's lockWaitQueue) instead of remotely next to the conflicting lock holders' txn records (in the txnWaitQueue) in more cases.

This has a big effect on high contention / high concurrency workloads. As we'll see in the following experiment, YCSB workload A would change regimes somewhere between 200 and 256 concurrent threads. Operations would slow down just enough that waiting txns would start performing liveness pushes and become much less effective at queuing and responding to transaction commits. This would cause the rest of the operations to slow down and suddenly everyone was pushing and the front of every lockWaitQueue was waiting in the txnWaitQueue.

The first group of selects and updates on the left (50/50 ratio, so they overlap) shows YCSB workload A run with 256 threads and a 10ms liveness push delay. We expect ~36k qps, but we're only sustaining between 5-15k qps. When we bump this delay to 50ms on the right, we see much higher throughput, stabilizing at ~36k. The workload is able to stay in the efficient regime (no liveness pushes, listening directly to lock state transitions) for far longer. I've tested up to 512 concurrent workers on the same workload and never seen us enter the slow regime.

<img width="978" alt="Screen Shot 2020-06-12 at 5 08 30 PM" src="https://user-images.githubusercontent.com/5438456/84550421-886d4f80-acd8-11ea-9982-2f2f7a3a4d6b.png">

<img width="1057" alt="Screen Shot 2020-06-12 at 5 18 27 PM" src="https://user-images.githubusercontent.com/5438456/84550428-8c996d00-acd8-11ea-8045-d2db4357e9da.png">

_10ms delay on left, 50ms delay on right_

This partially addresses an existing TODO:

> We could increase this default to somewhere on the order of the
> transaction heartbeat timeout (5s) if we had a better way to avoid paying
> the cost on each of a transaction's abandoned locks and instead only pay
> it once per abandoned transaction per range or per node. This could come
> in a few different forms, including:
> - a per-store cache of recently detected abandoned transaction IDs
> - a per-range reverse index from transaction ID to locked keys
>
> TODO(nvanbenschoten): increasing this default value.

The finalizedTxnCache (introduced in #49218) gets us part of the way here. It allows us to pay the liveness push delay cost once per abandoned transaction per range instead of once per each of an abandoned transaction's locks. This helped us to feel comfortable increasing the default delay from the original 10ms to the current 50ms. Still, to feel comfortable increasing this further, we'd want to improve this cache (e.g. lifting it to the store level) to reduce the cost to once per abandoned transaction per store.

To confirm that we're not regressing performance noticeably here, we run the same tests that we ran in #49218:
```
--- BEFORE
SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '10ms';

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.574853s


CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 33.155231s



--- AFTER
SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '50ms';

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.547465s


CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 35.429984s
```

Release note (performance improvement): Transaction liveness pushes are now delayed by 50ms, up from 10ms. This allows high contention workloads to sustain high throughput up to much larger concurrency levels.

50245: ui: remove compactor queue graphs r=jbowens a=jbowens

Remove the compactor queue graphs from the Admin UI. They were commonly
confused with storage-engine level compactions, and the compactor queue
itself will (after future work) no longer be used for Pebble
storage engines at all.

I left the underlying metrics, assuming they'll still be useful for
diangosing problems on nodes running with RocksDB.

Release note (admin ui change): The 'Queues' dashboard in the Admin UI
no longer shows a 'Compaction Queue' graph and the 'Queue Processing
Failures' and 'Queue Processing Times' graphs no longer include the
'Compaction' queue metrics because these were commonly confused.

Co-authored-by: Artem Barger <bartem@il.ibm.com>
Co-authored-by: Drew Kimball <drewk@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request May 14, 2023
…uests

Fixes cockroachdb#50390.
Related to cockroachdb#60585.
Related to cockroachdb#103126.

Closes cockroachdb#64500, which was an earlier attempt to solve this issue using a similar
approach. This commit addresses the comments on that PR, which focused on the
pagination of intent resolution when bypassing the request batcher. We still
don't try to solve this issue, and instead limit the cases where intent
resolution bypasses the request batcher to those where pagination is not
necessary.

This commit adds a new `sendImmediately` option to the `IntentResolver`
`ResolveOptions`, which instructs the `IntentResolver` to send the provided
intent resolution requests immediately, instead of adding them to a batch and
waiting up to 10ms (defaultIntentResolutionBatchWait) for that batch to fill up
with other intent resolution requests. This can be used to avoid any
batching-induced latency and should be used only by foreground traffic that is
willing to trade off some throughput for lower latency.

The commit then sets this flag for single-key intent resolution requests initiated
by the `lockTableWaiter`. Unlike most calls into the `IntentResolver`, which are
performed by background tasks that are more than happy to trade increased
latency for increased throughput, the `lockTableWaiter` calls into the
`IntentResolver` on behalf of a foreground operation. It wants intent resolution
to complete as soon as possible, so it is willing to trade reduced throughput
for reduced latency.

I tested this out by writing 10,000 different intents in a normal-priority transaction
and then scanning over the table using a high-priority transaction. The test was
performed on a 3-node roachprod cluster to demonstrate the effect with realistic
network and disk latencies.
```
-- session 1
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000);

-- session 2
BEGIN PRIORITY HIGH; SELECT count(*) FROM keys;
```

Without this change, the scan took 70.078s. With this change, the scan took
15.958s. This 78% speed up checks out. Each encountered intent is resolved
serially (see cockroachdb#103126), so the per-intent latency drops from 7ms to 1.6ms. This
improvement by about 5ms agrees with the `defaultIntentResolutionBatchIdle`,
which delays each resolution request that passes through a RequestBatcher. With
this change, these resolve intent requests are issued immediately and this delay
is not experienced.

While this is a significant improvement and will be important for Read Committed
transactions after cockroachdb#102014, this is still not quite enough to resolve cockroachdb#103126.
For that, we need to batch the resolve intent requests themselves using a form
of deferred intent resolution like we added in cockroachdb#49218 (for finalized transactions).

A similar improvement is seen for scans that encounter many abandoned intents
from many different transactions. This scenario bypasses the deferred intent
resolution path added in cockroachdb#49218, because the intents are each written by
different transactions. The speedup for this scenario was presented in cockroachdb#64500.

Release note (performance improvement): SQL statements that must clean up
intents from many different previously abandoned transactions now do so
moderately more efficiently.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request May 18, 2023
…uests

Fixes cockroachdb#50390.
Related to cockroachdb#60585.
Related to cockroachdb#103126.

Closes cockroachdb#64500, which was an earlier attempt to solve this issue using a similar
approach. This commit addresses the comments on that PR, which focused on the
pagination of intent resolution when bypassing the request batcher. We still
don't try to solve this issue, and instead limit the cases where intent
resolution bypasses the request batcher to those where pagination is not
necessary.

This commit adds a new `sendImmediately` option to the `IntentResolver`
`ResolveOptions`, which instructs the `IntentResolver` to send the provided
intent resolution requests immediately, instead of adding them to a batch and
waiting up to 10ms (defaultIntentResolutionBatchWait) for that batch to fill up
with other intent resolution requests. This can be used to avoid any
batching-induced latency and should be used only by foreground traffic that is
willing to trade off some throughput for lower latency.

The commit then sets this flag for single-key intent resolution requests initiated
by the `lockTableWaiter`. Unlike most calls into the `IntentResolver`, which are
performed by background tasks that are more than happy to trade increased
latency for increased throughput, the `lockTableWaiter` calls into the
`IntentResolver` on behalf of a foreground operation. It wants intent resolution
to complete as soon as possible, so it is willing to trade reduced throughput
for reduced latency.

I tested this out by writing 10,000 different intents in a normal-priority transaction
and then scanning over the table using a high-priority transaction. The test was
performed on a 3-node roachprod cluster to demonstrate the effect with realistic
network and disk latencies.
```
-- session 1
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000);

-- session 2
BEGIN PRIORITY HIGH; SELECT count(*) FROM keys;
```

Without this change, the scan took 70.078s. With this change, the scan took
15.958s. This 78% speed up checks out. Each encountered intent is resolved
serially (see cockroachdb#103126), so the per-intent latency drops from 7ms to 1.6ms. This
improvement by about 5ms agrees with the `defaultIntentResolutionBatchIdle`,
which delays each resolution request that passes through a RequestBatcher. With
this change, these resolve intent requests are issued immediately and this delay
is not experienced.

While this is a significant improvement and will be important for Read Committed
transactions after cockroachdb#102014, this is still not quite enough to resolve cockroachdb#103126.
For that, we need to batch the resolve intent requests themselves using a form
of deferred intent resolution like we added in cockroachdb#49218 (for finalized transactions).

A similar improvement is seen for scans that encounter many abandoned intents
from many different transactions. This scenario bypasses the deferred intent
resolution path added in cockroachdb#49218, because the intents are each written by
different transactions. The speedup for this scenario was presented in cockroachdb#64500.

Release note (performance improvement): SQL statements that must clean up
intents from many different previously abandoned transactions now do so
moderately more efficiently.
craig bot pushed a commit that referenced this pull request May 18, 2023
103265: kv: resolve conflicting intents immediately for latency-sensitive requests r=nvanbenschoten a=nvanbenschoten

Fixes #50390.
Related to #60585.
Related to #103126.

Closes #64500, which was an earlier attempt to solve this issue using a similar approach. This commit addresses the comments on that PR, which focused on the pagination of intent resolution when bypassing the request batcher. We still don't try to solve this issue, and instead limit the cases where intent resolution bypasses the request batcher to those where pagination is not necessary.

This commit adds a new `sendImmediately` option to the `IntentResolver` `ResolveOptions`, which instructs the `IntentResolver` to send the provided intent resolution requests immediately, instead of adding them to a batch and waiting up to 10ms (`defaultIntentResolutionBatchWait`) for that batch to fill up with other intent resolution requests. This can be used to avoid any batching-induced latency and should be used only by foreground traffic that is willing to trade off some throughput for lower latency.

The commit then sets this flag for single-key intent resolution requests initiated by the `lockTableWaiter`. Unlike most calls into the `IntentResolver`, which are performed by background tasks that are more than happy to trade increased latency for increased throughput, the `lockTableWaiter` calls into the `IntentResolver` on behalf of a foreground operation. It wants intent resolution to complete as soon as possible, so it is willing to trade reduced throughput for reduced latency.

I tested this out by writing 10,000 different intents in a normal-priority transaction and then scanning over the table using a high-priority transaction. The test was performed on a 3-node roachprod cluster to demonstrate the effect with realistic network and disk latencies.
```sql
-- session 1
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000);

-- session 2
BEGIN PRIORITY HIGH; SELECT count(*) FROM keys;
```

Without this change, the scan took **70.078s**. With this change, the scan took **15.958s**. This **78%** speed-up checks out. Each encountered intent is resolved serially (see #103126), so the **per-intent latency** drops from **7ms** to **1.6ms.** This improvement by about 5ms agrees with the `defaultIntentResolutionBatchIdle`, which delays each resolution request that passes through a RequestBatcher. With this change, these resolve intent requests are issued immediately and this delay is not experienced.

While this is a significant improvement and will be important for Read Committed transactions after #102014, this is still not quite enough to resolve #103126. For that, we need to batch the resolve intent requests themselves using a form of deferred intent resolution like we added in #49218 (for finalized transactions).

A similar improvement is seen for scans that encounter many abandoned intents from many different transactions. This scenario bypasses the deferred intent resolution path added in #49218, because the intents are each written by different transactions. The speedup for this scenario was presented in #64500.

Release note (performance improvement): SQL statements that must clean up intents from many different previously abandoned transactions now do so moderately more efficiently.


103362: sql: validate primary / secondary region localities at end of txn r=fqazi a=fqazi

Previously, if a database was restored with skip_localities, there was no way to modify this database to set the primary region since validation would kick in too early during the statement. This meant fixing the regions in a restored database was impossible if the primary region was no longer valid. To address this, this patch, delays locality validation till the end of the transaction.

Fixes: #103290

Release note (bug fix): SET PRIMARY REGION and SET SECONDARY REGION did not validate transactionally, which could prevent cleaning up removed regions after a restore.

103373: concurrency: small refactors in preparation for reservation removal r=nvanbenschoten a=arulajmani

See individual commits for details. 

Informs: #103361

103538: go.mod: bump Pebble to 6f2788660198, rework shared storage wrapper r=RaduBerinde a=RaduBerinde

6f278866 shared: improve interface for more efficient reading
9eb2c407 db: log events to testing.T in unit tests
f32e7dc6 db: add reserved Pebblev4 sstable format
5a6b91b8 objstorage: improve test and add read ahead test
2bc4319e objstorage: remove genericFileReadable
8143ffb9 objstorage: fix readaheadState initialization
06d08888 db: add Compact.Duration metric
e7213de0 db: add Uptime metric
e9005aed db: don't delete files during ingest application
222b43ec internal/arenaskl: fix Skiplist doc typo

Release note: None
Epic: none

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Co-authored-by: Arul Ajmani <arulajmani@gmail.com>
Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 13, 2023
Fixes cockroachdb#103126.

This commit extends the infrastructure introduced in cockroachdb#49218 for transaction
timestamp pushes. It avoids redundant txn pushes of PENDING transactions and
batches the resolution of PENDING intents. This breaks the O(num_intents) work
performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into
something closer to O(num_ranges) work.

The commit accomplishes its goals by adding a second per-Range LRU cache of
transactions that are PENDING and are known to have been pushed to higher
timestamps. We use this cache for two purposes:

1. when we are a non-locking read and we see a lock at a conflicting timestamp
   who is held by a pushed txn above our read timestamp, we neither wait out the
   kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have written multiple intents, so we begin deferring intent resolution
   to enable batching.

Together, these two changes make us much more effective at pushing transactions
with a large number of intents. The following example (from cockroachdb#103126) demonstrates
this:
```sql
-- SETUP: run in a 3-node GCP roachprod cluster

--- session 1 - write 100k intents
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 100000);

--- session 2 - push intents with high-priority txn without uncertainty interval
BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms';
SELECT count(*) FROM keys;

--- BEFORE this PR and before cockroachdb#103265 (i.e. v23.1.2): takes ~7.1ms per intent
Time: 714.441s total

--- BEFORE this PR: takes ~1.5ms per intent
Time: 151.880s total

--- AFTER this PR: takes ~24μs per intent
Time: 2.405s
```

The change does have an unfortunate limitation. Deferred intent resolution
is only currently enabled for non-locking readers without uncertainty
intervals. Readers with uncertainty intervals must contend with the
possibility of pushing a conflicting intent up into their uncertainty
interval and causing more work for themselves, which is avoided with care
by the lockTableWaiter but difficult to coordinate through the
txnStatusCache. This limitation is acceptable because the most important
case here is optimizing the Export requests issued by backup.

This limitation also hints at the long-term plan for this interaction,
which is that non-locking readers can ignore known pending intents without
the need to even resolve those intents (see cockroachdb#94730). This will require a
request-scoped cache of pending, pushed transactions, which does not have
the same problems with uncertainty intervals.

Release note (performance improvement): Backups no longer perform work
proportional to the number of pending intents that they encounter, so they are
over 100x faster when encountering long-running, bulk writing transactions.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 19, 2023
Fixes cockroachdb#103126.

This commit extends the infrastructure introduced in cockroachdb#49218 for transaction
timestamp pushes. It avoids redundant txn pushes of PENDING transactions and
batches the resolution of PENDING intents. This breaks the O(num_intents) work
performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into
something closer to O(num_ranges) work.

The commit accomplishes its goals by adding a second per-Range LRU cache of
transactions that are PENDING and are known to have been pushed to higher
timestamps. We use this cache for two purposes:

1. when we are a non-locking read and we see a lock at a conflicting timestamp
   who is held by a pushed txn above our read timestamp, we neither wait out the
   kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have written multiple intents, so we begin deferring intent resolution
   to enable batching.

Together, these two changes make us much more effective at pushing transactions
with a large number of intents. The following example (from cockroachdb#103126) demonstrates
this:
```sql
-- SETUP: run in a 3-node GCP roachprod cluster

--- session 1 - write 100k intents
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 100000);

--- session 2 - push intents with high-priority txn without uncertainty interval
BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms';
SELECT count(*) FROM keys;

--- BEFORE this PR and before cockroachdb#103265 (i.e. v23.1.2): takes ~7.1ms per intent
Time: 714.441s total

--- BEFORE this PR: takes ~1.5ms per intent
Time: 151.880s total

--- AFTER this PR: takes ~24μs per intent
Time: 2.405s
```

The change does have an unfortunate limitation. Deferred intent resolution
is only currently enabled for non-locking readers without uncertainty
intervals. Readers with uncertainty intervals must contend with the
possibility of pushing a conflicting intent up into their uncertainty
interval and causing more work for themselves, which is avoided with care
by the lockTableWaiter but difficult to coordinate through the
txnStatusCache. This limitation is acceptable because the most important
case here is optimizing the Export requests issued by backup.

This limitation also hints at the long-term plan for this interaction,
which is that non-locking readers can ignore known pending intents without
the need to even resolve those intents (see cockroachdb#94730). This will require a
request-scoped cache of pending, pushed transactions, which does not have
the same problems with uncertainty intervals.

Release note (performance improvement): Backups no longer perform work
proportional to the number of pending intents that they encounter, so they are
over 100x faster when encountering long-running, bulk writing transactions.
craig bot pushed a commit that referenced this pull request Jun 20, 2023
104784: kv/concurrency: batch intent resolution of pushed intents from same txn r=arulajmani a=nvanbenschoten

Fixes #103126.

This commit extends the infrastructure introduced in #49218 for transaction timestamp pushes. It avoids redundant txn pushes of PENDING transactions and batches the resolution of PENDING intents. This breaks the O(num_intents) work performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into something closer to O(num_ranges) work.

The commit accomplishes its goals by adding a second per-Range LRU cache of transactions that are PENDING and are known to have been pushed to higher timestamps. We use this cache for two purposes:

1. when we are a non-locking read and we see a lock at a conflicting timestamp who is held by a pushed txn above our read timestamp, we neither wait out the kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that it may have written multiple intents, so we begin deferring intent resolution to enable batching.

Together, these two changes make us much more effective at pushing transactions with a large number of intents. The following example (from #103126) demonstrates this:
```sql
-- SETUP: run in a 3-node GCP roachprod cluster

--- session 1 - write 100k intents
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 100000);

--- session 2 - push intents with high-priority txn without uncertainty interval
BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms';
SELECT count(*) FROM keys;

--- BEFORE this PR and before #103265 (i.e. v23.1.2): takes ~7.1ms per intent
Time: 714.441s total

--- BEFORE this PR: takes ~1.5ms per intent
Time: 151.880s total

--- AFTER this PR: takes ~24μs per intent
Time: 2.405s
```

The change does have an unfortunate limitation. Deferred intent resolution is only currently enabled for non-locking readers without uncertainty intervals. Readers with uncertainty intervals must contend with the possibility of pushing a conflicting intent up into their uncertainty interval and causing more work for themselves, which is avoided with care by the lockTableWaiter but difficult to coordinate through the txnStatusCache. This limitation is acceptable because the most important case here is optimizing the Export requests issued by backup.

This limitation also hints at the long-term plan for this interaction, which is that non-locking readers can ignore known pending intents without the need to even resolve those intents (see #94730). This will require a request-scoped cache of pending, pushed transactions, which does not have the same problems with uncertainty intervals.

Release note (performance improvement): Backups no longer perform work proportional to the number of pending intents that they encounter, so they are over 100x faster when encountering long-running, bulk writing transactions.

Co-authored-by: Arul Ajmani <arulajmani@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jul 13, 2023
Fixes cockroachdb#103126.

This commit extends the infrastructure introduced in cockroachdb#49218 for transaction
timestamp pushes. It avoids redundant txn pushes of PENDING transactions and
batches the resolution of PENDING intents. This breaks the O(num_intents) work
performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into
something closer to O(num_ranges) work.

The commit accomplishes its goals by adding a second per-Range LRU cache of
transactions that are PENDING and are known to have been pushed to higher
timestamps. We use this cache for two purposes:

1. when we are a non-locking read and we see a lock at a conflicting timestamp
   who is held by a pushed txn above our read timestamp, we neither wait out the
   kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have written multiple intents, so we begin deferring intent resolution
   to enable batching.

Together, these two changes make us much more effective at pushing transactions
with a large number of intents. The following example (from cockroachdb#103126) demonstrates
this:
```sql
-- SETUP: run in a 3-node GCP roachprod cluster

--- session 1 - write 100k intents
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 100000);

--- session 2 - push intents with high-priority txn without uncertainty interval
BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms';
SELECT count(*) FROM keys;

--- BEFORE this PR and before cockroachdb#103265 (i.e. v23.1.2): takes ~7.1ms per intent
Time: 714.441s total

--- BEFORE this PR: takes ~1.5ms per intent
Time: 151.880s total

--- AFTER this PR: takes ~24μs per intent
Time: 2.405s
```

The change does have an unfortunate limitation. Deferred intent resolution
is only currently enabled for non-locking readers without uncertainty
intervals. Readers with uncertainty intervals must contend with the
possibility of pushing a conflicting intent up into their uncertainty
interval and causing more work for themselves, which is avoided with care
by the lockTableWaiter but difficult to coordinate through the
txnStatusCache. This limitation is acceptable because the most important
case here is optimizing the Export requests issued by backup.

This limitation also hints at the long-term plan for this interaction,
which is that non-locking readers can ignore known pending intents without
the need to even resolve those intents (see cockroachdb#94730). This will require a
request-scoped cache of pending, pushed transactions, which does not have
the same problems with uncertainty intervals.

Release note (performance improvement): Backups no longer perform work
proportional to the number of pending intents that they encounter, so they are
over 100x faster when encountering long-running, bulk writing transactions.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jul 13, 2023
Fixes cockroachdb#103126.

This commit extends the infrastructure introduced in cockroachdb#49218 for transaction
timestamp pushes. It avoids redundant txn pushes of PENDING transactions and
batches the resolution of PENDING intents. This breaks the O(num_intents) work
performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into
something closer to O(num_ranges) work.

The commit accomplishes its goals by adding a second per-Range LRU cache of
transactions that are PENDING and are known to have been pushed to higher
timestamps. We use this cache for two purposes:

1. when we are a non-locking read and we see a lock at a conflicting timestamp
   who is held by a pushed txn above our read timestamp, we neither wait out the
   kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have written multiple intents, so we begin deferring intent resolution
   to enable batching.

Together, these two changes make us much more effective at pushing transactions
with a large number of intents. The following example (from cockroachdb#103126) demonstrates
this:
```sql
-- SETUP: run in a 3-node GCP roachprod cluster

--- session 1 - write 100k intents
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 100000);

--- session 2 - push intents with high-priority txn without uncertainty interval
BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms';
SELECT count(*) FROM keys;

--- BEFORE this PR and before cockroachdb#103265 (i.e. v23.1.2): takes ~7.1ms per intent
Time: 714.441s total

--- BEFORE this PR: takes ~1.5ms per intent
Time: 151.880s total

--- AFTER this PR: takes ~24μs per intent
Time: 2.405s
```

The change does have an unfortunate limitation. Deferred intent resolution
is only currently enabled for non-locking readers without uncertainty
intervals. Readers with uncertainty intervals must contend with the
possibility of pushing a conflicting intent up into their uncertainty
interval and causing more work for themselves, which is avoided with care
by the lockTableWaiter but difficult to coordinate through the
txnStatusCache. This limitation is acceptable because the most important
case here is optimizing the Export requests issued by backup.

This limitation also hints at the long-term plan for this interaction,
which is that non-locking readers can ignore known pending intents without
the need to even resolve those intents (see cockroachdb#94730). This will require a
request-scoped cache of pending, pushed transactions, which does not have
the same problems with uncertainty intervals.

Release note (performance improvement): Backups no longer perform work
proportional to the number of pending intents that they encounter, so they are
over 100x faster when encountering long-running, bulk writing transactions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: deleting rows from table A hangs after failed insert transaction to related table B
4 participants