Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] storage/kv: remember interacting transaction dispositions #31664

Conversation

nvanbenschoten
Copy link
Member

Informs #22349.

This is inspired by the trace in #18684 (comment).

The change adds new state to roachpb.Transaction, allowing it to remember
the authoritative dispositions of a limited number of ABORTED and COMMITTED
transactions. The transaction can use this new knowledge for two purposes:

  1. when a write runs into a conflicting intent for a transaction that it
    knows about, it can immediately resolve the intent and write its new
    intent, all in the same WriteBatch and Raft proposal.
  2. when a scan runs into an intent for a transaction it knows about, it
    still throws a WriteIntentError, but the intentResolver doesn't need
    to push the transaction before resolving the intents.

Transactions use this local "memory" to remember the state of intents
that it has run into in the past. This change is founded on the conjecture
that transactions which contend at one location have a high probability of
contending in other locations. The reasoning for this is that clients
typically have a fixed set of queries they run, each of which takes a set
of parameters. If one or more of the parameters line up for the same
transaction type between two transactions, one or more of their statements
will touch the same rows.

It is therefore beneficial for transactions to carry around some memory
about their related transactions so that they can optimize for interactions
with them. For instance, a transaction A that pushes a transaction B after
seeing one of B's intents and finds the B is ABORTED should not need to
push transaction B again to know that it can clean up any other intents
that B has abandoned.


To test this, I ran kv0 --batch=100 --splits=5 for 10 second with async
intent resolution disabled and a 1ms delay added to PushTxnRequests to
simulate transaction records living on different nodes than their external
intents. I then ran the same command again with the exact same write sequence.
The effect of this is that the second run hit all of the intents abandoned by
the first run and had to clean them up as it went This is the situation
we saw in the linked trace. Here are the results:

First run:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   10.0s        0           4080          407.8     19.5     17.8     33.6     75.5    142.6

Second run without this change:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   10.0s        0            454           45.4    173.2    167.8    260.0    302.0    352.3

Second run with this change:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   10.0s        0           2383          238.2     33.5     30.4     60.8    104.9    209.7

Remembering the status of related transactions allows us to improve throughput
in this degenerate case by 425% and reduce average latency by 81%. Of
course, this is the best case scenerio for this change.


I don't necessarily think we even need to implement something like this, but we should
start thinking about this problem. For instance, another alternative would be to introduce
an in-memory LRU cache on each Store that held similar information. This
has a number of trade-offs compared to the state being local to transactions.

cc. @spencerkimball @tschottdorf

Release note: None

Informs cockroachdb#22349.

This is inspired by the trace in cockroachdb#18684 (comment).

The change adds new state to `roachpb.Transaction`, allowing it to remember
the authoritative dispositions of a limited number of ABORTED and COMMITTED
transactions. The transaction can use this new knowledge for two purposes:
1. when a write runs into a conflicting intent for a transaction that it
   knows about, it can immediately resolve the intent and write its new
   intent, all in the same WriteBatch and Raft proposal.
2. when a scan runs into an intent for a transaction it knows about, it
   still throws a WriteIntentError, but the intentResolver doesn't need
   to push the transaction before resolving the intents.

Transactions use this local "memory" to remember the state of intents
that it has run into in the past. This change is founded on the conjecture
that transactions which contend at one location have a high probability of
contending in other locations. The reasoning for this is that clients
typically have a fixed set of queries they run, each of which takes a set
of parameters. If one or more of the parameters line up for the same
transaction type between two transactions, one or more of their statements
will touch the same rows.

It is therefore beneficial for transactions to carry around some memory
about their related transactions so that they can optimize for interactions
with them. For instance, a transaction A that pushes a transaction B after
seeing one of B's intents and finds the B is ABORTED should not need to
push transaction B again to know that it can clean up any other intents
that B has abandoned.

To test this, I ran `kv0 --batch=100 --splits=5` for 10 second with async
intent resolution disabled and a 1ms delay added to `PushTxnRequest`s to
simulate transaction records living on different nodes than their external
intents. I then ran the same command again with the exact same write sequence.
The effect of this is that the second run hit all of the intents abandoned by
the first run and had to clean them up as it went This is the situation
we saw in the linked trace. Here are the results:

First run:
```
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   10.0s        0           4080          407.8     19.5     17.8     33.6     75.5    142.6
```

Second run without this change:
```
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   10.0s        0            454           45.4    173.2    167.8    260.0    302.0    352.3
```

Second run with this change:
```
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   10.0s        0           2383          238.2     33.5     30.4     60.8    104.9    209.7
```

Remembering the status of related transactions allows us to improve throughput
in this degenerate case by **425%** and reduce average latency by **81%**. Of
course, this is the best case scenerio for this change. I don't necessarily
think we even need to implement it like this, but we should start thinking
about this problem. For instance, another alternative would be to introduce
an in-memory LRU cache on each `Store` that held similar information. This
has a number of trade-offs compared to the state being local to transactions.

Release note: None
@nvanbenschoten nvanbenschoten requested a review from a team October 20, 2018 03:54
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@bdarnell
Copy link
Contributor

I think it's probably worth doing something like this. My instinct is that it would be better to try an LRU cache on the Store instead of adding state to the Transaction, mainly because I suspect that a lot of workloads that run into this problem are not using transactional retries correctly, so they'd benefit from a store-level cache but not a transaction-level one. The Transaction also has more complex concurrency and persistence semantics (I don't think there would be any actual problem here, but better not to pile on more complexity).

@nvanbenschoten
Copy link
Member Author

Yeah, there's a trade-off here between making this cache transaction scoped or making it store scoped.

The benefits of the txn-scoped cache approach are:

  • the cache will travel around with the transaction, so once a transaction sees the garbage of another one, it never needs to look it up again. This is true even if the transaction sees the garbage of another one on a new node
  • the cache is nicely accessible in MVCCPut, allowing us to trivially resolve intents for other txns whenever possible. We can even cleanly resolve intents in the same WriteBatch as a new intent without threading a new dependency into mvcc code
  • the cache has a simple eviction policy and a simple lifetime
  • there are no synchronization concerns to access the cache - no locking, etc.

The benefits of a store-level cache are:

  • multiple transactions can have access to the same information, which means that it will probably save space. It also means that transactions may never need to look up the state of an interfering transaction in the first place because a different transaction already looked the information up
  • it doesn't increase the size of the transaction object itself
  • the implementation is trivial. We could just hang an LRU cache off of the intentResolver

I'm still weighing these tradeoffs. It does seem like they target different workloads and I'm not sure they solutions are even mutually exclusive. The txn-scoped cache is good for cases where there is a lot of overlap between small groups of transactions. The store-level cache is good for cases where there is less overlap between larger groups of transactions.

@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label Jun 19, 2019
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request May 19, 2020
Fixes cockroachdb#48790.
Informs cockroachdb#36876.
Closes cockroachdb#31664.

This commit adds a per-Range LRU cache of transactions that are known to
be aborted or committed. We use this cache in the lockTableWaiter for
two purposes:
1. when we see a lock held by a known-finalized txn, we neither wait out
   the kv.lock_table.coordinator_liveness_push_delay (10 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have abandoned multiple intents, perhaps due to a failure of the
   transaction coordinator node, so we begin deferring intent resolution to
   enable batching.

Together, these two changes make us much more effective as cleaning up
after failed transactions that have abandoned a large number of intents.
The following example demonstrates this:
```sql
--- BEFORE

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m50.801304266s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 3m26.874571045s

--- AFTER

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.138220753s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 48.763541138s
```

Notice that we are still not as fast at cleaning up intents on the
insertion path as we are at doing so on the retrieval path. This is
because we only batch the resolution of intents observed by a single
request at a time. For the scanning case, a single ScanRequest notices
all 10,000 intents and cleans them all up together. For the insertion
case, each of the 10,000 PutRequests notice a single intent, and each
intent is cleaned up individually. So this case is only benefited by
the first part of this change (no liveness delay or txn record push)
and not the second part of this change (intent resolution batching).

For this reason, we still haven't solved all of cockroachdb#36876. To completely
address that, we'll need to defer propagation of WriteIntentError during
batch evaluation, like we do for WriteTooOldErrors. Or we can wait out
the future LockTable changes - once we remove all cases where an intent
is not "discovered", the changes here will effectively address cockroachdb#36876.

This was a partial regression in v20.1, so we'll want to backport this
to that release branch. This change is on the larger side, but I feel ok
about it because the mechanics aren't too tricky. I'll wait a week before
backporting just to see if anything falls out.

Release note (bug fix): Abandoned intents due to failed transaction
coordinators are now cleaned up much faster. This resolves a regression
in v20.1.0 compared to prior releases.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request May 23, 2020
Fixes cockroachdb#48790.
Informs cockroachdb#36876.
Closes cockroachdb#31664.

This commit adds a per-Range LRU cache of transactions that are known to
be aborted or committed. We use this cache in the lockTableWaiter for
two purposes:
1. when we see a lock held by a known-finalized txn, we neither wait out
   the kv.lock_table.coordinator_liveness_push_delay (10 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have abandoned multiple intents, perhaps due to a failure of the
   transaction coordinator node, so we begin deferring intent resolution to
   enable batching.

Together, these two changes make us much more effective as cleaning up
after failed transactions that have abandoned a large number of intents.
The following example demonstrates this:
```sql
--- BEFORE

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m50.801304266s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 3m26.874571045s

--- AFTER

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.138220753s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 48.763541138s
```

Notice that we are still not as fast at cleaning up intents on the
insertion path as we are at doing so on the retrieval path. This is
because we only batch the resolution of intents observed by a single
request at a time. For the scanning case, a single ScanRequest notices
all 10,000 intents and cleans them all up together. For the insertion
case, each of the 10,000 PutRequests notice a single intent, and each
intent is cleaned up individually. So this case is only benefited by
the first part of this change (no liveness delay or txn record push)
and not the second part of this change (intent resolution batching).

For this reason, we still haven't solved all of cockroachdb#36876. To completely
address that, we'll need to defer propagation of WriteIntentError during
batch evaluation, like we do for WriteTooOldErrors. Or we can wait out
the future LockTable changes - once we remove all cases where an intent
is not "discovered", the changes here will effectively address cockroachdb#36876.

This was a partial regression in v20.1, so we'll want to backport this
to that release branch. This change is on the larger side, but I feel ok
about it because the mechanics aren't too tricky. I'll wait a week before
backporting just to see if anything falls out.

Release note (bug fix): Abandoned intents due to failed transaction
coordinators are now cleaned up much faster. This resolves a regression
in v20.1.0 compared to prior releases.
craig bot pushed a commit that referenced this pull request May 26, 2020
49218: kv/concurrency: avoid redundant txn pushes and batch intent resolution r=nvanbenschoten a=nvanbenschoten

Fixes #48790.
Informs #36876.
Closes #31664.

This commit adds a per-Range LRU cache of transactions that are known to be aborted or committed. We use this cache in the lockTableWaiter for two purposes:
1. when we see a lock held by a known-finalized txn, we neither wait out the `kv.lock_table.coordinator_liveness_push_delay` (10 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that it may have abandoned multiple intents, perhaps due to a failure of the transaction coordinator node, so we begin deferring intent resolution to enable batching.

Together, these two changes make us much more effective as cleaning up after failed transactions that have abandoned a large number of intents. The following example demonstrates this:
```sql
--- BEFORE

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m50.801304266s


CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 3m26.874571045s



--- AFTER

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.138220753s


CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 48.763541138s
```

Notice that we are still not as fast at cleaning up intents on the insertion path as we are at doing so on the retrieval path. This is because we only batch the resolution of intents observed by a single request at a time. For the scanning case, a single ScanRequest notices all 10,000 intents and cleans them all up together. For the insertion case, each of the 10,000 PutRequests notices a single intent, and each intent is cleaned up individually. So this case is only benefited by the first part of this change (no liveness delay or txn record push) and not the second part of this change (intent resolution batching).

For this reason, we still haven't solved all of #36876. To completely address that, we'll need to defer propagation of `WriteIntentError` during batch evaluation, as we do for `WriteTooOldError`s. Or we can wait out the future LockTable changes - once we remove all cases where an intent is not "discovered", the changes here will effectively address #36876.

This was a partial regression in v20.1, so we'll want to backport this to that release branch. This change is on the larger side, but I feel ok about it because the mechanics aren't too tricky. I'll wait a week before backporting just to see if anything falls out.

Release note (bug fix): Abandoned intents due to failed transaction coordinators are now cleaned up much faster. This resolves a regression in v20.1.0 compared to prior releases.

@irfansharif I'm adding you as a reviewer because there's not really anyone else on KV that knows this code, so we should change that.

49557: kvserver: remove migration to remove preemptive snapshots r=nvanbenschoten a=ajwerner

This migration ran in 20.1 to remove pre-emptive snapshots which may have
existed from before 19.2 was finalized. This migration is no longer relevant.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@craig craig bot closed this in 67c6bdb May 26, 2020
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/knownTxns branch May 27, 2020 19:55
jbowens pushed a commit to jbowens/cockroach that referenced this pull request Jun 1, 2020
Fixes cockroachdb#48790.
Informs cockroachdb#36876.
Closes cockroachdb#31664.

This commit adds a per-Range LRU cache of transactions that are known to
be aborted or committed. We use this cache in the lockTableWaiter for
two purposes:
1. when we see a lock held by a known-finalized txn, we neither wait out
   the kv.lock_table.coordinator_liveness_push_delay (10 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have abandoned multiple intents, perhaps due to a failure of the
   transaction coordinator node, so we begin deferring intent resolution to
   enable batching.

Together, these two changes make us much more effective as cleaning up
after failed transactions that have abandoned a large number of intents.
The following example demonstrates this:
```sql
--- BEFORE

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m50.801304266s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 3m26.874571045s

--- AFTER

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.138220753s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 48.763541138s
```

Notice that we are still not as fast at cleaning up intents on the
insertion path as we are at doing so on the retrieval path. This is
because we only batch the resolution of intents observed by a single
request at a time. For the scanning case, a single ScanRequest notices
all 10,000 intents and cleans them all up together. For the insertion
case, each of the 10,000 PutRequests notice a single intent, and each
intent is cleaned up individually. So this case is only benefited by
the first part of this change (no liveness delay or txn record push)
and not the second part of this change (intent resolution batching).

For this reason, we still haven't solved all of cockroachdb#36876. To completely
address that, we'll need to defer propagation of WriteIntentError during
batch evaluation, like we do for WriteTooOldErrors. Or we can wait out
the future LockTable changes - once we remove all cases where an intent
is not "discovered", the changes here will effectively address cockroachdb#36876.

This was a partial regression in v20.1, so we'll want to backport this
to that release branch. This change is on the larger side, but I feel ok
about it because the mechanics aren't too tricky. I'll wait a week before
backporting just to see if anything falls out.

Release note (bug fix): Abandoned intents due to failed transaction
coordinators are now cleaned up much faster. This resolves a regression
in v20.1.0 compared to prior releases.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 3, 2020
Fixes cockroachdb#48790.
Informs cockroachdb#36876.
Closes cockroachdb#31664.

This commit adds a per-Range LRU cache of transactions that are known to
be aborted or committed. We use this cache in the lockTableWaiter for
two purposes:
1. when we see a lock held by a known-finalized txn, we neither wait out
   the kv.lock_table.coordinator_liveness_push_delay (10 ms) nor push the
   transactions record (RPC to leaseholder of pushee's txn record range).
2. we use the existence of a transaction in the cache as an indication that
   it may have abandoned multiple intents, perhaps due to a failure of the
   transaction coordinator node, so we begin deferring intent resolution to
   enable batching.

Together, these two changes make us much more effective as cleaning up
after failed transactions that have abandoned a large number of intents.
The following example demonstrates this:
```sql
--- BEFORE

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2m50.801304266s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 3m26.874571045s

--- AFTER

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.138220753s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 48.763541138s
```

Notice that we are still not as fast at cleaning up intents on the
insertion path as we are at doing so on the retrieval path. This is
because we only batch the resolution of intents observed by a single
request at a time. For the scanning case, a single ScanRequest notices
all 10,000 intents and cleans them all up together. For the insertion
case, each of the 10,000 PutRequests notice a single intent, and each
intent is cleaned up individually. So this case is only benefited by
the first part of this change (no liveness delay or txn record push)
and not the second part of this change (intent resolution batching).

For this reason, we still haven't solved all of cockroachdb#36876. To completely
address that, we'll need to defer propagation of WriteIntentError during
batch evaluation, like we do for WriteTooOldErrors. Or we can wait out
the future LockTable changes - once we remove all cases where an intent
is not "discovered", the changes here will effectively address cockroachdb#36876.

This was a partial regression in v20.1, so we'll want to backport this
to that release branch. This change is on the larger side, but I feel ok
about it because the mechanics aren't too tricky. I'll wait a week before
backporting just to see if anything falls out.

Release note (bug fix): Abandoned intents due to failed transaction
coordinators are now cleaned up much faster. This resolves a regression
in v20.1.0 compared to prior releases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants