kv: make long running transactions less expensive (improve txn heartbeats) #45013

ajwerner · 2020-02-12T16:10:46Z

Is your feature request related to a problem? Please describe.

Current the txnHeartbeater will periodically heartbeat the transaction record of a transaction. It performs this heart beating once a second (see here) with an expiration at 5s (see here).

This is heartbeating acts as a cliff for performance. As soon as transactions take longer than 1s, they perform a write, leading to more load, increasing latency, leading to more transactions sending heartbeats.

Another thing to note is that only long-lived transactions heartbeat but every transaction kicks off a goroutine to attempt to heartbeat #35009. The reason for this heartbeating is so that when another transaction encounters an intent, it can determine whether the transaction has expired and can be aborted.

A key point is that nowhere in the system do we encourage or rely on holding open long-running transactions intentionally. There are good reasons why we might want to use long-running transactions (discussed below in additional context).

@nvanbenschoten I've come around to your perspective on this issue.

Describe the solution you'd like

One approach to making this heartbeat cheaper is to batch heartbeat requests from a node to another node. This would likely fix the performance cliff. Say that each node has T heartbeating transactions which have transaction records held on R ranges. Currently we'll issues T writes per second. With coalescing we'd move to O(R) writes per second (it will depend on the batching settings). R is probably okay in the cases we care about.

The biggest work item to making this change would be to extend the HeartbeatTxnRequest to include the Txn rather than using the Txn off of the BatchRequest.Header. This would allow multiple heartbeats to be coalesced. At that point the https://godoc.org/github.com/cockroachdb/cockroach/pkg/internal/client/requestbatcher.

Another, somewhat simpler, change would be to make transaction heartbeats configurable. This change complements the above suggestion nicely. This would allow control over the heartbeat at a cluster level to be controlled with a cluster setting and then for transactions which are intentionally held open for a long time and longer timeouts can be tolerated can set their expiration even further into the future.

A further optimization would be to centralize the scheduling of the heartbeats as opposed to spinning off a goroutine per transaction (#35009).

Describe alternatives you've considered

Another option is to add a layer of indirection; instead of heartbeating transaction records, each gateway could heartbeat some other record. This would work precisely like node liveness works at the KV layer. The problems with such an approach are twofold: (1) the intent resolution protocol becomes more complicated and (2) the transactions would not learn about their having been aborted so we'd need to add another RPC to inform them, further compounding (1).

There's something nice about this approach. It eliminates the poorly handled asynchrony of the heartbeat loop and replaces it with explicit coordination. The other downside is that it moves the reads required to resolve an intent to a third location, away from the transaction record.

Lastly, it probably is the case that when there are a large number of long-running transactions, they occur on the same range. If they don't occur on the same range and there are a lot of them, then the user is probably holding it wrong and should fix something.

Additional context

Currently we don't ever really run long-running transactions as part of the normal course of business. We do however have a variety of different and bespoke leasing systems above the KV layer. These leases take two approaches: they heartbeat explicitly (SQL schema leases, table descriptor schema change leases) job leases (epoch based). Each of these have their own problems and are super bespoke - storing records and determine status based on logic that interprets fields in protocol buffers. The purpose of one of these leases is mutual exclusion (locking). It turns out we've invested quite a bit in building distributed locking in our KV layer. We should make the locking in the KV layer work for the higher level abstractions rather than continuously reinventing the wheel above the KV layer. In this vision of the world, the process of holding a lock is likely to map to the holding open a long-running transaction. Hence my desire to make long-running transactions cheaper.

Additionally of import is the desire to prevent TPC-C runs from falling off a cliff as soon as latency approaches the current 1s level.

This issue would be a really nice starter issue for a new member to the KV team.

Jira issue: CRDB-5182

The text was updated successfully, but these errors were encountered:

Mitigates the runtime regression in cockroachdb#50865. Implements the proposal from cockroachdb#50865 (comment). Based on cockroachdb#51816. In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_quiescence` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This commit deliberately does not worry about the migration of this change for clarity. The next commit adds in the migration. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.

Mitigates the runtime regression in cockroachdb#50865. Implements the proposal from cockroachdb#50865 (comment). Based on cockroachdb#51816. In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_quiescence` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This is a large improvement on top of the existing behavior because it reduces the set of ranges with a replica on a given dead node that unquiesce when that node becomes live to only those ranges that had write activity after the node went down. This is typically only a small fraction of the total number of ranges on each node (operating on the usual assumption that in an average large cluster, most data is write-cold), especially when the outage is short. However, if all ranges had write activity and subsequently quiesced before the outage resolved, we would still have to unquiesce all ranges upon the node coming back up. For this reason, the change is primarily targeted towards reducing the impact of node liveness blips and not reducing the impact of bringing nodes back up after sustained node outages. This is intentional, as the former category of outage is the one more likely to be caused by overload due to unquiescence traffic itself (as we see in cockroachdb#50865), so it is the category we choose to focus on here. This commit deliberately does not worry about the migration of this change for clarity. The next commit adds in the migration. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.

Mitigates the runtime regression in cockroachdb#50865. Implements the proposal from cockroachdb#50865 (comment). Based on cockroachdb#51816. In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_followers_on_quiesce` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This is a large improvement on top of the existing behavior because it reduces the set of ranges with a replica on a given dead node that unquiesce when that node becomes live to only those ranges that had write activity after the node went down. This is typically only a small fraction of the total number of ranges on each node (operating on the usual assumption that in an average large cluster, most data is write-cold), especially when the outage is short. However, if all ranges had write activity and subsequently quiesced before the outage resolved, we would still have to unquiesce all ranges upon the node coming back up. For this reason, the change is primarily targeted towards reducing the impact of node liveness blips and not reducing the impact of bringing nodes back up after sustained node outages. This is intentional, as the former category of outage is the one more likely to be caused by overload due to unquiescence traffic itself (as we see in cockroachdb#50865), so it is the category we choose to focus on here. This commit deliberately does not worry about the migration of this change for clarity. The next commit adds in the migration. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.

@petermattis

51894: kvserver: don't unquiesce on liveness changes of up-to-date replicas r=nvanbenschoten a=nvanbenschoten Mitigates the runtime regression in #50865. Implements the proposal from #50865 (comment). In #26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like #50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like #51888 and #45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_quiescence` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see #50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This is a large improvement on top of the existing behavior because it reduces the set of ranges with a replica on a given dead node that unquiesce when that node becomes live to only those ranges that had write activity after the node went down. This is typically only a small fraction of the total number of ranges on each node (operating on the usual assumption that in an average large cluster, most data is write-cold), especially when the outage is short. However, if all ranges had write activity and subsequently quiesced before the outage resolved, we would still have to unquiesce all ranges upon the node coming back up. For this reason, the change is primarily targeted towards reducing the impact of node liveness blips and not reducing the impact of bringing nodes back up after sustained node outages. This is intentional, as the former category of outage is the one more likely to be caused by overload due to unquiescence traffic itself (as we see in #50865), so it is the category we choose to focus on here. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing. @petermattis I'm adding you as a reviewer because you've done more thinking about Range quiescence than anyone else. But I can find a new second reviewer if you don't have time to take a look at this. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

irfansharif · 2020-09-08T16:41:48Z

@madelineliao worked on her prototype of this before she wrapped up her internship (#52415, #52705). I'll take it over the finish line.

See cockroachdb#45013. Picking up from cockroachdb#52705 left off. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Madeline Liao <mliao@cockroachlabs.com>

github-actions · 2023-09-14T11:09:52Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

ajwerner added A-kv-transactions Relating to MVCC and the transactional model. A-kv Anything in KV that doesn't belong in a more specific category. labels Feb 12, 2020

lunevalex added this to To do in KV 20.2 Jul 13, 2020

nvanbenschoten mentioned this issue Jul 22, 2020

roachtest: kv/splits/nodes=3/quiesce=true failed #50865

Closed

lunevalex assigned madelineliao Jul 24, 2020

nvanbenschoten mentioned this issue Jul 24, 2020

kvserver: don't unquiesce on liveness changes of up-to-date replicas #51894

Merged

lunevalex moved this from To do to In progress in KV 20.2 Aug 5, 2020

madelineliao mentioned this issue Aug 5, 2020

kv: tracking progress of improving txn heartbeats #52415

Closed

irfansharif assigned irfansharif and unassigned madelineliao Sep 8, 2020

irfansharif mentioned this issue Sep 11, 2020

[wip,dnm,dnr] kvserver,kvcoord: batch txn heartbeats #54263

Closed

nvanbenschoten added this to To be considered in KV 21.1 Stability Period via automation Sep 29, 2020

nvanbenschoten moved this from To be considered to NH in KV 21.1 Stability Period Sep 29, 2020

nvanbenschoten moved this from NH to SH in KV 21.1 Stability Period Sep 29, 2020

irfansharif moved this from In progress to To do in KV 20.2 Jan 4, 2021

lunevalex removed this from To do in KV 20.2 Jan 5, 2021

tbg moved this from SH to Unlikely/Nope in KV 21.1 Stability Period Mar 4, 2021

tbg added this to Incoming in KV via automation Mar 9, 2021

jlinder added the T-kv KV Team label Jun 16, 2021

mwang1026 moved this from Incoming to Prioritized in KV Aug 20, 2021

lunevalex moved this from Prioritized to Quick Wins in KV Nov 2, 2021

irfansharif removed their assignment Nov 12, 2021

github-actions bot added the no-issue-activity label Sep 14, 2023

github-actions bot added the X-stale label Sep 25, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 25, 2023

KV automation moved this from Quick Wins to Closed Sep 25, 2023

exalate-issue-sync bot closed this as completed Sep 25, 2023

nvanbenschoten reopened this Sep 25, 2023

KV automation moved this from Closed to Incoming Sep 25, 2023

nvanbenschoten removed X-stale no-issue-activity labels Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: make long running transactions less expensive (improve txn heartbeats) #45013

kv: make long running transactions less expensive (improve txn heartbeats) #45013

ajwerner commented Feb 12, 2020 •

edited by cockroach-jira-scripts

irfansharif commented Sep 8, 2020

github-actions bot commented Sep 14, 2023

kv: make long running transactions less expensive (improve txn heartbeats) #45013

kv: make long running transactions less expensive (improve txn heartbeats) #45013

Comments

ajwerner commented Feb 12, 2020 • edited by cockroach-jira-scripts

irfansharif commented Sep 8, 2020

github-actions bot commented Sep 14, 2023

ajwerner commented Feb 12, 2020 •

edited by cockroach-jira-scripts