Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: make long running transactions less expensive (improve txn heartbeats) #45013

Open
ajwerner opened this issue Feb 12, 2020 · 2 comments
Open
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-kv-transactions Relating to MVCC and the transactional model. T-kv KV Team
Projects

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Feb 12, 2020

Is your feature request related to a problem? Please describe.

Current the txnHeartbeater will periodically heartbeat the transaction record of a transaction. It performs this heart beating once a second (see here) with an expiration at 5s (see here).

This is heartbeating acts as a cliff for performance. As soon as transactions take longer than 1s, they perform a write, leading to more load, increasing latency, leading to more transactions sending heartbeats.

Another thing to note is that only long-lived transactions heartbeat but every transaction kicks off a goroutine to attempt to heartbeat #35009. The reason for this heartbeating is so that when another transaction encounters an intent, it can determine whether the transaction has expired and can be aborted.

A key point is that nowhere in the system do we encourage or rely on holding open long-running transactions intentionally. There are good reasons why we might want to use long-running transactions (discussed below in additional context).

@nvanbenschoten I've come around to your perspective on this issue.

Describe the solution you'd like

One approach to making this heartbeat cheaper is to batch heartbeat requests from a node to another node. This would likely fix the performance cliff. Say that each node has T heartbeating transactions which have transaction records held on R ranges. Currently we'll issues T writes per second. With coalescing we'd move to O(R) writes per second (it will depend on the batching settings). R is probably okay in the cases we care about.

The biggest work item to making this change would be to extend the HeartbeatTxnRequest to include the Txn rather than using the Txn off of the BatchRequest.Header. This would allow multiple heartbeats to be coalesced. At that point the https://godoc.org/github.com/cockroachdb/cockroach/pkg/internal/client/requestbatcher.

Another, somewhat simpler, change would be to make transaction heartbeats configurable. This change complements the above suggestion nicely. This would allow control over the heartbeat at a cluster level to be controlled with a cluster setting and then for transactions which are intentionally held open for a long time and longer timeouts can be tolerated can set their expiration even further into the future.

A further optimization would be to centralize the scheduling of the heartbeats as opposed to spinning off a goroutine per transaction (#35009).

Describe alternatives you've considered

Another option is to add a layer of indirection; instead of heartbeating transaction records, each gateway could heartbeat some other record. This would work precisely like node liveness works at the KV layer. The problems with such an approach are twofold: (1) the intent resolution protocol becomes more complicated and (2) the transactions would not learn about their having been aborted so we'd need to add another RPC to inform them, further compounding (1).

There's something nice about this approach. It eliminates the poorly handled asynchrony of the heartbeat loop and replaces it with explicit coordination. The other downside is that it moves the reads required to resolve an intent to a third location, away from the transaction record.

Lastly, it probably is the case that when there are a large number of long-running transactions, they occur on the same range. If they don't occur on the same range and there are a lot of them, then the user is probably holding it wrong and should fix something.

Additional context

Currently we don't ever really run long-running transactions as part of the normal course of business. We do however have a variety of different and bespoke leasing systems above the KV layer. These leases take two approaches: they heartbeat explicitly (SQL schema leases, table descriptor schema change leases) job leases (epoch based). Each of these have their own problems and are super bespoke - storing records and determine status based on logic that interprets fields in protocol buffers. The purpose of one of these leases is mutual exclusion (locking). It turns out we've invested quite a bit in building distributed locking in our KV layer. We should make the locking in the KV layer work for the higher level abstractions rather than continuously reinventing the wheel above the KV layer. In this vision of the world, the process of holding a lock is likely to map to the holding open a long-running transaction. Hence my desire to make long-running transactions cheaper.

Additionally of import is the desire to prevent TPC-C runs from falling off a cliff as soon as latency approaches the current 1s level.

This issue would be a really nice starter issue for a new member to the KV team.

Jira issue: CRDB-5182

@ajwerner ajwerner added A-kv-transactions Relating to MVCC and the transactional model. A-kv Anything in KV that doesn't belong in a more specific category. labels Feb 12, 2020
@lunevalex lunevalex added this to To do in KV 20.2 Jul 13, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 24, 2020
Mitigates the runtime regression in cockroachdb#50865.
Implements the proposal from cockroachdb#50865 (comment).
Based on cockroachdb#51816.

In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas
that were behind on the Raft log. This was a big win, as it prevented a
node failure from stopping any range with a replica on that node from
quiescing, even if the range was dormant. However, in order to ensure
that the replica was caught up quickly once its node being brought back
on-line, we began aggressively unquiescing when nodes moved from not-live
to live.

This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865
where nodes contain large numbers of replicas (i.e. O(100k)), suddenly
unquiescing all replicas on a node due to a transient liveness blip can
bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another
case where we add work to the system when it gets sufficiently loaded,
causing stability to be divergent instead of convergent and resulting in
an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date
replicas on nodes that undergo liveness changes. It does so by socializing
the liveness state of lagging replicas during Range quiescence. In
dong so through a new `lagging_quiescence` set attached to quiescence
requests, it allows all quiesced replicas to agree on which node liveness
events would need to result in the range being unquiesced and which ones
can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence
as it exists today. Before this change, we did not provide this
guarantee in the case that a leader was behind on its node liveness
information, broadcasted a quiescence notification, and then crashed. In
that case, a node that was brought back online might not be caught up
for minutes because the range wouldn't unquiesce until the lagging
replica reached out to it. Interestingly, this may be a semi-common
occurrence during rolling restart scenarios.

This commit deliberately does not worry about the migration of this
change for clarity. The next commit adds in the migration.

Release note (performance improvement): transient node liveness blips
no longer cause up-to-date Ranges to unquiesce, which makes these events
less destabilizing.
@lunevalex lunevalex moved this from To do to In progress in KV 20.2 Aug 5, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 7, 2020
Mitigates the runtime regression in cockroachdb#50865.
Implements the proposal from cockroachdb#50865 (comment).
Based on cockroachdb#51816.

In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas
that were behind on the Raft log. This was a big win, as it prevented a
node failure from stopping any range with a replica on that node from
quiescing, even if the range was dormant. However, in order to ensure
that the replica was caught up quickly once its node being brought back
on-line, we began aggressively unquiescing when nodes moved from not-live
to live.

This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865
where nodes contain large numbers of replicas (i.e. O(100k)), suddenly
unquiescing all replicas on a node due to a transient liveness blip can
bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another
case where we add work to the system when it gets sufficiently loaded,
causing stability to be divergent instead of convergent and resulting in
an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date
replicas on nodes that undergo liveness changes. It does so by socializing
the liveness state of lagging replicas during Range quiescence. In
dong so through a new `lagging_quiescence` set attached to quiescence
requests, it allows all quiesced replicas to agree on which node liveness
events would need to result in the range being unquiesced and which ones
can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence
as it exists today. Before this change, we did not provide this
guarantee in the case that a leader was behind on its node liveness
information, broadcasted a quiescence notification, and then crashed. In
that case, a node that was brought back online might not be caught up
for minutes because the range wouldn't unquiesce until the lagging
replica reached out to it. Interestingly, this may be a semi-common
occurrence during rolling restart scenarios.

This is a large improvement on top of the existing behavior because it
reduces the set of ranges with a replica on a given dead node that
unquiesce when that node becomes live to only those ranges that had
write activity after the node went down. This is typically only a small
fraction of the total number of ranges on each node (operating on the
usual assumption that in an average large cluster, most data is
write-cold), especially when the outage is short. However, if all ranges
had write activity and subsequently quiesced before the outage resolved,
we would still have to unquiesce all ranges upon the node coming back
up. For this reason, the change is primarily targeted towards reducing
the impact of node liveness blips and not reducing the impact of
bringing nodes back up after sustained node outages. This is
intentional, as the former category of outage is the one more likely to
be caused by overload due to unquiescence traffic itself (as we see
in cockroachdb#50865), so it is the category we choose to focus on here.

This commit deliberately does not worry about the migration of this
change for clarity. The next commit adds in the migration.

Release note (performance improvement): transient node liveness blips
no longer cause up-to-date Ranges to unquiesce, which makes these events
less destabilizing.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 18, 2020
Mitigates the runtime regression in cockroachdb#50865.
Implements the proposal from cockroachdb#50865 (comment).
Based on cockroachdb#51816.

In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas
that were behind on the Raft log. This was a big win, as it prevented a
node failure from stopping any range with a replica on that node from
quiescing, even if the range was dormant. However, in order to ensure
that the replica was caught up quickly once its node being brought back
on-line, we began aggressively unquiescing when nodes moved from not-live
to live.

This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865
where nodes contain large numbers of replicas (i.e. O(100k)), suddenly
unquiescing all replicas on a node due to a transient liveness blip can
bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another
case where we add work to the system when it gets sufficiently loaded,
causing stability to be divergent instead of convergent and resulting in
an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date
replicas on nodes that undergo liveness changes. It does so by socializing
the liveness state of lagging replicas during Range quiescence. In
dong so through a new `lagging_followers_on_quiesce` set attached to
quiescence requests, it allows all quiesced replicas to agree on which
node liveness events would need to result in the range being unquiesced
and which ones can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence
as it exists today. Before this change, we did not provide this
guarantee in the case that a leader was behind on its node liveness
information, broadcasted a quiescence notification, and then crashed. In
that case, a node that was brought back online might not be caught up
for minutes because the range wouldn't unquiesce until the lagging
replica reached out to it. Interestingly, this may be a semi-common
occurrence during rolling restart scenarios.

This is a large improvement on top of the existing behavior because it
reduces the set of ranges with a replica on a given dead node that
unquiesce when that node becomes live to only those ranges that had
write activity after the node went down. This is typically only a small
fraction of the total number of ranges on each node (operating on the
usual assumption that in an average large cluster, most data is
write-cold), especially when the outage is short. However, if all ranges
had write activity and subsequently quiesced before the outage resolved,
we would still have to unquiesce all ranges upon the node coming back
up. For this reason, the change is primarily targeted towards reducing
the impact of node liveness blips and not reducing the impact of
bringing nodes back up after sustained node outages. This is
intentional, as the former category of outage is the one more likely to
be caused by overload due to unquiescence traffic itself (as we see
in cockroachdb#50865), so it is the category we choose to focus on here.

This commit deliberately does not worry about the migration of this
change for clarity. The next commit adds in the migration.

Release note (performance improvement): transient node liveness blips
no longer cause up-to-date Ranges to unquiesce, which makes these events
less destabilizing.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 20, 2020
Mitigates the runtime regression in cockroachdb#50865.
Implements the proposal from cockroachdb#50865 (comment).
Based on cockroachdb#51816.

In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas
that were behind on the Raft log. This was a big win, as it prevented a
node failure from stopping any range with a replica on that node from
quiescing, even if the range was dormant. However, in order to ensure
that the replica was caught up quickly once its node being brought back
on-line, we began aggressively unquiescing when nodes moved from not-live
to live.

This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865
where nodes contain large numbers of replicas (i.e. O(100k)), suddenly
unquiescing all replicas on a node due to a transient liveness blip can
bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another
case where we add work to the system when it gets sufficiently loaded,
causing stability to be divergent instead of convergent and resulting in
an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date
replicas on nodes that undergo liveness changes. It does so by socializing
the liveness state of lagging replicas during Range quiescence. In
dong so through a new `lagging_followers_on_quiesce` set attached to
quiescence requests, it allows all quiesced replicas to agree on which
node liveness events would need to result in the range being unquiesced
and which ones can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence
as it exists today. Before this change, we did not provide this
guarantee in the case that a leader was behind on its node liveness
information, broadcasted a quiescence notification, and then crashed. In
that case, a node that was brought back online might not be caught up
for minutes because the range wouldn't unquiesce until the lagging
replica reached out to it. Interestingly, this may be a semi-common
occurrence during rolling restart scenarios.

This is a large improvement on top of the existing behavior because it
reduces the set of ranges with a replica on a given dead node that
unquiesce when that node becomes live to only those ranges that had
write activity after the node went down. This is typically only a small
fraction of the total number of ranges on each node (operating on the
usual assumption that in an average large cluster, most data is
write-cold), especially when the outage is short. However, if all ranges
had write activity and subsequently quiesced before the outage resolved,
we would still have to unquiesce all ranges upon the node coming back
up. For this reason, the change is primarily targeted towards reducing
the impact of node liveness blips and not reducing the impact of
bringing nodes back up after sustained node outages. This is
intentional, as the former category of outage is the one more likely to
be caused by overload due to unquiescence traffic itself (as we see
in cockroachdb#50865), so it is the category we choose to focus on here.

This commit deliberately does not worry about the migration of this
change for clarity. The next commit adds in the migration.

Release note (performance improvement): transient node liveness blips
no longer cause up-to-date Ranges to unquiesce, which makes these events
less destabilizing.
craig bot pushed a commit that referenced this issue Aug 20, 2020
51894: kvserver: don't unquiesce on liveness changes of up-to-date replicas r=nvanbenschoten a=nvanbenschoten

Mitigates the runtime regression in #50865.
Implements the proposal from #50865 (comment).

In #26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live.

This turns out to be a destabilizing behavior. In scenarios like #50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like #51888 and #45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_quiescence` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see #50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios.

This is a large improvement on top of the existing behavior because it reduces the set of ranges with a replica on a given dead node that unquiesce when that node becomes live to only those ranges that had write activity after the node went down. This is typically only a small fraction of the total number of ranges on each node (operating on the usual assumption that in an average large cluster, most data is write-cold), especially when the outage is short. However, if all ranges had write activity and subsequently quiesced before the outage resolved, we would still have to unquiesce all ranges upon the node coming back up. For this reason, the change is primarily targeted towards reducing the impact of node liveness blips and not reducing the impact of bringing nodes back up after sustained node outages. This is intentional, as the former category of outage is the one more likely to be caused by overload due to unquiescence traffic itself (as we see in #50865), so it is the category we choose to focus on here.

Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.

@petermattis I'm adding you as a reviewer because you've done more thinking about Range quiescence than anyone else. But I can find a new second reviewer if you don't have time to take a look at this.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@irfansharif
Copy link
Contributor

@madelineliao worked on her prototype of this before she wrapped up her internship (#52415, #52705). I'll take it over the finish line.

irfansharif added a commit to irfansharif/cockroach that referenced this issue Sep 11, 2020
See cockroachdb#45013. Picking up from cockroachdb#52705 left off.

Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Co-authored-by: Madeline Liao <mliao@cockroachlabs.com>
irfansharif added a commit to irfansharif/cockroach that referenced this issue Sep 11, 2020
See cockroachdb#45013. Picking up from cockroachdb#52705 left off.

Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Co-authored-by: Madeline Liao <mliao@cockroachlabs.com>
irfansharif added a commit to irfansharif/cockroach that referenced this issue Sep 14, 2020
See cockroachdb#45013. Picking up from cockroachdb#52705 left off.

Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Co-authored-by: Madeline Liao <mliao@cockroachlabs.com>
@nvanbenschoten nvanbenschoten added this to To be considered in KV 21.1 Stability Period via automation Sep 29, 2020
@nvanbenschoten nvanbenschoten moved this from To be considered to NH in KV 21.1 Stability Period Sep 29, 2020
@irfansharif irfansharif moved this from In progress to To do in KV 20.2 Jan 4, 2021
@lunevalex lunevalex removed this from To do in KV 20.2 Jan 5, 2021
@tbg tbg moved this from SH to Unlikely/Nope in KV 21.1 Stability Period Mar 4, 2021
@tbg tbg added this to Incoming in KV via automation Mar 9, 2021
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@mwang1026 mwang1026 moved this from Incoming to Prioritized in KV Aug 20, 2021
@lunevalex lunevalex moved this from Prioritized to Quick Wins in KV Nov 2, 2021
@irfansharif irfansharif removed their assignment Nov 12, 2021
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 25, 2023
KV automation moved this from Quick Wins to Closed Sep 25, 2023
KV automation moved this from Closed to Incoming Sep 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-kv-transactions Relating to MVCC and the transactional model. T-kv KV Team
Projects
KV
Incoming
Development

No branches or pull requests

5 participants