kvserver: move raft.proto to proto3 #51816

nvanbenschoten · 2020-07-23T01:17:52Z

Along with storage_services.proto.

The only message in these files that is written to the Raft log
is ConfChangeContext. The two fields in this message are never
empty and do not depend on the absent-vs-zero distinction, so
there are no migration concerns.

All other message types are serialized above Raft and never
persisted.

The motivations of this change are minor:

it moves the last remaining two proto2 files in the entire
pkg/kv/... package tree over to proto3, which contains twelve
other proto3 files. This avoids questions about proto2 files
importing proto3 protos.
it shaves 2 bytes off of the serialized size of RaftHeartbeat
and RaftMessageRequest because the Quiesce field is no longer
encoded when false, which is almost always the case for
RaftHeartbeat and is always the case when on the wire for
RaftMessageRequest.

Along with storage_services.proto. The only message in these files that is written to the Raft log is ConfChangeContext. The two fields in this message are never empty and do not depend on the absent-vs-zero distinction, so there are no migration concerns. All other message types are serialized above Raft and never persisted. The motivations of this change are minor: 1. it moves the last remaining two proto2 files in the entire `pkg/kv/...` package tree over to proto3, which contains twelve other proto3 files. This avoids questions about proto2 files importing proto3 protos. 2. it shaves 2 bytes off of the serialized size of RaftHeartbeat and RaftMessageRequest because the Quiesce field is no longer encoded when false, which is almost always the case for RaftHeartbeat and is always the case when on the wire for RaftMessageRequest.

cockroach-teamcity · 2020-07-23T01:17:58Z

This change is

Mitigates the runtime regression in cockroachdb#50865. Implements the proposal from cockroachdb#50865 (comment). Based on cockroachdb#51816. In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_quiescence` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This commit deliberately does not worry about the migration of this change for clarity. The next commit adds in the migration. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.

nvanbenschoten · 2020-07-29T13:57:01Z

bors r+

craig · 2020-07-29T14:36:45Z

Build failed (retrying...):

GitHub CI (Cockroach)

nvanbenschoten · 2020-07-30T01:41:58Z

bors r+

craig · 2020-07-30T02:16:41Z

Build succeeded:

GitHub CI (Cockroach)

Mitigates the runtime regression in cockroachdb#50865. Implements the proposal from cockroachdb#50865 (comment). Based on cockroachdb#51816. In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_quiescence` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This is a large improvement on top of the existing behavior because it reduces the set of ranges with a replica on a given dead node that unquiesce when that node becomes live to only those ranges that had write activity after the node went down. This is typically only a small fraction of the total number of ranges on each node (operating on the usual assumption that in an average large cluster, most data is write-cold), especially when the outage is short. However, if all ranges had write activity and subsequently quiesced before the outage resolved, we would still have to unquiesce all ranges upon the node coming back up. For this reason, the change is primarily targeted towards reducing the impact of node liveness blips and not reducing the impact of bringing nodes back up after sustained node outages. This is intentional, as the former category of outage is the one more likely to be caused by overload due to unquiescence traffic itself (as we see in cockroachdb#50865), so it is the category we choose to focus on here. This commit deliberately does not worry about the migration of this change for clarity. The next commit adds in the migration. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.

Mitigates the runtime regression in cockroachdb#50865. Implements the proposal from cockroachdb#50865 (comment). Based on cockroachdb#51816. In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_followers_on_quiesce` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This is a large improvement on top of the existing behavior because it reduces the set of ranges with a replica on a given dead node that unquiesce when that node becomes live to only those ranges that had write activity after the node went down. This is typically only a small fraction of the total number of ranges on each node (operating on the usual assumption that in an average large cluster, most data is write-cold), especially when the outage is short. However, if all ranges had write activity and subsequently quiesced before the outage resolved, we would still have to unquiesce all ranges upon the node coming back up. For this reason, the change is primarily targeted towards reducing the impact of node liveness blips and not reducing the impact of bringing nodes back up after sustained node outages. This is intentional, as the former category of outage is the one more likely to be caused by overload due to unquiescence traffic itself (as we see in cockroachdb#50865), so it is the category we choose to focus on here. This commit deliberately does not worry about the migration of this change for clarity. The next commit adds in the migration. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.

nvanbenschoten requested a review from tbg July 23, 2020 01:17

nvanbenschoten mentioned this pull request Jul 24, 2020

kvserver: don't unquiesce on liveness changes of up-to-date replicas #51894

Merged

tbg approved these changes Jul 28, 2020

View reviewed changes

craig bot merged commit 4d29d15 into cockroachdb:master Jul 30, 2020

nvanbenschoten deleted the nvanbenschoten/raftProto3 branch August 3, 2020 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: move raft.proto to proto3 #51816

kvserver: move raft.proto to proto3 #51816

nvanbenschoten commented Jul 23, 2020

cockroach-teamcity commented Jul 23, 2020

nvanbenschoten commented Jul 29, 2020

craig bot commented Jul 29, 2020

nvanbenschoten commented Jul 30, 2020

craig bot commented Jul 30, 2020

kvserver: move raft.proto to proto3 #51816

kvserver: move raft.proto to proto3 #51816

Conversation

nvanbenschoten commented Jul 23, 2020

cockroach-teamcity commented Jul 23, 2020

nvanbenschoten commented Jul 29, 2020

craig bot commented Jul 29, 2020

nvanbenschoten commented Jul 30, 2020

craig bot commented Jul 30, 2020