Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: move raft.proto to proto3 #51816

Merged
merged 1 commit into from
Jul 30, 2020

Conversation

nvanbenschoten
Copy link
Member

Along with storage_services.proto.

The only message in these files that is written to the Raft log
is ConfChangeContext. The two fields in this message are never
empty and do not depend on the absent-vs-zero distinction, so
there are no migration concerns.

All other message types are serialized above Raft and never
persisted.

The motivations of this change are minor:

  1. it moves the last remaining two proto2 files in the entire
    pkg/kv/... package tree over to proto3, which contains twelve
    other proto3 files. This avoids questions about proto2 files
    importing proto3 protos.
  2. it shaves 2 bytes off of the serialized size of RaftHeartbeat
    and RaftMessageRequest because the Quiesce field is no longer
    encoded when false, which is almost always the case for
    RaftHeartbeat and is always the case when on the wire for
    RaftMessageRequest.

Along with storage_services.proto.

The only message in these files that is written to the Raft log
is ConfChangeContext. The two fields in this message are never
empty and do not depend on the absent-vs-zero distinction, so
there are no migration concerns.

All other message types are serialized above Raft and never
persisted.

The motivations of this change are minor:
1. it moves the last remaining two proto2 files in the entire
   `pkg/kv/...` package tree over to proto3, which contains twelve
   other proto3 files. This avoids questions about proto2 files
   importing proto3 protos.
2. it shaves 2 bytes off of the serialized size of RaftHeartbeat
   and RaftMessageRequest because the Quiesce field is no longer
   encoded when false, which is almost always the case for
   RaftHeartbeat and is always the case when on the wire for
   RaftMessageRequest.
@nvanbenschoten nvanbenschoten requested a review from tbg July 23, 2020 01:17
@cockroach-teamcity
Copy link
Member

This change is Reviewable

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jul 24, 2020
Mitigates the runtime regression in cockroachdb#50865.
Implements the proposal from cockroachdb#50865 (comment).
Based on cockroachdb#51816.

In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas
that were behind on the Raft log. This was a big win, as it prevented a
node failure from stopping any range with a replica on that node from
quiescing, even if the range was dormant. However, in order to ensure
that the replica was caught up quickly once its node being brought back
on-line, we began aggressively unquiescing when nodes moved from not-live
to live.

This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865
where nodes contain large numbers of replicas (i.e. O(100k)), suddenly
unquiescing all replicas on a node due to a transient liveness blip can
bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another
case where we add work to the system when it gets sufficiently loaded,
causing stability to be divergent instead of convergent and resulting in
an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date
replicas on nodes that undergo liveness changes. It does so by socializing
the liveness state of lagging replicas during Range quiescence. In
dong so through a new `lagging_quiescence` set attached to quiescence
requests, it allows all quiesced replicas to agree on which node liveness
events would need to result in the range being unquiesced and which ones
can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence
as it exists today. Before this change, we did not provide this
guarantee in the case that a leader was behind on its node liveness
information, broadcasted a quiescence notification, and then crashed. In
that case, a node that was brought back online might not be caught up
for minutes because the range wouldn't unquiesce until the lagging
replica reached out to it. Interestingly, this may be a semi-common
occurrence during rolling restart scenarios.

This commit deliberately does not worry about the migration of this
change for clarity. The next commit adds in the migration.

Release note (performance improvement): transient node liveness blips
no longer cause up-to-date Ranges to unquiesce, which makes these events
less destabilizing.
@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 29, 2020

Build failed (retrying...):

@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 30, 2020

Build succeeded:

@craig craig bot merged commit 4d29d15 into cockroachdb:master Jul 30, 2020
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/raftProto3 branch August 3, 2020 15:13
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 7, 2020
Mitigates the runtime regression in cockroachdb#50865.
Implements the proposal from cockroachdb#50865 (comment).
Based on cockroachdb#51816.

In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas
that were behind on the Raft log. This was a big win, as it prevented a
node failure from stopping any range with a replica on that node from
quiescing, even if the range was dormant. However, in order to ensure
that the replica was caught up quickly once its node being brought back
on-line, we began aggressively unquiescing when nodes moved from not-live
to live.

This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865
where nodes contain large numbers of replicas (i.e. O(100k)), suddenly
unquiescing all replicas on a node due to a transient liveness blip can
bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another
case where we add work to the system when it gets sufficiently loaded,
causing stability to be divergent instead of convergent and resulting in
an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date
replicas on nodes that undergo liveness changes. It does so by socializing
the liveness state of lagging replicas during Range quiescence. In
dong so through a new `lagging_quiescence` set attached to quiescence
requests, it allows all quiesced replicas to agree on which node liveness
events would need to result in the range being unquiesced and which ones
can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence
as it exists today. Before this change, we did not provide this
guarantee in the case that a leader was behind on its node liveness
information, broadcasted a quiescence notification, and then crashed. In
that case, a node that was brought back online might not be caught up
for minutes because the range wouldn't unquiesce until the lagging
replica reached out to it. Interestingly, this may be a semi-common
occurrence during rolling restart scenarios.

This is a large improvement on top of the existing behavior because it
reduces the set of ranges with a replica on a given dead node that
unquiesce when that node becomes live to only those ranges that had
write activity after the node went down. This is typically only a small
fraction of the total number of ranges on each node (operating on the
usual assumption that in an average large cluster, most data is
write-cold), especially when the outage is short. However, if all ranges
had write activity and subsequently quiesced before the outage resolved,
we would still have to unquiesce all ranges upon the node coming back
up. For this reason, the change is primarily targeted towards reducing
the impact of node liveness blips and not reducing the impact of
bringing nodes back up after sustained node outages. This is
intentional, as the former category of outage is the one more likely to
be caused by overload due to unquiescence traffic itself (as we see
in cockroachdb#50865), so it is the category we choose to focus on here.

This commit deliberately does not worry about the migration of this
change for clarity. The next commit adds in the migration.

Release note (performance improvement): transient node liveness blips
no longer cause up-to-date Ranges to unquiesce, which makes these events
less destabilizing.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 18, 2020
Mitigates the runtime regression in cockroachdb#50865.
Implements the proposal from cockroachdb#50865 (comment).
Based on cockroachdb#51816.

In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas
that were behind on the Raft log. This was a big win, as it prevented a
node failure from stopping any range with a replica on that node from
quiescing, even if the range was dormant. However, in order to ensure
that the replica was caught up quickly once its node being brought back
on-line, we began aggressively unquiescing when nodes moved from not-live
to live.

This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865
where nodes contain large numbers of replicas (i.e. O(100k)), suddenly
unquiescing all replicas on a node due to a transient liveness blip can
bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another
case where we add work to the system when it gets sufficiently loaded,
causing stability to be divergent instead of convergent and resulting in
an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date
replicas on nodes that undergo liveness changes. It does so by socializing
the liveness state of lagging replicas during Range quiescence. In
dong so through a new `lagging_followers_on_quiesce` set attached to
quiescence requests, it allows all quiesced replicas to agree on which
node liveness events would need to result in the range being unquiesced
and which ones can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence
as it exists today. Before this change, we did not provide this
guarantee in the case that a leader was behind on its node liveness
information, broadcasted a quiescence notification, and then crashed. In
that case, a node that was brought back online might not be caught up
for minutes because the range wouldn't unquiesce until the lagging
replica reached out to it. Interestingly, this may be a semi-common
occurrence during rolling restart scenarios.

This is a large improvement on top of the existing behavior because it
reduces the set of ranges with a replica on a given dead node that
unquiesce when that node becomes live to only those ranges that had
write activity after the node went down. This is typically only a small
fraction of the total number of ranges on each node (operating on the
usual assumption that in an average large cluster, most data is
write-cold), especially when the outage is short. However, if all ranges
had write activity and subsequently quiesced before the outage resolved,
we would still have to unquiesce all ranges upon the node coming back
up. For this reason, the change is primarily targeted towards reducing
the impact of node liveness blips and not reducing the impact of
bringing nodes back up after sustained node outages. This is
intentional, as the former category of outage is the one more likely to
be caused by overload due to unquiescence traffic itself (as we see
in cockroachdb#50865), so it is the category we choose to focus on here.

This commit deliberately does not worry about the migration of this
change for clarity. The next commit adds in the migration.

Release note (performance improvement): transient node liveness blips
no longer cause up-to-date Ranges to unquiesce, which makes these events
less destabilizing.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 20, 2020
Mitigates the runtime regression in cockroachdb#50865.
Implements the proposal from cockroachdb#50865 (comment).
Based on cockroachdb#51816.

In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas
that were behind on the Raft log. This was a big win, as it prevented a
node failure from stopping any range with a replica on that node from
quiescing, even if the range was dormant. However, in order to ensure
that the replica was caught up quickly once its node being brought back
on-line, we began aggressively unquiescing when nodes moved from not-live
to live.

This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865
where nodes contain large numbers of replicas (i.e. O(100k)), suddenly
unquiescing all replicas on a node due to a transient liveness blip can
bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another
case where we add work to the system when it gets sufficiently loaded,
causing stability to be divergent instead of convergent and resulting in
an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date
replicas on nodes that undergo liveness changes. It does so by socializing
the liveness state of lagging replicas during Range quiescence. In
dong so through a new `lagging_followers_on_quiesce` set attached to
quiescence requests, it allows all quiesced replicas to agree on which
node liveness events would need to result in the range being unquiesced
and which ones can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence
as it exists today. Before this change, we did not provide this
guarantee in the case that a leader was behind on its node liveness
information, broadcasted a quiescence notification, and then crashed. In
that case, a node that was brought back online might not be caught up
for minutes because the range wouldn't unquiesce until the lagging
replica reached out to it. Interestingly, this may be a semi-common
occurrence during rolling restart scenarios.

This is a large improvement on top of the existing behavior because it
reduces the set of ranges with a replica on a given dead node that
unquiesce when that node becomes live to only those ranges that had
write activity after the node went down. This is typically only a small
fraction of the total number of ranges on each node (operating on the
usual assumption that in an average large cluster, most data is
write-cold), especially when the outage is short. However, if all ranges
had write activity and subsequently quiesced before the outage resolved,
we would still have to unquiesce all ranges upon the node coming back
up. For this reason, the change is primarily targeted towards reducing
the impact of node liveness blips and not reducing the impact of
bringing nodes back up after sustained node outages. This is
intentional, as the former category of outage is the one more likely to
be caused by overload due to unquiescence traffic itself (as we see
in cockroachdb#50865), so it is the category we choose to focus on here.

This commit deliberately does not worry about the migration of this
change for clarity. The next commit adds in the migration.

Release note (performance improvement): transient node liveness blips
no longer cause up-to-date Ranges to unquiesce, which makes these events
less destabilizing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants