storage: excessively large raft logs #27772

petermattis · 2018-07-19T23:46:21Z

We've seen a handful of occurrences of excessively large Raft logs in clusters in the wild without being able to identify the root cause. A large Raft log should normally not occur as the proposal quota mechanism should limit the size of the Raft log if a range is healthy. If one node in the range is down the proposal quota mechanism will drop that replica from consideration, but at the same time allow Raft log truncation to occur, so the size of the Raft log should be limited to whatever Raft log truncation dictates (default 4MiB).

While each follower in a range maintains a Raft log, control of what gets written to the Raft log is in the hands of the leader. No leader and there are no Raft log writes. What happens if there is a leader but the Raft log entries it proposes are never getting applied? This can happen when the range is below quorum. Prior to 2.0, the Raft CheckQuorum mechanism would kick in and the leader would quickly step down. In 2.0 we turned on PreVote and disabled CheckQuorum (@bdarnell are they incompatible?). With CheckQuorum disabled and PreVote enabled a leader can remain the leader forever when the range is below quorum. The PreVote mechanism prevents another range from calling an election and without CheckQuorum enabled the leader won't step down.

Note that the proposal quota mechanism still applies in this scenario and incoming operations will eventually block waiting for quota. There is another mechanism at work. The periodic Raft proposal refresh (which is necessary to deal with dropped proposals) interacts badly with this scenario. In particular, we refresh pending proposals every leader election timeout period (3s?). Refreshing a proposal only results in reproposing if the lease index is still compatible, but if a range is below quorum no other commands will be applied so it seems like we would see reproposals.

The above is a theory based on a reading of the code. It matches the conditions in the clusters that have experienced problems with large Raft logs (i.e. ranges that have gone through long periods of unavailability). @nvanbenschoten is going to work on writing a test to reproduce which should be straightforward to do if the above is correct.

Fixing this problem shouldn't be too difficult. @bdarnell says:

we want to stop refreshRaftProposals(reasonTicks) if liveness indicates that we don't have a quorum of live followers

and

i think we might also want to only do reasonTicks refreshes if we are a follower

the reason that path exists is due to raft's unacknowledged proposal forwarding

if we handled ErrProposalDropped (#21849) and did our own forwarding, i think we might be able to get rid of reasonTicks refreshes completely

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2018-07-20T00:36:32Z

I wrote up a test case that proves this theory to be valid. See #27774.

nvanbenschoten · 2018-07-20T01:12:26Z

Because we use a raft.RawNode instead of the standard raft.node, the locking should be sufficient for us to determine whether we're a leader, a candidate, or a follower whenever we propose a command. One solution to this would then be to mark ProposalData as "non-reproposable" immediately before they have been proposed if the local raft group is the leader. We could then just ignore these proposals in refreshProposalsLocked whenver the reason is reasonTicks.

On a side note, paging this back in/learning new stuff about re-proposals has made me much more excited about #21849. I think we should begin handling raft.ErrProposalDropped soon even if we don't go the extra step of doing our own reliable proposal forwarding. It should be really simple and I fear we're leaving latency on the table for no real reason by ignoring these errors.

nvanbenschoten · 2018-07-20T01:21:58Z

The following is enough to get the new test (and all others in the storage package) to pass:

--- a/pkg/storage/replica.go
+++ b/pkg/storage/replica.go
@@ -3663,6 +3663,7 @@ func defaultSubmitProposalLocked(r *Replica, p *ProposalData) error {
                // We're proposing a command so there is no need to wake the leader if we
                // were quiesced.
                r.unquiesceLocked()
+               p.proposedByLeader = raftGroup.Status().RaftState == raft.StateLeader
                return false /* unquiesceAndWakeLeader */, raftGroup.Propose(encode(p.idKey, data))
        })
 }
@@ -4595,6 +4596,8 @@ func (r *Replica) refreshProposalsLocked(refreshAtDelta int, reason refreshRaftR
                } else if reason == reasonTicks && p.proposedAtTicks > r.mu.ticks-refreshAtDelta {
                        // The command was proposed too recently, don't bother reproprosing it
                        // yet.
+               } else if reason == reasonTicks && p.proposedByLeader {
+                       // The command could not have been lost.
                } else {
                        // The proposal can still apply according to its MaxLeaseIndex. But it's
                        // likely that the proposal was dropped, so we're going to repropose it
diff --git a/pkg/storage/replica_proposal.go b/pkg/storage/replica_proposal.go
index 9cd31af752..4f69e2a886 100644
--- a/pkg/storage/replica_proposal.go
+++ b/pkg/storage/replica_proposal.go
@@ -65,6 +65,10 @@ type ProposalData struct {
        // last (re-)proposed.
        proposedAtTicks int

+       // proposedByLeader indicates that the node that created the proposal was
+       // the leader when the proposal was proposed.
+       proposedByLeader bool
+
        // command is serialized and proposed to raft. In the event of
        // reproposals its MaxLeaseIndex field is mutated.
        command *storagebase.RaftCommand

bdarnell · 2018-07-20T02:40:15Z

In 2.0 we turned on PreVote and disabled CheckQuorum (@bdarnell are they incompatible?).

No, PreVote and CheckQuorum are compatible. We treated them as mutually exclusive because they are partially redundant (our original interest in CheckQuorum was for its ability to minimize disruption when recovering from partitions, which is also addressed in a better way by PreVote). The reason we disabled CheckQuorum when we enabled PreVote is that CheckQuorum is not quite compatible with quiesced replicas, and required hacks like the (now removed) tickQuiesced method.

I think we should begin handling raft.ErrProposalDropped soon even if we don't go the extra step of doing our own reliable proposal forwarding. It should be really simple and I fear we're leaving latency on the table for no real reason by ignoring these errors.

What specifically would you do with this error to improve latency? We can't do anything until we know the leader, and when we know we'll generate a reasonNewLeader refresh.

As for the fix, I wasn't thinking we'd need to track whether a particular proposal had been proposed by a leader. All we need is to check whether this replica is currently the leader, and if so ignore reasonTicks. The theory is that if we were not the leader at the time of the previous proposal, we'd get a reasonNewLeader when we became leader, and are therefore not relying on reasonTicks.

Alternately, we could use NodeLiveness, and ignore reasonTicks (regardless of our current state) whenever NodeLiveness indicates that there is not a quorum of live nodes. This is a little more expensive and a little more code than checking the state, but it feels like a narrower fix and might be more suitable to backport (I'm vaguely concerned about missing an edge case if we base everything on raft leadership).

petermattis · 2018-07-20T10:33:44Z

No, PreVote and CheckQuorum are compatible. We treated them as mutually exclusive because they are partially redundant (our original interest in CheckQuorum was for its ability to minimize disruption when recovering from partitions, which is also addressed in a better way by PreVote). The reason we disabled CheckQuorum when we enabled PreVote is that CheckQuorum is not quite compatible with quiesced replicas, and required hacks like the (now removed) tickQuiesced method.

Ah, so CheckQuorum was providing an unexpected benefit in preventing quorum-less leaders. We should all do some day dreaming about whether there are other scenarios where a quorum-less leader can cause a problem.

nvanbenschoten · 2018-07-20T17:29:26Z

What specifically would you do with this error to improve latency? We can't do anything until we know the leader, and when we know we'll generate a reasonNewLeader refresh.

I didn't fully understand the interactions here when posting that. I think I do a little better now. Is it safe to think about ErrProposalDropped errors as an indication that proposals are queueing and that we're waiting for a reasonNewLeader to send them all out?

Alternately, we could use NodeLiveness, and ignore reasonTicks (regardless of our current state) whenever NodeLiveness indicates that there is not a quorum of live nodes. This is a little more expensive and a little more code than checking the state, but it feels like a narrower fix and might be more suitable to backport (I'm vaguely concerned about missing an edge case if we base everything on raft leadership).

I think this would also adress the 5x replication factor issue discussion mentioned in #27774 (comment). Do you see any downsides to tying this to node liveness though? For instance, I'm worried we might run into issues on the node liveness range itself when a node that is currently down is trying to indicate that it is now live. Maybe that's fine because node liveness updates are in their own retry loop anyway.

bdarnell · 2018-07-20T19:10:40Z

Is it safe to think about ErrProposalDropped errors as an indication that proposals are queueing and that we're waiting for a reasonNewLeader to send them all out?

There is no queuing at the raft level. The raft Propose method does one of three things:

If leader, append to log
If follower of known leader, send (enqueue) MsgProp to the leader
If leader unknown, return ErrProposalDropped

ErrProposalDropped tells us that the leader is unknown, so we could infer that no proposals will succeed until we learn of the new leader.

Do you see any downsides to tying this to node liveness though? For instance, I'm worried we might run into issues on the node liveness range itself when a node that is currently down is trying to indicate that it is now live. Maybe that's fine because node liveness updates are in their own retry loop anyway.

Yeah, as I said on the other issue, there might be trouble with the liveness range itself. The fact that liveness updates have short timeouts and get retried themselves may mitigate it, though.

nvanbenschoten · 2018-07-20T19:34:27Z

There is no queuing at the raft level.

Right, I didn't mean actual queuing, I was just making an analogy for the sake of my own understanding. It is safe to ignore an ErrProposalDropped because we are still holding the proposal in Replica.mu.proposals and waiting until we learn about a new leader, at which point we re-propose in MaxLeaseIndex order.

Fixes cockroachdb#27772. This change adds safeguards to prevent cases where a raft log would grow without bound during loss of quorum scenarios. It also adds a new test that demonstrates that the raft log does not grow without bound in these cases. There are two cases that need to be handled to prevent the unbounded raft log growth observed in cockroachdb#27772. 1. When the leader proposes a command and cannot establish a quorum. In this case, we know the leader has the entry in its log, so there's no need to refresh it with `reasonTicks`. To avoid this, we no longer use `refreshTicks` as a leader. 2. When a follower proposes a command that is forwarded to the leader who cannot establish a quorum. In this case, the follower can't be sure (currently) that the leader got the proposal, so it needs to refresh using `reasonTicks`. However, the leader now detects duplicate forwarded proposals and avoids appending redundant entries to its log. It does so by maintaining a set of in-flight forwarded proposals that it has received during its term as leader. This set is reset after every leadership change. Both of these cases are tested against in the new TestLogGrowthWhenRefreshingPendingCommands. Without both of the safeguards introduced in this commit, the test fails. Release note (bug fix): Prevent loss of quorum situations from allowing unbounded growth of a Range's Raft log.

Fixes cockroachdb#27772. This change adds safeguards to prevent cases where a raft log would grow without bound during loss of quorum scenarios. It also adds a new test that demonstrates that the raft log does not grow without bound in these cases. There are two cases that need to be handled to prevent the unbounded raft log growth observed in cockroachdb#27772. 1. When the leader proposes a command and cannot establish a quorum. In this case, we know the leader has the entry in its log, so there's no need to refresh it with `reasonTicks`. To avoid this, we no longer use `refreshTicks` as a leader. 2. When a follower proposes a command that is forwarded to the leader who cannot establish a quorum. In this case, the follower can't be sure (currently) that the leader got the proposal, so it needs to refresh using `reasonTicks`. However, the leader now detects duplicate forwarded proposals and avoids appending redundant entries to its log. It does so by maintaining a set of in-flight forwarded proposals that it has received during its term as leader. This set is reset after every leadership change. Both of these cases are tested against in the new TestLogGrowthWhenRefreshingPendingCommands. Without both of the safeguards introduced in this commit, the test fails. Release note (bug fix): Prevent unbounded growth of the raft log caused by a loss of quorum.

27774: storage: prevent unbounded raft log growth without quorum r=nvanbenschoten a=nvanbenschoten Fixes #27772. This change adds safeguards to prevent cases where a raft log would grow without bound during loss of quorum scenarios. It also adds a new test that demonstrates that the raft log does not grow without bound in these cases. There are two cases that need to be handled to prevent the unbounded raft log growth observed in #27772. 1. When the leader proposes a command and cannot establish a quorum. In this case, we know the leader has the entry in its log, so there's no need to refresh it with `reasonTicks`. To avoid this, we no longer use `refreshTicks` as a leader. 2. When a follower proposes a command that is forwarded to the leader who cannot establish a quorum. In this case, the follower can't be sure (currently) that the leader got the proposal, so it needs to refresh using `reasonTicks`. However, the leader now detects duplicate forwarded proposals and avoids appending redundant entries to its log. It does so by maintaining a set of in-flight forwarded proposals that it has received during its term as leader. This set is reset after every leadership change. Both of these cases are tested against in the new TestLogGrowthWhenRefreshingPendingCommands. Without both of the safeguards introduced in this commit, the test fails. Release note (bug fix): Prevent loss of quorum situations from allowing unbounded growth of a Range's Raft log. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

Fixes cockroachdb#27772. This change adds safeguards to prevent cases where a raft log would grow without bound during loss of quorum scenarios. It also adds a new test that demonstrates that the raft log does not grow without bound in these cases. There are two cases that need to be handled to prevent the unbounded raft log growth observed in cockroachdb#27772. 1. When the leader proposes a command and cannot establish a quorum. In this case, we know the leader has the entry in its log, so there's no need to refresh it with `reasonTicks`. To avoid this, we no longer use `refreshTicks` as a leader. 2. When a follower proposes a command that is forwarded to the leader who cannot establish a quorum. In this case, the follower can't be sure (currently) that the leader got the proposal, so it needs to refresh using `reasonTicks`. However, the leader now detects duplicate forwarded proposals and avoids appending redundant entries to its log. It does so by maintaining a set of in-flight forwarded proposals that it has received during its term as leader. This set is reset after every leadership change. Both of these cases are tested against in the new TestLogGrowthWhenRefreshingPendingCommands. Without both of the safeguards introduced in this commit, the test fails. Release note (bug fix): Prevent unbounded growth of the raft log caused by a loss of quorum.

27868: backport-2.0: storage: prevent unbounded raft log growth without quorum r=nvanbenschoten a=nvanbenschoten Backport 2/2 commits from #27774. /cc @cockroachdb/release --- Fixes #27772. This change adds safeguards to prevent cases where a raft log would grow without bound during loss of quorum scenarios. It also adds a new test that demonstrates that the raft log does not grow without bound in these cases. There are two cases that need to be handled to prevent the unbounded raft log growth observed in #27772. 1. When the leader proposes a command and cannot establish a quorum. In this case, we know the leader has the entry in its log, so there's no need to refresh it with `reasonTicks`. To avoid this, we no longer use `refreshTicks` as a leader. 2. When a follower proposes a command that is forwarded to the leader who cannot establish a quorum. In this case, the follower can't be sure (currently) that the leader got the proposal, so it needs to refresh using `reasonTicks`. However, the leader now detects duplicate forwarded proposals and avoids appending redundant entries to its log. It does so by maintaining a set of in-flight forwarded proposals that it has received during its term as leader. This set is reset after every leadership change. Both of these cases are tested against in the new TestLogGrowthWhenRefreshingPendingCommands. Without both of the safeguards introduced in this commit, the test fails. Release note (bug fix): Prevent loss of quorum situations from allowing unbounded growth of a Range's Raft log. 28225: release-2.0: importccl: Preserve '\r\n' during CSV import r=dt a=dt Backport 1/1 commits from #28181. /cc @cockroachdb/release --- See #25344. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: neeral <neeral@users.noreply.github.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com>

The suggested pattern for Raft proposals is that they be retried periodically until they succeed. This turns out to be an issue when a leader cannot commit entries because the leader will continue to append re-proposed entries to its log without committing anything. This can result in the uncommitted tail of a leader's log growing without bound until it is able to commit entries. This change add a safeguard to protect against this case where a leader's log can grow without bound during loss of quorum scenarios. It does so by introducing a new, optional `MaxUncommittedEntries` configuration. This config limits the max number of uncommitted entries that may be appended to a leader's log. Once this limit is exceeded, proposals will begin to return ErrProposalDropped errors. See cockroachdb/cockroach#27772

The suggested pattern for Raft proposals is that they be retried periodically until they succeed. This turns out to be an issue when a leader cannot commit entries because the leader will continue to append re-proposed entries to its log without committing anything. This can result in the uncommitted tail of a leader's log growing without bound until it is able to commit entries. This change add a safeguard to protect against this case where a leader's log can grow without bound during loss of quorum scenarios. It does so by introducing a new, optional ``MaxUncommittedEntriesSize configuration. This config limits the max aggregate size of uncommitted entries that may be appended to a leader's log. Once this limit is exceeded, proposals will begin to return ErrProposalDropped errors. See cockroachdb/cockroach#27772

petermattis added S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting A-kv-client Relating to the KV client and the KV interface. labels Jul 19, 2018

petermattis added this to the 2.0.x milestone Jul 19, 2018

petermattis assigned nvanbenschoten and petermattis Jul 19, 2018

nvanbenschoten mentioned this issue Jul 20, 2018

storage: prevent unbounded raft log growth without quorum #27774

Merged

tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 22, 2018

craig bot closed this as completed in #27774 Jul 23, 2018

craig bot added this to Finished (milestone 3, ends 7/20) in KV via automation Jul 23, 2018

nvanbenschoten mentioned this issue Jul 23, 2018

backport-2.0: storage: prevent unbounded raft log growth without quorum #27868

Merged

nvanbenschoten mentioned this issue Oct 9, 2018

raft: provide protection against unbounded Raft log growth etcd-io/etcd#10167

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: excessively large raft logs #27772

storage: excessively large raft logs #27772

petermattis commented Jul 19, 2018

nvanbenschoten commented Jul 20, 2018

nvanbenschoten commented Jul 20, 2018

nvanbenschoten commented Jul 20, 2018 •

edited

Loading

bdarnell commented Jul 20, 2018

petermattis commented Jul 20, 2018

nvanbenschoten commented Jul 20, 2018

bdarnell commented Jul 20, 2018

nvanbenschoten commented Jul 20, 2018

storage: excessively large raft logs #27772

storage: excessively large raft logs #27772

Comments

petermattis commented Jul 19, 2018

nvanbenschoten commented Jul 20, 2018

nvanbenschoten commented Jul 20, 2018

nvanbenschoten commented Jul 20, 2018 • edited Loading

bdarnell commented Jul 20, 2018

petermattis commented Jul 20, 2018

nvanbenschoten commented Jul 20, 2018

bdarnell commented Jul 20, 2018

nvanbenschoten commented Jul 20, 2018

nvanbenschoten commented Jul 20, 2018 •

edited

Loading