Update commit index and apply configs eagerly #465

cole-miller · 2023-08-04T20:09:19Z

Fixes #386, using the approach suggested by @freeekanayaka.

Signed-off-by: Cole Miller cole.miller@canonical.com

Signed-off-by: Cole Miller <cole.miller@canonical.com>

cole-miller · 2023-08-04T20:46:53Z

please test downstream

codecov · 2023-08-04T20:49:25Z

Codecov Report

Merging #465 (a8eb490) into master (0ea24be) will decrease coverage by 0.09%.
The diff coverage is 67.34%.

@@            Coverage Diff             @@
##           master     #465      +/-   ##
==========================================
- Coverage   76.72%   76.63%   -0.09%     
==========================================
  Files          51       51              
  Lines        9686     9696      +10     
  Branches     2476     2480       +4     
==========================================
- Hits         7432     7431       -1     
- Misses       1088     1094       +6     
- Partials     1166     1171       +5

Files Changed	Coverage Δ
src/replication.c	`68.28% <62.50%> (-0.54%)`	⬇️
src/convert.c	`84.17% <66.66%> (-0.39%)`	⬇️
src/fixture.c	`92.76% <100.00%> (+0.02%)`	⬆️

... and 1 file with indirect coverage changes

cole-miller · 2023-08-04T21:27:17Z

Clean set of downstream checks: https://github.com/canonical/raft/actions/runs/5766410935

freeekanayaka

Nice work, looks good to me, just a nit and a clarification request.

freeekanayaka · 2023-08-05T14:55:02Z

src/replication.c

+        if (rv != 0) {
+            goto err_after_acquire_entries;
+        }
+        /* Don't try to take a snapshot while holding onto log entries. */


Can you expand a bit this comment?

It's not entirely clear to me what it means, and what would instead happen if we did call replicationMaybeTakeSnapshot() here (and conversely why it's safe to call replicationMaybeTakeSnapshot() above, in the (n == 0) branch for empty AppendEntries).

Yes, will do (if we don't end up deciding that this is the wrong approach...)

After some digging, the specific problem I was seeing is a bad interaction with the raft I/O fixture. If the takeSnapshot in replicationAppend fails, the in-memory log entries from the append request will be un-referenced, but the I/O fixture will still have the append request in its queue and will eventually try to flush it, triggering a use-after-free. The right fix here is to make the fixture smarter (thanks for prompting me to work this out). The only reason we weren't seeing this before is that the shouldTakeSnapshot check is based on the last_applied index, and previously replicationAppend didn't update that index, so no snapshot would ever be triggered.

freeekanayaka · 2023-08-05T15:06:40Z

src/replication.c

+                goto err_after_acquire_entries;
+            }
+        }
+    }


I think it would be a tad simpler and more explicit to call membershipUncommittedChange() in the for loop above, right after the logAppend() call here.

That will save a bit of code (since the two loops should be the same) and make it more clear that configuration changes are applied immediately upon appending them to the in-memory log.

I agree that this is a better design. The reason I had this loop further down originally is that I wanted to get all the potential failure points out of the way before applying the new config/s; otherwise there's a question of what to do if (for example) we apply the configs and then the raft_io.append call fails. membershipRollback is not equipped to handle this situation, but I can add a little bit of new code to save the old (current, committed) configuration pair before the append loop and restore it if something goes wrong.

I didn't look much into what type of failures could happen into this specific spot, but as a general rule I'd say we have 2 types of failures in raft:

Failures that make it totally impossible to proceed (for example failing to allocate memory that we absolutely need): in this case we should basically stop-the-line and transition to RAFT_UNAVAILABLE (user code will typically abort the process).

Failures that can be retried (that's typically I/O): in this case we should have some mechanism to retry (for example if the disk is currently full, but eventually the operator frees some of it). We currently handle that fine for network I/O, but not that much for disk I/O, and it's an area that should be improved.

In both cases, I don't think there's any need to restore the old (current, committed) configuration: we should continue operating with the uncommitted one and deal with I/O errors by retrying.

Note though, that I'm not totally convinced anymore that it's a good idea to immediately apply the new configuration at all, before persisting it. See the example scenario here and the follow-up notes.

MathieuBordere · 2023-08-07T08:39:01Z

replicationApply and replicationMaybeTakeSnapshot are often called one after the other, feels like they could be combined.

MathieuBordere · 2023-08-07T08:41:08Z

I'd probably also prefer a couple more clean jepsen runs.

MathieuBordere · 2023-08-07T08:41:23Z

please test downstream

freeekanayaka · 2023-08-07T09:21:32Z

replicationApply and replicationMaybeTakeSnapshot are often called one after the other, feels like they could be combined.

I thought that too, and that's the reason why they are currently combined in master. However I believe it's not possible anymore, since now you need also a way to apply entries without triggering snapshots IIUC.

This is probably not be bad per se, but it could be a tiny smell (see also my other comment).

cole-miller · 2023-08-07T14:00:15Z

Second clean downstream run: https://github.com/canonical/raft/actions/runs/5782948278

please test downstream

freeekanayaka · 2023-08-07T14:24:30Z

One situation that could happen is:

Server S is operating with configuration C_old, where S is a stand-by.
Server S receives configuration C_new, where S is a voter.
Server S immediately starts operating with C_new, even if the associated entry is not yet persisted
Server S receives a RequestVote request and grants its vote
Server S crashes before the configuration entry is persisted
Server S restarts. It finds it has granted a vote, although it operates with C_old in which is not a voter

Practically speaking this particular scenario is probably going to be fine and not create any issue, but I just wanted to illustrate subtle behavior that might happen. There might be other similar scenarios which are actually problematic, I don't know, I haven't spent time analyzing them.

I had thought about this scenario even when I approved this PR (and when I proposed this approach in the first place), but I considered it ok. It probably is ok, however maybe we want to look at this type of issues from a broader perspective, see the next comment.

freeekanayaka · 2023-08-07T14:28:02Z

I'd like to expand a bit on @MathieuBordere's comment #464 (comment) about waiting for the entries to persist before converting to candidate.

There is currently a bit of "tension" between struct raft->log (the in-memory cache of the log) and the on-disk log, which typically lags behind the in-memory one. At the moment neither of them is a "single-source-of-truth" of what we consider the "current state", and instead we opportunistically use one or the other depending on the situation.

This approach is probably slightly more efficient (avoiding unnecessary lags in some cases), but it's also more tricky as it requires to think carefully about subtle situations.

Maybe we should consider to make the on-disk log a hard "single-source-of-truth", and treat the in-memory cache purely as an optimization to be used in a couple of cases, but that in no situation gets treated as "current state".

When operating as leader, the in-memory cache would be used only to fetch entries that need to be sent to followers.

When operating as follower, the in-memory cache would be used only as a "buffer" for incoming entries.

When operating as candidate, we don't receive or send any entry, so the in-memory cache is not used (see below).

For any actual "decision" that the engine takes, the "current state" to refer to would be always the on-disk one, never the in-memory one, which becomes purely an implementation detail.

On top of that, we could also say that all state transitions (not only follower->candidate, all of them) must occur only when disk I/O has completed, and so the "current state" has become stable.

Such approach would probably lead to slightly increased latency in some cases, and possibly to a bit of extra complexity in a few spots (e.g. we wouldn't be able transition state immediately when the election timer expires). However it might also lead to some reduced overall complexity and more importantly to make it easier to reason around corner cases, or eliminate some of them altogether.

There are a few important details about such approach that are still not clear in my mind (and which might well end up making this approach not that appealing after all), but wanted to get some feedback first.

cole-miller · 2023-08-07T15:03:33Z

One situation that could happen is

Yeah, the async design makes it easy to get into "inconsistent state" situations like this one. The question of where/whether to send an AppendEntriesResult message in a different term than the original request (see #421) has some of the same characteristics: there's a potentially long-running operation, we want to take some action at the end of this operation, but many assumptions about the shared state of the raft server can be invalidated in the intervening time. And so far we've been tackling these issues one-by-one instead of thinking globally about what bits of state are allowed to change when, what things can change in between the beginning of an async operation and its completion. I think it would be great to have a general plan for this!

cole-miller · 2023-08-07T15:20:26Z

On this specific issue, I would be just fine with returning to the approach from #464, which sounds like what @freeekanayaka is proposing as part of a broader look at how state transitions are handled.

freeekanayaka · 2023-08-07T18:25:00Z

One situation that could happen is

Yeah, the async design makes it easy to get into "inconsistent state" situations like this one. The question of where/whether to send an AppendEntriesResult message in a different term than the original request (see #421) has some of the same characteristics: there's a potentially long-running operation, we want to take some action at the end of this operation, but many assumptions about the shared state of the raft server can be invalidated in the intervening time. And so far we've been tackling these issues one-by-one instead of thinking globally about what bits of state are allowed to change when, what things can change in between the beginning of an async operation and its completion. I think it would be great to have a general plan for this!

Right, the important thing is to have a guideline that makes reasoning and decisions easier.

freeekanayaka · 2023-08-07T18:28:44Z

On this specific issue, I would be just fine with returning to the approach from #464, which sounds like what @freeekanayaka is proposing as part of a broader look at how state transitions are handled.

Before choosing one approach or the other, I'd suggest to understand the details of both first. If there's a feeling that using the on-disk state as single-source-of-truth might be a good idea, we can explore that a bit more.

freeekanayaka · 2023-08-08T15:34:11Z

I've tried to put together a slightly more detailed analysis of the current situation and of the "wait for disk writes to settle before converting to candidate" approach. Hope it helps.

Election

When receiving a RequestVote message we always compare to the in-memory log, not to the on-disk log, which might be a few entries behind.

Benefit: we don't let the requesting server win the election when it is actually slightly behind us, and we potentially avoid to lose a pending transaction and hence requiring the client to re-submit it

Safety: if we crash before persisting the in-flight entries or install-snapshot and then restarts, we will effectively have a shorter log, and might start to grant our vote to the same candidate that we initially
rejected, but that's fine.

Proposal: nothing in particular needs to change.
When sending a RequestVote message, we obviously always use information from the on-disk log. Currently we convert to candidate and send RequestVote messages right away when the election timer expires, and we don't wait for pending writes to finish (writes could be new entries, or a snapshot being installed), so the on-disk log might be behind and the candidate has slightly less chances to win the election.

Proposal: wait until pending entries or install-snapshot are finished, then convert to candidate and start the election.

Benefit: the in-memory log will match the on-disk log for all the duration of the candidate state, and the candidate should have more chances to win the election.

Safety: no particular edge case should happen that we need to worry about.

Question: should we also wait for take-snapshot operations to be finished? probably not, since technically that doesn't modify the length of the on-disk log, but we need to think if that leads to any weird edge case.

Alternative: keep things as they are now, but if we finish persisting entries during the candidate state, we should apply any pending configuration change that was contained in the entries that have just been persisted (currently we just ignore the new configuration, see the code here, which is the cause of recvAppendEntries: Assertion r->state == RAFT_FOLLOWER || r->state == RAFT_CANDIDATE failed #386)

Replication:

When receiving an AppendEntries/InstallSnapshot message, always use the information from the in-memory log.

Benefit: the in-memory log accurately reflects what we have received so far, so when replying to an AppendEntries message we should use that to filter out duplicate entries or informing the leader about where we are at.

Safety: if the server crashes before persisting in-flight entries or install-snapshot and then restarts, it will effectively have a shorter log, but that's fine as the leader will cope with that.

Proposal: nothing in particular needs to change.
When sending an AppendEntries/InstallSnapshot message, it's not relevant whether the leader has pending entries not yet persisted to disk.

Proposal: nothing in particular needs to change.

Question: when stepping down from leader and becoming follower, should we wait for disk writes to settle? I believe that's not required in principle, but might need a bit more investigation.
When sending an AppendEntries/InstallSnapshot result, we obviously need to use the information from the on-disk log.

Proposal: nothing in particular needs to change.

Commitment:

On followers we currently look at on-disk log (last_stored) when updating the commit index and deciding whether to apply new entries to the FSM.

Proposal: apply entries to the FSM as soon as we see the new commit index in the AppendEntries message handler.

Question: as shown by this PR, that leads to a bit more complication, because we then need to take snapshots only when the associated entries have been persisted. So it might not be a net win, needs more investigation.

cole-miller · 2023-08-08T17:55:40Z

@freeekanayaka Thanks for writing this! A couple of responses:

About the handling of config changes: I feel pretty negatively about the alternative you list of allowing candidates to change configuration during their election; I think it makes it harder to have a good mental model of how elections work, and I am not confident in being able to implement it without weird edge cases. Also, if we neither apply configs eagerly nor wait for them to be applied before becoming a candidate, we can potentially change our config while we are a leader, which seems similarly fraught.

I also wanted to point out that a conflict arises if we try to do both of (1) update the commit index eagerly and (2) wait for configs to be persisted before applying them (as uncommitted) -- because then we can potentially have a committed config that hasn't even been applied as uncommitted yet. I think this would cause issues if implemented naively, it pretty much breaks our model with a "current" config index that's always ahead of a fallback "last committed" config index. Maybe this model could be rethought; I'd need to ponder some more what a suitable replacement would be.

As for the question about snapshotting in replicationAppend, it seems like the problem I ran into there is in the raft I/O fixture (though I'm still working on making sure I have a clear picture of how snapshotting interacts with the eager commit index update) -- see my comment above. Looking beyond that specific bug, I think I do need to put more thought into how snapshotting interacts with the eager commit index update, because it becomes possible for us to kick off a snapshot that includes entries we haven't persisted yet. I was hoping to avoid that complication by moving the snapshot step to the append callback, but that doesn't really work, because we could have a sequence of events like

we accept an append request like (prev_index=1, num_entries=4, commit_index=5), update our own commit index to 5, apply all the entries, and schedule the callback
we accept a second append request (prev_index=5, num_entries=3, commit_index=8), and etc.
the callback for the first request runs, and decides to take a snapshot through index 8, of which the last three entries haven't been persisted

So the "eager commit" change needs a careful look at snapshotting regardless of where we put the takeSnapshot call.

freeekanayaka · 2023-08-08T18:04:29Z

@freeekanayaka Thanks for writing this! A couple of responses:

About the handling of config changes: I feel pretty negatively about the alternative you list of allowing candidates to change configuration during their election; I think it makes it harder to have a good mental model of how elections work, and I am not confident in being able to implement it without weird edge cases. Also, if we neither apply configs eagerly nor wait for them to be applied before becoming a candidate, we can potentially change our config while we are a leader, which seems similarly fraught.

Do you mean "while we are candidate"? It would certainly feel a bit uncomfortable, even if it turns out to be correct.

I also wanted to point out that a conflict arises if we try to do both of (1) update the commit index eagerly and (2) wait for configs to be persisted before applying them (as uncommitted) -- because then we can potentially have a committed config that hasn't even been applied as uncommitted yet. I think this would cause issues if implemented naively, it pretty much breaks our model with a "current" config index that's always ahead of a fallback "last committed" config index. Maybe this model could be rethought; I'd need to ponder some more what a suitable replacement would be.

Didn't think much about it, but don't we already have cases like that? For example if you install a snapshot, the configuration in that snapshot would become operational immediately without having been uncommitted (a configuration contained in a snapshot is by definition committed).

As for the question about snapshotting in replicationAppend, it seems like the problem I ran into there is in the raft I/O fixture (though I'm still working on making sure I have a clear picture of how snapshotting interacts with the eager commit index update) -- see my comment above. Hopefully it will turn out that there aren't any more unexpected interactions.

Ok, that's a good data point, thanks. I didn't think much about snapshots as well yet, but I'd like too.

cole-miller · 2023-08-08T18:07:21Z

Do you mean "while we are candidate"? It would certainly feel a bit uncomfortable, even if it turns out to be correct.

No, I meant while we're a leader, since we can potentially convert to candidate and win an election all while waiting for the entries to be persisted. Or am I missing something?

freeekanayaka · 2023-08-08T18:15:48Z

Do you mean "while we are candidate"? It would certainly feel a bit uncomfortable, even if it turns out to be correct.

No, I meant while we're a leader, since we can potentially convert to candidate and win an election all while waiting for the entries to be persisted. Or am I missing something?

Ah ok. But we change configuration as leaders anyway, when accepting new configurations. There would be virtually no difference IIUC.

freeekanayaka · 2023-08-08T18:18:03Z

It should be safe to change the operating configuration at any time, as long as you have at most one uncommitted configuration.

cole-miller · 2023-08-08T18:35:36Z

Didn't think much about it, but don't we already have cases like that? For example if you install a snapshot, the configuration in that snapshot would become operational immediately without having been uncommitted (a configuration contained in a snapshot is by definition committed).

In the case of installing a snapshot we update both config indices at once to the same value (and apply the corresponding config), so they stay coupled/in sync, and in particular the configuration_committed_index doesn't run past the configuration_uncommitted_index. This happens discontinuously, outside the normal flow by which the config indices are updated and the current config is changed, and in a context where we have total control over all those bits of state. In other places we update just one of these indices, and we have to decide between

A. Don't allow configuration_committed_index to run ahead of configuration_uncommitted_index, breaking the property that configuration_committed_index is the index of the last entry we have locally that's known to be committed; or
B. Allow configuration_committed_index to run ahead of configuration_uncommitted_index, in which case we need to do something about rollbacks

In either case the current names for these indices cease to make sense, unfortunately.

freeekanayaka · 2023-08-08T18:42:23Z

Didn't think much about it, but don't we already have cases like that? For example if you install a snapshot, the configuration in that snapshot would become operational immediately without having been uncommitted (a configuration contained in a snapshot is by definition committed).

In the case of installing a snapshot we update both config indices at once to the same value (and apply the corresponding config), so they stay coupled/in sync, and in particular the configuration_committed_index doesn't run past the configuration_uncommitted_index. This happens discontinuously, outside the normal flow by which the config indices are updated and the current config is changed, and in a context where we have total control over all those bits of state. In other places we update just one of these indices, and we have to decide between

A. Don't allow configuration_committed_index to run ahead of configuration_uncommitted_index, breaking the property that configuration_committed_index is the index of the last entry we have locally that's known to be committed; or B. Allow configuration_committed_index to run ahead of configuration_uncommitted_index, in which case we need to do something about rollbacks

In either case the current names for these indices cease to make sense, unfortunately.

I'm not entirely sure to see what you mean. Can you make an example of a sequence of events that would lead to a situation that violates expectations? (either formal Raft expectation, or internal libraft expectations).

freeekanayaka · 2023-08-08T18:55:34Z

Perhaps something like:

Current commit_index is 4
Current configuration_committed_index is 3
Current configuration_uncommitted_index is 0

Then:

Receive C_new at index 5 and start persisting it
Learn that new commit index is 5
Apply C_new, configuration_commited_index is now 5 and configuration_uncommited_index is still 0
Finish persisting C_new
We need to be careful and notice that C_new is committed, and leave configuration_uncommitted_index to 0

The main thing that concerns me here is that this is quite similar to the "apply configuration changes immediately" approach, in that we can end up operating with a configuration that is not persisted locally, and so do things like voting for somebody, crash, restart and find out that we're not voters.

freeekanayaka · 2023-08-08T19:04:56Z

One thing that comes to my mind would be that it's probably fine to update commit_index eagerly, however we should not apply entries that are not yet persisted locally (either FSM entries or config entries).

In other words:

commit_index immediately reflects the most recent commit index we know of
last_stored is the last log entry in the on-disk log
last_applied is the last entry that was applied (either FSM or configuration change)

~~We'd have the invariant commit_index >= last_stored >= last_applied.~~

EDIT: actually commit_index >= last_applied and last_stored >= last_applied, as I guess that commit_index can be both greater or smaller than last_stored.

cole-miller · 2023-08-08T21:34:15Z

I'm not entirely sure to see what you mean. Can you make an example of a sequence of events that would lead to a situation that violates expectations? (either formal Raft expectation, or internal libraft expectations).

I guess I don't have a specific scenario in mind, it's just that I'm not sure what configuration_committed_index and configuration_uncommitted_index should mean (and therefore what the rules should be for updating them) in a world where the commit index is updated eagerly while config entries are applied not-eagerly. I'm sure there's a way to make it work, though.

cole-miller · 2023-08-08T21:44:31Z

One thing that comes to my mind would be that it's probably fine to update commit_index eagerly, however we should not apply entries that are not yet persisted locally (either FSM entries or config entries).

Worth noting that leaders already do this:

raft/test/integration/test_replication.c

Lines 1135 to 1141 in 0ea24be

    
           /* A leader with slow disk commits an entry that it hasn't persisted yet, 
        
            * because enough followers to have a majority have aknowledged that they have 
        
            * appended the entry. The leader's last_stored field hence lags behind its 
        
            * commit_index. A new leader gets elected, with a higher commit index and sends 
        
            * first a new entry than a heartbeat to the old leader, that needs to update 
        
            * its commit_index taking into account its lagging last_stored. */ 
        
           TEST(replication, lastStoredLaggingBehindCommitIndex, setUp, tearDown, 0, NULL)

That's part of what convinced me that it would likewise be okay for followers to apply entries before persisting them. But it also makes sense to me as a conservative choice to not apply anything until it has been persisted locally (like you say, treating the disk as the source of truth).

freeekanayaka · 2023-08-08T21:52:38Z

I'm not entirely sure to see what you mean. Can you make an example of a sequence of events that would lead to a situation that violates expectations? (either formal Raft expectation, or internal libraft expectations).

I guess I don't have a specific scenario in mind, it's just that I'm not sure what configuration_committed_index and configuration_uncommitted_index should mean (and therefore what the rules should be for updating them) in a world where the commit index is updated eagerly while config entries are applied not-eagerly. I'm sure there's a way to make it work, though.

I'd say configuration_commit_index and configuration_uncommitted_index maintain the same meaning. The fact that commit_index gets updated eagerly should not make any difference, as long as we continue applying FSM and config entries non-eagerly.

cole-miller · 2023-08-08T22:00:31Z

I'd say configuration_commit_index and configuration_uncommitted_index maintain the same meaning.

Should configuration_uncommitted_index mean the index of the last configuration we have in our log, or the last one we've applied? And how do rollbacks work if the "fallback" index can be greater than the "current" index?

freeekanayaka · 2023-08-08T22:02:38Z

One thing that comes to my mind would be that it's probably fine to update commit_index eagerly, however we should not apply entries that are not yet persisted locally (either FSM entries or config entries).

Worth noting that leaders already do this:

raft/test/integration/test_replication.c

Lines 1135 to 1141 in 0ea24be

/* A leader with slow disk commits an entry that it hasn't persisted yet,

* because enough followers to have a majority have aknowledged that they have

* appended the entry. The leader's last_stored field hence lags behind its

* commit_index. A new leader gets elected, with a higher commit index and sends

* first a new entry than a heartbeat to the old leader, that needs to update

* its commit_index taking into account its lagging last_stored. */

TEST(replication, lastStoredLaggingBehindCommitIndex, setUp, tearDown, 0, NULL)

That's part of what convinced me that it would likewise be okay for followers to apply entries before persisting them. But it also makes sense to me as a conservative choice to not apply anything until it has been persisted locally (like you say, treating the disk as the source of truth).

Right, exactly. Leaders already advance their commit index when there is a quorum of followers, even if they didn't finish persisting those entries themselves yet. Followers should do the same. However, neither of them should apply anything (FSM or config) until is persisted locally. So last_applied <= commit_index.

I believe this would make the situation more symmetric between followers and leaders, letting commit_index to be updated eagerly, as it should.

freeekanayaka · 2023-08-08T22:10:41Z

I'd say configuration_commit_index and configuration_uncommitted_index maintain the same meaning.

Should configuration_uncommitted_index mean the index of the last configuration we have in our log, or the last one we've applied?

The last we've applied. I.e. the last that we have in our disk-log. The last we have in our in-memory log is basically "non existing", i.e. in that case the memory log is just a buffer or a queue waiting to be processed. The same is true for FSM entries (e.g. entries in the in-memory log that are still being persisted to disk).

And how do rollbacks work if the "fallback" index can be greater than the "current" index?

What do you mean?

All the invariants described in this docstring should remain true.

cole-miller · 2023-08-08T22:14:28Z

Oh, so you're saying that configuration_committed_index would not be updated to point to committed entries that are only in our in-memory log, but not stored on disk? That was the point of confusion for me.

freeekanayaka · 2023-08-08T22:17:07Z

Oh, so you're saying that configuration_committed_index would not be updated to point to committed entries that are only in our in-memory log, but not stored on disk? That was the point of confusion for me.

Correct. Updating commit_index doesn't mean that we have to update configuration_committed_index too at the same time.

cole-miller · 2023-08-08T22:22:46Z

Okay, thanks for clearing that up. I disagree slightly that there's no need to update the docstring, I think this paragraph would need changing:

At all times configuration_committed_index is either zero or is the index of the most recent log entry of type RAFT_CHANGE that we know to be committed.

Part of my confusion was because the code currently does update configuration_committed_index in tandem with commit_index, and it wasn't clear to me that you were proposing to change that.

freeekanayaka · 2023-08-08T22:28:53Z

Okay, thanks for clearing that up. I disagree slightly that there's no need to update the docstring, I think this paragraph would need changing:

At all times configuration_committed_index is either zero or is the index of the most recent log entry of type RAFT_CHANGE that we know to be committed.

True, that'd would need a tweak.

Part of my confusion was because the code currently does update configuration_committed_index in tandem with commit_index, and it wasn't clear to me that you were proposing to change that.

Yeah, I didn't quite look at the code and I don't recall all details of the current code, was just trying to first get a sound plan first. We could update #465 (comment) with all the additional insights from these last comments.

cole-miller · 2023-08-08T22:34:00Z

I'm still not 100% convinced that there are no problems with allowing candidates and leaders to apply config change entries for which the append process started back when they were followers, and would prefer the more conservative approach of delaying the follower->candidate transition. I will try to think of an example where this causes trouble, or report if I can't do it :)

freeekanayaka · 2023-08-08T22:37:59Z

I'm still not 100% convinced that there are no problems with allowing candidates and leaders to apply config change entries for which the append process started back when they were followers, and would prefer the more conservative approach of delaying the follower->candidate transition. I will try to think of an example where this causes trouble, or report if I can't do it :)

FWIW candidates would not apply config (or FSM) entries, since we'd wait for writes to settle before converting to candidate. So, yes, we would delay the follow->candidate transition. Everything we said in the last comments is about updating commit_index eagerly, which is an unrelated change that can be done separately (if at all).

cole-miller · 2023-08-08T22:43:06Z

FWIW candidates would not apply config (or FSM) entries, since we'd wait for writes to settle before converting to candidate. So, yes, we would delay the follow->candidate transition.

Oh, I mistook your position on this! Glad we're on the same page after all.

freeekanayaka · 2023-08-09T10:41:38Z

One situation that could happen is:

Server S is operating with configuration C_old, where S is a stand-by.

Server S receives configuration C_new, where S is a voter.

Server S immediately starts operating with C_new, even if the associated entry is not yet persisted

Server S receives a RequestVote request and grants its vote

Server S crashes before the configuration entry is persisted

Server S restarts. It finds it has granted a vote, although it operates with C_old in which is not a voter

Practically speaking this particular scenario is probably going to be fine and not create any issue, but I just wanted to illustrate subtle behavior that might happen. There might be other similar scenarios which are actually problematic, I don't know, I haven't spent time analyzing them.

I had thought about this scenario even when I approved this PR (and when I proposed this approach in the first place), but I considered it ok. It probably is ok, however maybe we want to look at this type of issues from a broader perspective, see the next comment.

Actually I believe this scenario is kind of normal and expected, regardless of whether we apply config changes eagerly or not.

Consider the following scenario:

Server S is operating with configuration C_cur, where S is a voter.
Server S receives a RequestVote request and grants its vote for term T.
During term T, server S receives configuration C_new, where S is a stand-by.
Server S persists C_new and then crashes.
Server S restarts. It finds it has granted a vote for its current term T, although it operates with C_new in which is not a voter.

This is to say that it should be normal for a server to restart and find a mismatch between its vote and the configuration it operates.

It's also kind of similar to the situation where server S is the leader, but it's not part of the latest configuration. Quoting Section 4.2.2 of the Raft dissertation:

This approach leads to two implications about decision-making that are not particularly harmful
but may be surprising. First, there will be a period of time (while it is committing Cnew) when a
leader can manage a cluster that does not include itself; it replicates log entries but does not count
itself in majorities. Second, a server that is not part of its own latest configuration should still start
new elections, as it might still be needed until the Cnew entry is committed (as in Figure 4.6). It does
not count its own vote in elections unless it is part of its latest configuration

This is to say that, as pointed in this comment, the argument described in Figure 4.3 of the dissertation means that in terms of safety the only thing that should count is that there must be at most one pending configuration change. All weird situations might arise, but they should be fine (including allowing candidates and leaders to apply config change entries for which the append process started back when they were followers, as pointed by @cole-miller here, although of course that would not be possible if we wait for writes to complete before converting to candidate).

I'm going to update the summary of the conversation with the insights we got so far.

freeekanayaka · 2023-08-09T10:58:52Z

Updated analysis:

Election

When receiving a RequestVote message we always compare to the in-memory log, not to the on-disk log, which might be a few entries behind.

Benefit: we don't let the requesting server win the election when it is actually slightly behind us, and we potentially avoid to lose a pending transaction and hence requiring the client to re-submit it

Safety: if we crash before persisting the in-flight entries or install-snapshot and then restarts, we will effectively have a shorter log, and might start to grant our vote to the same candidate that we initially
rejected, but that's fine.

Proposal: nothing in particular needs to change.
When sending a RequestVote message, we obviously always use information from the on-disk log. Currently we convert to candidate and send RequestVote messages right away when the election timer expires, and we don't wait for pending writes to finish (writes could be new entries, or a snapshot being installed), so the on-disk log might be behind and the candidate has slightly less chances to win the election.

Proposal: wait until pending entries or install-snapshot are finished, then convert to candidate and start the election.

Benefit: the in-memory log will match the on-disk log for all the duration of the candidate state, the candidate should have more chances to win the election, and there is less risk to lose a transaction and forcing the client to retry it.

Safety: no particular edge case should happen that we need to worry about.

Question: should we also wait for take-snapshot operations to be finished? probably not, since technically that doesn't modify the length of the on-disk log, but we need to think if that leads to any weird edge case.

Alternative: keep things as they are now, but if we finish persisting entries during the candidate state, we should apply any pending configuration change that was contained in the entries that have just been persisted (currently we just ignore the new configuration, see the code here, which is the cause of recvAppendEntries: Assertion r->state == RAFT_FOLLOWER || r->state == RAFT_CANDIDATE failed #386). Update: This should be safe and fix the bug at hand, however it feels less palatable than waiting for writes to complete, which has the additional benefits described above.

Replication:

When receiving an AppendEntries/InstallSnapshot message, always use the information from the in-memory log.

Benefit: the in-memory log accurately reflects what we have received so far, so when replying to an AppendEntries message we should use that to filter out duplicate entries or informing the leader about where we are at.

Safety: if the server crashes before persisting in-flight entries or install-snapshot and then restarts, it will effectively have a shorter log, but that's fine as the leader will cope with that.

Proposal: nothing in particular needs to change.
When sending an AppendEntries/InstallSnapshot message, it's not relevant whether the leader has pending entries not yet persisted to disk.

Proposal: nothing in particular needs to change.

Question: when stepping down from leader and becoming follower, should we wait for disk writes to settle? I believe that's not required in principle, but might need a bit more investigation.
When sending an AppendEntries/InstallSnapshot result, we obviously need to use the information from the on-disk log.

Proposal: nothing in particular needs to change.

Commitment:

On followers we currently look at on-disk log (last_stored) when updating the commit index and deciding whether to apply new entries to the FSM.

Proposal: apply entries to the FSM as soon as we see the new commit index in the AppendEntries message handler.

Question: as shown by this PR, that leads to a bit more complication, because we then need to take snapshots only when the associated entries have been persisted. So it might not be a net win, needs more investigation. Update: it seems that taking snapshots should actually be fine (although I didn't quite understand the whole story here). It also seems that situations like the one outlined here are actually fine, as per this argument.

Conclusion

In order to solve this particular bug, it seems that waiting for writes to settle before converting from follower to candidate is the best option, although there are still a few aspects to be cleared (like if we should wait for take-snapshot writes to complete).
Applying config entries eagerly, as well as updating commit_index eagerly and applying entries (either FSM or config) without waiting for them to be persisted seems also fine and desirable, but might need a bit more investigation, and can be do separately as it's not strictly necessary to fix the bug at hand.

cole-miller · 2023-08-10T20:20:00Z

Thanks @freeekanayaka, that summary/plan of action looks good to me.

Question: when stepping down from leader and becoming follower, should we wait for disk writes to settle? I believe that's not required in principle, but might need a bit more investigation.

Ah, this is a tricky one, especially because leader->follower can happen for more than one reason. Let me try to write some tests that exercise those codepaths, just to suss out any easily-triggered misbehavior.

Question: should we also wait for take-snapshot operations to be finished? probably not, since technically that doesn't modify the length of the on-disk log, but we need to think if that leads to any weird edge case.

I think I'd support waiting in this case; I can't think of a specific bug that could be caused by not waiting, but it makes the whole system easier to reason about if this operation can't span the follower->candidate (and candidate->leader) boundary.

cole-miller force-pushed the fix-commit-index branch from 5c69472 to 2f29521 Compare August 4, 2023 20:13

cole-miller added 4 commits August 4, 2023 16:44

Add a failing test for config issue

dd043bb

Signed-off-by: Cole Miller <cole.miller@canonical.com>

replication: Refactor replicationApply and takeSnapshot

65d487c

Signed-off-by: Cole Miller <cole.miller@canonical.com>

replication: Apply configs, update commit index eagerly as follower

f5d04ce

Signed-off-by: Cole Miller <cole.miller@canonical.com>

Update an integration test for new config behavior

a8eb490

Signed-off-by: Cole Miller <cole.miller@canonical.com>

cole-miller force-pushed the fix-commit-index branch from 2f29521 to a8eb490 Compare August 4, 2023 20:45

cole-miller mentioned this pull request Aug 4, 2023

Don't convert to candidate while entries are being persisted #464

Closed

cole-miller requested review from freeekanayaka and MathieuBordere August 4, 2023 21:27

freeekanayaka approved these changes Aug 5, 2023

View reviewed changes

cole-miller marked this pull request as draft August 8, 2023 17:55

cole-miller mentioned this pull request Aug 15, 2023

Don't convert to candidate while entries are being persisted, take 2 #467

Merged

cole-miller closed this Aug 15, 2023

Update commit index and apply configs eagerly #465

Update commit index and apply configs eagerly #465

Conversation

cole-miller commented Aug 4, 2023

cole-miller commented Aug 4, 2023

codecov bot commented Aug 4, 2023 • edited Loading

Codecov Report

cole-miller commented Aug 4, 2023

freeekanayaka left a comment

Choose a reason for hiding this comment

freeekanayaka Aug 5, 2023

Choose a reason for hiding this comment

cole-miller Aug 7, 2023

Choose a reason for hiding this comment

cole-miller Aug 8, 2023

Choose a reason for hiding this comment

freeekanayaka Aug 5, 2023

Choose a reason for hiding this comment

cole-miller Aug 8, 2023

Choose a reason for hiding this comment

freeekanayaka Aug 8, 2023

Choose a reason for hiding this comment

MathieuBordere commented Aug 7, 2023

MathieuBordere commented Aug 7, 2023

MathieuBordere commented Aug 7, 2023

freeekanayaka commented Aug 7, 2023

cole-miller commented Aug 7, 2023

freeekanayaka commented Aug 7, 2023

freeekanayaka commented Aug 7, 2023

cole-miller commented Aug 7, 2023

cole-miller commented Aug 7, 2023

freeekanayaka commented Aug 7, 2023

freeekanayaka commented Aug 7, 2023 • edited Loading

freeekanayaka commented Aug 8, 2023

Election

Replication:

Commitment:

cole-miller commented Aug 8, 2023 • edited Loading

freeekanayaka commented Aug 8, 2023 • edited Loading

cole-miller commented Aug 8, 2023

freeekanayaka commented Aug 8, 2023

freeekanayaka commented Aug 8, 2023

cole-miller commented Aug 8, 2023

freeekanayaka commented Aug 8, 2023

freeekanayaka commented Aug 8, 2023

freeekanayaka commented Aug 8, 2023 • edited Loading

cole-miller commented Aug 8, 2023

cole-miller commented Aug 8, 2023

freeekanayaka commented Aug 8, 2023

cole-miller commented Aug 8, 2023 • edited Loading

freeekanayaka commented Aug 8, 2023

freeekanayaka commented Aug 8, 2023

cole-miller commented Aug 8, 2023 • edited Loading

freeekanayaka commented Aug 8, 2023

cole-miller commented Aug 8, 2023 • edited Loading

freeekanayaka commented Aug 8, 2023

cole-miller commented Aug 8, 2023

freeekanayaka commented Aug 8, 2023

cole-miller commented Aug 8, 2023

freeekanayaka commented Aug 9, 2023

freeekanayaka commented Aug 9, 2023 • edited Loading

Election

Replication:

Commitment:

Conclusion

cole-miller commented Aug 10, 2023

codecov bot commented Aug 4, 2023 •

edited

Loading

freeekanayaka commented Aug 7, 2023 •

edited

Loading

cole-miller commented Aug 8, 2023 •

edited

Loading

freeekanayaka commented Aug 8, 2023 •

edited

Loading

freeekanayaka commented Aug 8, 2023 •

edited

Loading

cole-miller commented Aug 8, 2023 •

edited

Loading

cole-miller commented Aug 8, 2023 •

edited

Loading

cole-miller commented Aug 8, 2023 •

edited

Loading

freeekanayaka commented Aug 9, 2023 •

edited

Loading