Skip to content
This repository has been archived by the owner on Mar 4, 2024. It is now read-only.

Update commit index and apply configs eagerly #465

Closed
wants to merge 4 commits into from

Conversation

cole-miller
Copy link
Contributor

Fixes #386, using the approach suggested by @freeekanayaka.

Signed-off-by: Cole Miller cole.miller@canonical.com

Signed-off-by: Cole Miller <cole.miller@canonical.com>
Signed-off-by: Cole Miller <cole.miller@canonical.com>
Signed-off-by: Cole Miller <cole.miller@canonical.com>
Signed-off-by: Cole Miller <cole.miller@canonical.com>
@cole-miller
Copy link
Contributor Author

please test downstream

@codecov
Copy link

codecov bot commented Aug 4, 2023

Codecov Report

Merging #465 (a8eb490) into master (0ea24be) will decrease coverage by 0.09%.
The diff coverage is 67.34%.

@@            Coverage Diff             @@
##           master     #465      +/-   ##
==========================================
- Coverage   76.72%   76.63%   -0.09%     
==========================================
  Files          51       51              
  Lines        9686     9696      +10     
  Branches     2476     2480       +4     
==========================================
- Hits         7432     7431       -1     
- Misses       1088     1094       +6     
- Partials     1166     1171       +5     
Files Changed Coverage Δ
src/replication.c 68.28% <62.50%> (-0.54%) ⬇️
src/convert.c 84.17% <66.66%> (-0.39%) ⬇️
src/fixture.c 92.76% <100.00%> (+0.02%) ⬆️

... and 1 file with indirect coverage changes

@cole-miller
Copy link
Contributor Author

Clean set of downstream checks: https://github.com/canonical/raft/actions/runs/5766410935

Copy link
Contributor

@freeekanayaka freeekanayaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, looks good to me, just a nit and a clarification request.

if (rv != 0) {
goto err_after_acquire_entries;
}
/* Don't try to take a snapshot while holding onto log entries. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand a bit this comment?

It's not entirely clear to me what it means, and what would instead happen if we did call replicationMaybeTakeSnapshot() here (and conversely why it's safe to call replicationMaybeTakeSnapshot() above, in the (n == 0) branch for empty AppendEntries).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will do (if we don't end up deciding that this is the wrong approach...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some digging, the specific problem I was seeing is a bad interaction with the raft I/O fixture. If the takeSnapshot in replicationAppend fails, the in-memory log entries from the append request will be un-referenced, but the I/O fixture will still have the append request in its queue and will eventually try to flush it, triggering a use-after-free. The right fix here is to make the fixture smarter (thanks for prompting me to work this out). The only reason we weren't seeing this before is that the shouldTakeSnapshot check is based on the last_applied index, and previously replicationAppend didn't update that index, so no snapshot would ever be triggered.

goto err_after_acquire_entries;
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a tad simpler and more explicit to call membershipUncommittedChange() in the for loop above, right after the logAppend() call here.

That will save a bit of code (since the two loops should be the same) and make it more clear that configuration changes are applied immediately upon appending them to the in-memory log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a better design. The reason I had this loop further down originally is that I wanted to get all the potential failure points out of the way before applying the new config/s; otherwise there's a question of what to do if (for example) we apply the configs and then the raft_io.append call fails. membershipRollback is not equipped to handle this situation, but I can add a little bit of new code to save the old (current, committed) configuration pair before the append loop and restore it if something goes wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look much into what type of failures could happen into this specific spot, but as a general rule I'd say we have 2 types of failures in raft:

  • Failures that make it totally impossible to proceed (for example failing to allocate memory that we absolutely need): in this case we should basically stop-the-line and transition to RAFT_UNAVAILABLE (user code will typically abort the process).

  • Failures that can be retried (that's typically I/O): in this case we should have some mechanism to retry (for example if the disk is currently full, but eventually the operator frees some of it). We currently handle that fine for network I/O, but not that much for disk I/O, and it's an area that should be improved.

In both cases, I don't think there's any need to restore the old (current, committed) configuration: we should continue operating with the uncommitted one and deal with I/O errors by retrying.

Note though, that I'm not totally convinced anymore that it's a good idea to immediately apply the new configuration at all, before persisting it. See the example scenario here and the follow-up notes.

@MathieuBordere
Copy link
Contributor

replicationApply and replicationMaybeTakeSnapshot are often called one after the other, feels like they could be combined.

@MathieuBordere
Copy link
Contributor

I'd probably also prefer a couple more clean jepsen runs.

@MathieuBordere
Copy link
Contributor

please test downstream

@freeekanayaka
Copy link
Contributor

replicationApply and replicationMaybeTakeSnapshot are often called one after the other, feels like they could be combined.

I thought that too, and that's the reason why they are currently combined in master. However I believe it's not possible anymore, since now you need also a way to apply entries without triggering snapshots IIUC.

This is probably not be bad per se, but it could be a tiny smell (see also my other comment).

@cole-miller
Copy link
Contributor Author

Second clean downstream run: https://github.com/canonical/raft/actions/runs/5782948278

please test downstream

@freeekanayaka
Copy link
Contributor

One situation that could happen is:

  • Server S is operating with configuration C_old, where S is a stand-by.
  • Server S receives configuration C_new, where S is a voter.
  • Server S immediately starts operating with C_new, even if the associated entry is not yet persisted
  • Server S receives a RequestVote request and grants its vote
  • Server S crashes before the configuration entry is persisted
  • Server S restarts. It finds it has granted a vote, although it operates with C_old in which is not a voter

Practically speaking this particular scenario is probably going to be fine and not create any issue, but I just wanted to illustrate subtle behavior that might happen. There might be other similar scenarios which are actually problematic, I don't know, I haven't spent time analyzing them.

I had thought about this scenario even when I approved this PR (and when I proposed this approach in the first place), but I considered it ok. It probably is ok, however maybe we want to look at this type of issues from a broader perspective, see the next comment.

@freeekanayaka
Copy link
Contributor

I'd like to expand a bit on @MathieuBordere's comment #464 (comment) about waiting for the entries to persist before converting to candidate.

There is currently a bit of "tension" between struct raft->log (the in-memory cache of the log) and the on-disk log, which typically lags behind the in-memory one. At the moment neither of them is a "single-source-of-truth" of what we consider the "current state", and instead we opportunistically use one or the other depending on the situation.

This approach is probably slightly more efficient (avoiding unnecessary lags in some cases), but it's also more tricky as it requires to think carefully about subtle situations.

Maybe we should consider to make the on-disk log a hard "single-source-of-truth", and treat the in-memory cache purely as an optimization to be used in a couple of cases, but that in no situation gets treated as "current state".

When operating as leader, the in-memory cache would be used only to fetch entries that need to be sent to followers.

When operating as follower, the in-memory cache would be used only as a "buffer" for incoming entries.

When operating as candidate, we don't receive or send any entry, so the in-memory cache is not used (see below).

For any actual "decision" that the engine takes, the "current state" to refer to would be always the on-disk one, never the in-memory one, which becomes purely an implementation detail.

On top of that, we could also say that all state transitions (not only follower->candidate, all of them) must occur only when disk I/O has completed, and so the "current state" has become stable.

Such approach would probably lead to slightly increased latency in some cases, and possibly to a bit of extra complexity in a few spots (e.g. we wouldn't be able transition state immediately when the election timer expires). However it might also lead to some reduced overall complexity and more importantly to make it easier to reason around corner cases, or eliminate some of them altogether.

There are a few important details about such approach that are still not clear in my mind (and which might well end up making this approach not that appealing after all), but wanted to get some feedback first.

@cole-miller
Copy link
Contributor Author

One situation that could happen is

Yeah, the async design makes it easy to get into "inconsistent state" situations like this one. The question of where/whether to send an AppendEntriesResult message in a different term than the original request (see #421) has some of the same characteristics: there's a potentially long-running operation, we want to take some action at the end of this operation, but many assumptions about the shared state of the raft server can be invalidated in the intervening time. And so far we've been tackling these issues one-by-one instead of thinking globally about what bits of state are allowed to change when, what things can change in between the beginning of an async operation and its completion. I think it would be great to have a general plan for this!

@cole-miller
Copy link
Contributor Author

On this specific issue, I would be just fine with returning to the approach from #464, which sounds like what @freeekanayaka is proposing as part of a broader look at how state transitions are handled.

@freeekanayaka
Copy link
Contributor

One situation that could happen is

Yeah, the async design makes it easy to get into "inconsistent state" situations like this one. The question of where/whether to send an AppendEntriesResult message in a different term than the original request (see #421) has some of the same characteristics: there's a potentially long-running operation, we want to take some action at the end of this operation, but many assumptions about the shared state of the raft server can be invalidated in the intervening time. And so far we've been tackling these issues one-by-one instead of thinking globally about what bits of state are allowed to change when, what things can change in between the beginning of an async operation and its completion. I think it would be great to have a general plan for this!

Right, the important thing is to have a guideline that makes reasoning and decisions easier.

@freeekanayaka
Copy link
Contributor

freeekanayaka commented Aug 7, 2023

On this specific issue, I would be just fine with returning to the approach from #464, which sounds like what @freeekanayaka is proposing as part of a broader look at how state transitions are handled.

Before choosing one approach or the other, I'd suggest to understand the details of both first. If there's a feeling that using the on-disk state as single-source-of-truth might be a good idea, we can explore that a bit more.

@freeekanayaka
Copy link
Contributor

I've tried to put together a slightly more detailed analysis of the current situation and of the "wait for disk writes to settle before converting to candidate" approach. Hope it helps.

Election

  • When receiving a RequestVote message we always compare to the in-memory log, not to the on-disk log, which might be a few entries behind.

    Benefit: we don't let the requesting server win the election when it is actually slightly behind us, and we potentially avoid to lose a pending transaction and hence requiring the client to re-submit it

    Safety: if we crash before persisting the in-flight entries or install-snapshot and then restarts, we will effectively have a shorter log, and might start to grant our vote to the same candidate that we initially
    rejected, but that's fine.

    Proposal: nothing in particular needs to change.

  • When sending a RequestVote message, we obviously always use information from the on-disk log. Currently we convert to candidate and send RequestVote messages right away when the election timer expires, and we don't wait for pending writes to finish (writes could be new entries, or a snapshot being installed), so the on-disk log might be behind and the candidate has slightly less chances to win the election.

    Proposal: wait until pending entries or install-snapshot are finished, then convert to candidate and start the election.

    Benefit: the in-memory log will match the on-disk log for all the duration of the candidate state, and the candidate should have more chances to win the election.

    Safety: no particular edge case should happen that we need to worry about.

    Question: should we also wait for take-snapshot operations to be finished? probably not, since technically that doesn't modify the length of the on-disk log, but we need to think if that leads to any weird edge case.

    Alternative: keep things as they are now, but if we finish persisting entries during the candidate state, we should apply any pending configuration change that was contained in the entries that have just been persisted (currently we just ignore the new configuration, see the code here, which is the cause of recvAppendEntries: Assertion r->state == RAFT_FOLLOWER || r->state == RAFT_CANDIDATE failed #386)

Replication:

  • When receiving an AppendEntries/InstallSnapshot message, always use the information from the in-memory log.

    Benefit: the in-memory log accurately reflects what we have received so far, so when replying to an AppendEntries message we should use that to filter out duplicate entries or informing the leader about where we are at.

    Safety: if the server crashes before persisting in-flight entries or install-snapshot and then restarts, it will effectively have a shorter log, but that's fine as the leader will cope with that.

    Proposal: nothing in particular needs to change.

  • When sending an AppendEntries/InstallSnapshot message, it's not relevant whether the leader has pending entries not yet persisted to disk.

    Proposal: nothing in particular needs to change.

    Question: when stepping down from leader and becoming follower, should we wait for disk writes to settle? I believe that's not required in principle, but might need a bit more investigation.

  • When sending an AppendEntries/InstallSnapshot result, we obviously need to use the information from the on-disk log.

    Proposal: nothing in particular needs to change.

Commitment:

  • On followers we currently look at on-disk log (last_stored) when updating the commit index and deciding whether to apply new entries to the FSM.

    Proposal: apply entries to the FSM as soon as we see the new commit index in the AppendEntries message handler.

    Question: as shown by this PR, that leads to a bit more complication, because we then need to take snapshots only when the associated entries have been persisted. So it might not be a net win, needs more investigation.

@cole-miller
Copy link
Contributor Author

cole-miller commented Aug 8, 2023

@freeekanayaka Thanks for writing this! A couple of responses:

About the handling of config changes: I feel pretty negatively about the alternative you list of allowing candidates to change configuration during their election; I think it makes it harder to have a good mental model of how elections work, and I am not confident in being able to implement it without weird edge cases. Also, if we neither apply configs eagerly nor wait for them to be applied before becoming a candidate, we can potentially change our config while we are a leader, which seems similarly fraught.

I also wanted to point out that a conflict arises if we try to do both of (1) update the commit index eagerly and (2) wait for configs to be persisted before applying them (as uncommitted) -- because then we can potentially have a committed config that hasn't even been applied as uncommitted yet. I think this would cause issues if implemented naively, it pretty much breaks our model with a "current" config index that's always ahead of a fallback "last committed" config index. Maybe this model could be rethought; I'd need to ponder some more what a suitable replacement would be.

As for the question about snapshotting in replicationAppend, it seems like the problem I ran into there is in the raft I/O fixture (though I'm still working on making sure I have a clear picture of how snapshotting interacts with the eager commit index update) -- see my comment above. Looking beyond that specific bug, I think I do need to put more thought into how snapshotting interacts with the eager commit index update, because it becomes possible for us to kick off a snapshot that includes entries we haven't persisted yet. I was hoping to avoid that complication by moving the snapshot step to the append callback, but that doesn't really work, because we could have a sequence of events like

  • we accept an append request like (prev_index=1, num_entries=4, commit_index=5), update our own commit index to 5, apply all the entries, and schedule the callback
  • we accept a second append request (prev_index=5, num_entries=3, commit_index=8), and etc.
  • the callback for the first request runs, and decides to take a snapshot through index 8, of which the last three entries haven't been persisted

So the "eager commit" change needs a careful look at snapshotting regardless of where we put the takeSnapshot call.

@cole-miller cole-miller marked this pull request as draft August 8, 2023 17:55
@freeekanayaka
Copy link
Contributor

freeekanayaka commented Aug 8, 2023

@freeekanayaka Thanks for writing this! A couple of responses:

About the handling of config changes: I feel pretty negatively about the alternative you list of allowing candidates to change configuration during their election; I think it makes it harder to have a good mental model of how elections work, and I am not confident in being able to implement it without weird edge cases. Also, if we neither apply configs eagerly nor wait for them to be applied before becoming a candidate, we can potentially change our config while we are a leader, which seems similarly fraught.

Do you mean "while we are candidate"? It would certainly feel a bit uncomfortable, even if it turns out to be correct.

I also wanted to point out that a conflict arises if we try to do both of (1) update the commit index eagerly and (2) wait for configs to be persisted before applying them (as uncommitted) -- because then we can potentially have a committed config that hasn't even been applied as uncommitted yet. I think this would cause issues if implemented naively, it pretty much breaks our model with a "current" config index that's always ahead of a fallback "last committed" config index. Maybe this model could be rethought; I'd need to ponder some more what a suitable replacement would be.

Didn't think much about it, but don't we already have cases like that? For example if you install a snapshot, the configuration in that snapshot would become operational immediately without having been uncommitted (a configuration contained in a snapshot is by definition committed).

As for the question about snapshotting in replicationAppend, it seems like the problem I ran into there is in the raft I/O fixture (though I'm still working on making sure I have a clear picture of how snapshotting interacts with the eager commit index update) -- see my comment above. Hopefully it will turn out that there aren't any more unexpected interactions.

Ok, that's a good data point, thanks. I didn't think much about snapshots as well yet, but I'd like too.

@cole-miller
Copy link
Contributor Author

Do you mean "while we are candidate"? It would certainly feel a bit uncomfortable, even if it turns out to be correct.

No, I meant while we're a leader, since we can potentially convert to candidate and win an election all while waiting for the entries to be persisted. Or am I missing something?

@freeekanayaka
Copy link
Contributor

Do you mean "while we are candidate"? It would certainly feel a bit uncomfortable, even if it turns out to be correct.

No, I meant while we're a leader, since we can potentially convert to candidate and win an election all while waiting for the entries to be persisted. Or am I missing something?

Ah ok. But we change configuration as leaders anyway, when accepting new configurations. There would be virtually no difference IIUC.

@freeekanayaka
Copy link
Contributor

It should be safe to change the operating configuration at any time, as long as you have at most one uncommitted configuration.

@cole-miller
Copy link
Contributor Author

Didn't think much about it, but don't we already have cases like that? For example if you install a snapshot, the configuration in that snapshot would become operational immediately without having been uncommitted (a configuration contained in a snapshot is by definition committed).

In the case of installing a snapshot we update both config indices at once to the same value (and apply the corresponding config), so they stay coupled/in sync, and in particular the configuration_committed_index doesn't run past the configuration_uncommitted_index. This happens discontinuously, outside the normal flow by which the config indices are updated and the current config is changed, and in a context where we have total control over all those bits of state. In other places we update just one of these indices, and we have to decide between

A. Don't allow configuration_committed_index to run ahead of configuration_uncommitted_index, breaking the property that configuration_committed_index is the index of the last entry we have locally that's known to be committed; or
B. Allow configuration_committed_index to run ahead of configuration_uncommitted_index, in which case we need to do something about rollbacks

In either case the current names for these indices cease to make sense, unfortunately.

@freeekanayaka
Copy link
Contributor

Didn't think much about it, but don't we already have cases like that? For example if you install a snapshot, the configuration in that snapshot would become operational immediately without having been uncommitted (a configuration contained in a snapshot is by definition committed).

In the case of installing a snapshot we update both config indices at once to the same value (and apply the corresponding config), so they stay coupled/in sync, and in particular the configuration_committed_index doesn't run past the configuration_uncommitted_index. This happens discontinuously, outside the normal flow by which the config indices are updated and the current config is changed, and in a context where we have total control over all those bits of state. In other places we update just one of these indices, and we have to decide between

A. Don't allow configuration_committed_index to run ahead of configuration_uncommitted_index, breaking the property that configuration_committed_index is the index of the last entry we have locally that's known to be committed; or B. Allow configuration_committed_index to run ahead of configuration_uncommitted_index, in which case we need to do something about rollbacks

In either case the current names for these indices cease to make sense, unfortunately.

I'm not entirely sure to see what you mean. Can you make an example of a sequence of events that would lead to a situation that violates expectations? (either formal Raft expectation, or internal libraft expectations).

@freeekanayaka
Copy link
Contributor

Perhaps something like:

  • Current commit_index is 4
  • Current configuration_committed_index is 3
  • Current configuration_uncommitted_index is 0

Then:

  • Receive C_new at index 5 and start persisting it
  • Learn that new commit index is 5
  • Apply C_new, configuration_commited_index is now 5 and configuration_uncommited_index is still 0
  • Finish persisting C_new
  • We need to be careful and notice that C_new is committed, and leave configuration_uncommitted_index to 0

The main thing that concerns me here is that this is quite similar to the "apply configuration changes immediately" approach, in that we can end up operating with a configuration that is not persisted locally, and so do things like voting for somebody, crash, restart and find out that we're not voters.

@freeekanayaka
Copy link
Contributor

freeekanayaka commented Aug 8, 2023

One thing that comes to my mind would be that it's probably fine to update commit_index eagerly, however we should not apply entries that are not yet persisted locally (either FSM entries or config entries).

In other words:

  • commit_index immediately reflects the most recent commit index we know of
  • last_stored is the last log entry in the on-disk log
  • last_applied is the last entry that was applied (either FSM or configuration change)

We'd have the invariant commit_index >= last_stored >= last_applied.

EDIT: actually commit_index >= last_applied and last_stored >= last_applied, as I guess that commit_index can be both greater or smaller than last_stored.

@cole-miller
Copy link
Contributor Author

I'm not entirely sure to see what you mean. Can you make an example of a sequence of events that would lead to a situation that violates expectations? (either formal Raft expectation, or internal libraft expectations).

I guess I don't have a specific scenario in mind, it's just that I'm not sure what configuration_committed_index and configuration_uncommitted_index should mean (and therefore what the rules should be for updating them) in a world where the commit index is updated eagerly while config entries are applied not-eagerly. I'm sure there's a way to make it work, though.

@cole-miller
Copy link
Contributor Author

One thing that comes to my mind would be that it's probably fine to update commit_index eagerly, however we should not apply entries that are not yet persisted locally (either FSM entries or config entries).

Worth noting that leaders already do this:

/* A leader with slow disk commits an entry that it hasn't persisted yet,
* because enough followers to have a majority have aknowledged that they have
* appended the entry. The leader's last_stored field hence lags behind its
* commit_index. A new leader gets elected, with a higher commit index and sends
* first a new entry than a heartbeat to the old leader, that needs to update
* its commit_index taking into account its lagging last_stored. */
TEST(replication, lastStoredLaggingBehindCommitIndex, setUp, tearDown, 0, NULL)

That's part of what convinced me that it would likewise be okay for followers to apply entries before persisting them. But it also makes sense to me as a conservative choice to not apply anything until it has been persisted locally (like you say, treating the disk as the source of truth).

@freeekanayaka
Copy link
Contributor

I'm not entirely sure to see what you mean. Can you make an example of a sequence of events that would lead to a situation that violates expectations? (either formal Raft expectation, or internal libraft expectations).

I guess I don't have a specific scenario in mind, it's just that I'm not sure what configuration_committed_index and configuration_uncommitted_index should mean (and therefore what the rules should be for updating them) in a world where the commit index is updated eagerly while config entries are applied not-eagerly. I'm sure there's a way to make it work, though.

I'd say configuration_commit_index and configuration_uncommitted_index maintain the same meaning. The fact that commit_index gets updated eagerly should not make any difference, as long as we continue applying FSM and config entries non-eagerly.

@cole-miller
Copy link
Contributor Author

cole-miller commented Aug 8, 2023

I'd say configuration_commit_index and configuration_uncommitted_index maintain the same meaning.

Should configuration_uncommitted_index mean the index of the last configuration we have in our log, or the last one we've applied? And how do rollbacks work if the "fallback" index can be greater than the "current" index?

@freeekanayaka
Copy link
Contributor

One thing that comes to my mind would be that it's probably fine to update commit_index eagerly, however we should not apply entries that are not yet persisted locally (either FSM entries or config entries).

Worth noting that leaders already do this:

/* A leader with slow disk commits an entry that it hasn't persisted yet,
* because enough followers to have a majority have aknowledged that they have
* appended the entry. The leader's last_stored field hence lags behind its
* commit_index. A new leader gets elected, with a higher commit index and sends
* first a new entry than a heartbeat to the old leader, that needs to update
* its commit_index taking into account its lagging last_stored. */
TEST(replication, lastStoredLaggingBehindCommitIndex, setUp, tearDown, 0, NULL)

That's part of what convinced me that it would likewise be okay for followers to apply entries before persisting them. But it also makes sense to me as a conservative choice to not apply anything until it has been persisted locally (like you say, treating the disk as the source of truth).

Right, exactly. Leaders already advance their commit index when there is a quorum of followers, even if they didn't finish persisting those entries themselves yet. Followers should do the same. However, neither of them should apply anything (FSM or config) until is persisted locally. So last_applied <= commit_index.

I believe this would make the situation more symmetric between followers and leaders, letting commit_index to be updated eagerly, as it should.

@freeekanayaka
Copy link
Contributor

I'd say configuration_commit_index and configuration_uncommitted_index maintain the same meaning.

Should configuration_uncommitted_index mean the index of the last configuration we have in our log, or the last one we've applied?

The last we've applied. I.e. the last that we have in our disk-log. The last we have in our in-memory log is basically "non existing", i.e. in that case the memory log is just a buffer or a queue waiting to be processed. The same is true for FSM entries (e.g. entries in the in-memory log that are still being persisted to disk).

And how do rollbacks work if the "fallback" index can be greater than the "current" index?

What do you mean?

All the invariants described in this docstring should remain true.

@cole-miller
Copy link
Contributor Author

cole-miller commented Aug 8, 2023

Oh, so you're saying that configuration_committed_index would not be updated to point to committed entries that are only in our in-memory log, but not stored on disk? That was the point of confusion for me.

@freeekanayaka
Copy link
Contributor

Oh, so you're saying that configuration_committed_index would not be updated to point to committed entries that are only in our in-memory log, but not stored on disk? That was the point of confusion for me.

Correct. Updating commit_index doesn't mean that we have to update configuration_committed_index too at the same time.

@cole-miller
Copy link
Contributor Author

cole-miller commented Aug 8, 2023

Okay, thanks for clearing that up. I disagree slightly that there's no need to update the docstring, I think this paragraph would need changing:

At all times configuration_committed_index is either zero or is the index of the most recent log entry of type RAFT_CHANGE that we know to be committed.

Part of my confusion was because the code currently does update configuration_committed_index in tandem with commit_index, and it wasn't clear to me that you were proposing to change that.

@freeekanayaka
Copy link
Contributor

Okay, thanks for clearing that up. I disagree slightly that there's no need to update the docstring, I think this paragraph would need changing:

At all times configuration_committed_index is either zero or is the index of the most recent log entry of type RAFT_CHANGE that we know to be committed.

True, that'd would need a tweak.

Part of my confusion was because the code currently does update configuration_committed_index in tandem with commit_index, and it wasn't clear to me that you were proposing to change that.

Yeah, I didn't quite look at the code and I don't recall all details of the current code, was just trying to first get a sound plan first. We could update #465 (comment) with all the additional insights from these last comments.

@cole-miller
Copy link
Contributor Author

I'm still not 100% convinced that there are no problems with allowing candidates and leaders to apply config change entries for which the append process started back when they were followers, and would prefer the more conservative approach of delaying the follower->candidate transition. I will try to think of an example where this causes trouble, or report if I can't do it :)

@freeekanayaka
Copy link
Contributor

I'm still not 100% convinced that there are no problems with allowing candidates and leaders to apply config change entries for which the append process started back when they were followers, and would prefer the more conservative approach of delaying the follower->candidate transition. I will try to think of an example where this causes trouble, or report if I can't do it :)

FWIW candidates would not apply config (or FSM) entries, since we'd wait for writes to settle before converting to candidate. So, yes, we would delay the follow->candidate transition. Everything we said in the last comments is about updating commit_index eagerly, which is an unrelated change that can be done separately (if at all).

@cole-miller
Copy link
Contributor Author

FWIW candidates would not apply config (or FSM) entries, since we'd wait for writes to settle before converting to candidate. So, yes, we would delay the follow->candidate transition.

Oh, I mistook your position on this! Glad we're on the same page after all.

@freeekanayaka
Copy link
Contributor

One situation that could happen is:

  • Server S is operating with configuration C_old, where S is a stand-by.
  • Server S receives configuration C_new, where S is a voter.
  • Server S immediately starts operating with C_new, even if the associated entry is not yet persisted
  • Server S receives a RequestVote request and grants its vote
  • Server S crashes before the configuration entry is persisted
  • Server S restarts. It finds it has granted a vote, although it operates with C_old in which is not a voter

Practically speaking this particular scenario is probably going to be fine and not create any issue, but I just wanted to illustrate subtle behavior that might happen. There might be other similar scenarios which are actually problematic, I don't know, I haven't spent time analyzing them.

I had thought about this scenario even when I approved this PR (and when I proposed this approach in the first place), but I considered it ok. It probably is ok, however maybe we want to look at this type of issues from a broader perspective, see the next comment.

Actually I believe this scenario is kind of normal and expected, regardless of whether we apply config changes eagerly or not.

Consider the following scenario:

  • Server S is operating with configuration C_cur, where S is a voter.
  • Server S receives a RequestVote request and grants its vote for term T.
  • During term T, server S receives configuration C_new, where S is a stand-by.
  • Server S persists C_new and then crashes.
  • Server S restarts. It finds it has granted a vote for its current term T, although it operates with C_new in which is not a voter.

This is to say that it should be normal for a server to restart and find a mismatch between its vote and the configuration it operates.

It's also kind of similar to the situation where server S is the leader, but it's not part of the latest configuration. Quoting Section 4.2.2 of the Raft dissertation:

This approach leads to two implications about decision-making that are not particularly harmful
but may be surprising. First, there will be a period of time (while it is committing Cnew) when a
leader can manage a cluster that does not include itself; it replicates log entries but does not count
itself in majorities. Second, a server that is not part of its own latest configuration should still start
new elections, as it might still be needed until the Cnew entry is committed (as in Figure 4.6). It does
not count its own vote in elections unless it is part of its latest configuration

This is to say that, as pointed in this comment, the argument described in Figure 4.3 of the dissertation means that in terms of safety the only thing that should count is that there must be at most one pending configuration change. All weird situations might arise, but they should be fine (including allowing candidates and leaders to apply config change entries for which the append process started back when they were followers, as pointed by @cole-miller here, although of course that would not be possible if we wait for writes to complete before converting to candidate).

I'm going to update the summary of the conversation with the insights we got so far.

@freeekanayaka
Copy link
Contributor

freeekanayaka commented Aug 9, 2023

Updated analysis:

Election

  • When receiving a RequestVote message we always compare to the in-memory log, not to the on-disk log, which might be a few entries behind.

    Benefit: we don't let the requesting server win the election when it is actually slightly behind us, and we potentially avoid to lose a pending transaction and hence requiring the client to re-submit it

    Safety: if we crash before persisting the in-flight entries or install-snapshot and then restarts, we will effectively have a shorter log, and might start to grant our vote to the same candidate that we initially
    rejected, but that's fine.

    Proposal: nothing in particular needs to change.

  • When sending a RequestVote message, we obviously always use information from the on-disk log. Currently we convert to candidate and send RequestVote messages right away when the election timer expires, and we don't wait for pending writes to finish (writes could be new entries, or a snapshot being installed), so the on-disk log might be behind and the candidate has slightly less chances to win the election.

    Proposal: wait until pending entries or install-snapshot are finished, then convert to candidate and start the election.

    Benefit: the in-memory log will match the on-disk log for all the duration of the candidate state, the candidate should have more chances to win the election, and there is less risk to lose a transaction and forcing the client to retry it.

    Safety: no particular edge case should happen that we need to worry about.

    Question: should we also wait for take-snapshot operations to be finished? probably not, since technically that doesn't modify the length of the on-disk log, but we need to think if that leads to any weird edge case.

    Alternative: keep things as they are now, but if we finish persisting entries during the candidate state, we should apply any pending configuration change that was contained in the entries that have just been persisted (currently we just ignore the new configuration, see the code here, which is the cause of recvAppendEntries: Assertion r->state == RAFT_FOLLOWER || r->state == RAFT_CANDIDATE failed #386). Update: This should be safe and fix the bug at hand, however it feels less palatable than waiting for writes to complete, which has the additional benefits described above.

Replication:

  • When receiving an AppendEntries/InstallSnapshot message, always use the information from the in-memory log.

    Benefit: the in-memory log accurately reflects what we have received so far, so when replying to an AppendEntries message we should use that to filter out duplicate entries or informing the leader about where we are at.

    Safety: if the server crashes before persisting in-flight entries or install-snapshot and then restarts, it will effectively have a shorter log, but that's fine as the leader will cope with that.

    Proposal: nothing in particular needs to change.

  • When sending an AppendEntries/InstallSnapshot message, it's not relevant whether the leader has pending entries not yet persisted to disk.

    Proposal: nothing in particular needs to change.

    Question: when stepping down from leader and becoming follower, should we wait for disk writes to settle? I believe that's not required in principle, but might need a bit more investigation.

  • When sending an AppendEntries/InstallSnapshot result, we obviously need to use the information from the on-disk log.

    Proposal: nothing in particular needs to change.

Commitment:

  • On followers we currently look at on-disk log (last_stored) when updating the commit index and deciding whether to apply new entries to the FSM.

    Proposal: apply entries to the FSM as soon as we see the new commit index in the AppendEntries message handler.

    Question: as shown by this PR, that leads to a bit more complication, because we then need to take snapshots only when the associated entries have been persisted. So it might not be a net win, needs more investigation. Update: it seems that taking snapshots should actually be fine (although I didn't quite understand the whole story here). It also seems that situations like the one outlined here are actually fine, as per this argument.

Conclusion

  • In order to solve this particular bug, it seems that waiting for writes to settle before converting from follower to candidate is the best option, although there are still a few aspects to be cleared (like if we should wait for take-snapshot writes to complete).

  • Applying config entries eagerly, as well as updating commit_index eagerly and applying entries (either FSM or config) without waiting for them to be persisted seems also fine and desirable, but might need a bit more investigation, and can be do separately as it's not strictly necessary to fix the bug at hand.

@cole-miller
Copy link
Contributor Author

Thanks @freeekanayaka, that summary/plan of action looks good to me.

Question: when stepping down from leader and becoming follower, should we wait for disk writes to settle? I believe that's not required in principle, but might need a bit more investigation.

Ah, this is a tricky one, especially because leader->follower can happen for more than one reason. Let me try to write some tests that exercise those codepaths, just to suss out any easily-triggered misbehavior.

Question: should we also wait for take-snapshot operations to be finished? probably not, since technically that doesn't modify the length of the on-disk log, but we need to think if that leads to any weird edge case.

I think I'd support waiting in this case; I can't think of a specific bug that could be caused by not waiting, but it makes the whole system easier to reason about if this operation can't span the follower->candidate (and candidate->leader) boundary.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

recvAppendEntries: Assertion r->state == RAFT_FOLLOWER || r->state == RAFT_CANDIDATE failed
3 participants