-
Notifications
You must be signed in to change notification settings - Fork 136
Update commit index and apply configs eagerly #465
Conversation
5c69472
to
2f29521
Compare
Signed-off-by: Cole Miller <cole.miller@canonical.com>
Signed-off-by: Cole Miller <cole.miller@canonical.com>
Signed-off-by: Cole Miller <cole.miller@canonical.com>
Signed-off-by: Cole Miller <cole.miller@canonical.com>
2f29521
to
a8eb490
Compare
please test downstream |
Codecov Report
@@ Coverage Diff @@
## master #465 +/- ##
==========================================
- Coverage 76.72% 76.63% -0.09%
==========================================
Files 51 51
Lines 9686 9696 +10
Branches 2476 2480 +4
==========================================
- Hits 7432 7431 -1
- Misses 1088 1094 +6
- Partials 1166 1171 +5
|
Clean set of downstream checks: https://github.com/canonical/raft/actions/runs/5766410935 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, looks good to me, just a nit and a clarification request.
if (rv != 0) { | ||
goto err_after_acquire_entries; | ||
} | ||
/* Don't try to take a snapshot while holding onto log entries. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand a bit this comment?
It's not entirely clear to me what it means, and what would instead happen if we did call replicationMaybeTakeSnapshot()
here (and conversely why it's safe to call replicationMaybeTakeSnapshot()
above, in the (n == 0)
branch for empty AppendEntries).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will do (if we don't end up deciding that this is the wrong approach...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some digging, the specific problem I was seeing is a bad interaction with the raft I/O fixture. If the takeSnapshot in replicationAppend fails, the in-memory log entries from the append request will be un-referenced, but the I/O fixture will still have the append request in its queue and will eventually try to flush it, triggering a use-after-free. The right fix here is to make the fixture smarter (thanks for prompting me to work this out). The only reason we weren't seeing this before is that the shouldTakeSnapshot check is based on the last_applied index, and previously replicationAppend didn't update that index, so no snapshot would ever be triggered.
goto err_after_acquire_entries; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a tad simpler and more explicit to call membershipUncommittedChange()
in the for loop above, right after the logAppend()
call here.
That will save a bit of code (since the two loops should be the same) and make it more clear that configuration changes are applied immediately upon appending them to the in-memory log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is a better design. The reason I had this loop further down originally is that I wanted to get all the potential failure points out of the way before applying the new config/s; otherwise there's a question of what to do if (for example) we apply the configs and then the raft_io.append call fails. membershipRollback is not equipped to handle this situation, but I can add a little bit of new code to save the old (current, committed) configuration pair before the append loop and restore it if something goes wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look much into what type of failures could happen into this specific spot, but as a general rule I'd say we have 2 types of failures in raft:
-
Failures that make it totally impossible to proceed (for example failing to allocate memory that we absolutely need): in this case we should basically stop-the-line and transition to
RAFT_UNAVAILABLE
(user code will typically abort the process). -
Failures that can be retried (that's typically I/O): in this case we should have some mechanism to retry (for example if the disk is currently full, but eventually the operator frees some of it). We currently handle that fine for network I/O, but not that much for disk I/O, and it's an area that should be improved.
In both cases, I don't think there's any need to restore the old (current, committed) configuration: we should continue operating with the uncommitted one and deal with I/O errors by retrying.
Note though, that I'm not totally convinced anymore that it's a good idea to immediately apply the new configuration at all, before persisting it. See the example scenario here and the follow-up notes.
|
I'd probably also prefer a couple more clean jepsen runs. |
please test downstream |
I thought that too, and that's the reason why they are currently combined in This is probably not be bad per se, but it could be a tiny smell (see also my other comment). |
Second clean downstream run: https://github.com/canonical/raft/actions/runs/5782948278 please test downstream |
One situation that could happen is:
Practically speaking this particular scenario is probably going to be fine and not create any issue, but I just wanted to illustrate subtle behavior that might happen. There might be other similar scenarios which are actually problematic, I don't know, I haven't spent time analyzing them. I had thought about this scenario even when I approved this PR (and when I proposed this approach in the first place), but I considered it ok. It probably is ok, however maybe we want to look at this type of issues from a broader perspective, see the next comment. |
I'd like to expand a bit on @MathieuBordere's comment #464 (comment) about waiting for the entries to persist before converting to candidate. There is currently a bit of "tension" between This approach is probably slightly more efficient (avoiding unnecessary lags in some cases), but it's also more tricky as it requires to think carefully about subtle situations. Maybe we should consider to make the on-disk log a hard "single-source-of-truth", and treat the in-memory cache purely as an optimization to be used in a couple of cases, but that in no situation gets treated as "current state". When operating as leader, the in-memory cache would be used only to fetch entries that need to be sent to followers. When operating as follower, the in-memory cache would be used only as a "buffer" for incoming entries. When operating as candidate, we don't receive or send any entry, so the in-memory cache is not used (see below). For any actual "decision" that the engine takes, the "current state" to refer to would be always the on-disk one, never the in-memory one, which becomes purely an implementation detail. On top of that, we could also say that all state transitions (not only follower->candidate, all of them) must occur only when disk I/O has completed, and so the "current state" has become stable. Such approach would probably lead to slightly increased latency in some cases, and possibly to a bit of extra complexity in a few spots (e.g. we wouldn't be able transition state immediately when the election timer expires). However it might also lead to some reduced overall complexity and more importantly to make it easier to reason around corner cases, or eliminate some of them altogether. There are a few important details about such approach that are still not clear in my mind (and which might well end up making this approach not that appealing after all), but wanted to get some feedback first. |
Yeah, the async design makes it easy to get into "inconsistent state" situations like this one. The question of where/whether to send an AppendEntriesResult message in a different term than the original request (see #421) has some of the same characteristics: there's a potentially long-running operation, we want to take some action at the end of this operation, but many assumptions about the shared state of the raft server can be invalidated in the intervening time. And so far we've been tackling these issues one-by-one instead of thinking globally about what bits of state are allowed to change when, what things can change in between the beginning of an async operation and its completion. I think it would be great to have a general plan for this! |
On this specific issue, I would be just fine with returning to the approach from #464, which sounds like what @freeekanayaka is proposing as part of a broader look at how state transitions are handled. |
Right, the important thing is to have a guideline that makes reasoning and decisions easier. |
Before choosing one approach or the other, I'd suggest to understand the details of both first. If there's a feeling that using the on-disk state as single-source-of-truth might be a good idea, we can explore that a bit more. |
I've tried to put together a slightly more detailed analysis of the current situation and of the "wait for disk writes to settle before converting to candidate" approach. Hope it helps. Election
Replication:
Commitment:
|
@freeekanayaka Thanks for writing this! A couple of responses: About the handling of config changes: I feel pretty negatively about the alternative you list of allowing candidates to change configuration during their election; I think it makes it harder to have a good mental model of how elections work, and I am not confident in being able to implement it without weird edge cases. Also, if we neither apply configs eagerly nor wait for them to be applied before becoming a candidate, we can potentially change our config while we are a leader, which seems similarly fraught. I also wanted to point out that a conflict arises if we try to do both of (1) update the commit index eagerly and (2) wait for configs to be persisted before applying them (as uncommitted) -- because then we can potentially have a committed config that hasn't even been applied as uncommitted yet. I think this would cause issues if implemented naively, it pretty much breaks our model with a "current" config index that's always ahead of a fallback "last committed" config index. Maybe this model could be rethought; I'd need to ponder some more what a suitable replacement would be. As for the question about snapshotting in replicationAppend, it seems like the problem I ran into there is in the raft I/O fixture (though I'm still working on making sure I have a clear picture of how snapshotting interacts with the eager commit index update) -- see my comment above. Looking beyond that specific bug, I think I do need to put more thought into how snapshotting interacts with the eager commit index update, because it becomes possible for us to kick off a snapshot that includes entries we haven't persisted yet. I was hoping to avoid that complication by moving the snapshot step to the append callback, but that doesn't really work, because we could have a sequence of events like
So the "eager commit" change needs a careful look at snapshotting regardless of where we put the takeSnapshot call. |
Do you mean "while we are candidate"? It would certainly feel a bit uncomfortable, even if it turns out to be correct.
Didn't think much about it, but don't we already have cases like that? For example if you install a snapshot, the configuration in that snapshot would become operational immediately without having been uncommitted (a configuration contained in a snapshot is by definition committed).
Ok, that's a good data point, thanks. I didn't think much about snapshots as well yet, but I'd like too. |
No, I meant while we're a leader, since we can potentially convert to candidate and win an election all while waiting for the entries to be persisted. Or am I missing something? |
Ah ok. But we change configuration as leaders anyway, when accepting new configurations. There would be virtually no difference IIUC. |
It should be safe to change the operating configuration at any time, as long as you have at most one uncommitted configuration. |
In the case of installing a snapshot we update both config indices at once to the same value (and apply the corresponding config), so they stay coupled/in sync, and in particular the configuration_committed_index doesn't run past the configuration_uncommitted_index. This happens discontinuously, outside the normal flow by which the config indices are updated and the current config is changed, and in a context where we have total control over all those bits of state. In other places we update just one of these indices, and we have to decide between A. Don't allow configuration_committed_index to run ahead of configuration_uncommitted_index, breaking the property that configuration_committed_index is the index of the last entry we have locally that's known to be committed; or In either case the current names for these indices cease to make sense, unfortunately. |
I'm not entirely sure to see what you mean. Can you make an example of a sequence of events that would lead to a situation that violates expectations? (either formal Raft expectation, or internal libraft expectations). |
Perhaps something like:
Then:
The main thing that concerns me here is that this is quite similar to the "apply configuration changes immediately" approach, in that we can end up operating with a configuration that is not persisted locally, and so do things like voting for somebody, crash, restart and find out that we're not voters. |
One thing that comes to my mind would be that it's probably fine to update In other words:
EDIT: actually |
I guess I don't have a specific scenario in mind, it's just that I'm not sure what |
Worth noting that leaders already do this: raft/test/integration/test_replication.c Lines 1135 to 1141 in 0ea24be
That's part of what convinced me that it would likewise be okay for followers to apply entries before persisting them. But it also makes sense to me as a conservative choice to not apply anything until it has been persisted locally (like you say, treating the disk as the source of truth). |
I'd say |
Should configuration_uncommitted_index mean the index of the last configuration we have in our log, or the last one we've applied? And how do rollbacks work if the "fallback" index can be greater than the "current" index? |
Right, exactly. Leaders already advance their commit index when there is a quorum of followers, even if they didn't finish persisting those entries themselves yet. Followers should do the same. However, neither of them should apply anything (FSM or config) until is persisted locally. So I believe this would make the situation more symmetric between followers and leaders, letting |
The last we've applied. I.e. the last that we have in our disk-log. The last we have in our in-memory log is basically "non existing", i.e. in that case the memory log is just a buffer or a queue waiting to be processed. The same is true for FSM entries (e.g. entries in the in-memory log that are still being persisted to disk).
What do you mean? All the invariants described in this docstring should remain true. |
Oh, so you're saying that configuration_committed_index would not be updated to point to committed entries that are only in our in-memory log, but not stored on disk? That was the point of confusion for me. |
Correct. Updating |
Okay, thanks for clearing that up. I disagree slightly that there's no need to update the docstring, I think this paragraph would need changing:
Part of my confusion was because the code currently does update configuration_committed_index in tandem with commit_index, and it wasn't clear to me that you were proposing to change that. |
True, that'd would need a tweak.
Yeah, I didn't quite look at the code and I don't recall all details of the current code, was just trying to first get a sound plan first. We could update #465 (comment) with all the additional insights from these last comments. |
I'm still not 100% convinced that there are no problems with allowing candidates and leaders to apply config change entries for which the append process started back when they were followers, and would prefer the more conservative approach of delaying the follower->candidate transition. I will try to think of an example where this causes trouble, or report if I can't do it :) |
FWIW candidates would not apply config (or FSM) entries, since we'd wait for writes to settle before converting to candidate. So, yes, we would delay the follow->candidate transition. Everything we said in the last comments is about updating |
Oh, I mistook your position on this! Glad we're on the same page after all. |
Actually I believe this scenario is kind of normal and expected, regardless of whether we apply config changes eagerly or not. Consider the following scenario:
This is to say that it should be normal for a server to restart and find a mismatch between its vote and the configuration it operates. It's also kind of similar to the situation where server This approach leads to two implications about decision-making that are not particularly harmful This is to say that, as pointed in this comment, the argument described in Figure 4.3 of the dissertation means that in terms of safety the only thing that should count is that there must be at most one pending configuration change. All weird situations might arise, but they should be fine (including allowing candidates and leaders to apply config change entries for which the append process started back when they were followers, as pointed by @cole-miller here, although of course that would not be possible if we wait for writes to complete before converting to candidate). I'm going to update the summary of the conversation with the insights we got so far. |
Updated analysis: Election
Replication:
Commitment:
Conclusion
|
Thanks @freeekanayaka, that summary/plan of action looks good to me.
Ah, this is a tricky one, especially because leader->follower can happen for more than one reason. Let me try to write some tests that exercise those codepaths, just to suss out any easily-triggered misbehavior.
I think I'd support waiting in this case; I can't think of a specific bug that could be caused by not waiting, but it makes the whole system easier to reason about if this operation can't span the follower->candidate (and candidate->leader) boundary. |
Fixes #386, using the approach suggested by @freeekanayaka.
Signed-off-by: Cole Miller cole.miller@canonical.com