New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does etcd RAFT guarantees durability ? #12589
Comments
3.3. Get's msgAppResp from N2 & N3 confirming that N2 & N3 stored the log in the stable storage.
follower will store ready.Entries into WAL(mustSync:true), then send MsgAppResp to leader. |
@tangcong Line 1381 in a4570a6
Could you, please, let me know how etcd/RAFT handles overwrites off WAL log, so situations from Figure 7. of the RAFT paper, |
if a follower's WAL log is in conflict with the leader, it will set MsgAppResp's reject field to true and return RejectHint to leader. The leader will send log entries from the reject index to the follower. the follower will store the new entry into WAL logs(append only). WAL logs will store two different values on the same raft log index, but when etcd replays the WAL log, the entry written later will overwrite the previous one, so there is no problem. @ptabor |
good. It will help everyone better understand etcd durability. @ptabor |
Please take a look at this document: I will translate it to markdown as soon as its 'finalized'. |
wow, it is very detailed, awesome job. thank you. @ptabor |
The change makes it explicit that sending messages does not happen immidietely and is subject to proper persist & then send protocol on the application side. See: etcd-io#12589 (comment) for more context.
The PR with comment changes is ready: #12588. Closing the issue. |
The change makes it explicit that sending messages does not happen immidietely and is subject to proper persist & then send protocol on the application side. See: etcd-io#12589 (comment) for more context.
The change makes it explicit that sending messages does not happen immidietely and is subject to proper persist & then send protocol on the application side. See: etcd-io#12589 (comment) for more context.
The change makes it explicit that sending messages does not happen immidietely and is subject to proper persist & then send protocol on the application side. See: etcd-io#12589 (comment) for more context.
The change makes it explicit that sending messages does not happen immidietely and is subject to proper persist & then send protocol on the application side. See: etcd-io#12589 (comment) for more context.
The change makes it explicit that sending messages does not happen immidietely and is subject to proper persist & then send protocol on the application side. See: etcd-io#12589 (comment) for more context.
The change makes it explicit that sending messages does not happen immidietely and is subject to proper persist & then send protocol on the application side. See: etcd-io#12589 (comment) for more context.
nice |
I wonder whether etcd cluster is guarantying a full distributed durability, i.e. a data that got submitted successfully to one node (e.g. Put operation)n and user got successful response, are guaranteed to not be lost as long as quorum of the servers survives / recovers from persistent storage.
I thought that answer is obviously yes, but after deep dive into a code, I'm heaving doubts and I hope I'm missing some important piece.
Scenario
3.1 Writes the transaction to its own unstable storage & stable storage.
3.2 Sends msgApp to N2 & N3.
3.3. Get's msgAppResp from N2 & N3 confirming that N2 & N3 stored the log in the unstable storage(
etcd/raft/log.go
Line 30 in a4570a6
3.4 Applies the change into the mvcc/bbolt
3.5 Returns success to the user
Code wide
The source of the problem is the fact that in the step 3.3. we are considering entry as 'committed' when it got only stored in the 'unstable' storage and not flushed to the hard drive in the WAL log.
Please look at the comments in the commit: ptabor@f80d5cd,
creates only to illustrate the crucial code for the problem.
In particular it shows that:
etcd/raft/raft.go
Line 1381 in a4570a6
etcd/raft/log.go
Line 101 in a4570a6
etcd/raft/raft.go
Line 1141 in a4570a6
etcd/raft/tracker/tracker.go
Line 178 in a4570a6
Handling of conflicts in the RAFT log
According to RAFT protocol, the log is assumed to be 'persistent' but it's parts might need to be overridden. See e.g. this piece of explanation: https://youtu.be/vYp4LYbnnW8?t=2724 (or Figure 7. of the RAFT paper) when such 'out of sync' can happen and how it should get reconciled.
As I seen that etcd WAL implementation is append only, I started to wonder how the 'overwrites' are handled. The discovery was that there exists additional 'unstable_storage' LOG that allows for overrides. But I suspect:
Please let me know if I'm missing any protection mechanisms against both the problems described above:
The text was updated successfully, but these errors were encountered: