Does etcd RAFT guarantees durability ? #12589

ptabor · 2020-12-30T18:50:33Z

I wonder whether etcd cluster is guarantying a full distributed durability, i.e. a data that got submitted successfully to one node (e.g. Put operation)n and user got successful response, are guaranteed to not be lost as long as quorum of the servers survives / recovers from persistent storage.

I thought that answer is obviously yes, but after deep dive into a code, I'm heaving doubts and I hope I'm missing some important piece.

Scenario

A 3 node cluster (N1, N2, N3) have initial state S1. N1 is a leader.
User submits a transaction T1 on the leader.
Leader:
3.1 Writes the transaction to its own unstable storage & stable storage.
3.2 Sends msgApp to N2 & N3.
3.3. Get's msgAppResp from N2 & N3 confirming that N2 & N3 stored the log in the unstable storage(

etcd/raft/log.go

Line 30 in a4570a6

unstable unstable

) but NOT to the WAL
3.4 Applies the change into the mvcc/bbolt
3.5 Returns success to the user
There is 'power shutdown' affecting all 3 etcd nodes concurrently.
The nodes N2 & N3 come back. [ Let assume N1 stays unavailable]
N2 & N3 are new majority and N2 is elected as the leader.
The WAL of N2 & N3 does NOT contains T1, so T1 got really not committed.

Code wide

The source of the problem is the fact that in the step 3.3. we are considering entry as 'committed' when it got only stored in the 'unstable' storage and not flushed to the hard drive in the WAL log.

Please look at the comments in the commit: ptabor@f80d5cd,
creates only to illustrate the crucial code for the problem.

In particular it shows that:

msgAppResp is issues just after write to unstable storage:

etcd/raft/raft.go

Line 1381 in a4570a6

r.send(pb.Message{To: m.From, Type: pb.MsgAppResp, Index: mlastIndex})

and it's return as index the index of last entry

etcd/raft/log.go

Line 101 in a4570a6

return lastnewi, true
msgAppResp is used by the leader to update the follower's 'match' index:

etcd/raft/raft.go

Line 1141 in a4570a6

if pr.MaybeUpdate(m.Index) {
leader uses match indexes from progress tracker to decide what got Committed:

etcd/raft/tracker/tracker.go

Line 178 in a4570a6

return uint64(p.Voters.CommittedIndex(matchAckIndexer(p.Progress)))

Handling of conflicts in the RAFT log

According to RAFT protocol, the log is assumed to be 'persistent' but it's parts might need to be overridden. See e.g. this piece of explanation: https://youtu.be/vYp4LYbnnW8?t=2724 (or Figure 7. of the RAFT paper) when such 'out of sync' can happen and how it should get reconciled.

As I seen that etcd WAL implementation is append only, I started to wonder how the 'overwrites' are handled. The discovery was that there exists additional 'unstable_storage' LOG that allows for overrides. But I suspect:

the unstable_storage creates a time-window when there is no 'durability' guarantee in RAFT
there might be problem if the piece of log that needs to be overridden, got already submitted to stable storage (WAL).

Please let me know if I'm missing any protection mechanisms against both the problems described above:

loosing committed transaction
getting inconsistent log due to lack of ability to overwrite entries in WAL

The text was updated successfully, but these errors were encountered:

ptabor · 2020-12-30T18:51:18Z

@xiang90 @gyuho @jpbetz -> I appreciate your feedback on the issue above.

tangcong · 2020-12-31T02:41:10Z

3.3. Get's msgAppResp from N2 & N3 confirming that N2 & N3 stored the log in the stable storage.

14:55:53 etcd1 |  W | {SoftState:<nil> HardState:{Term:0 Vote:0 Commit:0 XXX_unrecognized:[]} ReadStates:[] Entries:[{Term:2 Index:9 Type:EntryNormal Data:[34 6 10 1 97 18 1 98 162 6 10 8 132 196 221 185 212 213 221 180 50] XXX_unrecognized:[]}] Snapshot:{Data:[] Metadata:{ConfState:{Voters:[] Learners:[] VotersOutgoing:[] LearnersNext:[] AutoLeave:false XXX_unrecognized:[]} Index:0 Term:0 XXX_unrecognized:[]} XXX_unrecognized:[]} CommittedEntries:[] Messages:[{Type:MsgAppResp To:18249187646912138824 From:9372538179322589801 Term:2 LogTerm:0 Index:9 Entries:[] Commit:0 Snapshot:{Data:[] Metadata:{ConfState:{Voters:[] Learners:[] VotersOutgoing:[] LearnersNext:[] AutoLeave:false XXX_unrecognized:[]} Index:0 Term:0 XXX_unrecognized:[]} XXX_unrecognized:[]} Reject:false RejectHint:0 Context:[] XXX_unrecognized:[]}] MustSync:true}

follower will store ready.Entries into WAL(mustSync:true), then send MsgAppResp to leader.

ptabor · 2020-12-31T06:52:40Z

@tangcong
Thank you. I assumed

etcd/raft/raft.go

Line 1381 in a4570a6

r.send(pb.Message{To: m.From, Type: pb.MsgAppResp, Index: mlastIndex})

is sending the message... but in practice its only scheduling it to be sent in the same "Ready" processing as WAL write. My mistake.

Could you, please, let me know how etcd/RAFT handles overwrites off WAL log, so situations from Figure 7. of the RAFT paper,
if the WAL log is append only ?

tangcong · 2020-12-31T09:23:40Z

if a follower's WAL log is in conflict with the leader, it will set MsgAppResp's reject field to true and return RejectHint to leader. The leader will send log entries from the reject index to the follower. the follower will store the new entry into WAL logs(append only). WAL logs will store two different values on the same raft log index, but when etcd replays the WAL log, the entry written later will overwrite the previous one, so there is no problem. @ptabor

ptabor · 2021-01-01T20:37:57Z

@tangcong Thank you.

Indeed the slice [:up] makes it an override:

etcd/wal/wal.go

Line 460 in 8a03d2e

ents = append(ents[:up], e)

I will prepare a PR that documents these subtle places.

tangcong · 2021-01-02T00:21:12Z

good. It will help everyone better understand etcd durability. @ptabor

ptabor · 2021-01-04T22:35:34Z

Please take a look at this document:

https://docs.google.com/document/d/1O2o1IApHWmSioXG3fez4eVlUHOrXICYGNVIzaqNS0IQ/edit?resourcekey=0-e6Iywgdkol0uiVBAaV1oww#

I will translate it to markdown as soon as its 'finalized'.

tangcong · 2021-01-04T22:51:20Z

wow, it is very detailed, awesome job. thank you. @ptabor

The change makes it explicit that sending messages does not happen immidietely and is subject to proper persist & then send protocol on the application side. See: etcd-io#12589 (comment) for more context.

ptabor · 2021-01-05T12:42:06Z

The PR with comment changes is ready: #12588.

Closing the issue.

The change makes it explicit that sending messages does not happen immidietely and is subject to proper persist & then send protocol on the application side. See: etcd-io#12589 (comment) for more context.

yangxuanjia · 2021-08-25T06:37:25Z

nice

ptabor mentioned this issue Jan 5, 2021

Raft: Expand raft documentation, in particular point on the godocs #12588

Merged

ptabor closed this as completed Jan 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does etcd RAFT guarantees durability ? #12589

Does etcd RAFT guarantees durability ? #12589

ptabor commented Dec 30, 2020 •

edited

ptabor commented Dec 30, 2020

tangcong commented Dec 31, 2020 •

edited

ptabor commented Dec 31, 2020 •

edited

tangcong commented Dec 31, 2020

ptabor commented Jan 1, 2021

tangcong commented Jan 2, 2021

ptabor commented Jan 4, 2021

tangcong commented Jan 4, 2021

ptabor commented Jan 5, 2021

yangxuanjia commented Aug 25, 2021

Does etcd RAFT guarantees durability ? #12589

Does etcd RAFT guarantees durability ? #12589

Comments

ptabor commented Dec 30, 2020 • edited

Scenario

Code wide

Handling of conflicts in the RAFT log

ptabor commented Dec 30, 2020

tangcong commented Dec 31, 2020 • edited

ptabor commented Dec 31, 2020 • edited

tangcong commented Dec 31, 2020

ptabor commented Jan 1, 2021

tangcong commented Jan 2, 2021

ptabor commented Jan 4, 2021

tangcong commented Jan 4, 2021

ptabor commented Jan 5, 2021

yangxuanjia commented Aug 25, 2021

ptabor commented Dec 30, 2020 •

edited

tangcong commented Dec 31, 2020 •

edited

ptabor commented Dec 31, 2020 •

edited