Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does etcd RAFT guarantees durability ? #12589

Closed
ptabor opened this issue Dec 30, 2020 · 10 comments
Closed

Does etcd RAFT guarantees durability ? #12589

ptabor opened this issue Dec 30, 2020 · 10 comments

Comments

@ptabor
Copy link
Contributor

ptabor commented Dec 30, 2020

I wonder whether etcd cluster is guarantying a full distributed durability, i.e. a data that got submitted successfully to one node (e.g. Put operation)n and user got successful response, are guaranteed to not be lost as long as quorum of the servers survives / recovers from persistent storage.

I thought that answer is obviously yes, but after deep dive into a code, I'm heaving doubts and I hope I'm missing some important piece.

Scenario

  1. A 3 node cluster (N1, N2, N3) have initial state S1. N1 is a leader.
  2. User submits a transaction T1 on the leader.
  3. Leader:
    3.1 Writes the transaction to its own unstable storage & stable storage.
    3.2 Sends msgApp to N2 & N3.
    3.3. Get's msgAppResp from N2 & N3 confirming that N2 & N3 stored the log in the unstable storage(
    unstable unstable
    ) but NOT to the WAL
    3.4 Applies the change into the mvcc/bbolt
    3.5 Returns success to the user
  4. There is 'power shutdown' affecting all 3 etcd nodes concurrently.
  5. The nodes N2 & N3 come back. [ Let assume N1 stays unavailable]
  6. N2 & N3 are new majority and N2 is elected as the leader.
  7. The WAL of N2 & N3 does NOT contains T1, so T1 got really not committed.

Code wide

The source of the problem is the fact that in the step 3.3. we are considering entry as 'committed' when it got only stored in the 'unstable' storage and not flushed to the hard drive in the WAL log.

Please look at the comments in the commit: ptabor@f80d5cd,
creates only to illustrate the crucial code for the problem.

In particular it shows that:

  1. msgAppResp is issues just after write to unstable storage:

    etcd/raft/raft.go

    Line 1381 in a4570a6

    r.send(pb.Message{To: m.From, Type: pb.MsgAppResp, Index: mlastIndex})
    and it's return as index the index of last entry
    return lastnewi, true
  2. msgAppResp is used by the leader to update the follower's 'match' index:

    etcd/raft/raft.go

    Line 1141 in a4570a6

    if pr.MaybeUpdate(m.Index) {
  3. leader uses match indexes from progress tracker to decide what got Committed:
    return uint64(p.Voters.CommittedIndex(matchAckIndexer(p.Progress)))

Handling of conflicts in the RAFT log

According to RAFT protocol, the log is assumed to be 'persistent' but it's parts might need to be overridden. See e.g. this piece of explanation: https://youtu.be/vYp4LYbnnW8?t=2724 (or Figure 7. of the RAFT paper) when such 'out of sync' can happen and how it should get reconciled.

As I seen that etcd WAL implementation is append only, I started to wonder how the 'overwrites' are handled. The discovery was that there exists additional 'unstable_storage' LOG that allows for overrides. But I suspect:

  1. the unstable_storage creates a time-window when there is no 'durability' guarantee in RAFT
  2. there might be problem if the piece of log that needs to be overridden, got already submitted to stable storage (WAL).

Please let me know if I'm missing any protection mechanisms against both the problems described above:

  • loosing committed transaction
  • getting inconsistent log due to lack of ability to overwrite entries in WAL
@ptabor
Copy link
Contributor Author

ptabor commented Dec 30, 2020

@xiang90 @gyuho @jpbetz -> I appreciate your feedback on the issue above.

@tangcong
Copy link
Contributor

tangcong commented Dec 31, 2020

3.3. Get's msgAppResp from N2 & N3 confirming that N2 & N3 stored the log in the stable storage.

14:55:53 etcd1 |  W | {SoftState:<nil> HardState:{Term:0 Vote:0 Commit:0 XXX_unrecognized:[]} ReadStates:[] Entries:[{Term:2 Index:9 Type:EntryNormal Data:[34 6 10 1 97 18 1 98 162 6 10 8 132 196 221 185 212 213 221 180 50] XXX_unrecognized:[]}] Snapshot:{Data:[] Metadata:{ConfState:{Voters:[] Learners:[] VotersOutgoing:[] LearnersNext:[] AutoLeave:false XXX_unrecognized:[]} Index:0 Term:0 XXX_unrecognized:[]} XXX_unrecognized:[]} CommittedEntries:[] Messages:[{Type:MsgAppResp To:18249187646912138824 From:9372538179322589801 Term:2 LogTerm:0 Index:9 Entries:[] Commit:0 Snapshot:{Data:[] Metadata:{ConfState:{Voters:[] Learners:[] VotersOutgoing:[] LearnersNext:[] AutoLeave:false XXX_unrecognized:[]} Index:0 Term:0 XXX_unrecognized:[]} XXX_unrecognized:[]} Reject:false RejectHint:0 Context:[] XXX_unrecognized:[]}] MustSync:true}

follower will store ready.Entries into WAL(mustSync:true), then send MsgAppResp to leader.

@ptabor
Copy link
Contributor Author

ptabor commented Dec 31, 2020

@tangcong
Thank you. I assumed

etcd/raft/raft.go

Line 1381 in a4570a6

r.send(pb.Message{To: m.From, Type: pb.MsgAppResp, Index: mlastIndex})
is sending the message... but in practice its only scheduling it to be sent in the same "Ready" processing as WAL write. My mistake.

Could you, please, let me know how etcd/RAFT handles overwrites off WAL log, so situations from Figure 7. of the RAFT paper,
if the WAL log is append only ?

@tangcong
Copy link
Contributor

if a follower's WAL log is in conflict with the leader, it will set MsgAppResp's reject field to true and return RejectHint to leader. The leader will send log entries from the reject index to the follower. the follower will store the new entry into WAL logs(append only). WAL logs will store two different values on the same raft log index, but when etcd replays the WAL log, the entry written later will overwrite the previous one, so there is no problem. @ptabor

@ptabor
Copy link
Contributor Author

ptabor commented Jan 1, 2021

@tangcong Thank you.

Indeed the slice [:up] makes it an override:

etcd/wal/wal.go

Line 460 in 8a03d2e

ents = append(ents[:up], e)

I will prepare a PR that documents these subtle places.

@tangcong
Copy link
Contributor

tangcong commented Jan 2, 2021

good. It will help everyone better understand etcd durability. @ptabor

@ptabor
Copy link
Contributor Author

ptabor commented Jan 4, 2021

Please take a look at this document:

https://docs.google.com/document/d/1O2o1IApHWmSioXG3fez4eVlUHOrXICYGNVIzaqNS0IQ/edit?resourcekey=0-e6Iywgdkol0uiVBAaV1oww#

I will translate it to markdown as soon as its 'finalized'.

@tangcong
Copy link
Contributor

tangcong commented Jan 4, 2021

wow, it is very detailed, awesome job. thank you. @ptabor

ptabor added a commit to ptabor/etcd that referenced this issue Jan 5, 2021
The change makes it explicit that sending messages does not happen
immidietely and is subject to proper persist & then send protocol
on the application side. See:

etcd-io#12589 (comment)

for more context.
@ptabor
Copy link
Contributor Author

ptabor commented Jan 5, 2021

The PR with comment changes is ready: #12588.

Closing the issue.

@ptabor ptabor closed this as completed Jan 5, 2021
ptabor added a commit to ptabor/etcd that referenced this issue Jan 7, 2021
The change makes it explicit that sending messages does not happen
immidietely and is subject to proper persist & then send protocol
on the application side. See:

etcd-io#12589 (comment)

for more context.
ptabor added a commit to ptabor/etcd that referenced this issue Jan 7, 2021
The change makes it explicit that sending messages does not happen
immidietely and is subject to proper persist & then send protocol
on the application side. See:

etcd-io#12589 (comment)

for more context.
ptabor added a commit to ptabor/etcd that referenced this issue Jan 12, 2021
The change makes it explicit that sending messages does not happen
immidietely and is subject to proper persist & then send protocol
on the application side. See:

etcd-io#12589 (comment)

for more context.
ptabor added a commit to ptabor/etcd that referenced this issue Jan 12, 2021
The change makes it explicit that sending messages does not happen
immidietely and is subject to proper persist & then send protocol
on the application side. See:

etcd-io#12589 (comment)

for more context.
ptabor added a commit to ptabor/etcd that referenced this issue Jan 12, 2021
The change makes it explicit that sending messages does not happen
immidietely and is subject to proper persist & then send protocol
on the application side. See:

etcd-io#12589 (comment)

for more context.
ptabor added a commit to ptabor/etcd that referenced this issue Jan 15, 2021
The change makes it explicit that sending messages does not happen
immidietely and is subject to proper persist & then send protocol
on the application side. See:

etcd-io#12589 (comment)

for more context.
@yangxuanjia
Copy link
Contributor

nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants