[Second Solution] Fix the potential data loss for clusters with only one member (simpler solution) #14400

ahrtr · 2022-08-30T03:05:00Z

Second solution to fix #14370

This solution is based on the following feedback,

Durability API guarantee broken in single node cluster #14370 (comment) from @hasethuraman
Fix the potential data loss for clusters with only one member #14394 (comment) from @lavacat
Durability API guarantee broken in single node cluster #14370 (comment) from @ptabor

I compared the performance between this PR and #14394 for one-member cluster , overall #14394 is a little better than this one (about 2.7% higher than this one). But this PR is much simpler; excluding the test and comment, this PR only has about 20 lines of code change.

cc @serathius @spzala @ptabor @liggitt @dims

ahrtr · 2022-08-30T05:16:05Z

The pipeline failures are caused by 70de5c8. I just delivered another PR #14401 to fix it.

For a cluster with only one member, the raft always send identical unstable entries and committed entries to etcdserver, and etcd responds to the client once it finishes (actually partially) the applying workflow. When the client receives the response, it doesn't mean etcd has already successfully saved the data, including BoltDB and WAL, because: 1. etcd commits the boltDB transaction periodically instead of on each request; 2. etcd saves WAL entries in parallel with applying the committed entries. Accordingly, it may run into a situation of data loss when the etcd crashes immediately after responding to the client and before the boltDB and WAL successfully save the data to disk. Note that this issue can only happen for clusters with only one member. For clusters with multiple members, it isn't an issue, because etcd will not commit & apply the data before it being replicated to majority members. When the client receives the response, it means the data must have been applied. It further means the data must have been committed. Note: for clusters with multiple members, the raft will never send identical unstable entries and committed entries to etcdserver. Signed-off-by: Benjamin Wang <wachao@vmware.com>

ahrtr · 2022-08-30T11:06:44Z

I suggest to cherry pick this PR or #14394 to 3.5 and 3.4.

We can continue to enhance the raft package implementation only in main branch.

serathius

Looks like the best intermediate solution for etcdserver as proposed in #14370 (comment)

… node cluster Since the raft side change has been merged, so we need to revert the etcdserver side change. Refer to etcd-io#14413 etcd-io#14400 Signed-off-by: Benjamin Wang <wachao@vmware.com>

ahrtr mentioned this pull request Aug 30, 2022

[test] Fix test code compiling error due to not enough arguments #14401

Merged

ahrtr mentioned this pull request Aug 30, 2022

Durability API guarantee broken in single node cluster #14370

Closed

ahrtr force-pushed the one_member_data_loss_raft branch from b5c5455 to 2a10049 Compare August 30, 2022 07:29

ahrtr mentioned this pull request Aug 30, 2022

Fix the potential data loss for clusters with only one member #14394

Closed

ahrtr changed the title ~~Fix the potential data loss for clusters with only one member (Second solution)~~ Fix the potential data loss for clusters with only one member (simpler solution) Aug 30, 2022

ahrtr changed the title ~~Fix the potential data loss for clusters with only one member (simpler solution)~~ [Second Solution] Fix the potential data loss for clusters with only one member (simpler solution) Aug 31, 2022

ahrtr mentioned this pull request Aug 31, 2022

test pr ahrtr/gocontainer#2

Closed

ptabor approved these changes Sep 2, 2022

View reviewed changes

serathius approved these changes Sep 5, 2022

View reviewed changes

ahrtr merged commit 73029fe into etcd-io:main Sep 5, 2022

This was referenced Sep 5, 2022

[release-3.4] fix the potential data loss for clusters with only one member #14423

Merged

[release-3.5] fix the potential data loss for clusters with only one member #14424

Merged

mzaian mentioned this pull request Sep 15, 2022

Plans for v3.4.21 release #14438

Closed

8 tasks

ahrtr mentioned this pull request Sep 20, 2022

raft: don't emit unstable CommittedEntries #14413

Merged

ahrtr mentioned this pull request Sep 22, 2022

etcdserve: revert the etcdserver side change for the data loss in one node cluster #14505

Merged

ishan16696 mentioned this pull request Oct 6, 2022

[Upgrade] Move the etcd from v3.4.26 to v3.5.x gardener/etcd-druid#445

Open

ahrtr mentioned this pull request Sep 30, 2023

etcdserver: ensure hardstate is persisten before applying committed entries #16675

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Second Solution] Fix the potential data loss for clusters with only one member (simpler solution) #14400

[Second Solution] Fix the potential data loss for clusters with only one member (simpler solution) #14400

ahrtr commented Aug 30, 2022 •

edited

ahrtr commented Aug 30, 2022

ahrtr commented Aug 30, 2022

serathius left a comment

[Second Solution] Fix the potential data loss for clusters with only one member (simpler solution) #14400

[Second Solution] Fix the potential data loss for clusters with only one member (simpler solution) #14400

Conversation

ahrtr commented Aug 30, 2022 • edited

ahrtr commented Aug 30, 2022

ahrtr commented Aug 30, 2022

serathius left a comment

Choose a reason for hiding this comment

ahrtr commented Aug 30, 2022 •

edited