Possible cluster unavailbility #6378

ramanala · 2016-09-07T14:33:59Z

Possible cluster unavailability

1 rename(source="/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal.tmp", dest="/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
2 append("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal", offset=88, count=4096)
3 append("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal", offset=4184, count=4242)
4 fdatasync("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")

I see the above sequence of system calls when etcd appends a user data item to its wal file. Now, if a crash happens just before the 4th operation (fdatasync), and if the file system reorders the 2nd append and the 3rd append (this reordering is possible on commonly used file systems such as ext4 ordered mode), during recovery, the server will crash with the following error in its debug-log-file.

...timestamp... I | etcdmain: etcd Version: 2.3.0
...timestamp... I | etcdmain: Git SHA: 3719912
...timestamp... I | etcdmain: Go Version: go1.6
...timestamp... I | etcdmain: Go OS/Arch: linux/amd64
...timestamp... I | etcdmain: setting maximum number of CPUs to 40, total number of available CPUs is 40
...timestamp... N | etcdmain: the server is already initialized as member before, starting as etcd member...
...timestamp... I | etcdmain: listening for peers on http://172.17.0.2:2380
...timestamp... I | etcdmain: listening for client requests on http://172.17.0.2:2379
...timestamp... I | etcdserver: name = infra0
...timestamp... I | etcdserver: data dir = /data/etcd/infra0.etcd/
...timestamp... I | etcdserver: member dir = /data/etcd/infra0.etcd//member
...timestamp... I | etcdserver: heartbeat = 100ms
...timestamp... I | etcdserver: election = 1000ms
...timestamp... I | etcdserver: snapshot count = 10000
...timestamp... I | etcdserver: advertise client URLs = http://172.17.0.2:2379
...timestamp... C | etcdserver: read wal error (walpb: crc mismatch) and cannot be repaired

Two nodes in a three node cluster can easily get into this state and so the majority of servers can go unusable. Thus, the third node in the cluster cannot make progress alone as there is no majority, rendering the cluster unavailable.

Although the window of vulnerability is small, this is a potential problem that can be fixed in etcd's recovery code after a crash.

xiang90 · 2016-09-07T15:18:53Z

I think this is already fixed, no? @heyitsanthony

heyitsanthony · 2016-09-07T17:21:34Z

@xiang90 yeah all the O_APPEND stuff is gone as of 3.0.

xiang90 · 2016-09-07T17:26:44Z

@ramanala This should be fixed. FYI.

heyitsanthony closed this as completed Sep 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible cluster unavailbility #6378

Possible cluster unavailbility #6378

ramanala commented Sep 7, 2016 •

edited

xiang90 commented Sep 7, 2016

heyitsanthony commented Sep 7, 2016

xiang90 commented Sep 7, 2016

Possible cluster unavailbility #6378

Possible cluster unavailbility #6378

Comments

ramanala commented Sep 7, 2016 • edited

xiang90 commented Sep 7, 2016

heyitsanthony commented Sep 7, 2016

xiang90 commented Sep 7, 2016

ramanala commented Sep 7, 2016 •

edited