Corrupted WAL and snapshot restoring process #10219

brk0v · 2018-10-26T15:52:41Z

Issue

Node that was offline more than max(SnapshotCount, DefaultSnapshotCatchUpEntries) corrupts its WAL log with bad HardState.Commitnumber if it's killed right after HardState was saved to non-volatile storage (failpoint: raftBeforeSaveSnap).

Specific

Version: master
Environment: any (tested on Linux, MacOS X)

Steps to reproduce

Patch snapshots defaults to trigger the issues easier:

+++ b/etcdserver/server.go
@@ -65,14 +65,14 @@ import (
 )

 const (
-       DefaultSnapshotCount = 100000
+       DefaultSnapshotCount = 10

        // DefaultSnapshotCatchUpEntries is the number of entries for a slow follower
        // to catch-up after compacting the raft storage entries.
        // We expect the follower has a millisecond level latency with the leader.
        // The max throughput is around 10K. Keep a 5K entries is enough for helping
        // follower to catch up.
-       DefaultSnapshotCatchUpEntries uint64 = 5000
+       DefaultSnapshotCatchUpEntries uint64 = 10

Compile with failpoints to emulate power failure during the process of snapshot restoration:

FAILPOINTS="true" make build

Create a Procfile with a failpoint raftBeforeSaveSnap for etcd2 node :

etcd1: bin/etcd --name infra1 --listen-client-urls http://127.0.0.1:2379 --advertise-client-urls http://127.0.0.1:2379 --listen-peer-urls http://127.0.0.1:12380 --initial-advertise-peer-urls http://127.0.0.1:12380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr --debug 
etcd2: GOFAIL_HTTP="127.0.0.1:1111" GOFAIL_FAILPOINTS='go.etcd.io/etcd/etcdserver/raftBeforeSaveSnap=panic("raftBeforeSaveSnap")' bin/etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr --debug 
etcd3: bin/etcd --name infra3 --listen-client-urls http://127.0.0.1:32379 --advertise-client-urls http://127.0.0.1:32379 --listen-peer-urls http://127.0.0.1:32380 --initial-advertise-peer-urls http://127.0.0.1:32380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr --debug

Start cluster

goreman start

Start write loop:

for i in {0..100000}; do ./bin/etcdctl put key$i value$i; echo $i; done

Stop etcd2 node for a "maintenance":

goreman run stop etcd2

Start etcd2 node after 10 entries have been written to the master to trigger snapshot restore:

sleep 10 && goreman run start etcd2

Should get a failoint panic, that emulates power issue during restoring from a snapshot.

From now WAL on the etcd2 node is corrupted. It was saved with a HardState entry that contains Commit number from the snapshot, but snapshot was never saved to WAL and disk.

WAL is corrupted.

Run etcd2 node manually without fail point and got an error:

bin/etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr --debug

Error:

panic: 91bc3c398fb3c146 state.commit 961 is out of range [572, 579]

goroutine 1 [running]:
go.etcd.io/etcd/vendor/go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0001cc160, 0x0, 0x0, 0x0)
        /Users/brk0v/go/src/go.etcd.io/etcd/vendor/go.uber.org/zap/zapcore/entry.go:229 +0x515
go.etcd.io/etcd/vendor/go.uber.org/zap.(*SugaredLogger).log(0xc0001ce118, 0x4, 0x1d063df, 0x2b, 0xc000643f40, 0x4, 0x4, 0x0, 0x0, 0x0)
        /Users/brk0v/go/src/go.etcd.io/etcd/vendor/go.uber.org/zap/sugar.go:234 +0xf6
go.etcd.io/etcd/vendor/go.uber.org/zap.(*SugaredLogger).Panicf(0xc0001ce118, 0x1d063df, 0x2b, 0xc000643f40, 0x4, 0x4)
        /Users/brk0v/go/src/go.etcd.io/etcd/vendor/go.uber.org/zap/sugar.go:159 +0x79
go.etcd.io/etcd/pkg/logutil.(*zapRaftLogger).Panicf(0xc0000601f0, 0x1d063df, 0x2b, 0xc000643f40, 0x4, 0x4)
        /Users/brk0v/go/src/go.etcd.io/etcd/pkg/logutil/zap_raft.go:96 +0x61
go.etcd.io/etcd/raft.(*raft).loadState(0xc0000f8000, 0x2, 0x8211f1d0f64f3269, 0x3c1, 0x0, 0x0, 0x0)
        /Users/brk0v/go/src/go.etcd.io/etcd/raft/raft.go:1459 +0x1b1
go.etcd.io/etcd/raft.newRaft(0xc000450360, 0x4)
        /Users/brk0v/go/src/go.etcd.io/etcd/raft/raft.go:368 +0xd52
go.etcd.io/etcd/raft.RestartNode(0xc000450360, 0x1e98e20, 0xc0000601f0)
        /Users/brk0v/go/src/go.etcd.io/etcd/raft/node.go:232 +0x43
go.etcd.io/etcd/etcdserver.restartNode(0x7fff5fbff840, 0x6, 0x0, 0x0, 0x0, 0x0, 0xc0001aae80, 0x1, 0x1, 0xc0001ab180, ...)
        /Users/brk0v/go/src/go.etcd.io/etcd/etcdserver/raft.go:548 +0x7ad
go.etcd.io/etcd/etcdserver.NewServer(0x7fff5fbff840, 0x6, 0x0, 0x0, 0x0, 0x0, 0xc0001aae80, 0x1, 0x1, 0xc0001ab180, ...)
        /Users/brk0v/go/src/go.etcd.io/etcd/etcdserver/server.go:464 +0x3038
go.etcd.io/etcd/embed.StartEtcd(0xc0002b4000, 0xc0002b4500, 0x0, 0x0)
        /Users/brk0v/go/src/go.etcd.io/etcd/embed/etcd.go:203 +0x8ed
go.etcd.io/etcd/etcdmain.startEtcd(0xc0002b4000, 0x1cde6b6, 0x6, 0xc0001ab401, 0x2)
        /Users/brk0v/go/src/go.etcd.io/etcd/etcdmain/etcd.go:304 +0x40
go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
        /Users/brk0v/go/src/go.etcd.io/etcd/etcdmain/etcd.go:146 +0x2f37
go.etcd.io/etcd/etcdmain.Main()
        /Users/brk0v/go/src/go.etcd.io/etcd/etcdmain/main.go:47 +0x37
main.main()
        /Users/brk0v/go/src/go.etcd.io/etcd/main.go:28 +0x20

The text was updated successfully, but these errors were encountered:

gyuho · 2018-10-26T17:01:55Z

Is this reproducible with current master?

brk0v · 2018-10-26T17:10:50Z

Yes, you could reproduce this with the latest commit 5837632

brk0v · 2018-12-21T17:43:15Z

Appreciate if someone could confirm this and share ideas how it could be fixed.
I tried rearrange data saving steps to the WAL with some code changes, but it seems like we need to do all saves in one transaction like dgraph does https://github.com/dgraph-io/dgraph/blob/master/raftwal/storage.go#L574

xiang90 · 2018-12-21T21:36:03Z

@brk0v

I am aware of this issue. Would you like to spend some time to get it fixed?

brk0v · 2018-12-24T17:03:06Z

Sure thing, but if you already have an idea to check or a direction to dig, I could try to do that.

…o#10219

brk0v · 2018-12-27T16:50:54Z

Added code that passed my local failpoints test. Could you please check if this approach makes sense.
Thank you!

P.S. Unit tests are broken because of interface changes in etcdserver.Storage code.

brk0v · 2019-01-09T10:56:44Z

Tests now pass.

To summarise what were done:

Added snapshotter.LoadIndex() to load arbitrary snapshot;
Added etcdserver.storage.checkWALSnap() to check that snapshot could be used to load raft state from the WAL;
Added etcdserver.storage.Release() and etcdserver.storage. Sync() to provide the safe order of saving operations;
Changed logic how raft.Ready handles not emptyrd.Snapshot:
- save snapshot to the disk;
- save hardstate;
- fsync WAL to make hardstate persistent;
- release all WAL entries up to snapshot index.

xiang90 · 2019-01-09T19:46:00Z

@brk0v Thanks. I will give this a careful look over the next couple of weeks.

…o#10219

jingyih · 2019-01-28T19:13:48Z

cc @jpbetz

stale · 2020-04-07T03:12:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

…o#10219

jpbetz · 2020-05-14T04:22:53Z

I'm working on rebasing #10356 on master and making a couple adjustments to it (retaining the commits from @brk0v). I'll send out a PR shortly.

etcdserver/*, wal/*: changes to snapshots and wal logic etcdserver/*: changes to snapshots and wal logic to fix #10219 etcdserver/*, wal/*: add Sync method etcdserver/*, wal/*: find valid snapshots by cross checking snap files and wal snap entries etcdserver/*, wal/*:Add comments, clean up error messages and tests etcdserver/*, wal/*: Remove orphaned .snap.db files during Release Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

ref. #10219 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

Update raftexample to save the snapshot file and WAL snapshot entry before hardstate to ensure the snapshot exists during recovery. Otherwise if there is a failure after storing the hard state there may be reference to a non-existent snapshot. This PR introduces the fix from etcd-io#10219 to the raftexample.

brk0v mentioned this issue Oct 31, 2018

Raft: Force kill a node during snapshot retrieval goes into limbo #10225

Closed

brk0v pushed a commit to brk0v/etcd that referenced this issue Dec 27, 2018

etcdserver/*, wal/*: changes to snapshots and wal logic to fix etcd-i…

aaddc1a

…o#10219

brk0v mentioned this issue Dec 27, 2018

etcdserver/*, wal/*: changes to snapshots and WAL logic to fix #10219 #10356

Closed

brk0v pushed a commit to brk0v/etcd that referenced this issue Jan 4, 2019

etcdserver/*: changes to snapshots and wal logic to fix etcd-io#10219

54cddf3

xiang90 self-assigned this Jan 9, 2019

brk0v pushed a commit to brk0v/etcd that referenced this issue Jan 23, 2019

etcdserver/*, wal/*: changes to snapshots and wal logic to fix etcd-i…

b3443fc

…o#10219

brk0v pushed a commit to brk0v/etcd that referenced this issue Jan 23, 2019

etcdserver/*: changes to snapshots and wal logic to fix etcd-io#10219

e67b5ae

stale bot added the stale label Apr 7, 2020

stale bot closed this as completed Apr 28, 2020

jpbetz reopened this May 13, 2020

stale bot removed the stale label May 13, 2020

jpbetz pushed a commit to jpbetz/etcd that referenced this issue May 13, 2020

etcdserver/*, wal/*: changes to snapshots and wal logic to fix etcd-i…

9162cd6

…o#10219

jpbetz pushed a commit to jpbetz/etcd that referenced this issue May 13, 2020

etcdserver/*: changes to snapshots and wal logic to fix etcd-io#10219

3d2b565

jpbetz mentioned this issue May 14, 2020

Fix state.commit is out of range on restart #11888

Merged

gyuho closed this as completed in #11888 May 15, 2020

gyuho added a commit that referenced this issue May 18, 2020

etcdserver,wal: fix inconsistencies in WAL and snapshot

51dfbc7

ref. #10219 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit that referenced this issue May 18, 2020

etcdserver,wal: fix inconsistencies in WAL and snapshot

cac67dd

ref. #10219 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit that referenced this issue May 18, 2020

etcdserver,wal: fix inconsistencies in WAL and snapshot

9caec0d

ref. #10219 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

mrkm4ntr mentioned this issue May 27, 2021

Remove redundant fsync for WAL #13048

Closed

wyrobnik mentioned this issue Apr 11, 2022

contrib/raftexample: Save snapshot and WAL before hard state #13929

Merged

zghh mentioned this issue Aug 11, 2022

Fix inconsistent state between WAL and saved Snapshot hyperledger/fabric#3584

Open

This was referenced Dec 15, 2022

Implement remaining gofailpoints in linearizability tests #14726

Closed

tests linearizability: reproduce and prevent 14571 #14819

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted WAL and snapshot restoring process #10219

Corrupted WAL and snapshot restoring process #10219

brk0v commented Oct 26, 2018 •

edited

gyuho commented Oct 26, 2018

brk0v commented Oct 26, 2018 •

edited

brk0v commented Dec 21, 2018

xiang90 commented Dec 21, 2018

brk0v commented Dec 24, 2018

brk0v commented Dec 27, 2018

brk0v commented Jan 9, 2019

xiang90 commented Jan 9, 2019

jingyih commented Jan 28, 2019

stale bot commented Apr 7, 2020

jpbetz commented May 14, 2020

Corrupted WAL and snapshot restoring process #10219

Corrupted WAL and snapshot restoring process #10219

Comments

brk0v commented Oct 26, 2018 • edited

Issue

Specific

Steps to reproduce

gyuho commented Oct 26, 2018

brk0v commented Oct 26, 2018 • edited

brk0v commented Dec 21, 2018

xiang90 commented Dec 21, 2018

brk0v commented Dec 24, 2018

brk0v commented Dec 27, 2018

brk0v commented Jan 9, 2019

xiang90 commented Jan 9, 2019

jingyih commented Jan 28, 2019

stale bot commented Apr 7, 2020

jpbetz commented May 14, 2020

brk0v commented Oct 26, 2018 •

edited

brk0v commented Oct 26, 2018 •

edited