Fix state.commit is out of range on restart #11888

jpbetz · 2020-05-14T04:49:23Z

This is a rebase of #10356 (which fixes #10219), with a couple differences:

Write .snap files as the very first step of the snapshot process
On startup, directly cross check the .snap file with the WAL log snap entries to find the most recent valid snapshot to use, to make sure orphaned .snap files are ignored
Remove orphaned .snap.db files when they are obviously out-of-date

cc @brk0v, @jingyih

…o#10219

…s and wal snap entries

jpbetz · 2020-05-14T16:26:56Z

cc @YoyinZyc

gyuho

Thanks for the fix.

etcdserver/server_test.go

wal/wal.go

wal/wal_test.go

etcdserver/storage.go

etcdserver/raft.go

wal/wal_test.go

gyuho · 2020-05-14T19:00:33Z

Overall looks great. Some minor changes requested. Could have been caught in our failpoints testing :)

/cc @xiang90 @wenjiaswe @spzala

jpbetz · 2020-05-14T20:07:18Z

Just for confirmation, I replicated this on master (without this fix in place):

13:06:05 etcd2 | panic: 91bc3c398fb3c146 state.commit 1065 is out of range [352, 354]
13:06:05 etcd2 | goroutine 1 [running]:
13:06:05 etcd2 | go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000170e70, 0x0, 0x0, 0x0)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/vendor/go.uber.org/zap/zapcore/entry.go:230 +0x545
13:06:05 etcd2 | go.uber.org/zap.(*SugaredLogger).log(0xc00000efc8, 0x4, 0x107c099, 0x2b, 0xc0001ec840, 0x4, 0x4, 0x0, 0x0, 0x0)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/vendor/go.uber.org/zap/sugar.go:234 +0x100
13:06:05 etcd2 | go.uber.org/zap.(*SugaredLogger).Panicf(...)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/vendor/go.uber.org/zap/sugar.go:159
13:06:05 etcd2 | go.etcd.io/etcd/v3/pkg/logutil.(*zapRaftLogger).Panicf(0xc00019fb90, 0x107c099, 0x2b, 0xc0001ec840, 0x4, 0x4)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/pkg/logutil/zap_raft.go:101 +0x7d
13:06:05 etcd2 | go.etcd.io/etcd/v3/raft.(*raft).loadState(0xc0001b4a00, 0x2, 0xfd422379fda50e48, 0x429, 0x0, 0x0, 0x0)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/raft/raft.go:1576 +0x1be
13:06:05 etcd2 | go.etcd.io/etcd/v3/raft.newRaft(0xc0003dfb18, 0x1)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/raft/raft.go:371 +0xd90
13:06:05 etcd2 | go.etcd.io/etcd/v3/raft.NewRawNode(0xc0003dfb18, 0xc00000efc8, 0xc0001e1620, 0xc0001e1680)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/raft/rawnode.go:48 +0x2f
13:06:05 etcd2 | go.etcd.io/etcd/v3/raft.RestartNode(0xc0003dfb18, 0x1245740, 0xc00019fb90)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/raft/node.go:237 +0x43
13:06:05 etcd2 | go.etcd.io/etcd/v3/etcdserver.restartNode(0x7ffc42ddce86, 0x6, 0x0, 0x0, 0x0, 0x0, 0xc0000db680, 0x1, 0x1, 0xc0000db980, ...)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/etcdserver/raft.go:505 +0x6ed
13:06:05 etcd2 | go.etcd.io/etcd/v3/etcdserver.NewServer(0x7ffc42ddce86, 0x6, 0x0, 0x0, 0x0, 0x0, 0xc0000db680, 0x1, 0x1, 0xc0000db980, ...)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/etcdserver/server.go:446 +0x2c0f
13:06:05 etcd2 | go.etcd.io/etcd/v3/embed.StartEtcd(0xc0000c9b80, 0xc0001ee000, 0x0, 0x0)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/embed/etcd.go:203 +0xcd3
13:06:05 etcd2 | go.etcd.io/etcd/v3/etcdmain.startEtcd(0xc0000c9b80, 0x1051cca, 0x6, 0xc0000dbf01, 0x2)
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/etcdmain/etcd.go:212 +0x40
13:06:05 etcd2 | go.etcd.io/etcd/v3/etcdmain.startEtcdOrProxyV2()
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/etcdmain/etcd.go:107 +0x25f0
13:06:05 etcd2 | go.etcd.io/etcd/v3/etcdmain.Main()
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/etcdmain/main.go:46 +0x37
13:06:05 etcd2 | main.main()
13:06:05 etcd2 | 	/home/jpbetz/projects/etcd/main.go:28 +0x20
13:06:05 etcd2 | Terminating etcd2

To replicate the Procfile change should be GOFAIL_HTTP="127.0.0.1:1111" GOFAIL_FAILPOINTS='etcdserver/raftBeforeSaveSnap=panic("raftBeforeSaveSnap")' which differs slightly from what is documented in #10219

jpbetz · 2020-05-14T20:37:39Z

Overall looks great. Some minor changes requested.

Feedback applied. Thanks for the review!

Could have been caught in our failpoints testing :)

Where do we test that, by the way? I'd like to review it and see if there is anything that can be done to make it test this case better.

gyuho · 2020-05-14T20:41:31Z

Where do we test that, by the way?

We've been running failpoints in our CI https://github.com/etcd-io/etcd/blob/master/functional.yaml#L4, but maybe it wasn't enough.

lgtm.

I will merge after CIs and release them with new 3.3 and 3.4.

jpbetz · 2020-05-14T21:06:41Z

Thanks @gyuho Hold on merging for just a bit. I just noticed that .snap.db files might not be getting properly cleaned up if orphaned.

jpbetz · 2020-05-14T22:12:12Z

Added one more commit to remove orphaned .snap.db files during Release().

tedyu · 2020-05-15T22:46:45Z

etcdserver/api/snap/snapshotter.go

+			hexIndex := strings.TrimSuffix(filepath.Base(filename), ".snap.db")
+			index, err := strconv.ParseUint(hexIndex, 16, 64)
+			if err != nil {
+				return fmt.Errorf("failed to parse index from .snap.db filename '%s': %w", filename, err)


.snap.db seems to be redundant since it is suffix of filename

Also, we should continue the loop and try to release other files

Submitted #11899

We can't just call it .snap since there are already different files with that suffix. But the purpose of this PR isn't to rethink the file suffix anyway...

Continuing if we hit an error sounds fine. This is merged now though, so that would need a separate PR.

serathius · 2023-12-20T10:38:54Z

etcdserver/server.go

+		}
+		// snapshot files can be orphaned if etcd crashes after writing them but before writing the corresponding
+		// wal log entries
+		snapshot, err := ss.LoadNewestAvailable(walSnaps)


This change created a local snapshot variable. Now the snapshot declared on line 337 is always passed empty to validation on line 535.

…stent index and closing database on defer `err` variable shared throughout the NewServer function and used on line 396 to defer decision whether backend should be closed when starting the server failed. `snapshot` variable is first defined 407, redeclared locally on line 496 and later again used on line 625. Creation of local variable is a bug introduced in etcd-io#11888. Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

Viacheslav Biriukov added 5 commits May 13, 2020 08:42

etcdserver/*, wal/*: changes to snapshots and wal logic to fix etcd-i…

9162cd6

…o#10219

etcdserver/*: changes to snapshots and wal logic to fix etcd-io#10219

3d2b565

etcdserver/*: fix tests

5435e76

etcdserver/*: rollback default settings

91efa67

etcdserver/*, wal/*: add Sync method

5051703

jpbetz mentioned this pull request May 14, 2020

etcdserver/*, wal/*: changes to snapshots and WAL logic to fix #10219 #10356

Closed

jpbetz force-pushed the out-of-range-fix branch 2 times, most recently from c98115a to b5630ae Compare May 14, 2020 16:19

etcdserver/*, wal/*: find valid snapshots by cross checking snap file…

bd16071

…s and wal snap entries

jpbetz force-pushed the out-of-range-fix branch from b5630ae to bd16071 Compare May 14, 2020 16:22

jpbetz changed the title ~~[WIP] Fix state.commit is out of range on restart~~ Fix state.commit is out of range on restart May 14, 2020

jpbetz requested a review from jingyih May 14, 2020 16:25

jpbetz added the backport/v3.3 label May 14, 2020

jpbetz self-assigned this May 14, 2020

jpbetz added the backport/v3.4 label May 14, 2020