etcd3.5.0: panic: tocommit(458) is out of range [lastIndex(3)]. Was the raft log corrupted, truncated, or lost? #13509

redriverhong · 2021-11-27T02:09:12Z

I got a panic with version 3.5.0 when I stop a member and remove data start.
panic: tocommit(458) is out of range [lastIndex(3)]. Was the raft log corrupted, truncated, or lost?

goroutine 163 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0001663c0, 0x0, 0x0, 0x0)
/opt/buildtools/go_workpace/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*SugaredLogger).log(0xc0004981e0, 0xc0006ba104, 0x55a3ccd20fe1, 0x5d, 0xc00007c4c0, 0x2, 0x2, 0x0, 0x0, 0x0)
/opt/buildtools/go_workpace/pkg/mod/go.uber.org/zap@v1.17.0/sugar.go:227 +0x115
go.uber.org/zap.(*SugaredLogger).Panicf(...)
/opt/buildtools/go_workpace/pkg/mod/go.uber.org/zap@v1.17.0/sugar.go:159
go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf(0xc000682b80, 0x55a3ccd20fe1, 0x5d, 0xc00007c4c0, 0x2, 0x2)
/usr1/3.5.0/server/etcdserver/zap_raft.go:101 +0x7f
go.etcd.io/etcd/raft/v3.(*raftLog).commitTo(0xc0006e4310, 0x1ca)
/usr1/3.5.0/raft/log.go:237 +0x135
go.etcd.io/etcd/raft/v3.(*raft).handleHeartbeat(0xc0004bcf20, 0x8, 0x5e92d99e003cce4, 0x507df051d12df981, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/usr1/3.5.0/raft/raft.go:1513 +0x56
go.etcd.io/etcd/raft/v3.stepFollower(0xc0004bcf20, 0x8, 0x5e92d99e003cce4, 0x507df051d12df981, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/usr1/3.5.0/raft/raft.go:1439 +0x498
go.etcd.io/etcd/raft/v3.(*raft).Step(0xc0004bcf20, 0x8, 0x5e92d99e003cce4, 0x507df051d12df981, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/usr1/3.5.0/raft/raft.go:980 +0xa55
go.etcd.io/etcd/raft/v3.(*node).run(0xc00069cf00)
/usr1/3.5.0/raft/node.go:356 +0x798
created by go.etcd.io/etcd/raft/v3.RestartNode
/usr1/3.5.0/raft/node.go:244 +0x330

Reproduce Procedure:

Start etcd on the three nodes in static specified mode.
Key Configuration Items:
stop a member, and remove Data Directory.
start this member and got a panic.

I think this is a normal operation for a three-node cluster. In 3.4.x and earlier versions, this method is often used to recover a node in a cluster that has corrupted data. However, version 3.5.0 does not apply.

ahrtr · 2021-11-27T05:29:53Z

It's expected behavior. If an etcd node crashes and the local data is completely gone, then operator needs to remove the member from the cluster, and then add the member back again. The commands are roughly like below

etcdctl member remove <memberId>
etcdctl member add <name> --peer-urls=https:/x.x.x.x:2380

At last, start the member again, note you need to set the --initial-cluster-state as "existing" in this case. Note that it makes sense to require the operator/human intervention in such extreme scenario.

If a member crashes but the local data is still there, then it should be OK to start the member again directly.

stale · 2022-02-26T12:06:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

zhangguanzhang · 2022-08-24T09:01:51Z

etcdctl member remove <memberId>
etcdctl member add <name> --peer-urls=https://x.x.x.x:2380

M1178475702 · 2023-05-07T19:35:07Z

I noticed that a panic can occur when a crashed node restarts and receives a heartbeat message from the leader despite having lost the Raft log. While I understand that a new entry cannot be committed if the last committed log index (lastindex) is less than the log to be committed (toCommit), I am unclear as to why the node should panic and exit the system instead of rejecting the heartbeat message and waiting for further heartbeat messages with lower index logs

M1178475702 · 2023-05-07T20:19:23Z

And there is a further problem that if leader has committed some logs during the the node crashd, then the restart of the node will always be failed no matter if the logs are lost or not.

stale bot added the stale label Feb 26, 2022

ahrtr closed this as completed Feb 26, 2022

Tejaswini5327 mentioned this issue Apr 11, 2023

etcd 3.5.3panic: tocommit(587192) is out of range [lastIndex(587189)]. Was the raft log corrupted, truncated, or lost? #15699

Closed

lavacat mentioned this issue Apr 11, 2023

Intermittent error: "Etcd is not responding properly" when topology request is made by Patroni #15700

Closed

qixiaoyang0 mentioned this issue Jul 11, 2023

tocommit(3730) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost #16220

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd3.5.0: panic: tocommit(458) is out of range [lastIndex(3)]. Was the raft log corrupted, truncated, or lost? #13509

etcd3.5.0: panic: tocommit(458) is out of range [lastIndex(3)]. Was the raft log corrupted, truncated, or lost? #13509

redriverhong commented Nov 27, 2021

ahrtr commented Nov 27, 2021

stale bot commented Feb 26, 2022

zhangguanzhang commented Aug 24, 2022

M1178475702 commented May 7, 2023

M1178475702 commented May 7, 2023

etcd3.5.0: panic: tocommit(458) is out of range [lastIndex(3)]. Was the raft log corrupted, truncated, or lost? #13509

etcd3.5.0: panic: tocommit(458) is out of range [lastIndex(3)]. Was the raft log corrupted, truncated, or lost? #13509

Comments

redriverhong commented Nov 27, 2021

ahrtr commented Nov 27, 2021

stale bot commented Feb 26, 2022

zhangguanzhang commented Aug 24, 2022

M1178475702 commented May 7, 2023

M1178475702 commented May 7, 2023