Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd3.5.0: panic: tocommit(458) is out of range [lastIndex(3)]. Was the raft log corrupted, truncated, or lost? #13509

Closed
redriverhong opened this issue Nov 27, 2021 · 5 comments
Labels

Comments

@redriverhong
Copy link

I got a panic with version 3.5.0 when I stop a member and remove data start.
panic: tocommit(458) is out of range [lastIndex(3)]. Was the raft log corrupted, truncated, or lost?

goroutine 163 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0001663c0, 0x0, 0x0, 0x0)
/opt/buildtools/go_workpace/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*SugaredLogger).log(0xc0004981e0, 0xc0006ba104, 0x55a3ccd20fe1, 0x5d, 0xc00007c4c0, 0x2, 0x2, 0x0, 0x0, 0x0)
/opt/buildtools/go_workpace/pkg/mod/go.uber.org/zap@v1.17.0/sugar.go:227 +0x115
go.uber.org/zap.(*SugaredLogger).Panicf(...)
/opt/buildtools/go_workpace/pkg/mod/go.uber.org/zap@v1.17.0/sugar.go:159
go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf(0xc000682b80, 0x55a3ccd20fe1, 0x5d, 0xc00007c4c0, 0x2, 0x2)
/usr1/3.5.0/server/etcdserver/zap_raft.go:101 +0x7f
go.etcd.io/etcd/raft/v3.(*raftLog).commitTo(0xc0006e4310, 0x1ca)
/usr1/3.5.0/raft/log.go:237 +0x135
go.etcd.io/etcd/raft/v3.(*raft).handleHeartbeat(0xc0004bcf20, 0x8, 0x5e92d99e003cce4, 0x507df051d12df981, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/usr1/3.5.0/raft/raft.go:1513 +0x56
go.etcd.io/etcd/raft/v3.stepFollower(0xc0004bcf20, 0x8, 0x5e92d99e003cce4, 0x507df051d12df981, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/usr1/3.5.0/raft/raft.go:1439 +0x498
go.etcd.io/etcd/raft/v3.(*raft).Step(0xc0004bcf20, 0x8, 0x5e92d99e003cce4, 0x507df051d12df981, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/usr1/3.5.0/raft/raft.go:980 +0xa55
go.etcd.io/etcd/raft/v3.(*node).run(0xc00069cf00)
/usr1/3.5.0/raft/node.go:356 +0x798
created by go.etcd.io/etcd/raft/v3.RestartNode
/usr1/3.5.0/raft/node.go:244 +0x330

Reproduce Procedure:

  1. Start etcd on the three nodes in static specified mode.
    Key Configuration Items:
    image
  2. stop a member, and remove Data Directory.
  3. start this member and got a panic.

I think this is a normal operation for a three-node cluster. In 3.4.x and earlier versions, this method is often used to recover a node in a cluster that has corrupted data. However, version 3.5.0 does not apply.

@ahrtr
Copy link
Member

ahrtr commented Nov 27, 2021

It's expected behavior. If an etcd node crashes and the local data is completely gone, then operator needs to remove the member from the cluster, and then add the member back again. The commands are roughly like below

etcdctl member remove <memberId>
etcdctl member add <name> --peer-urls=https:/x.x.x.x:2380

At last, start the member again, note you need to set the --initial-cluster-state as "existing" in this case. Note that it makes sense to require the operator/human intervention in such extreme scenario.

If a member crashes but the local data is still there, then it should be OK to start the member again directly.

@stale
Copy link

stale bot commented Feb 26, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 26, 2022
@ahrtr ahrtr closed this as completed Feb 26, 2022
@zhangguanzhang
Copy link
Contributor

etcdctl member remove <memberId>
etcdctl member add <name> --peer-urls=https://x.x.x.x:2380

@M1178475702
Copy link

I noticed that a panic can occur when a crashed node restarts and receives a heartbeat message from the leader despite having lost the Raft log. While I understand that a new entry cannot be committed if the last committed log index (lastindex) is less than the log to be committed (toCommit), I am unclear as to why the node should panic and exit the system instead of rejecting the heartbeat message and waiting for further heartbeat messages with lower index logs

@M1178475702
Copy link

And there is a further problem that if leader has committed some logs during the the node crashd, then the restart of the node will always be failed no matter if the logs are lost or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants