panic due to race between wal rotation and snapshot #14252

jasondobryk · 2022-07-21T13:52:44Z

Hi, there is a problem on several etc clusters on windows server at the customer. process есв.ехе managed by the windows service. It looks like this: the cluster is working, but if you restart the etc service on one of the servers, we will get an error:

{"level":"fatal","ts":"2022-07-21T13:17:36.221+0300","caller":"etcdserver/storage.go:95","msg":"failed to open WAL","error":"wal: file not found","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.readWAL\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/storage.go:95\ngo.etcd.io/etcd/server/v3/etcdserver.restartNode\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/raft.go:481\ngo.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:533\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}

What did you expect to happen?
I expect a normal restart of the etcd process

How can we reproduce it?
so far we have not been able to repeat the error on our test servers

Etcd version
etcd.exe --version
etcd Version: 3.5.4

etcdctl.exe version
etcdctl version: 3.5.4

Etcd configuration of one of the servers:
name: ck11-apache2.test.local
initial-advertise-peer-urls: http://10.X.X.178:2380
listen-peer-urls: http://10.X.X.178:2380
listen-client-urls: http://10.X.X.178:2379,http://127.0.0.1:2379
advertise-client-urls: http://10.X.X.178:2379
initial-cluster-token: etcd-cluster-123456789
initial-cluster: ck11-apache1.test.local=http://10.X.X.177:2380,ck11-apache2.test.local=http://10.X.X.178:2380,ck11-apache-nlb.test.local=http://10.X.X.179:2381
initial-cluster-state: 'new'
max-txn-ops: 12800
heartbeat-interval: 250
election-timeout: 1250
logger: zap
log-level: info

Etcd debug information

C:\Program Files\etcd>etcdctl -w table endpoint --cluster status
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://10.X.X.179:2379 | 90f53ed17cd0b92f | 3.5.4 | 1.7 GB | false | false | 2 | 2048771 | 2048771 | |
| http://10.X.X.178:2379 | 97a661acdde4fc12 | 3.5.4 | 1.7 GB | true | false | 2 | 2048771 | 2048771 | |
| http://10.X.X.177:2379 | a82db5fe4c233dac | 3.5.4 | 1.7 GB | false | false | 2 | 2048771 | 2048771 | |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

I really need help with this problem. It is still unclear how to reproduce it, what to focus on

The text was updated successfully, but these errors were encountered:

serathius · 2022-07-21T14:08:27Z

I would start from "error": "wal: file not found". Note: you didn't use a proper bug template, but provided most of the needed information. Please use it next time.

jasondobryk · 2022-07-25T13:08:01Z

I would start from "error": "wal: file not found". Note: you didn't use a proper bug template, but provided most of the needed information. Please use it next time.

According to the logs, it is unclear what specific file is missing, there are files with wal in the folder. Perhaps some specific one is missing or damaged, is it possible to understand it somehow?

jasondobryk · 2022-07-27T05:21:10Z

I found the problem. If the rotation of wal files occurs faster than the server takes a snapshot, then the next restart of the server leads to an error.

jasondobryk · 2022-07-27T05:24:57Z

It would be nice to solve this problem in the next version

serathius · 2022-07-27T08:30:07Z

I found the problem. If the rotation of wal files occurs faster than the server takes a snapshot, then the next restart of the server leads to an error.

This seems like a bug, can you provide a minimal reproduction steps?

MIVerTFT · 2022-07-27T14:29:24Z

Yes, a very strange situation. Reproduced like this.
There is one node etcd 3.5.4 running on Windows 10 config file below.

name: node1
listen-peer-urls: http://localhost:2002
listen-client-urls: http://localhost:2001
initial-advertise-peer-urls: http://localhost:2002
advertise-client-urls: http://localhost:2001
initial-cluster-token: cluster_1'
initial-cluster: node1=http://localhost:2002
initial-cluster-state: 'new'
heartbeat-interval: 250
election-timeout: 1250
max-txn-ops: 12800
auto-compaction-mode: periodic
auto-compaction-retention: 1m
max-wals: 5
snapshot-count: 10000

I run a task that continuously writes records of 200,000 bytes each.
As a result, there is a fast rotation of the wal files and after 3000-4000 transaction, the node cannot restart due to the error that the wal files are deleted and there is no snapshot.

On Windows Server 2019, the situation is a little different. wal files are not immediately deleted, they continue to accumulate until the snapshot is created, but after that the node still does not restart due to the same error.

On debian, i couldn't reproduce this situation.

serathius · 2022-07-27T14:41:58Z

Windows is currently in Tier 3 support. https://etcd.io/docs/v3.5/op-guide/supported-platform/#current-support
Meaning we just build windows binary, but we never tested if it works.

Feel free to look into the issue, however it's not unexpected that there might be problems.

The current wal file lock check does not work on windows. Fixes etcd-io#14252

The current wal file lock check does not work on windows. Fixes etcd-io#14252 Signed-off-by: mivertft <894972+MIVerTFT@users.noreply.github.com>

ahrtr · 2022-08-02T21:06:25Z

Things to check:

etcd has permission to access the WAL directory and files.
Whether there are any WAL files?

MIVerTFT · 2022-08-03T06:45:25Z

Yes, I specifically run etcd with administrator rights
Yes, there are other wall files in the data folder

MIVerTFT · 2022-08-03T08:49:24Z

I wrote a very simple test.

Create file.
Put a lock on file.
Delete file.
Close process.

On windows 10, the file is deleted after step 3, but on windows server 2019 only after step 4, while there are no errors at step 3.
This explains the difference in behavior about which I wrote above. On windows server Etcd sends a command to delete files, but they remain until the lock is remove, or until the process is killed.

ahrtr · 2022-08-08T20:43:34Z

The file being deleting after step 4 is expected behavior on Linux as well. I suspect the issue is caused by TryLockFile. It actually failed to lock the file, but the lockFile returns nil, accodingly the purge goroutine regards it as a successful lock, then it deletes the file.

Could you double check it?

MIVerTFT · 2022-08-17T09:23:42Z

Yes, that's right, LockFileEx cannot lock the file, but the lockFile method returns nil due to an incorrect condition, well, as it is written here.

xrl · 2022-08-17T21:33:04Z

I am seeing this issue as well, 3.5.4 on linux/kubernetes using the official images. We are shipping our etcd using the bitnami helm chart.

stale · 2023-08-12T07:21:48Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

jasondobryk closed this as completed Jul 27, 2022

serathius reopened this Jul 27, 2022

serathius added the type/bug label Jul 27, 2022

serathius changed the title ~~failed to open WAL~~ panic due to race between wal rotation and snapshot Jul 27, 2022

serathius added the platform/windows label Jul 27, 2022

MIVerTFT added a commit to MIVerTFT/etcd that referenced this issue Aug 1, 2022

fix bug on windows wal lock etcd-io#14252

9ca1eed

MIVerTFT added a commit to MIVerTFT/etcd that referenced this issue Aug 1, 2022

fileutil: fixed work with locks on the windows

0fb5815

The current wal file lock check does not work on windows. Fixes etcd-io#14252

MIVerTFT mentioned this issue Aug 1, 2022

fileutil: fixed work with wal files lock on the windows #14293

Closed

MIVerTFT added a commit to MIVerTFT/etcd that referenced this issue Aug 1, 2022

fileutil: fixed work with locks on the windows

f2c59f4

The current wal file lock check does not work on windows. Fixes etcd-io#14252 Signed-off-by: mivertft <894972+MIVerTFT@users.noreply.github.com>

ahrtr mentioned this issue Aug 2, 2022

Enhance the WAL file related error #14300

Merged

serathius added the release/v3.5 label Nov 15, 2022

serathius removed the platform/windows label Jan 18, 2023

serathius mentioned this issue Jan 18, 2023

Plan for v3.5.7 release #15141

Closed

serathius added the help wanted label May 10, 2023

serathius mentioned this issue May 10, 2023

Plans for v3.5.9 release #15871

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic due to race between wal rotation and snapshot #14252

panic due to race between wal rotation and snapshot #14252

jasondobryk commented Jul 21, 2022

serathius commented Jul 21, 2022

jasondobryk commented Jul 25, 2022

jasondobryk commented Jul 27, 2022

jasondobryk commented Jul 27, 2022

serathius commented Jul 27, 2022

MIVerTFT commented Jul 27, 2022

serathius commented Jul 27, 2022

ahrtr commented Aug 2, 2022

MIVerTFT commented Aug 3, 2022

MIVerTFT commented Aug 3, 2022

ahrtr commented Aug 8, 2022

MIVerTFT commented Aug 17, 2022 •

edited

xrl commented Aug 17, 2022

stale bot commented Aug 12, 2023

panic due to race between wal rotation and snapshot #14252

panic due to race between wal rotation and snapshot #14252

Comments

jasondobryk commented Jul 21, 2022

serathius commented Jul 21, 2022

jasondobryk commented Jul 25, 2022

jasondobryk commented Jul 27, 2022

jasondobryk commented Jul 27, 2022

serathius commented Jul 27, 2022

MIVerTFT commented Jul 27, 2022

serathius commented Jul 27, 2022

ahrtr commented Aug 2, 2022

MIVerTFT commented Aug 3, 2022

MIVerTFT commented Aug 3, 2022

ahrtr commented Aug 8, 2022

MIVerTFT commented Aug 17, 2022 • edited

xrl commented Aug 17, 2022

stale bot commented Aug 12, 2023

MIVerTFT commented Aug 17, 2022 •

edited