Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic due to race between wal rotation and snapshot #14252

Open
jasondobryk opened this issue Jul 21, 2022 · 14 comments
Open

panic due to race between wal rotation and snapshot #14252

jasondobryk opened this issue Jul 21, 2022 · 14 comments

Comments

@jasondobryk
Copy link

Hi, there is a problem on several etc clusters on windows server at the customer. process есв.ехе managed by the windows service. It looks like this: the cluster is working, but if you restart the etc service on one of the servers, we will get an error:

{"level":"fatal","ts":"2022-07-21T13:17:36.221+0300","caller":"etcdserver/storage.go:95","msg":"failed to open WAL","error":"wal: file not found","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.readWAL\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/storage.go:95\ngo.etcd.io/etcd/server/v3/etcdserver.restartNode\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/raft.go:481\ngo.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:533\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}

What did you expect to happen?
I expect a normal restart of the etcd process

How can we reproduce it?
so far we have not been able to repeat the error on our test servers

Etcd version
etcd.exe --version
etcd Version: 3.5.4

etcdctl.exe version
etcdctl version: 3.5.4

Etcd configuration of one of the servers:
name: ck11-apache2.test.local
initial-advertise-peer-urls: http://10.X.X.178:2380
listen-peer-urls: http://10.X.X.178:2380
listen-client-urls: http://10.X.X.178:2379,http://127.0.0.1:2379
advertise-client-urls: http://10.X.X.178:2379
initial-cluster-token: etcd-cluster-123456789
initial-cluster: ck11-apache1.test.local=http://10.X.X.177:2380,ck11-apache2.test.local=http://10.X.X.178:2380,ck11-apache-nlb.test.local=http://10.X.X.179:2381
initial-cluster-state: 'new'
max-txn-ops: 12800
heartbeat-interval: 250
election-timeout: 1250
logger: zap
log-level: info

Etcd debug information

C:\Program Files\etcd>etcdctl -w table endpoint --cluster status
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://10.X.X.179:2379 | 90f53ed17cd0b92f | 3.5.4 | 1.7 GB | false | false | 2 | 2048771 | 2048771 | |
| http://10.X.X.178:2379 | 97a661acdde4fc12 | 3.5.4 | 1.7 GB | true | false | 2 | 2048771 | 2048771 | |
| http://10.X.X.177:2379 | a82db5fe4c233dac | 3.5.4 | 1.7 GB | false | false | 2 | 2048771 | 2048771 | |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

I really need help with this problem. It is still unclear how to reproduce it, what to focus on

@serathius
Copy link
Member

I would start from "error": "wal: file not found". Note: you didn't use a proper bug template, but provided most of the needed information. Please use it next time.

@jasondobryk
Copy link
Author

I would start from "error": "wal: file not found". Note: you didn't use a proper bug template, but provided most of the needed information. Please use it next time.

According to the logs, it is unclear what specific file is missing, there are files with wal in the folder. Perhaps some specific one is missing or damaged, is it possible to understand it somehow?

@jasondobryk
Copy link
Author

I found the problem. If the rotation of wal files occurs faster than the server takes a snapshot, then the next restart of the server leads to an error.

@jasondobryk
Copy link
Author

It would be nice to solve this problem in the next version

@serathius
Copy link
Member

I found the problem. If the rotation of wal files occurs faster than the server takes a snapshot, then the next restart of the server leads to an error.

This seems like a bug, can you provide a minimal reproduction steps?

@serathius serathius reopened this Jul 27, 2022
@serathius serathius changed the title failed to open WAL panic due to race between wal rotation and snapshot Jul 27, 2022
@MIVerTFT
Copy link

Yes, a very strange situation. Reproduced like this.
There is one node etcd 3.5.4 running on Windows 10 config file below.

name: node1
listen-peer-urls: http://localhost:2002
listen-client-urls: http://localhost:2001
initial-advertise-peer-urls: http://localhost:2002
advertise-client-urls: http://localhost:2001
initial-cluster-token: cluster_1'
initial-cluster: node1=http://localhost:2002
initial-cluster-state: 'new'
heartbeat-interval: 250
election-timeout: 1250
max-txn-ops: 12800
auto-compaction-mode: periodic
auto-compaction-retention: 1m
max-wals: 5
snapshot-count: 10000

I run a task that continuously writes records of 200,000 bytes each.
As a result, there is a fast rotation of the wal files and after 3000-4000 transaction, the node cannot restart due to the error that the wal files are deleted and there is no snapshot.

On Windows Server 2019, the situation is a little different. wal files are not immediately deleted, they continue to accumulate until the snapshot is created, but after that the node still does not restart due to the same error.

On debian, i couldn't reproduce this situation.

@serathius
Copy link
Member

Windows is currently in Tier 3 support. https://etcd.io/docs/v3.5/op-guide/supported-platform/#current-support
Meaning we just build windows binary, but we never tested if it works.

Feel free to look into the issue, however it's not unexpected that there might be problems.

MIVerTFT added a commit to MIVerTFT/etcd that referenced this issue Aug 1, 2022
MIVerTFT added a commit to MIVerTFT/etcd that referenced this issue Aug 1, 2022
The current wal file lock check does not work on windows.

Fixes etcd-io#14252
MIVerTFT added a commit to MIVerTFT/etcd that referenced this issue Aug 1, 2022
The current wal file lock check does not work on windows.

Fixes etcd-io#14252

Signed-off-by: mivertft <894972+MIVerTFT@users.noreply.github.com>
@ahrtr
Copy link
Member

ahrtr commented Aug 2, 2022

Things to check:

  1. etcd has permission to access the WAL directory and files.
  2. Whether there are any WAL files?

@MIVerTFT
Copy link

MIVerTFT commented Aug 3, 2022

  1. Yes, I specifically run etcd with administrator rights
  2. Yes, there are other wall files in the data folder

@MIVerTFT
Copy link

MIVerTFT commented Aug 3, 2022

I wrote a very simple test.

  1. Create file.
  2. Put a lock on file.
  3. Delete file.
  4. Close process.

On windows 10, the file is deleted after step 3, but on windows server 2019 only after step 4, while there are no errors at step 3.
This explains the difference in behavior about which I wrote above. On windows server Etcd sends a command to delete files, but they remain until the lock is remove, or until the process is killed.

@ahrtr
Copy link
Member

ahrtr commented Aug 8, 2022

The file being deleting after step 4 is expected behavior on Linux as well. I suspect the issue is caused by TryLockFile. It actually failed to lock the file, but the lockFile returns nil, accodingly the purge goroutine regards it as a successful lock, then it deletes the file.

Could you double check it?

@MIVerTFT
Copy link

MIVerTFT commented Aug 17, 2022

Yes, that's right, LockFileEx cannot lock the file, but the lockFile method returns nil due to an incorrect condition, well, as it is written here.

@xrl
Copy link

xrl commented Aug 17, 2022

I am seeing this issue as well, 3.5.4 on linux/kubernetes using the official images. We are shipping our etcd using the bitnami helm chart.

@stale
Copy link

stale bot commented Aug 12, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants