-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcdserver: wait purge file loop to exit before stopping raft #11308
Conversation
Codecov Report
@@ Coverage Diff @@
## master #11308 +/- ##
==========================================
- Coverage 64.05% 63.68% -0.38%
==========================================
Files 403 403
Lines 37953 37966 +13
==========================================
- Hits 24312 24179 -133
- Misses 11980 12142 +162
+ Partials 1661 1645 -16
Continue to review full report at Codecov.
|
To prevent the purge file loop from accidentally acquiring the file lock and remove the files during server shutdowm.
f428fea
to
c447955
Compare
cc @wenjiaswe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm good catch!
@gyuho I think we should backport this fix? |
@jingyih Yes, let's backport! |
LGTM. Thanks for jumping on this @jingyih. yes, please backport. Severity is high. |
…8-upstream-release-3.2 Automated cherry pick of #11308 on release-3.2
…8-upstream-release-3.3 Automated cherry pick of #11308 on release-3.3
…8-upstream-release-3.4 Automated cherry pick of #11308 on release-3.4
Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
To prevent purgeFile loop from accidentally acquiring file lock and remove them during server shutdown.
Keyword: "open wal error: wal: file not found"
Issue
Normally, server's raft node holds file lock on all the needed wal files. The file lock is only released when a new snapshot is created and the old wal files are no longer needed [1]. During server shutdown, there is a chance that the raft node stops before the purge file loop exists. On stop, raft node closes all the wal files and therefore releases the file lock. So there is a chance that the purge file loop might acquire the file lock [2] and remove some wal files that are still needed by the server. When this happens, server cannot restart due to error:
C | etcdserver: open wal error: wal: file not found
.Ref:
[1]
etcd/wal/wal.go
Line 731 in 84e2788
[2]
etcd/pkg/fileutil/purge.go
Line 51 in 84e2788
Fix
This PR solves the issue by making sure the purge file loop exists before server signals the stopping of raft node.
Reproduce of the orignal issue
Change
purgeFileInterval
to a very small number (e.g.1 * time.Nanosecond
) so that it is easier to reproduce. Put a lot of data into the server and stop it. There is high probability that purge file loop will remove some of the wal files that are still needed by the server.Example server log of a local reproduce: