-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential file integrity issue #7158
Comments
Possibly @Zelldon might be interested in this too 😅 |
Some work was already done here regarding the snapshot files. The atomic move is correctly dealt with, and we also flush things properly. What's left here is dealing with segment files in the journal, notably when writing the descriptor. |
I spent a bit of time trying to clean up the segment loading and creating process. To durably create segments, we have to do the following:
However, each of these steps can fail due to I/O errors, and since we don't use checked exceptions, it's difficult to trace how these errors are ultimately handled. Most I/O errors are retry-able, however they may land us in a state where we cannot recover anymore. So following the order above:
This has some impact on loading existing segments, since we need to be able to distinguish between partially written segments/descriptors and real, existing and corrupted segments, so we can try to safely recover. Additionally, as I mentioned, I found it hard to trace the error handling in the journal. There's many places where we throw errors, and others where errors are thrown unexpectedly (like the |
Describe the bug
Depending on the platform and file system you deploy Zeebe on, it's possible at the moment to run into file integrity issues, notably those related to our usage of fsync, and potentially msync (indirectly).
It seems to be a known issue that when writing a new file in a directory, you not only need to fsync the file to ensure integrity, you need to fsync the directory as well to ensure the directory entry is also persisted on disk. This is corroborated from SQLite (https://www.sqlite.org/src/doc/trunk/src/os_unix.c - see unixSync), from various blog posts from LWN.net (see this comment by Postgres developer Andres Freund describing the "fsync dance") and gnome.org, and affects ext4, which is the file system used in Camunda Cloud. This is further implied by the manpages:
See also this talk on file system integrity - Eat My Data - How most people get I/O wrong
Additionally, there are also concerns related to the rename syscall and integrity - it may be necessary to fsync after a rename, both the files and the directory. Reading the ext4 documentation, you can find a reference to this "broken" behavior (their words 😅) regarding the rename operation, and how you need a specific configuration option to avoid zero-length files. See the reference to
auto_da_alloc
, i.e.:To Reproduce
These kind of issues are very hard to reproduce as we have poor control over the various buffers and flush mechanisms. Is the data in the JVM I/O buffers (if using buffered I/O)? Is it in some intermediate library buffer? In the OS buffers? In the on-disk/disk driver buffers?
If anyone knows how to reproduce these or test these things, I'd be very happy to hear about them.
Expected behavior
We should strive to ensure we never lose data, and maintain data integrity, especially while we still only manual recovery procedures. So persisting a snapshot should be an atomic operation, and same with creating/writing a segment.
Additional context
Here are some more interesting sources to read up on about file integrity, and figure out how far we are from implementing it properly:
Environment:
The text was updated successfully, but these errors were encountered: