Potential file integrity issue #7158

npepinpe · 2021-06-01T08:35:45Z

Describe the bug

Depending on the platform and file system you deploy Zeebe on, it's possible at the moment to run into file integrity issues, notably those related to our usage of fsync, and potentially msync (indirectly).

It seems to be a known issue that when writing a new file in a directory, you not only need to fsync the file to ensure integrity, you need to fsync the directory as well to ensure the directory entry is also persisted on disk. This is corroborated from SQLite (https://www.sqlite.org/src/doc/trunk/src/os_unix.c - see unixSync), from various blog posts from LWN.net (see this comment by Postgres developer Andres Freund describing the "fsync dance") and gnome.org, and affects ext4, which is the file system used in Camunda Cloud. This is further implied by the manpages:

Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.

See also this talk on file system integrity - Eat My Data - How most people get I/O wrong

Additionally, there are also concerns related to the rename syscall and integrity - it may be necessary to fsync after a rename, both the files and the directory. Reading the ext4 documentation, you can find a reference to this "broken" behavior (their words 😅) regarding the rename operation, and how you need a specific configuration option to avoid zero-length files. See the reference to auto_da_alloc, i.e.:

auto_da_alloc(*)	Many broken applications don't use fsync() when 
noauto_da_alloc	replacing existing files via patterns such as
			fd = open("foo.new")/write(fd,..)/close(fd)/
			rename("foo.new", "foo"), or worse yet,
			fd = open("foo", O_TRUNC)/write(fd,..)/close(fd).
			If auto_da_alloc is enabled, ext4 will detect
			the replace-via-rename and replace-via-truncate
			patterns and force that any delayed allocation
			blocks are allocated such that at the next
			journal commit, in the default data=ordered
			mode, the data blocks of the new file are forced
			to disk before the rename() operation is
			committed.  This provides roughly the same level
			of guarantees as ext3, and avoids the
			"zero-length" problem that can happen when a
			system crashes before the delayed allocation
			blocks are forced to disk.

To Reproduce

These kind of issues are very hard to reproduce as we have poor control over the various buffers and flush mechanisms. Is the data in the JVM I/O buffers (if using buffered I/O)? Is it in some intermediate library buffer? In the OS buffers? In the on-disk/disk driver buffers?

If anyone knows how to reproduce these or test these things, I'd be very happy to hear about them.

Expected behavior

We should strive to ensure we never lose data, and maintain data integrity, especially while we still only manual recovery procedures. So persisting a snapshot should be an atomic operation, and same with creating/writing a segment.

Additional context

Here are some more interesting sources to read up on about file integrity, and figure out how far we are from implementing it properly:

Daniel Luu: Files are hard
Jeff Moyer: Ensuring data reaches the disk
Ted T'So: Don't fear the fsync
Postgres Wiki: fsync error page
rio: discussion about atomic move <- relevant for our snapshot move/commit

Environment:

OS: Linux
Zeebe Version: 1.0.0
Configuration: default

The text was updated successfully, but these errors were encountered:

npepinpe · 2021-06-01T11:20:35Z

/cc @MiguelPires @deepthidevaki

npepinpe · 2021-06-01T14:11:08Z

Possibly @Zelldon might be interested in this too 😅

Zelldon · 2021-06-07T07:53:27Z

Thanks @npepinpe interesting read 👍

Regarding

                        avoids the
			"zero-length" problem that can happen when a
			system crashes before the delayed allocation
			blocks are forced to disk.

might be related to #6979 ?

npepinpe · 2021-07-02T12:46:20Z

Some work was already done here regarding the snapshot files. The atomic move is correctly dealt with, and we also flush things properly. What's left here is dealing with segment files in the journal, notably when writing the descriptor.

npepinpe · 2021-07-12T13:30:27Z

I spent a bit of time trying to clean up the segment loading and creating process. To durably create segments, we have to do the following:

Create the new segment file
Write the descriptor
Flush the descriptor
Flush the journal directory to ensure the segment file entry is present

However, each of these steps can fail due to I/O errors, and since we don't use checked exceptions, it's difficult to trace how these errors are ultimately handled. Most I/O errors are retry-able, however they may land us in a state where we cannot recover anymore. So following the order above:

Say we fail on step 1; nothing to clean up, we can just retry, easy peasy 🍋 zeebe
Say we fail on step 2; we have to delete the segment file, otherwise on retry we have to decide what to do: should we overwrite an existing segment (since we think we should be creating it), or should we try to load it? We cannot load it since we failed to write the descriptor, so this will always fail. Our only option is to overwrite it - but we first need to distinguish between a segment where we failed to write the descriptor, and one which is really corrupted. So if we can determine that the segment was just garbage, then we can retry the operation correctly. Keep in mind that even if we delete the file on failure when writing the descriptor, deleting can also fail, so we still need to handle the case where we have an empty segment file, or one with a partial descriptor.
Say we fail on step 3; we cannot guarantee the segment data is on disk. We could go on anyway, since we will flush the next time we append (on follower) or commit (on the leader). However, if these flush would also fail, the OS isn't guaranteed to flush pages sequentially, so it could be that weird partial segment later on. It may be that that's OK (since we wouldn't acknowledge on the follower if flushing failed, and we wouldn't increase the commit index on the leader if flushing failed), but to be safe we should just always flush sequentially.
Say we fail on step 4; we cannot guarantee the file will be visible on retry. This is actually totally fine as far as the descriptor is concerned (since if we only wrote the descriptor than who cares, let's just recreate the segment on recovery), but it's a problem if we actually flushed some data blocks - then we lost these! So here while we don't need to rollback, we do need to retry, if only retry the directory flush. But how do we distinguish that's that only missing thing? On retry, the file will be visible to our process (remember, it's just not guaranteed on crash recovery), so we wouldn't distinguish it from an existing segment or a new segment. The safest approach is probably to detect that the file only contains the descriptor but no data, at which point we can verify that the descriptor is as expected (or just throw it away and recreate it all), and just flush the directory.

This has some impact on loading existing segments, since we need to be able to distinguish between partially written segments/descriptors and real, existing and corrupted segments, so we can try to safely recover.

Additionally, as I mentioned, I found it hard to trace the error handling in the journal. There's many places where we throw errors, and others where errors are thrown unexpectedly (like the InternalError on mapped buffer operations). I think we could try using the Either type here to help with tracing and improve the general error handling. I would also propose splitting of the segment loading/creating into its own class, and we can feed an interface to the SegmentedJournal, mostly to improve testability of the journal implementation.

npepinpe added the kind/bug Categorizes an issue or PR as a bug label Jun 1, 2021

npepinpe added Impact: Data scope/broker Marks an issue or PR to appear in the broker section of the changelog labels Jun 1, 2021

npepinpe added the severity/high Marks a bug as having a noticeable impact on the user with no known workaround label Jun 2, 2021

npepinpe added this to Ready in Zeebe Jun 2, 2021

npepinpe self-assigned this Jun 30, 2021

npepinpe moved this from Ready to In progress in Zeebe Jul 2, 2021

npepinpe mentioned this issue Jul 14, 2021

Ensure journal segments are durably created #7496

Merged

9 tasks

npepinpe moved this from In progress to Review in progress in Zeebe Jul 14, 2021

ghost closed this as completed in 97dd956 Jul 19, 2021

Zeebe automation moved this from Review in progress to Done Jul 19, 2021

saig0 added the Release: 1.2.0-alpha1 label Aug 3, 2021

deepthidevaki added the Release: 1.2.0 label Oct 5, 2021

KerstinHebel removed this from Done in Zeebe Mar 23, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential file integrity issue #7158

Potential file integrity issue #7158

npepinpe commented Jun 1, 2021 •

edited

Loading

npepinpe commented Jun 1, 2021

npepinpe commented Jun 1, 2021

Zelldon commented Jun 7, 2021

npepinpe commented Jul 2, 2021

npepinpe commented Jul 12, 2021 •

edited

Loading

Potential file integrity issue #7158

Potential file integrity issue #7158

Comments

npepinpe commented Jun 1, 2021 • edited Loading

npepinpe commented Jun 1, 2021

npepinpe commented Jun 1, 2021

Zelldon commented Jun 7, 2021

npepinpe commented Jul 2, 2021

npepinpe commented Jul 12, 2021 • edited Loading

npepinpe commented Jun 1, 2021 •

edited

Loading

npepinpe commented Jul 12, 2021 •

edited

Loading