Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential file integrity issue #7158

Closed
npepinpe opened this issue Jun 1, 2021 · 5 comments · Fixed by #7496
Closed

Potential file integrity issue #7158

npepinpe opened this issue Jun 1, 2021 · 5 comments · Fixed by #7496
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround

Comments

@npepinpe
Copy link
Member

npepinpe commented Jun 1, 2021

Describe the bug

Depending on the platform and file system you deploy Zeebe on, it's possible at the moment to run into file integrity issues, notably those related to our usage of fsync, and potentially msync (indirectly).

It seems to be a known issue that when writing a new file in a directory, you not only need to fsync the file to ensure integrity, you need to fsync the directory as well to ensure the directory entry is also persisted on disk. This is corroborated from SQLite (https://www.sqlite.org/src/doc/trunk/src/os_unix.c - see unixSync), from various blog posts from LWN.net (see this comment by Postgres developer Andres Freund describing the "fsync dance") and gnome.org, and affects ext4, which is the file system used in Camunda Cloud. This is further implied by the manpages:

Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.

See also this talk on file system integrity - Eat My Data - How most people get I/O wrong

Additionally, there are also concerns related to the rename syscall and integrity - it may be necessary to fsync after a rename, both the files and the directory. Reading the ext4 documentation, you can find a reference to this "broken" behavior (their words 😅) regarding the rename operation, and how you need a specific configuration option to avoid zero-length files. See the reference to auto_da_alloc, i.e.:

auto_da_alloc(*)	Many broken applications don't use fsync() when 
noauto_da_alloc	replacing existing files via patterns such as
			fd = open("foo.new")/write(fd,..)/close(fd)/
			rename("foo.new", "foo"), or worse yet,
			fd = open("foo", O_TRUNC)/write(fd,..)/close(fd).
			If auto_da_alloc is enabled, ext4 will detect
			the replace-via-rename and replace-via-truncate
			patterns and force that any delayed allocation
			blocks are allocated such that at the next
			journal commit, in the default data=ordered
			mode, the data blocks of the new file are forced
			to disk before the rename() operation is
			committed.  This provides roughly the same level
			of guarantees as ext3, and avoids the
			"zero-length" problem that can happen when a
			system crashes before the delayed allocation
			blocks are forced to disk.

To Reproduce

These kind of issues are very hard to reproduce as we have poor control over the various buffers and flush mechanisms. Is the data in the JVM I/O buffers (if using buffered I/O)? Is it in some intermediate library buffer? In the OS buffers? In the on-disk/disk driver buffers?

If anyone knows how to reproduce these or test these things, I'd be very happy to hear about them.

Expected behavior

We should strive to ensure we never lose data, and maintain data integrity, especially while we still only manual recovery procedures. So persisting a snapshot should be an atomic operation, and same with creating/writing a segment.

Additional context

Here are some more interesting sources to read up on about file integrity, and figure out how far we are from implementing it properly:

Environment:

  • OS: Linux
  • Zeebe Version: 1.0.0
  • Configuration: default
@npepinpe npepinpe added the kind/bug Categorizes an issue or PR as a bug label Jun 1, 2021
@npepinpe
Copy link
Member Author

npepinpe commented Jun 1, 2021

/cc @MiguelPires @deepthidevaki

@npepinpe npepinpe added Impact: Data scope/broker Marks an issue or PR to appear in the broker section of the changelog labels Jun 1, 2021
@npepinpe
Copy link
Member Author

npepinpe commented Jun 1, 2021

Possibly @Zelldon might be interested in this too 😅

@npepinpe npepinpe added the severity/high Marks a bug as having a noticeable impact on the user with no known workaround label Jun 2, 2021
@npepinpe npepinpe added this to Ready in Zeebe Jun 2, 2021
@Zelldon
Copy link
Member

Zelldon commented Jun 7, 2021

Thanks @npepinpe interesting read 👍

Regarding

                        avoids the
			"zero-length" problem that can happen when a
			system crashes before the delayed allocation
			blocks are forced to disk.

might be related to #6979 ?

@npepinpe npepinpe self-assigned this Jun 30, 2021
@npepinpe npepinpe moved this from Ready to In progress in Zeebe Jul 2, 2021
@npepinpe
Copy link
Member Author

npepinpe commented Jul 2, 2021

Some work was already done here regarding the snapshot files. The atomic move is correctly dealt with, and we also flush things properly. What's left here is dealing with segment files in the journal, notably when writing the descriptor.

@npepinpe
Copy link
Member Author

npepinpe commented Jul 12, 2021

I spent a bit of time trying to clean up the segment loading and creating process. To durably create segments, we have to do the following:

  1. Create the new segment file
  2. Write the descriptor
  3. Flush the descriptor
  4. Flush the journal directory to ensure the segment file entry is present

However, each of these steps can fail due to I/O errors, and since we don't use checked exceptions, it's difficult to trace how these errors are ultimately handled. Most I/O errors are retry-able, however they may land us in a state where we cannot recover anymore. So following the order above:

  • Say we fail on step 1; nothing to clean up, we can just retry, easy peasy 🍋 zeebe
  • Say we fail on step 2; we have to delete the segment file, otherwise on retry we have to decide what to do: should we overwrite an existing segment (since we think we should be creating it), or should we try to load it? We cannot load it since we failed to write the descriptor, so this will always fail. Our only option is to overwrite it - but we first need to distinguish between a segment where we failed to write the descriptor, and one which is really corrupted. So if we can determine that the segment was just garbage, then we can retry the operation correctly. Keep in mind that even if we delete the file on failure when writing the descriptor, deleting can also fail, so we still need to handle the case where we have an empty segment file, or one with a partial descriptor.
  • Say we fail on step 3; we cannot guarantee the segment data is on disk. We could go on anyway, since we will flush the next time we append (on follower) or commit (on the leader). However, if these flush would also fail, the OS isn't guaranteed to flush pages sequentially, so it could be that weird partial segment later on. It may be that that's OK (since we wouldn't acknowledge on the follower if flushing failed, and we wouldn't increase the commit index on the leader if flushing failed), but to be safe we should just always flush sequentially.
  • Say we fail on step 4; we cannot guarantee the file will be visible on retry. This is actually totally fine as far as the descriptor is concerned (since if we only wrote the descriptor than who cares, let's just recreate the segment on recovery), but it's a problem if we actually flushed some data blocks - then we lost these! So here while we don't need to rollback, we do need to retry, if only retry the directory flush. But how do we distinguish that's that only missing thing? On retry, the file will be visible to our process (remember, it's just not guaranteed on crash recovery), so we wouldn't distinguish it from an existing segment or a new segment. The safest approach is probably to detect that the file only contains the descriptor but no data, at which point we can verify that the descriptor is as expected (or just throw it away and recreate it all), and just flush the directory.

This has some impact on loading existing segments, since we need to be able to distinguish between partially written segments/descriptors and real, existing and corrupted segments, so we can try to safely recover.

Additionally, as I mentioned, I found it hard to trace the error handling in the journal. There's many places where we throw errors, and others where errors are thrown unexpectedly (like the InternalError on mapped buffer operations). I think we could try using the Either type here to help with tracing and improve the general error handling. I would also propose splitting of the segment loading/creating into its own class, and we can feed an interface to the SegmentedJournal, mostly to improve testability of the journal implementation.

@npepinpe npepinpe moved this from In progress to Review in progress in Zeebe Jul 14, 2021
@ghost ghost closed this as completed in 97dd956 Jul 19, 2021
Zeebe automation moved this from Review in progress to Done Jul 19, 2021
@KerstinHebel KerstinHebel removed this from Done in Zeebe Mar 23, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants