Pre-allocate files to prevent no space errors during flush #7607

deepthidevaki · 2021-08-06T14:00:27Z

Since we memory mapped files, the files are not pre-allocated. When the buffer is flush there is no guarantee that there is enough space in disk. This can lead to errors which are hard to diagnose. We have seen before that it returns "internalError" which is not really useful to know the root cause.

To prevent such cases, we can pre-allocate the segment when it is created. It is like reserving that space for the file.

Followup #6221

npepinpe · 2021-10-18T08:45:43Z

Prioritized up as it's part of Ole's onboarding plan.

lenaschoenburg · 2021-10-18T15:27:16Z

Writing a test for pre-allocation will be difficult for two reasons:

Testing that pre-allocation works as expected

To test that pre-allocation works as expected we need to verify that the on-disk size of the segment file is as expected. Using Files.size(...) does not work because the OS can lie to us and report the file size as expected when in reality the file is sparse and does not use disk space. This is verifiable (on linux) by looking at a segment file with ls -lhs segmentFile. On filesystems that support sparse files the reported number of allocated 512 byte blocks will not match the reported file size.
We could write a test that uses linux syscalls to test the "real" file size but the same test would not work on macOS because macOS with APFS does not use sparse files in this case.

@npepinpe Do you think it is okay to have a platform specific tests for this? It'd only work on linux and not on all filesystems.

Testing that pre-allocation can fail

As discussed with @npepinpe, testing this is difficult because it relies on the filesystem to be full. We could, in theory, create this condition by using docker but this comes with considerable overhead and we could only check that Zeebe fails, not why. We decided that it is okay if we don't test this condition.

npepinpe · 2021-10-18T15:28:58Z

Let's check in tomorrow - I'm having second thoughts as to whether or not we should do this now instead of proceeding with the Java 17 migration, especially because writing the test I think will be more difficult/complex than expected.

lenaschoenburg · 2021-10-18T15:33:10Z

Okay, let's discuss this in detail. Just as a note to myself: We could also do this on a best-effort basis where we try to pre-allocate on systems which support fallocate but make no promises to do so. This would improve the situation in most cases without over-complicating things by trying to support every OS/FS.

Zelldon · 2022-06-23T07:39:02Z

Did someone already tried that and can share some references or code? Might be easier to start with.

npepinpe · 2022-06-23T08:37:27Z

I did, and I think Ole might have as well, but I didn't save anything. There's two portable ways of pre-allocating files: using a template file (so create a temporary blank file of the right size at startup/the first time, and copy it every other time), or using the same trick Agrona does of mmap'ing the file, then at every 4Kb page, setting one byte to 0, then flushing the whole thing at the end. The fast/non-portable way would be to use jni-ffr and LibC, and a simple call to fallocate. It's pretty easy to do:

public interface LibC {
  int fallocate(int fd, int mode, int offset, int len);
}

public static void main(String[] args) throws Exception {
  final var libc = LibraryLoader.create(LibC.class).load("c");
  try(final var file = new RandomAccessFile("/path/to/file", "rw")) {
    final var fd = MethodHandles.lookup().findVarHandle(int.class, "fd", FileDescriptor.getClass()).get(file.getFD());
    libc.fallocate(fd, 0, 0, 1024 * 1024 * 1024); // allocates a 1MB file
  }
}

Maybe there's a way to get the file descriptor without reflection, but I'm not aware of it OTOH :(

npepinpe · 2022-06-23T10:31:35Z

It seems there's a SharedSecrets class which you can use to get the file descriptor, but it is an internal class...

/** A repository of "shared secrets", which are a mechanism for
    calling implementation-private methods in another package without
    using reflection. A package-private class implements a public
    interface and provides the ability to call package-private methods
    within that package; the object implementing that interface is
    provided through a third package to which access is restricted.
    This framework avoids the primary disadvantage of using reflection
    for this purpose, namely the loss of compile-time checking. */

public class SharedSecrets {
...
}

However this is not exported by default, so you'd need to add --add-exports java.base/jdk.internal.access=ALL-UNNAMED at compile time. So I don't know if it's much nicer...

Anyway, if you do have it, you can do SharedSecrets.getJavaIOFileDescriptorAccess().get(new RandomFile("path", "rw").getFD()) and no reflection is involved. That's what the JDK uses internally 🤷

npepinpe · 2022-06-24T19:21:47Z

For testability, I think it might make sense to combine this issue with #7604

I would also omit using FFI to bind to libc for now. We can look into pre-allocating sparse files in a second iteration as a performance optimization.

lenaschoenburg · 2022-06-27T03:16:50Z

Are template files actually a portable and reliable mechanism? I'd imagine that for copy-on-write filesystems (btrfs, zfs) this will not have any effect.

npepinpe · 2022-06-27T08:04:53Z

To be portable you need to really write the other files, using something like FileChannel.transferTo. This is potentially less efficient but still relatively OK since it doesn't go through the JVM. That said, we can add optimizations after, I.e fallocate. It seems even RandomFile setLength might work for our Docker image, as setLength maps to ftruncate, which on ext3/ext4 does allocate when the size is greater. But fallocate is probably the best optimization for Linux deployments in general. It's just the error handling is awkward in Java (I.e. handling errno).

npepinpe · 2022-07-04T18:13:45Z

Alright, so I did a quick prototype today, and I would propose the following:

Add a allocate(Path, long) call to FileUtil, which pre-allocates files. Call this in SegmentedJournal#createSegment.
Initially, the implementation just zeroes the file by writing one byte to every page, then flushing. We can use Agrona's IoUtil#fill(FileChannel, long, long, byte) method for this. This is not efficient, much less than say using a template file. However, it's extremely simple, much more than using the template file. As we want to use this as a fallback for non Unix systems, and we're not creating segments every second, I think we could use this.
Second iteration adds a mapping to posix_fallocate using JNA. While jnr-ffi is technically more performant since it produces byte code at compile time (where JNA uses proxies/reflection), it's less documented, has a smaller community, and less convenience utilities (such as when dealing with C's errno when doing I/O). Plus, our allocate method will not be in the hot path, so I don't think the reflection overhead of JNA will be a big deal versus the compile time byte code generation from jnr-ffi.

I did a prototype already, which you can see #9688. I ran into some weird issues:

In order to get the raw file descriptor to use with the posix_fallocate binding, I need to use reflection and grab out from a FileDescriptor. Unfortunately since Java 11, this is not possible unless we add an --add-opens of the java.io module. This can be done via a manifest entry for the util module, but IntelliJ won't pick it up when you run the tests, so I add to add it to the surefire and failsafe configurations in the parent/pom.xml. I don't like it, but I don't really see how else to get the file descriptor other than use fopen directly (which could be an option, but the error handling would be more complicated then).
Testing coverage isn't super high. Essentially I use a fake LibC implementation and return the errors that I want, but it's not doing any integration test other than the happy path. The only thing I know which could manipulate the file system is https://github.com/scylladb/charybdefs, but there's only Python bindings, and adding the Java bindings as part of this PR seems overkill. That said, it might be useful for us to have these bindings for other types of tests, so who knows, maybe it's worth the investment later.
I added a feature flag to disable the optimization, but it's pretty hacky since FileUtil is a static utility class, so I can't really inject any feature flags. I think right now it's bad, but maybe one option would be to have the feature flags passed to the journal, and then the journal (or the atomix factory) keeps track of the native FS to use which can be passed to FileUtil afterwards. Still not super great though...

Anyway, for the real implementation, I would split this into 2 PRs - adding the allocation with the "dumb" way, and then another one to add the C binding to posix_fallocate. The feature flag could also be added as third PR if the second one is too large.

EDIT: jnr-ffi might be a better option in the end. While it has a smaller community (though it's arguably used by many large projects), it has just better support to variable-width types (e.g. off_t), and if we go with posix_fallocate then we don't need complex errno handling.

npepinpe · 2022-07-05T14:08:29Z

As discussed, we'll start with a first PR which adds a global feature flag for pre-allocation, and uses IoUtil.fill to pre-allocate.

A second PR can add an optimization to use posix_fallocate where possible, which is automatically disabled if it fails any time during runtime and will reuse the fallback.

9731: Preallocate segment files r=npepinpe a=npepinpe ## Description This PR introduces segment file pre-allocation in the journal. This is on by default, but can be disabled via an experimental configuration option. At the moment, the pre-allocation is done in a "dumb" fashion - we allocate a 4Kb blocks of zeroes, and write this until we've reached the expected file length. Note that this means there may be one extra block allocated on disk. One thing to note, to verify this, we used [jnr-posix](https://github.com/jnr/jnr-posix). The reason behind this is we want to know the actual number of blocks on disk reserved for this file. `Files#size`, or `File#length`, return the reported file size, which is part of the file's metadata (on UNIX systems anyway). If you mmap a file with a size of 1Mb, write one byte, then flush it, the reported size will be 1Mb, but the actual size on disk will be a single block (on most modern UNIX systems anyway). By using [stat](https://linux.die.net/man/2/stat), we can get the actual file size in terms of 512-bytes allocated blocks, so we get a pretty accurate measurement of the actual disk space used by the file. I would've like to capture this in a test utility, but since `test-util` depends on `util`, there wasn't an easy way to do this, so I just copied the method in two places. One possibility I thought of is moving the whole pre-allocation stuff in `journal`, since we only use it there. The only downside I can see there is about discovery and cohesion, but I'd like to hear your thoughts on this. A follow-up PR will come which will optimize the pre-allocation by using the [posix_fallocate](https://man7.org/linux/man-pages/man3/posix_fallocate.3.html) on POSIX systems. Finally, I opted for an experimental configuration option instead of a feature flag. My reasoning is that it isn't a "new" feature, but instead we want to option of disabling this (for performance reasons potentially). So it's more of an advanced option. But I'd also like to hear your thoughts here. ## Related issues closes #6504 closes #8099 related to #7607 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>

9834: Restructure journal module r=npepinpe a=npepinpe ## Description This PR mostly restructures the journal module to better reflect its usage. Instead of the main API in the root package and everything else under the `file` package, we now have: ``` root <--- journal API and common classes meant to be consumed outside | |-- file <--- persistence layer, i.e. segmented journal implementation |-- record <-- record (de)serialization and manipulation |-- util <-- internal utilities (only ChecksumGenerator for now) ``` We could alternatively think about moving `file`, `record`, and `util` under a `impl` package as well to clearly indicate these aren't meant to be used outside of the module. This came out of #7607. ## Related issues related to #7607 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>

deepthidevaki added the kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. label Aug 6, 2021

npepinpe added this to Planned in Zeebe Aug 9, 2021

npepinpe mentioned this issue Sep 8, 2021

InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code #6504

Closed

npepinpe mentioned this issue Oct 8, 2021

Handle i/o exceptions better in raft #7283

Closed

npepinpe assigned lenaschoenburg Oct 18, 2021

npepinpe moved this from Planned to Ready in Zeebe Oct 18, 2021

npepinpe moved this from Ready to Planned in Zeebe Oct 19, 2021

npepinpe unassigned lenaschoenburg Oct 19, 2021

npepinpe mentioned this issue Nov 1, 2021

SIGBUS error with 1.2.2 #8099

Closed

npepinpe moved this from Planned to Ready in Zeebe Dec 12, 2021

npepinpe mentioned this issue Jan 3, 2022

Always exit on unrecoverable VM errors #8327

Merged

9 tasks

KerstinHebel removed this from Ready in Zeebe Mar 23, 2022

npepinpe added team/distributed and removed target/8.1 labels Mar 29, 2022

npepinpe added the area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) label Apr 20, 2022

npepinpe self-assigned this Jun 24, 2022

npepinpe mentioned this issue Jul 4, 2022

Pre-allocate segment files #9688

Closed

10 tasks

npepinpe mentioned this issue Jul 7, 2022

Preallocate segment files #9731

Merged

10 tasks

menski removed the team/distributed label Jul 11, 2022

This was referenced Jul 18, 2022

Clean up journal segment pre-allocation #9833

Merged

Restructure journal module #9834

Merged

zeebe-bors-camunda bot closed this as completed in 9323a43 Jul 19, 2022

npepinpe added the version:8.1.0-alpha4 label Aug 2, 2022

Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022

rodrigo-lourenco-lopes mentioned this issue May 19, 2023

Test shouldNotPreallocateSegmentFiles does not access correct size on MAC_OS. #12664

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-allocate files to prevent no space errors during flush #7607

Pre-allocate files to prevent no space errors during flush #7607

deepthidevaki commented Aug 6, 2021

npepinpe commented Oct 18, 2021

lenaschoenburg commented Oct 18, 2021

npepinpe commented Oct 18, 2021

lenaschoenburg commented Oct 18, 2021

Zelldon commented Jun 23, 2022

npepinpe commented Jun 23, 2022 •

edited

npepinpe commented Jun 23, 2022 •

edited

npepinpe commented Jun 24, 2022 •

edited

lenaschoenburg commented Jun 27, 2022

npepinpe commented Jun 27, 2022

npepinpe commented Jul 4, 2022 •

edited

npepinpe commented Jul 5, 2022

Pre-allocate files to prevent no space errors during flush #7607

Pre-allocate files to prevent no space errors during flush #7607

Comments

deepthidevaki commented Aug 6, 2021

npepinpe commented Oct 18, 2021

lenaschoenburg commented Oct 18, 2021

Testing that pre-allocation works as expected

Testing that pre-allocation can fail

npepinpe commented Oct 18, 2021

lenaschoenburg commented Oct 18, 2021

Zelldon commented Jun 23, 2022

npepinpe commented Jun 23, 2022 • edited

npepinpe commented Jun 23, 2022 • edited

npepinpe commented Jun 24, 2022 • edited

lenaschoenburg commented Jun 27, 2022

npepinpe commented Jun 27, 2022

npepinpe commented Jul 4, 2022 • edited

npepinpe commented Jul 5, 2022

npepinpe commented Jun 23, 2022 •

edited

npepinpe commented Jun 23, 2022 •

edited

npepinpe commented Jun 24, 2022 •

edited

npepinpe commented Jul 4, 2022 •

edited