Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-allocate files to prevent no space errors during flush #7607

Closed
deepthidevaki opened this issue Aug 6, 2021 · 12 comments · Fixed by #9833
Closed

Pre-allocate files to prevent no space errors during flush #7607

deepthidevaki opened this issue Aug 6, 2021 · 12 comments · Fixed by #9833
Assignees
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. version:8.1.0-alpha4 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0

Comments

@deepthidevaki
Copy link
Contributor

Since we memory mapped files, the files are not pre-allocated. When the buffer is flush there is no guarantee that there is enough space in disk. This can lead to errors which are hard to diagnose. We have seen before that it returns "internalError" which is not really useful to know the root cause.

To prevent such cases, we can pre-allocate the segment when it is created. It is like reserving that space for the file.

Followup #6221

@deepthidevaki deepthidevaki added the kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. label Aug 6, 2021
@npepinpe npepinpe added this to Planned in Zeebe Aug 9, 2021
@npepinpe npepinpe moved this from Planned to Ready in Zeebe Oct 18, 2021
@npepinpe
Copy link
Member

Prioritized up as it's part of Ole's onboarding plan.

@lenaschoenburg
Copy link
Member

Writing a test for pre-allocation will be difficult for two reasons:

Testing that pre-allocation works as expected

To test that pre-allocation works as expected we need to verify that the on-disk size of the segment file is as expected. Using Files.size(...) does not work because the OS can lie to us and report the file size as expected when in reality the file is sparse and does not use disk space. This is verifiable (on linux) by looking at a segment file with ls -lhs segmentFile. On filesystems that support sparse files the reported number of allocated 512 byte blocks will not match the reported file size.
We could write a test that uses linux syscalls to test the "real" file size but the same test would not work on macOS because macOS with APFS does not use sparse files in this case.

@npepinpe Do you think it is okay to have a platform specific tests for this? It'd only work on linux and not on all filesystems.

Testing that pre-allocation can fail

As discussed with @npepinpe, testing this is difficult because it relies on the filesystem to be full. We could, in theory, create this condition by using docker but this comes with considerable overhead and we could only check that Zeebe fails, not why. We decided that it is okay if we don't test this condition.

@npepinpe
Copy link
Member

Let's check in tomorrow - I'm having second thoughts as to whether or not we should do this now instead of proceeding with the Java 17 migration, especially because writing the test I think will be more difficult/complex than expected.

@lenaschoenburg
Copy link
Member

Okay, let's discuss this in detail. Just as a note to myself: We could also do this on a best-effort basis where we try to pre-allocate on systems which support fallocate but make no promises to do so. This would improve the situation in most cases without over-complicating things by trying to support every OS/FS.

@npepinpe npepinpe moved this from Ready to Planned in Zeebe Oct 19, 2021
@npepinpe npepinpe moved this from Planned to Ready in Zeebe Dec 12, 2021
@KerstinHebel KerstinHebel removed this from Ready in Zeebe Mar 23, 2022
@npepinpe npepinpe added the area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) label Apr 20, 2022
@Zelldon
Copy link
Member

Zelldon commented Jun 23, 2022

Did someone already tried that and can share some references or code? Might be easier to start with.

@npepinpe
Copy link
Member

npepinpe commented Jun 23, 2022

I did, and I think Ole might have as well, but I didn't save anything. There's two portable ways of pre-allocating files: using a template file (so create a temporary blank file of the right size at startup/the first time, and copy it every other time), or using the same trick Agrona does of mmap'ing the file, then at every 4Kb page, setting one byte to 0, then flushing the whole thing at the end. The fast/non-portable way would be to use jni-ffr and LibC, and a simple call to fallocate. It's pretty easy to do:

public interface LibC {
  int fallocate(int fd, int mode, int offset, int len);
}

public static void main(String[] args) throws Exception {
  final var libc = LibraryLoader.create(LibC.class).load("c");
  try(final var file = new RandomAccessFile("/path/to/file", "rw")) {
    final var fd = MethodHandles.lookup().findVarHandle(int.class, "fd", FileDescriptor.getClass()).get(file.getFD());
    libc.fallocate(fd, 0, 0, 1024 * 1024 * 1024); // allocates a 1MB file
  }
}

Maybe there's a way to get the file descriptor without reflection, but I'm not aware of it OTOH :(

@npepinpe
Copy link
Member

npepinpe commented Jun 23, 2022

It seems there's a SharedSecrets class which you can use to get the file descriptor, but it is an internal class...

/** A repository of "shared secrets", which are a mechanism for
    calling implementation-private methods in another package without
    using reflection. A package-private class implements a public
    interface and provides the ability to call package-private methods
    within that package; the object implementing that interface is
    provided through a third package to which access is restricted.
    This framework avoids the primary disadvantage of using reflection
    for this purpose, namely the loss of compile-time checking. */

public class SharedSecrets {
...
}

However this is not exported by default, so you'd need to add --add-exports java.base/jdk.internal.access=ALL-UNNAMED at compile time. So I don't know if it's much nicer...

Anyway, if you do have it, you can do SharedSecrets.getJavaIOFileDescriptorAccess().get(new RandomFile("path", "rw").getFD()) and no reflection is involved. That's what the JDK uses internally 🤷

@npepinpe npepinpe self-assigned this Jun 24, 2022
@npepinpe
Copy link
Member

npepinpe commented Jun 24, 2022

For testability, I think it might make sense to combine this issue with #7604

I would also omit using FFI to bind to libc for now. We can look into pre-allocating sparse files in a second iteration as a performance optimization.

@lenaschoenburg
Copy link
Member

Are template files actually a portable and reliable mechanism? I'd imagine that for copy-on-write filesystems (btrfs, zfs) this will not have any effect.

@npepinpe
Copy link
Member

To be portable you need to really write the other files, using something like FileChannel.transferTo. This is potentially less efficient but still relatively OK since it doesn't go through the JVM. That said, we can add optimizations after, I.e fallocate. It seems even RandomFile setLength might work for our Docker image, as setLength maps to ftruncate, which on ext3/ext4 does allocate when the size is greater. But fallocate is probably the best optimization for Linux deployments in general. It's just the error handling is awkward in Java (I.e. handling errno).

@npepinpe npepinpe mentioned this issue Jul 4, 2022
10 tasks
@npepinpe
Copy link
Member

npepinpe commented Jul 4, 2022

Alright, so I did a quick prototype today, and I would propose the following:

  1. Add a allocate(Path, long) call to FileUtil, which pre-allocates files. Call this in SegmentedJournal#createSegment.
  2. Initially, the implementation just zeroes the file by writing one byte to every page, then flushing. We can use Agrona's IoUtil#fill(FileChannel, long, long, byte) method for this. This is not efficient, much less than say using a template file. However, it's extremely simple, much more than using the template file. As we want to use this as a fallback for non Unix systems, and we're not creating segments every second, I think we could use this.
  3. Second iteration adds a mapping to posix_fallocate using JNA. While jnr-ffi is technically more performant since it produces byte code at compile time (where JNA uses proxies/reflection), it's less documented, has a smaller community, and less convenience utilities (such as when dealing with C's errno when doing I/O). Plus, our allocate method will not be in the hot path, so I don't think the reflection overhead of JNA will be a big deal versus the compile time byte code generation from jnr-ffi.

I did a prototype already, which you can see #9688. I ran into some weird issues:

  • In order to get the raw file descriptor to use with the posix_fallocate binding, I need to use reflection and grab out from a FileDescriptor. Unfortunately since Java 11, this is not possible unless we add an --add-opens of the java.io module. This can be done via a manifest entry for the util module, but IntelliJ won't pick it up when you run the tests, so I add to add it to the surefire and failsafe configurations in the parent/pom.xml. I don't like it, but I don't really see how else to get the file descriptor other than use fopen directly (which could be an option, but the error handling would be more complicated then).
  • Testing coverage isn't super high. Essentially I use a fake LibC implementation and return the errors that I want, but it's not doing any integration test other than the happy path. The only thing I know which could manipulate the file system is https://github.com/scylladb/charybdefs, but there's only Python bindings, and adding the Java bindings as part of this PR seems overkill. That said, it might be useful for us to have these bindings for other types of tests, so who knows, maybe it's worth the investment later.
  • I added a feature flag to disable the optimization, but it's pretty hacky since FileUtil is a static utility class, so I can't really inject any feature flags. I think right now it's bad, but maybe one option would be to have the feature flags passed to the journal, and then the journal (or the atomix factory) keeps track of the native FS to use which can be passed to FileUtil afterwards. Still not super great though...

Anyway, for the real implementation, I would split this into 2 PRs - adding the allocation with the "dumb" way, and then another one to add the C binding to posix_fallocate. The feature flag could also be added as third PR if the second one is too large.

EDIT: jnr-ffi might be a better option in the end. While it has a smaller community (though it's arguably used by many large projects), it has just better support to variable-width types (e.g. off_t), and if we go with posix_fallocate then we don't need complex errno handling.

@npepinpe
Copy link
Member

npepinpe commented Jul 5, 2022

As discussed, we'll start with a first PR which adds a global feature flag for pre-allocation, and uses IoUtil.fill to pre-allocate.

A second PR can add an optimization to use posix_fallocate where possible, which is automatically disabled if it fails any time during runtime and will reuse the fallback.

@npepinpe npepinpe mentioned this issue Jul 7, 2022
10 tasks
zeebe-bors-camunda bot added a commit that referenced this issue Jul 11, 2022
9731: Preallocate segment files r=npepinpe a=npepinpe

## Description

This PR introduces segment file pre-allocation in the journal. This is on by default, but can be disabled via an experimental configuration option.

At the moment, the pre-allocation is done in a "dumb" fashion - we allocate a 4Kb blocks of zeroes, and write this until we've reached the expected file length. Note that this means there may be one extra block allocated on disk.

One thing to note, to verify this, we used [jnr-posix](https://github.com/jnr/jnr-posix). The reason behind this is we want to know the actual number of blocks on disk reserved for this file. `Files#size`, or `File#length`, return the reported file size, which is part of the file's metadata (on UNIX systems anyway). If you mmap a file with a size of 1Mb, write one byte, then flush it, the reported size will be 1Mb, but the actual size on disk will be a single block (on most modern UNIX systems anyway). By using [stat](https://linux.die.net/man/2/stat), we can get the actual file size in terms of 512-bytes allocated blocks, so we get a pretty accurate measurement of the actual disk space used by the file.

I would've like to capture this in a test utility, but since `test-util` depends on `util`, there wasn't an easy way to do this, so I just copied the method in two places. One possibility I thought of is moving the whole pre-allocation stuff in `journal`, since we only use it there. The only downside I can see there is about discovery and cohesion, but I'd like to hear your thoughts on this.

A follow-up PR will come which will optimize the pre-allocation by using the [posix_fallocate](https://man7.org/linux/man-pages/man3/posix_fallocate.3.html) on POSIX systems.

Finally, I opted for an experimental configuration option instead of a feature flag. My reasoning is that it isn't a "new" feature, but instead we want to option of disabling this (for performance reasons potentially). So it's more of an advanced option. But I'd also like to hear your thoughts here.

## Related issues

closes #6504
closes #8099
related to #7607  



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
zeebe-bors-camunda bot added a commit that referenced this issue Jul 11, 2022
9731: Preallocate segment files r=npepinpe a=npepinpe

## Description

This PR introduces segment file pre-allocation in the journal. This is on by default, but can be disabled via an experimental configuration option.

At the moment, the pre-allocation is done in a "dumb" fashion - we allocate a 4Kb blocks of zeroes, and write this until we've reached the expected file length. Note that this means there may be one extra block allocated on disk.

One thing to note, to verify this, we used [jnr-posix](https://github.com/jnr/jnr-posix). The reason behind this is we want to know the actual number of blocks on disk reserved for this file. `Files#size`, or `File#length`, return the reported file size, which is part of the file's metadata (on UNIX systems anyway). If you mmap a file with a size of 1Mb, write one byte, then flush it, the reported size will be 1Mb, but the actual size on disk will be a single block (on most modern UNIX systems anyway). By using [stat](https://linux.die.net/man/2/stat), we can get the actual file size in terms of 512-bytes allocated blocks, so we get a pretty accurate measurement of the actual disk space used by the file.

I would've like to capture this in a test utility, but since `test-util` depends on `util`, there wasn't an easy way to do this, so I just copied the method in two places. One possibility I thought of is moving the whole pre-allocation stuff in `journal`, since we only use it there. The only downside I can see there is about discovery and cohesion, but I'd like to hear your thoughts on this.

A follow-up PR will come which will optimize the pre-allocation by using the [posix_fallocate](https://man7.org/linux/man-pages/man3/posix_fallocate.3.html) on POSIX systems.

Finally, I opted for an experimental configuration option instead of a feature flag. My reasoning is that it isn't a "new" feature, but instead we want to option of disabling this (for performance reasons potentially). So it's more of an advanced option. But I'd also like to hear your thoughts here.

## Related issues

closes #6504
closes #8099
related to #7607  



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
zeebe-bors-camunda bot added a commit that referenced this issue Jul 12, 2022
9731: Preallocate segment files r=npepinpe a=npepinpe

## Description

This PR introduces segment file pre-allocation in the journal. This is on by default, but can be disabled via an experimental configuration option.

At the moment, the pre-allocation is done in a "dumb" fashion - we allocate a 4Kb blocks of zeroes, and write this until we've reached the expected file length. Note that this means there may be one extra block allocated on disk.

One thing to note, to verify this, we used [jnr-posix](https://github.com/jnr/jnr-posix). The reason behind this is we want to know the actual number of blocks on disk reserved for this file. `Files#size`, or `File#length`, return the reported file size, which is part of the file's metadata (on UNIX systems anyway). If you mmap a file with a size of 1Mb, write one byte, then flush it, the reported size will be 1Mb, but the actual size on disk will be a single block (on most modern UNIX systems anyway). By using [stat](https://linux.die.net/man/2/stat), we can get the actual file size in terms of 512-bytes allocated blocks, so we get a pretty accurate measurement of the actual disk space used by the file.

I would've like to capture this in a test utility, but since `test-util` depends on `util`, there wasn't an easy way to do this, so I just copied the method in two places. One possibility I thought of is moving the whole pre-allocation stuff in `journal`, since we only use it there. The only downside I can see there is about discovery and cohesion, but I'd like to hear your thoughts on this.

A follow-up PR will come which will optimize the pre-allocation by using the [posix_fallocate](https://man7.org/linux/man-pages/man3/posix_fallocate.3.html) on POSIX systems.

Finally, I opted for an experimental configuration option instead of a feature flag. My reasoning is that it isn't a "new" feature, but instead we want to option of disabling this (for performance reasons potentially). So it's more of an advanced option. But I'd also like to hear your thoughts here.

## Related issues

closes #6504
closes #8099
related to #7607  



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
zeebe-bors-camunda bot added a commit that referenced this issue Jul 19, 2022
9834: Restructure journal module r=npepinpe a=npepinpe

## Description

This PR mostly restructures the journal module to better reflect its usage. Instead of the main API in the root package and everything else under the `file` package, we now have:

```
root <--- journal API and common classes meant to be consumed outside
  |
  |-- file <--- persistence layer, i.e. segmented journal implementation
  |-- record <-- record (de)serialization and manipulation
  |-- util <-- internal utilities (only ChecksumGenerator for now)
```

We could alternatively think about moving `file`, `record`, and `util` under a `impl` package as well to clearly indicate these aren't meant to be used outside of the module. This came out of #7607.

## Related issues

related to #7607



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
@Zelldon Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. version:8.1.0-alpha4 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Projects
None yet
5 participants