-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
direct-io: add support for bypassing operating system I/O cache when logging entries #2932
Conversation
the new modules only build with gradle. i haven't added maven support yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great stuff.
Can you please start an email thread on dev@bookkeeper.apache.com to introduce this work?
You would need to create a BP, a BookKeeper Proposal
73c8907
to
fb9be14
Compare
I've waffled a bit on how far to collapse these commits. All of the commits following "structured logger" logically fit together, but it gets unwieldy to follow some of the changes to common code that don't make sense as standalone commits themselves. I thought about pushing the JNA->JNI changes up the history, but that introduces so many point changes I didn't want to risk introducing "transitory" bugs. |
Looking forward for this change. |
ab84dda
to
3d7e874
Compare
bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/DbLedgerStorage.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/bookkeeper/bookie/storage/directentrylogger/WriterWithMetadata.java
Show resolved
Hide resolved
6ecf725
to
13d98ae
Compare
2260025
to
95ce5e6
Compare
except for the commit comments (i won't edit them unless this version is preferred): https://github.com/apache/bookkeeper/compare/master...mauricebarnum:directio-squashed?expand=1 |
Looking forward this feature. |
95ce5e6
to
a9660b5
Compare
a0f2d8f
to
0569a69
Compare
rerun failure checks |
1 similar comment
rerun failure checks |
ca50fde
to
d377830
Compare
d377830
to
7c75ff5
Compare
e4e8a8c
to
655f253
Compare
@mauricebarnum Would you please send a proposal discuss into dev@bookkeeper.apache.org mail list ? |
We should definitely split this patch into separate changes. |
0551cc6
to
0540e4e
Compare
4a0715d
to
703a042
Compare
Define the interface and contract for entrylogger. This is mostly taking the entrylogger methods used by other components and prettying them up a bit. Notable changes are: - internalReadEntry is now readEntry. There is no 'validate' flag. Instead there are two overloads for the method, and validation only runs if ledgerId and entryId are passed in. - shutdown has been renamed to close. - the compaction entrylog methods have been put behind an interface. As it was they were leaking implementation details. Ultimitely compaction itself should be hidden behind the entrylogger, but that's a larger refactor.
Utility to make it easier to add context to exception messages. Avoids having to do custom formatting. You just add keyvalues to a builder. Use like: ``` exMsg("something failed").kv("filename", fn).kv("errno", errno).toString() ```
Structured logging wrapper for slf4j.
Provide an interface to the POSIX IO API + extensions for O_DIRECT and Linux's fallocate API in order to support direct IO on supported systems. Direct IO will allow us to bypass cache layers in the underlying operation system, improving memory usage and reducing unnecessary I/O operations to fill caches that won't be used.
A utility buffer class to be used with JNA calls. Buffers are page aligned (4k pages). The wrapper mostly handles writes between ByteBuffers and ByteBufs. It also provides a method for padding the buffer to the next alignment, so writes can have an aligned size also (as required by direct I/O). The padding is done with 0xF0, so that if it is read as an integer, or long, the value will be negative (assuming the read is a java read, and thus an signed int).
GC support requires that the entrylogger provides a way to retrieve all entrylogs which have been completely flushed to disk. Previously this was done by returning the least unflushed log id. However, this is problematic as it doesn't support the log ids wrapping around. It also means that GC has to start checking for log id existence from zero every time it boots. This change replaces getLeastUnflushedLogId() with getFlushedLogIds(), to give the entrylogger full control of which logs should be considered for GC. It also changes the CompactableLedgerStorage interface, removing getEntryLogger() and adding injection of the entrylogger to the GarbageCollectionThread. This makes testing easier.
The implementation uses JNI to do direct I/O to files via posix syscalls. Fallocate is used if running on linux, otherwise this is skipped (at the cost of more filesystem operates during writing). There are two calls to write, writeAt and writeDelimited. I expect writeAt to be used for the entrylog headers, which entries will go through writeDelimited. In both cases, the calls may return before the syscalls occur. #flush() needs to be called to ensure things are actually written. The entry log format isn't much changed from what is used by the existing entrylogger. The biggest difference is the padding. Direct I/O must write in aligned blocked. The size of the alignment varies by machine configuration, but 4K is a safe bet on most. As it is unlikely that entry data will land exactly on the alignment boundary, we need to add padding to writes. The existing entry logger has been changed to take this padding into account. When read as a signed int/long/byte the padding will aways parse to a negative value, which distinguishes it from valid entry data (the entry size will always be positive) and also from preallocated space (which is always 0). Another difference in the format is that the header is now 4K rather than 1K. Again, this is to allow aligned rights. No changes are necessary to allow the existing entry logger to deal with the header change, as we create a dummy entry in the extra header space that the existing entry logger already knows to ignore. To enable, set dbStorage_directIOEntryLogger=true in the configuration.
This PR has been divided into 7 Prs, and they are already merged into the master. https://github.com/apache/bookkeeper/pulls?q=is%3Apr+bp-47+is%3Aclosed+author%3Ahangc0276 We can close the PR now. Thanks for @mauricebarnum 's great contribution. |
Descriptions of the changes in this PR:
Motivation
BookKeeper's entry log writing is buffered in application code before being submitted to the operating system, which by will buffer it again. The cost of this "double buffering" becomes a limiting write throughput factor.
This set of changes adds optional support to bypass the operating system buffering on supported systems (currently Linux and MacOS) by using the open(2) flag O_DIRECT. fallocate(2) is used, if available, to request that the filesystem allocate the required space before data is written.
Access to the I/O system calls is via a JNI binding included in bookkeeper/native-io
Changes
dbStorage_directIOEntryLogger=true