Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

direct-io: add support for bypassing operating system I/O cache when logging entries #2932

Closed
wants to merge 7 commits into from

Conversation

mauricebarnum
Copy link
Contributor

Descriptions of the changes in this PR:

Motivation

BookKeeper's entry log writing is buffered in application code before being submitted to the operating system, which by will buffer it again. The cost of this "double buffering" becomes a limiting write throughput factor.

This set of changes adds optional support to bypass the operating system buffering on supported systems (currently Linux and MacOS) by using the open(2) flag O_DIRECT. fallocate(2) is used, if available, to request that the filesystem allocate the required space before data is written.

Access to the I/O system calls is via a JNI binding included in bookkeeper/native-io

Changes

  • Refactor EntryLogger by introducing an interface, EntryLoggerIface. The direct-io entry logger will implement this interface.
  • Add package bookkeeper-slogger to provide support for structured logging with a pluggable logging backend. Provide an implementation using SLF4J.
  • Add native-io package to providing JNI bindings to operating system I/O api.
  • Refactor garbage collection and compaction to allow the entry logger to control which files are available to be garbage collected.
  • Add EventLoggerIface implementation DirectEntryLogger, enabled with dbStorage_directIOEntryLogger=true

In order to uphold a high standard for quality for code contributions, Apache BookKeeper runs various precommit
checks for pull requests. A pull request can only be merged when it passes precommit checks.


Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

If this PR is a BookKeeper Proposal (BP):

  • Make sure the PR title is formatted like:
    <BP-#>: Description of bookkeeper proposal
    e.g. BP-1: 64 bits ledger is support
  • Attach the master issue link in the description of this PR.
  • Attach the google doc link if the BP is written in Google Doc.

Otherwise:

  • Make sure the PR title is formatted like:
    <Issue #>: Description of pull request
    e.g. Issue 123: Description ...
  • Make sure tests pass via mvn clean apache-rat:check install spotbugs:check.
  • Replace <Issue #> in the title with the actual Issue number.

@mauricebarnum mauricebarnum marked this pull request as draft December 8, 2021 19:20
@mauricebarnum
Copy link
Contributor Author

the new modules only build with gradle. i haven't added maven support yet.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great stuff.

Can you please start an email thread on dev@bookkeeper.apache.com to introduce this work?

You would need to create a BP, a BookKeeper Proposal

@mauricebarnum
Copy link
Contributor Author

I've waffled a bit on how far to collapse these commits. All of the commits following "structured logger" logically fit together, but it gets unwieldy to follow some of the changes to common code that don't make sense as standalone commits themselves. I thought about pushing the JNA->JNI changes up the history, but that introduces so many point changes I didn't want to risk introducing "transitory" bugs.

@jvrao
Copy link
Contributor

jvrao commented Dec 15, 2021

Looking forward for this change.

@mauricebarnum mauricebarnum marked this pull request as ready for review January 11, 2022 00:16
@mauricebarnum
Copy link
Contributor Author

I've waffled a bit on how far to collapse these commits.

except for the commit comments (i won't edit them unless this version is preferred): https://github.com/apache/bookkeeper/compare/master...mauricebarnum:directio-squashed?expand=1

@hangc0276
Copy link
Contributor

Looking forward this feature.

@mauricebarnum
Copy link
Contributor Author

rerun failure checks

1 similar comment
@hangc0276
Copy link
Contributor

rerun failure checks

@mauricebarnum mauricebarnum force-pushed the directio branch 2 times, most recently from e4e8a8c to 655f253 Compare March 15, 2022 20:55
@hangc0276
Copy link
Contributor

@mauricebarnum Would you please send a proposal discuss into dev@bookkeeper.apache.org mail list ?

@eolivelli
Copy link
Contributor

We should definitely split this patch into separate changes.
I know it is hard but it is worth

@mauricebarnum mauricebarnum force-pushed the directio branch 6 times, most recently from 0551cc6 to 0540e4e Compare March 29, 2022 04:49
@mauricebarnum mauricebarnum force-pushed the directio branch 4 times, most recently from 4a0715d to 703a042 Compare April 2, 2022 06:04
Ivan Kelly and others added 7 commits April 6, 2022 22:48
Define the interface and contract for entrylogger. This is mostly
taking the entrylogger methods used by other components and prettying
them up a bit.

Notable changes are:
- internalReadEntry is now readEntry. There is no 'validate'
  flag. Instead there are two overloads for the method, and validation
  only runs if ledgerId and entryId are passed in.
- shutdown has been renamed to close.
- the compaction entrylog methods have been put behind an
  interface. As it was they were leaking implementation
  details. Ultimitely compaction itself should be hidden behind the
  entrylogger, but that's a larger refactor.
Utility to make it easier to add context to exception messages. Avoids
having to do custom formatting. You just add keyvalues to a builder.

Use like:
```
exMsg("something failed").kv("filename", fn).kv("errno", errno).toString()
```
Structured logging wrapper for slf4j.
Provide an interface to the POSIX IO API + extensions for O_DIRECT and
Linux's fallocate API in order to support direct IO on supported
systems.  Direct IO will allow us to bypass cache layers in the
underlying operation system, improving memory usage and reducing
unnecessary I/O operations to fill caches that won't be used.
A utility buffer class to be used with JNA calls. Buffers are page
aligned (4k pages).

The wrapper mostly handles writes between ByteBuffers and
ByteBufs. It also provides a method for padding the buffer to the next
alignment, so writes can have an aligned size also (as required by
direct I/O). The padding is done with 0xF0, so that if it is read as
an integer, or long, the value will be negative (assuming the read is
a java read, and thus an signed int).
GC support requires that the entrylogger provides a way to retrieve
all entrylogs which have been completely flushed to disk. Previously
this was done by returning the least unflushed log id. However, this
is problematic as it doesn't support the log ids wrapping around. It
also means that GC has to start checking for log id existence from
zero every time it boots.

This change replaces getLeastUnflushedLogId() with getFlushedLogIds(),
to give the entrylogger full control of which logs should be
considered for GC.

It also changes the CompactableLedgerStorage interface, removing
getEntryLogger() and adding injection of the entrylogger to the
GarbageCollectionThread. This makes testing easier.
The implementation uses JNI to do direct I/O to files via posix
syscalls. Fallocate is used if running on linux, otherwise this is
skipped (at the cost of more filesystem operates during writing).

There are two calls to write, writeAt and writeDelimited. I expect
writeAt to be used for the entrylog headers, which entries will go
through writeDelimited. In both cases, the calls may return before the
syscalls occur. #flush() needs to be called to ensure things are
actually written.

The entry log format isn't much changed from what is used by the
existing entrylogger. The biggest difference is the padding. Direct
I/O must write in aligned blocked. The size of the alignment varies by
machine configuration, but 4K is a safe bet on most. As it is unlikely
that entry data will land exactly on the alignment boundary, we need
to add padding to writes. The existing entry logger has been changed
to take this padding into account. When read as a signed int/long/byte
the padding will aways parse to a negative value, which distinguishes
it from valid entry data (the entry size will always be positive) and
also from preallocated space (which is always 0).

Another difference in the format is that the header is now 4K rather
than 1K. Again, this is to allow aligned rights. No changes are
necessary to allow the existing entry logger to deal with the header
change, as we create a dummy entry in the extra header space that the
existing entry logger already knows to ignore.

To enable, set dbStorage_directIOEntryLogger=true in the configuration.
@hangc0276
Copy link
Contributor

This PR has been divided into 7 Prs, and they are already merged into the master. https://github.com/apache/bookkeeper/pulls?q=is%3Apr+bp-47+is%3Aclosed+author%3Ahangc0276

We can close the PR now. Thanks for @mauricebarnum 's great contribution.

@hangc0276 hangc0276 closed this Jul 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants