Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify detection of x86 CPU features #11419

Closed
wants to merge 9 commits into from

Conversation

pdillinger
Copy link
Contributor

@pdillinger pdillinger commented Apr 29, 2023

Summary:
Background - runtime detection of certain x86 CPU features was added for optimizing CRC32c checksums, where performance is dramatically affected by the availability of certain CPU instructions and code using intrinsics for those instructions. And Java builds with native library try to be broadly compatible but performant.

What has changed is that CRC32c is no longer the most efficient cheecksum on contemporary x86_64 hardware, nor the default checksum. XXH3 is generally faster and not as dramatically impacted by the availability of certain CPU instructions. For example, on my Skylake system using db_bench (similar on an older Skylake system without AVX512):

PORTABLE=1 empty USE_SSE : xxh3->8 GB/s crc32c->0.8 GB/s (no SSE4.2 nor AVX2 instructions)
PORTABLE=1 USE_SSE=1 : xxh3->19 GB/s crc32c->16 GB/s (with SSE4.2 and AVX2)
PORTABLE=0 USE_SSE ignored: xxh3->28 GB/s crc32c->16 GB/s (also some AVX512)

Testing a ~10 year old system, with SSE4.2 but without AVX2, crc32c is a similar speed to the new systems but xxh3 is only about half that speed, also 8GB/s like the non-AVX2 compile above. Given that xxh3 has specific optimization for AVX2, I think we can infer that that crc32c is only fastest for that ~2008-2013 period when SSE4.2 was included but not AVX2. And given that xxh3 is only about 2x slower on these systems (not like >10x slower for unoptimized crc32c), I don't think we need to invest too much in optimally adapting to these old cases.

x86 hardware that doesn't support fast CRC32c is now extremely rare, so requiring a custom build to support such hardware is fine IMHO.

This change does two related things:

  • Remove runtime CPU detection for optimizing CRC32c on x86. Maintaining this code is non-zero work, and compiling special code that doesn't work on the configured target instruction set for code generation is always dubious. (On the one hand we have to ensure the CRC32c code uses SSE4.2 but on the other hand we have to ensure nothing else does.)
  • Detect CPU features in source code, not in build scripts. Although there are some hypothetical advantages to detectiong in build scripts (compiler generality), RocksDB supports at least three build systems: make, cmake, and buck. It's not practical to support feature detection on all three, and we have suffered from missed optimization opportunities by relying on missing or incomplete detection in cmake and buck. We also depend on some components like xxhash that do source code detection anyway.

In more detail:

  • HAVE_SSE42, HAVE_AVX2, and HAVE_PCLMUL replaced by standard macros __SSE4_2__, __AVX2__, and __PCLMUL__.
  • MSVC does not provide high fidelity defines for SSE, PCLMUL, or POPCNT, but we can infer those from __AVX__ or __AVX2__ in a compatibility header. In rare cases of false negative or false positive feature detection, a build engineer should be able to set defines to work around the issue.
  • __POPCNT__ is another standard define, but we happen to only need it on MSVC, where it is set by that compatibility header, or can be set by the build engineer.
  • PORTABLE can be set to a CPU type, e.g. "haswell", to compile for that CPU type.
  • USE_SSE is deprecated, now equivalent to PORTABLE=haswell, which roughly approximates its old behavior.

Notably, this change should enable more builds to use the AVX2-optimized Bloom filter implementation.

Test Plan: existing tests, CI

Manual performance tests after the change match the before above (none expected with make build).

We also see AVX2 optimized Bloom filter code enabled when expected, by injecting a compiler error. (Performance difference is not big on my current CPU.)

Summary:
**Background** - runtime detection of certain x86 CPU features was added for
optimizing CRC32c checksums, where performance is dramatically affected
by the availability of certain CPU instructions and code using
intrinsics for those instructions. And Java builds with native library
try to be broadly compatible but performant.

What has changed is that CRC32c is no longer the most efficient
cheecksum on contemporary x86_64 hardware, nor the default checksum.
XXH3 is generally faster and not as dramatically impacted by the
availability of certain CPU instructions. For example, on my Skylake
system using db_bench:

PORTABLE=1 empty USE_SSE  : xxh3->8 GB/s   crc32c->0.8 GB/s
PORTABLE=1 USE_SSE=1      : xxh3->19 GB/s  crc32c->16 GB/s
PORTABLE=0 USE_SSE ignored: xxh3->28 GB/s  crc32c->16 GB/s

And hardware that doesn't support fast CRC32c is now extremely rare, so
requiring a custom build to support such hardware is more tenable.

**This change** does two related things:
* Remove runtime CPU detection for optimizing CRC32c on x86. Maintaining
this code is non-zero work, and compiling special code that doesn't work on
the configured target instruction set for code generation is always dubious.
(On the one hand we have to ensure the CRC32c code uses SSE4.2 but on
the other hand we have to ensure nothing else does.)
* Detect CPU features in source code, not in build scripts. Although
there are some hypothetical advantages to detectiong in build scripts
(compiler generality), RocksDB supports at least three build systems:
make, cmake, and buck. It's not practical to support feature detection
on all three, and we have suffered from missed optimization
opportunities by relying on missing or incomplete detection in cmake and
buck. We also depend on some components like xxhash that do source code
detection anyway.

**In more detail:**
* HAVE_SSE42, HAVE_AVX2, and HAVE_PCLMUL replaced by standard macros
__SSE4_2__, __AVX2__, and __PCLMUL__.
* MSVC does not provide high fidelity defines for SSE, PCLMUL, or
POPCNT, but we can infer those from __AVX__ or __AVX2__ in a
compatibility header. In rare cases of false negative or false positive
feature detection, a build engineer should be able to set defines to
work around the issue.
* __POPCNT__ is another standard define, but we happen to only need it
on MSVC, where it is set by that compatibility header, or can be set by
the build engineer.
* PORTABLE can be set to a CPU type, e.g. "haswell", to compile for that
CPU type.
* USE_SSE is deprecated, now equivalent to PORTABLE=haswell, which
roughly approximates its old behavior.

Notably, this change should enable more builds to use the
AVX2-optimized Bloom filter implementation.

Test Plan: existing tests, CI

Manual performance tests after the change match the before above (none
expected with make build).

We also see AVX2 optimized Bloom filter code enabled when expected, by
injecting a compiler error. (Performance difference is not big on my
current CPU.)
@pdillinger pdillinger requested a review from ajkr May 2, 2023 17:05
@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@socketpair
Copy link

Anyway. I still don't understand why we can just add runtime detection? Why not to:

  1. have several .c files implementing the same algo, but using different implementations.
  2. -march xxx should cut some of them off. I mean, if a given arch guarantee some instruction set, the slower implementations should be omitted.
  3. each .c file from the list above, should be compiled with different compiler options.
  4. The main function (when run first time) should detect the best implementation and fill the global function pointer to one of the implementations.

In other words, setting -march xxx should change (influence to) the list of available implementations.

Why this is important ? For common distributives, like Fedora, Debian or Ubuntu. Unfortunately, they have to support relatively old hardware. Everything may look like:

(I imply the compiler supports all high-end intrinsics, so we don't have to check it at build stage)

low-end, middle-end and high-end - just pseudocode that helps understand my vision.

Low-end impl Middle-end impl High-end impl
PORTABLE=low-end Assume supported Check at runtime Check at runtime
PORTABLE=middle-end Omit Assume supported Check at runtime
PORTABLE=high-end Omit Omit Assume supported

some-end-impl.c (example for one of the algo implementations. Should be compiled with appropriate -march without respect of PORTABLE=XXX):

int some_end_impl(int x) {
#if PORTABLE > SOME_END_ARCH  // pseudocode.
    assert(0);
#else
    (implementation for SOME_END ARCH)
#endif
}

int some_end_supported() { 
#if PORTABLE > SOME_END_ARCH
    return 0; // useless to include since more fast implmentation is guaranteed to be supported by choosing PORTABLE=XXX
#elif PORTABLE == SOME_END_ARCH
    return 1; // don't need to check since PORTABLE=XXX guarantee this is supported.
#else
    // this may or may not be supported by CPU at runtime. check it
    (check at runtime)
#endif
}

general_code.c:

int algo(int x) {
    static int (*impl)(int x) = NULL;
    if (!impl) {
        if (high_end_supported())
            impl=high_end_impl;
        else if (middle_end_supported())
            impl=middle_end_impl;
        else if (low_end_supported())
            impl=low_end_impl;
        else
            assert(0);
    }
    return impl(x)
}

@pdillinger
Copy link
Contributor Author

Anyway. I still don't understand why we can just add runtime detection? Why not to:
...
Why this is important ? For common distributives, like Fedora, Debian or Ubuntu. Unfortunately, they have to support relatively old hardware.

If crc32c were the default checksum, you would have a point. It's not the default. And I think you're dug into the out-of-date premise that this is a singularly performance-critical component. For most of our workloads, block checksums are less than 1% of RocksDB CPU. Essentially all are less than 5%. Doubling that cost in a portable build is not going to be the only overhead associated with a portable build. Lots of other things passively benefit from the compiler using newer, wider instructions.

And anyway we have never had runtime detection of AVX2 and switching between xxh3 implementations. So this change is not a regression in default behaviors under any build config.

Under this change, everyone (using default checksum) still gets reasonable performance in a portable build. If you are sensitive to RocksDB performance, you should build specifically for your hardware, and that relates to more than just checksum functions.

@mrambacher
Copy link
Contributor

Anyway. I still don't understand why we can just add runtime detection? Why not to:
...
Why this is important ? For common distributives, like Fedora, Debian or Ubuntu. Unfortunately, they have to support relatively old hardware.

If crc32c were the default checksum, you would have a point. It's not the default. And I think you're dug into the out-of-date premise that this is a singularly performance-critical component. For most of our workloads, block checksums are less than 1% of RocksDB CPU. Essentially all are less than 5%. Doubling that cost in a portable build is not going to be the only overhead associated with a portable build. Lots of other things passively benefit from the compiler using newer, wider instructions.

And anyway we have never had runtime detection of AVX2 and switching between xxh3 implementations. So this change is not a regression in default behaviors under any build config.

Under this change, everyone (using default checksum) still gets reasonable performance in a portable build. If you are sensitive to RocksDB performance, you should build specifically for your hardware, and that relates to more than just checksum functions.

My concern is what happens in binary distributions (such as Java). Can those binary distributions still take advantage of the architectural efficiencies? Personally, I would prefer we did more "stuff" at run-time (AVX and other support, URING detection, run-time detection of compression libraries, etc) that would allow these features to be used and discovered in binary distributions.

@pdillinger
Copy link
Contributor Author

I believe a single java jar could can contain various binary builds to be chosen at runtime. With kXXH3 as default checksum, the most significant x86 CPU feature is probably AVX2. If you want to support pre-AVX2 hardware, you could probably just bundle two binaries, one for "haswell" and newer/greater, and one for the compiler minimum. Though you might need a small native library itself to identify which RocksDB library is compatible. Runtime checking at that level (select which full library) should be much more reliable and maintainable than the stuff I'm trying to get rid of, even if there is extra weight required in compilation time and package size.

Although it's tempting to try to do clever stuff like only compile certain translation units with various supported instruction sets, so you don't have to include two full RocksDB libraries, I believe it's possible to run into ODR or linker issues when mixing different -march settings. Probably not in the current code (unless you're using Folly F14, in which case you will!) but someday you might, and better safe than sorry.

@mrambacher
Copy link
Contributor

While I agree you could package multiple SO for Java, you are then pushing a run-time determination on which to use into Java, which I do not know why would be any easier/better than making that same determination in a single library in C++.

For the machine intrinsics, my understanding is that you can call most (if not all) of those intrinsics even if the architecture does not support them. Doing so will result in an Illegal Instruction exception and not a link error. If there are ODR issues, then the methods should be refactored anyway for testing purposes.

I fully support simplifying the rules here but feel that compile-time choices (instead of run-time) can lead to a bunch of other issues and makes it hard to develop and deploy features that may be available on some hardware or installations.

@pdillinger
Copy link
Contributor Author

While I agree you could package multiple SO for Java, you are then pushing a run-time determination on which to use into Java, which I do not know why would be any easier/better than making that same determination in a single library in C++.

Like I said, you could use a small bootstrap native library to make the determination.

For the machine intrinsics, my understanding is that you can call most (if not all) of those intrinsics even if the architecture does not support them. Doing so will result in an Illegal Instruction exception and not a link error. If there are ODR issues, then the methods should be refactored anyway for testing purposes.

But then you're limited to the benefits of the hand-coded optimizations implemented with runtime selection, not those opportunities found by your compiler nor compile-time-selected hand-coded optimizations in dependences such as std or folly.

I fully support simplifying the rules here but feel that compile-time choices (instead of run-time) can lead to a bunch of other issues and makes it hard to develop and deploy features that may be available on some hardware or installations.

Can you be more specific? As evidenced by recurring bug fix PRs, runtime detection is a testing nightmare and involves a level of compile time detection also. I don't see your point.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

HISTORY.md Outdated
@@ -7,6 +7,9 @@
### Public API Changes
* Add `MakeSharedCache()` construction functions to various cache Options objects, and deprecated the `NewWhateverCache()` functions with long parameter lists.

### Behavior changes
* For x86, CPU features are no longer detected at runtime nor in build scripts, but in source code using common preprocessor defines. This will likely unlock some small performance improvements on some newer hardware. See (PR link here).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it also likely cause regression in MSVC + PORTABLE=1 builds on files written with ChecksumType::kCRC32c? I think that should be mentioned.

@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pdillinger merged this pull request in 459969e.

aleksraiden added a commit to aleksraiden/kvrocks that referenced this pull request Jun 24, 2023
Change PORTABLE options into 0 instead of OFF (see: facebook/rocksdb#11419)
niklasf added a commit to niklasf/rust-rocksdb that referenced this pull request Jul 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants