Simplify detection of x86 CPU features #11419

pdillinger · 2023-04-29T14:20:14Z

Summary:
Background - runtime detection of certain x86 CPU features was added for optimizing CRC32c checksums, where performance is dramatically affected by the availability of certain CPU instructions and code using intrinsics for those instructions. And Java builds with native library try to be broadly compatible but performant.

What has changed is that CRC32c is no longer the most efficient cheecksum on contemporary x86_64 hardware, nor the default checksum. XXH3 is generally faster and not as dramatically impacted by the availability of certain CPU instructions. For example, on my Skylake system using db_bench (similar on an older Skylake system without AVX512):

PORTABLE=1 empty USE_SSE : xxh3->8 GB/s crc32c->0.8 GB/s (no SSE4.2 nor AVX2 instructions)
PORTABLE=1 USE_SSE=1 : xxh3->19 GB/s crc32c->16 GB/s (with SSE4.2 and AVX2)
PORTABLE=0 USE_SSE ignored: xxh3->28 GB/s crc32c->16 GB/s (also some AVX512)

Testing a ~10 year old system, with SSE4.2 but without AVX2, crc32c is a similar speed to the new systems but xxh3 is only about half that speed, also 8GB/s like the non-AVX2 compile above. Given that xxh3 has specific optimization for AVX2, I think we can infer that that crc32c is only fastest for that ~2008-2013 period when SSE4.2 was included but not AVX2. And given that xxh3 is only about 2x slower on these systems (not like >10x slower for unoptimized crc32c), I don't think we need to invest too much in optimally adapting to these old cases.

x86 hardware that doesn't support fast CRC32c is now extremely rare, so requiring a custom build to support such hardware is fine IMHO.

This change does two related things:

Remove runtime CPU detection for optimizing CRC32c on x86. Maintaining this code is non-zero work, and compiling special code that doesn't work on the configured target instruction set for code generation is always dubious. (On the one hand we have to ensure the CRC32c code uses SSE4.2 but on the other hand we have to ensure nothing else does.)
Detect CPU features in source code, not in build scripts. Although there are some hypothetical advantages to detectiong in build scripts (compiler generality), RocksDB supports at least three build systems: make, cmake, and buck. It's not practical to support feature detection on all three, and we have suffered from missed optimization opportunities by relying on missing or incomplete detection in cmake and buck. We also depend on some components like xxhash that do source code detection anyway.

In more detail:

HAVE_SSE42, HAVE_AVX2, and HAVE_PCLMUL replaced by standard macros __SSE4_2__, __AVX2__, and __PCLMUL__.
MSVC does not provide high fidelity defines for SSE, PCLMUL, or POPCNT, but we can infer those from __AVX__ or __AVX2__ in a compatibility header. In rare cases of false negative or false positive feature detection, a build engineer should be able to set defines to work around the issue.
__POPCNT__ is another standard define, but we happen to only need it on MSVC, where it is set by that compatibility header, or can be set by the build engineer.
PORTABLE can be set to a CPU type, e.g. "haswell", to compile for that CPU type.
USE_SSE is deprecated, now equivalent to PORTABLE=haswell, which roughly approximates its old behavior.

Notably, this change should enable more builds to use the AVX2-optimized Bloom filter implementation.

Test Plan: existing tests, CI

Manual performance tests after the change match the before above (none expected with make build).

We also see AVX2 optimized Bloom filter code enabled when expected, by injecting a compiler error. (Performance difference is not big on my current CPU.)

Summary: **Background** - runtime detection of certain x86 CPU features was added for optimizing CRC32c checksums, where performance is dramatically affected by the availability of certain CPU instructions and code using intrinsics for those instructions. And Java builds with native library try to be broadly compatible but performant. What has changed is that CRC32c is no longer the most efficient cheecksum on contemporary x86_64 hardware, nor the default checksum. XXH3 is generally faster and not as dramatically impacted by the availability of certain CPU instructions. For example, on my Skylake system using db_bench: PORTABLE=1 empty USE_SSE : xxh3->8 GB/s crc32c->0.8 GB/s PORTABLE=1 USE_SSE=1 : xxh3->19 GB/s crc32c->16 GB/s PORTABLE=0 USE_SSE ignored: xxh3->28 GB/s crc32c->16 GB/s And hardware that doesn't support fast CRC32c is now extremely rare, so requiring a custom build to support such hardware is more tenable. **This change** does two related things: * Remove runtime CPU detection for optimizing CRC32c on x86. Maintaining this code is non-zero work, and compiling special code that doesn't work on the configured target instruction set for code generation is always dubious. (On the one hand we have to ensure the CRC32c code uses SSE4.2 but on the other hand we have to ensure nothing else does.) * Detect CPU features in source code, not in build scripts. Although there are some hypothetical advantages to detectiong in build scripts (compiler generality), RocksDB supports at least three build systems: make, cmake, and buck. It's not practical to support feature detection on all three, and we have suffered from missed optimization opportunities by relying on missing or incomplete detection in cmake and buck. We also depend on some components like xxhash that do source code detection anyway. **In more detail:** * HAVE_SSE42, HAVE_AVX2, and HAVE_PCLMUL replaced by standard macros __SSE4_2__, __AVX2__, and __PCLMUL__. * MSVC does not provide high fidelity defines for SSE, PCLMUL, or POPCNT, but we can infer those from __AVX__ or __AVX2__ in a compatibility header. In rare cases of false negative or false positive feature detection, a build engineer should be able to set defines to work around the issue. * __POPCNT__ is another standard define, but we happen to only need it on MSVC, where it is set by that compatibility header, or can be set by the build engineer. * PORTABLE can be set to a CPU type, e.g. "haswell", to compile for that CPU type. * USE_SSE is deprecated, now equivalent to PORTABLE=haswell, which roughly approximates its old behavior. Notably, this change should enable more builds to use the AVX2-optimized Bloom filter implementation. Test Plan: existing tests, CI Manual performance tests after the change match the before above (none expected with make build). We also see AVX2 optimized Bloom filter code enabled when expected, by injecting a compiler error. (Performance difference is not big on my current CPU.)

facebook-github-bot · 2023-05-02T17:06:03Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

socketpair · 2023-05-02T21:01:42Z

Anyway. I still don't understand why we can just add runtime detection? Why not to:

have several .c files implementing the same algo, but using different implementations.
-march xxx should cut some of them off. I mean, if a given arch guarantee some instruction set, the slower implementations should be omitted.
each .c file from the list above, should be compiled with different compiler options.
The main function (when run first time) should detect the best implementation and fill the global function pointer to one of the implementations.

In other words, setting -march xxx should change (influence to) the list of available implementations.

Why this is important ? For common distributives, like Fedora, Debian or Ubuntu. Unfortunately, they have to support relatively old hardware. Everything may look like:

(I imply the compiler supports all high-end intrinsics, so we don't have to check it at build stage)

low-end, middle-end and high-end - just pseudocode that helps understand my vision.

	Low-end impl	Middle-end impl	High-end impl
PORTABLE=low-end	Assume supported	Check at runtime	Check at runtime
PORTABLE=middle-end	Omit	Assume supported	Check at runtime
PORTABLE=high-end	Omit	Omit	Assume supported

some-end-impl.c (example for one of the algo implementations. Should be compiled with appropriate -march without respect of PORTABLE=XXX):

int some_end_impl(int x) {
#if PORTABLE > SOME_END_ARCH  // pseudocode.
    assert(0);
#else
    (implementation for SOME_END ARCH)
#endif
}

int some_end_supported() { 
#if PORTABLE > SOME_END_ARCH
    return 0; // useless to include since more fast implmentation is guaranteed to be supported by choosing PORTABLE=XXX
#elif PORTABLE == SOME_END_ARCH
    return 1; // don't need to check since PORTABLE=XXX guarantee this is supported.
#else
    // this may or may not be supported by CPU at runtime. check it
    (check at runtime)
#endif
}

general_code.c:

int algo(int x) {
    static int (*impl)(int x) = NULL;
    if (!impl) {
        if (high_end_supported())
            impl=high_end_impl;
        else if (middle_end_supported())
            impl=middle_end_impl;
        else if (low_end_supported())
            impl=low_end_impl;
        else
            assert(0);
    }
    return impl(x)
}

pdillinger · 2023-05-02T22:29:44Z

Anyway. I still don't understand why we can just add runtime detection? Why not to:
...
Why this is important ? For common distributives, like Fedora, Debian or Ubuntu. Unfortunately, they have to support relatively old hardware.

If crc32c were the default checksum, you would have a point. It's not the default. And I think you're dug into the out-of-date premise that this is a singularly performance-critical component. For most of our workloads, block checksums are less than 1% of RocksDB CPU. Essentially all are less than 5%. Doubling that cost in a portable build is not going to be the only overhead associated with a portable build. Lots of other things passively benefit from the compiler using newer, wider instructions.

And anyway we have never had runtime detection of AVX2 and switching between xxh3 implementations. So this change is not a regression in default behaviors under any build config.

Under this change, everyone (using default checksum) still gets reasonable performance in a portable build. If you are sensitive to RocksDB performance, you should build specifically for your hardware, and that relates to more than just checksum functions.

mrambacher · 2023-05-02T23:33:50Z

Anyway. I still don't understand why we can just add runtime detection? Why not to:
...
Why this is important ? For common distributives, like Fedora, Debian or Ubuntu. Unfortunately, they have to support relatively old hardware.

If crc32c were the default checksum, you would have a point. It's not the default. And I think you're dug into the out-of-date premise that this is a singularly performance-critical component. For most of our workloads, block checksums are less than 1% of RocksDB CPU. Essentially all are less than 5%. Doubling that cost in a portable build is not going to be the only overhead associated with a portable build. Lots of other things passively benefit from the compiler using newer, wider instructions.

And anyway we have never had runtime detection of AVX2 and switching between xxh3 implementations. So this change is not a regression in default behaviors under any build config.

Under this change, everyone (using default checksum) still gets reasonable performance in a portable build. If you are sensitive to RocksDB performance, you should build specifically for your hardware, and that relates to more than just checksum functions.

My concern is what happens in binary distributions (such as Java). Can those binary distributions still take advantage of the architectural efficiencies? Personally, I would prefer we did more "stuff" at run-time (AVX and other support, URING detection, run-time detection of compression libraries, etc) that would allow these features to be used and discovered in binary distributions.

pdillinger · 2023-05-03T05:01:23Z

I believe a single java jar could can contain various binary builds to be chosen at runtime. With kXXH3 as default checksum, the most significant x86 CPU feature is probably AVX2. If you want to support pre-AVX2 hardware, you could probably just bundle two binaries, one for "haswell" and newer/greater, and one for the compiler minimum. Though you might need a small native library itself to identify which RocksDB library is compatible. Runtime checking at that level (select which full library) should be much more reliable and maintainable than the stuff I'm trying to get rid of, even if there is extra weight required in compilation time and package size.

Although it's tempting to try to do clever stuff like only compile certain translation units with various supported instruction sets, so you don't have to include two full RocksDB libraries, I believe it's possible to run into ODR or linker issues when mixing different -march settings. Probably not in the current code (unless you're using Folly F14, in which case you will!) but someday you might, and better safe than sorry.

mrambacher · 2023-05-03T16:59:29Z

While I agree you could package multiple SO for Java, you are then pushing a run-time determination on which to use into Java, which I do not know why would be any easier/better than making that same determination in a single library in C++.

For the machine intrinsics, my understanding is that you can call most (if not all) of those intrinsics even if the architecture does not support them. Doing so will result in an Illegal Instruction exception and not a link error. If there are ODR issues, then the methods should be refactored anyway for testing purposes.

I fully support simplifying the rules here but feel that compile-time choices (instead of run-time) can lead to a bunch of other issues and makes it hard to develop and deploy features that may be available on some hardware or installations.

pdillinger · 2023-05-03T18:54:11Z

While I agree you could package multiple SO for Java, you are then pushing a run-time determination on which to use into Java, which I do not know why would be any easier/better than making that same determination in a single library in C++.

Like I said, you could use a small bootstrap native library to make the determination.

For the machine intrinsics, my understanding is that you can call most (if not all) of those intrinsics even if the architecture does not support them. Doing so will result in an Illegal Instruction exception and not a link error. If there are ODR issues, then the methods should be refactored anyway for testing purposes.

But then you're limited to the benefits of the hand-coded optimizations implemented with runtime selection, not those opportunities found by your compiler nor compile-time-selected hand-coded optimizations in dependences such as std or folly.

I fully support simplifying the rules here but feel that compile-time choices (instead of run-time) can lead to a bunch of other issues and makes it hard to develop and deploy features that may be available on some hardware or installations.

Can you be more specific? As evidenced by recurring bug fix PRs, runtime detection is a testing nightmare and involves a level of compile time detection also. I don't see your point.

ajkr

LGTM!

ajkr · 2023-05-08T20:05:17Z

HISTORY.md

@@ -7,6 +7,9 @@
 ### Public API Changes
 * Add `MakeSharedCache()` construction functions to various cache Options objects, and deprecated the `NewWhateverCache()` functions with long parameter lists.

+### Behavior changes
+* For x86, CPU features are no longer detected at runtime nor in build scripts, but in source code using common preprocessor defines. This will likely unlock some small performance improvements on some newer hardware. See (PR link here).


Will it also likely cause regression in MSVC + PORTABLE=1 builds on files written with ChecksumType::kCRC32c? I think that should be mentioned.

facebook-github-bot · 2023-05-09T23:45:43Z

@pdillinger has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2023-05-09T23:46:08Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-05-10T05:28:27Z

@pdillinger merged this pull request in 459969e.

Change PORTABLE options into 0 instead of OFF (see: facebook/rocksdb#11419)

facebook-github-bot added the CLA Signed label Apr 29, 2023

pdillinger added 6 commits April 29, 2023 07:48

More simplification (fix unused)

9da2a20

Merge remote-tracking branch 'origin/main' into simplify_cpu_features

130d7d4

More unused fixes

8f2af4e

Fix target params to folly build

0bb494e

Simplify in cmake build

5cb6251

Merge remote-tracking branch 'origin/main' into simplify_cpu_features

10515aa

pdillinger requested a review from ajkr May 2, 2023 17:05

pdillinger mentioned this pull request May 2, 2023

CMake: PORTABLE=ON does not set HAVE_SSE on X86 builds => no fast CRC32 #11311

Closed

ajkr approved these changes May 8, 2023

View reviewed changes

pdillinger added 2 commits May 9, 2023 16:35

Merge remote-tracking branch 'origin/main' into simplify_cpu_features

cb98e48

Improve HISTORY

745a00f

facebook-github-bot closed this in 459969e May 10, 2023

facebook-github-bot added the Merged label May 10, 2023

aleksraiden added a commit to aleksraiden/kvrocks that referenced this pull request Jun 24, 2023

Update CMakeLists.txt

0759499

Change PORTABLE options into 0 instead of OFF (see: facebook/rocksdb#11419)

git-hulk mentioned this pull request Jun 25, 2023

build(deps): Bump rocksdb v8.3.2 apache/kvrocks#1516

Merged

niklasf added a commit to niklasf/rust-rocksdb that referenced this pull request Jul 2, 2023

Remove CPU feature defines after facebook/rocksdb#11419

5236fed

igorcanadi mentioned this pull request Jan 17, 2024

[SYS-6913] Upgrade RocksDB-Cloud to 8.9.1 rockset/rocksdb-cloud#315

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify detection of x86 CPU features #11419

Simplify detection of x86 CPU features #11419

pdillinger commented Apr 29, 2023 •

edited

facebook-github-bot commented May 2, 2023

socketpair commented May 2, 2023

pdillinger commented May 2, 2023

mrambacher commented May 2, 2023

pdillinger commented May 3, 2023

mrambacher commented May 3, 2023

pdillinger commented May 3, 2023

ajkr left a comment

ajkr May 8, 2023

facebook-github-bot commented May 9, 2023

facebook-github-bot commented May 9, 2023

facebook-github-bot commented May 10, 2023

Simplify detection of x86 CPU features #11419

Simplify detection of x86 CPU features #11419

Conversation

pdillinger commented Apr 29, 2023 • edited

facebook-github-bot commented May 2, 2023

socketpair commented May 2, 2023

pdillinger commented May 2, 2023

mrambacher commented May 2, 2023

pdillinger commented May 3, 2023

mrambacher commented May 3, 2023

pdillinger commented May 3, 2023

ajkr left a comment

Choose a reason for hiding this comment

ajkr May 8, 2023

Choose a reason for hiding this comment

facebook-github-bot commented May 9, 2023

facebook-github-bot commented May 9, 2023

facebook-github-bot commented May 10, 2023

pdillinger commented Apr 29, 2023 •

edited