-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify detection of x86 CPU features #11419
Conversation
Summary: **Background** - runtime detection of certain x86 CPU features was added for optimizing CRC32c checksums, where performance is dramatically affected by the availability of certain CPU instructions and code using intrinsics for those instructions. And Java builds with native library try to be broadly compatible but performant. What has changed is that CRC32c is no longer the most efficient cheecksum on contemporary x86_64 hardware, nor the default checksum. XXH3 is generally faster and not as dramatically impacted by the availability of certain CPU instructions. For example, on my Skylake system using db_bench: PORTABLE=1 empty USE_SSE : xxh3->8 GB/s crc32c->0.8 GB/s PORTABLE=1 USE_SSE=1 : xxh3->19 GB/s crc32c->16 GB/s PORTABLE=0 USE_SSE ignored: xxh3->28 GB/s crc32c->16 GB/s And hardware that doesn't support fast CRC32c is now extremely rare, so requiring a custom build to support such hardware is more tenable. **This change** does two related things: * Remove runtime CPU detection for optimizing CRC32c on x86. Maintaining this code is non-zero work, and compiling special code that doesn't work on the configured target instruction set for code generation is always dubious. (On the one hand we have to ensure the CRC32c code uses SSE4.2 but on the other hand we have to ensure nothing else does.) * Detect CPU features in source code, not in build scripts. Although there are some hypothetical advantages to detectiong in build scripts (compiler generality), RocksDB supports at least three build systems: make, cmake, and buck. It's not practical to support feature detection on all three, and we have suffered from missed optimization opportunities by relying on missing or incomplete detection in cmake and buck. We also depend on some components like xxhash that do source code detection anyway. **In more detail:** * HAVE_SSE42, HAVE_AVX2, and HAVE_PCLMUL replaced by standard macros __SSE4_2__, __AVX2__, and __PCLMUL__. * MSVC does not provide high fidelity defines for SSE, PCLMUL, or POPCNT, but we can infer those from __AVX__ or __AVX2__ in a compatibility header. In rare cases of false negative or false positive feature detection, a build engineer should be able to set defines to work around the issue. * __POPCNT__ is another standard define, but we happen to only need it on MSVC, where it is set by that compatibility header, or can be set by the build engineer. * PORTABLE can be set to a CPU type, e.g. "haswell", to compile for that CPU type. * USE_SSE is deprecated, now equivalent to PORTABLE=haswell, which roughly approximates its old behavior. Notably, this change should enable more builds to use the AVX2-optimized Bloom filter implementation. Test Plan: existing tests, CI Manual performance tests after the change match the before above (none expected with make build). We also see AVX2 optimized Bloom filter code enabled when expected, by injecting a compiler error. (Performance difference is not big on my current CPU.)
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Anyway. I still don't understand why we can just add runtime detection? Why not to:
In other words, setting Why this is important ? For common distributives, like Fedora, Debian or Ubuntu. Unfortunately, they have to support relatively old hardware. Everything may look like: (I imply the compiler supports all high-end intrinsics, so we don't have to check it at build stage) low-end, middle-end and high-end - just pseudocode that helps understand my vision.
some-end-impl.c (example for one of the algo implementations. Should be compiled with appropriate int some_end_impl(int x) {
#if PORTABLE > SOME_END_ARCH // pseudocode.
assert(0);
#else
(implementation for SOME_END ARCH)
#endif
}
int some_end_supported() {
#if PORTABLE > SOME_END_ARCH
return 0; // useless to include since more fast implmentation is guaranteed to be supported by choosing PORTABLE=XXX
#elif PORTABLE == SOME_END_ARCH
return 1; // don't need to check since PORTABLE=XXX guarantee this is supported.
#else
// this may or may not be supported by CPU at runtime. check it
(check at runtime)
#endif
} general_code.c: int algo(int x) {
static int (*impl)(int x) = NULL;
if (!impl) {
if (high_end_supported())
impl=high_end_impl;
else if (middle_end_supported())
impl=middle_end_impl;
else if (low_end_supported())
impl=low_end_impl;
else
assert(0);
}
return impl(x)
} |
If crc32c were the default checksum, you would have a point. It's not the default. And I think you're dug into the out-of-date premise that this is a singularly performance-critical component. For most of our workloads, block checksums are less than 1% of RocksDB CPU. Essentially all are less than 5%. Doubling that cost in a portable build is not going to be the only overhead associated with a portable build. Lots of other things passively benefit from the compiler using newer, wider instructions. And anyway we have never had runtime detection of AVX2 and switching between xxh3 implementations. So this change is not a regression in default behaviors under any build config. Under this change, everyone (using default checksum) still gets reasonable performance in a portable build. If you are sensitive to RocksDB performance, you should build specifically for your hardware, and that relates to more than just checksum functions. |
My concern is what happens in binary distributions (such as Java). Can those binary distributions still take advantage of the architectural efficiencies? Personally, I would prefer we did more "stuff" at run-time (AVX and other support, URING detection, run-time detection of compression libraries, etc) that would allow these features to be used and discovered in binary distributions. |
I believe a single java jar could can contain various binary builds to be chosen at runtime. With kXXH3 as default checksum, the most significant x86 CPU feature is probably AVX2. If you want to support pre-AVX2 hardware, you could probably just bundle two binaries, one for "haswell" and newer/greater, and one for the compiler minimum. Though you might need a small native library itself to identify which RocksDB library is compatible. Runtime checking at that level (select which full library) should be much more reliable and maintainable than the stuff I'm trying to get rid of, even if there is extra weight required in compilation time and package size. Although it's tempting to try to do clever stuff like only compile certain translation units with various supported instruction sets, so you don't have to include two full RocksDB libraries, I believe it's possible to run into ODR or linker issues when mixing different -march settings. Probably not in the current code (unless you're using Folly F14, in which case you will!) but someday you might, and better safe than sorry. |
While I agree you could package multiple SO for Java, you are then pushing a run-time determination on which to use into Java, which I do not know why would be any easier/better than making that same determination in a single library in C++. For the machine intrinsics, my understanding is that you can call most (if not all) of those intrinsics even if the architecture does not support them. Doing so will result in an Illegal Instruction exception and not a link error. If there are ODR issues, then the methods should be refactored anyway for testing purposes. I fully support simplifying the rules here but feel that compile-time choices (instead of run-time) can lead to a bunch of other issues and makes it hard to develop and deploy features that may be available on some hardware or installations. |
Like I said, you could use a small bootstrap native library to make the determination.
But then you're limited to the benefits of the hand-coded optimizations implemented with runtime selection, not those opportunities found by your compiler nor compile-time-selected hand-coded optimizations in dependences such as std or folly.
Can you be more specific? As evidenced by recurring bug fix PRs, runtime detection is a testing nightmare and involves a level of compile time detection also. I don't see your point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
HISTORY.md
Outdated
@@ -7,6 +7,9 @@ | |||
### Public API Changes | |||
* Add `MakeSharedCache()` construction functions to various cache Options objects, and deprecated the `NewWhateverCache()` functions with long parameter lists. | |||
|
|||
### Behavior changes | |||
* For x86, CPU features are no longer detected at runtime nor in build scripts, but in source code using common preprocessor defines. This will likely unlock some small performance improvements on some newer hardware. See (PR link here). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it also likely cause regression in MSVC + PORTABLE=1 builds on files written with ChecksumType::kCRC32c
? I think that should be mentioned.
@pdillinger has updated the pull request. You must reimport the pull request before landing. |
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pdillinger merged this pull request in 459969e. |
Change PORTABLE options into 0 instead of OFF (see: facebook/rocksdb#11419)
Summary:
Background - runtime detection of certain x86 CPU features was added for optimizing CRC32c checksums, where performance is dramatically affected by the availability of certain CPU instructions and code using intrinsics for those instructions. And Java builds with native library try to be broadly compatible but performant.
What has changed is that CRC32c is no longer the most efficient cheecksum on contemporary x86_64 hardware, nor the default checksum. XXH3 is generally faster and not as dramatically impacted by the availability of certain CPU instructions. For example, on my Skylake system using db_bench (similar on an older Skylake system without AVX512):
PORTABLE=1 empty USE_SSE : xxh3->8 GB/s crc32c->0.8 GB/s (no SSE4.2 nor AVX2 instructions)
PORTABLE=1 USE_SSE=1 : xxh3->19 GB/s crc32c->16 GB/s (with SSE4.2 and AVX2)
PORTABLE=0 USE_SSE ignored: xxh3->28 GB/s crc32c->16 GB/s (also some AVX512)
Testing a ~10 year old system, with SSE4.2 but without AVX2, crc32c is a similar speed to the new systems but xxh3 is only about half that speed, also 8GB/s like the non-AVX2 compile above. Given that xxh3 has specific optimization for AVX2, I think we can infer that that crc32c is only fastest for that ~2008-2013 period when SSE4.2 was included but not AVX2. And given that xxh3 is only about 2x slower on these systems (not like >10x slower for unoptimized crc32c), I don't think we need to invest too much in optimally adapting to these old cases.
x86 hardware that doesn't support fast CRC32c is now extremely rare, so requiring a custom build to support such hardware is fine IMHO.
This change does two related things:
In more detail:
HAVE_SSE42
,HAVE_AVX2
, andHAVE_PCLMUL
replaced by standard macros__SSE4_2__
,__AVX2__
, and__PCLMUL__
.__AVX__
or__AVX2__
in a compatibility header. In rare cases of false negative or false positive feature detection, a build engineer should be able to set defines to work around the issue.__POPCNT__
is another standard define, but we happen to only need it on MSVC, where it is set by that compatibility header, or can be set by the build engineer.PORTABLE
can be set to a CPU type, e.g. "haswell", to compile for that CPU type.USE_SSE
is deprecated, now equivalent to PORTABLE=haswell, which roughly approximates its old behavior.Notably, this change should enable more builds to use the AVX2-optimized Bloom filter implementation.
Test Plan: existing tests, CI
Manual performance tests after the change match the before above (none expected with make build).
We also see AVX2 optimized Bloom filter code enabled when expected, by injecting a compiler error. (Performance difference is not big on my current CPU.)