Use CRC32C instead of multiplication hashes#4605
Use CRC32C instead of multiplication hashes#4605danlark1 wants to merge 10 commits intofacebook:devfrom
Conversation
At Google we found around 3-5% of performance uplift for different kind of tests. The improvement is mostly uniform across levels, x86-64 and aarch64 platforms with no change in compression ratios (variance is around 0.08%). We decided to implement a slower crc32c version for other platforms. Let us know if you want to leave multiplication hashes. We wanted to prioritize stability rather than speed but this patch can be detrimental to 32bit platforms as well as RISC-V.
We definitely need to produce deterministic output regardless of the platform. This is a strong requirement for Zstd that is relied upon in many places (e.g. build systems). So the big question for whether to merge this is: Do we have HW support on enough platforms that the large slowdown on platforms without it is okay? What is the slowdown when the SW CRC32C has to be used?
Looks like the determinism test is failing, because there we check in a hash of the compressed file and validate it matches the hash. We just need to update the stored value with: |
|
It looks like |
Thanks, that was our impression as well. For now crc32c is on x86-64 (SSE4.2+) and aarch64 (armv8+) which covers 99+% modern hardware. The problem is probably sometimes some builds might assume really old defaults like SSE2 or something. I think default compiler options for GCC/Clang are still SSE2, not SSE4.2. We need to add
20-30%
Thanks, updated, the test for decompression is still a bit a mystery. |
| size_t hash; | ||
| assert(h <= 32); | ||
| hash = (size_t)ZSTD_COMPRESS_INTERNAL_CRC32_U64((U32)s, u); | ||
| hash &= ((U64)1 << h) - 1; |
There was a problem hiding this comment.
@danlark1 This is what is causing the problem for dictionary compression.
We (implicitly) rely on the fact that:
ZSTD_hash(p, h, s) >> n == ZSTD_hash(p, h - n, s)
When using dictionaries for the fast and double_fast strategies. Dictionary content is loaded here for the fast strategy, using hbits + ZSTD_SHORT_CACHE_TAG_BITS bits of hash:
Line 38 in 6e1e545
Then, for larger inputs, we end up copying the dictionary into the working hash table here:
zstd/lib/compress/zstd_compress.c
Line 2417 in 6e1e545
Those indices are then accessed using just hbits of hash in the match finding loop here:
Line 785 in 6e1e545
This should fix it. Could you please also add a comment noting this requirement.
Does using a shift impact speed?
| size_t hash; | |
| assert(h <= 32); | |
| hash = (size_t)ZSTD_COMPRESS_INTERNAL_CRC32_U64((U32)s, u); | |
| hash &= ((U64)1 << h) - 1; | |
| U32 hash; | |
| assert(h <= 32); | |
| hash = ZSTD_COMPRESS_INTERNAL_CRC32_U64((U32)s, u); | |
| hash >>= (32 - h); |
There was a problem hiding this comment.
Thanks, changed.
Will run performance tests, will report tomorrow
There was a problem hiding this comment.
So, after the shift, performance dropped to a baseline because shift (32 - h) requires 2 instructions (negate h and then shift) when & (1 << h ) - 1 is just a bzhi instruction. If possible, we prefer lower bits here.
Other resolution is to make a constant dispatch on hlog in functions so that shifts become constant shifts
Another resolution is to change all hlogs to be negative, then we only need to do >>= 32 + h which is optimized to 1 shift
There was a problem hiding this comment.
@danlark1 We could use the low bits here, then we'd need to update everywhere that splits a hash with ZSTD_SHORT_CACHE_TAG_BITS to use the top bits for the tag, and the bottom bits for the hash.
That would change the requirement that:
ZSTD_hash(p, h, s) >> n == ZSTD_hash(p, h - n, s)
Into the requirement that
ZSTD_hash(p, h, s) & ((1u << n) - 1) == ZSTD_hash(p, h - n, s)
Assuming I'm getting my bits right.
|
CC @Cyan4973 for your thoughts about switching to CRC32C & if we need to support fast compression on platforms without CRC32C. |
| */ | ||
| #if defined(__SSE4_2__) && defined(__x86_64__) | ||
|
|
||
| #include <x86intrin.h> |
There was a problem hiding this comment.
Isn't the correct header for this is nmmintrin.h?
Is it worth implementing this with clmul flavor instructions on architectures that support it? I believe all x86 architectures with clmul also have crc32, but PMULL on ARM is lower-spec than CRC32, and RISC-V has clmul with the Zbc extension? |
|
@danlark1 I chatted with @Cyan4973 yesterday, and we're a bit hesitant to require CRC32C to get performant compression. But, I still think there is a path forward for this patch. We could gate the CRC32 hash behind a build macro like This gets a bit more tricky with the need to use But, then we'd need to actually test this variant. I can help out in setting up our unit tests & determinism tests to test this variant. |
Thanks a lot, macro compromise sounds good to us. We are discussing internally and debugging why shift is not working as we expect. Hopefully we'll understand this by tomorrow. I'll prepare a patch for a macro then and hopefully will find a good compromise, maybe a separate function of |
|
Sounds good, thanks @danlark1! |
At Google we found around 3-5% of performance uplift for different kind of tests, mostly nicely compressed data. The improvement is mostly uniform across levels, x86-64 and aarch64 platforms with no change in compression ratios (variance is around 0.08%).
We decided to implement a slower crc32c version for other platforms. Let us know if you want to leave multiplication hashes but this will lead to slightly different outputs across platforms. We decided to prioritize stability rather than speed but we don't have an opinion on this. This patch directly can be detrimental to 32bit platforms as well as RISC-V 64 bits.
One test suddenly started not to have an error, please take a look :)