Skip to content

Use CRC32C instead of multiplication hashes#4605

Open
danlark1 wants to merge 10 commits intofacebook:devfrom
danlark1:dev
Open

Use CRC32C instead of multiplication hashes#4605
danlark1 wants to merge 10 commits intofacebook:devfrom
danlark1:dev

Conversation

@danlark1
Copy link
Contributor

@danlark1 danlark1 commented Mar 4, 2026

At Google we found around 3-5% of performance uplift for different kind of tests, mostly nicely compressed data. The improvement is mostly uniform across levels, x86-64 and aarch64 platforms with no change in compression ratios (variance is around 0.08%).

We decided to implement a slower crc32c version for other platforms. Let us know if you want to leave multiplication hashes but this will lead to slightly different outputs across platforms. We decided to prioritize stability rather than speed but we don't have an opinion on this. This patch directly can be detrimental to 32bit platforms as well as RISC-V 64 bits.

One test suddenly started not to have an error, please take a look :)

At Google we found around 3-5% of performance uplift for different kind
of tests. The improvement is mostly uniform across levels, x86-64 and
aarch64 platforms with no change in compression ratios (variance is
around 0.08%).

We decided to implement a slower crc32c version for other platforms. Let
us know if you want to leave multiplication hashes. We wanted to
prioritize stability rather than speed but this patch can be detrimental
to 32bit platforms as well as RISC-V.
@meta-cla meta-cla bot added the CLA Signed label Mar 4, 2026
@terrelln
Copy link
Contributor

terrelln commented Mar 4, 2026

Let us know if you want to leave multiplication hashes but this will lead to slightly different outputs across platforms

We definitely need to produce deterministic output regardless of the platform. This is a strong requirement for Zstd that is relied upon in many places (e.g. build systems). So the big question for whether to merge this is: Do we have HW support on enough platforms that the large slowdown on platforms without it is okay?

What is the slowdown when the SW CRC32C has to be used?

One test suddenly started not to have an error, please take a look :)

Looks like the determinism test is failing, because there we check in a hash of the compressed file and validate it matches the hash. We just need to update the stored value with:

cd tests/cli-tests/
./run.py --set-exact-output

@terrelln
Copy link
Contributor

terrelln commented Mar 4, 2026

It looks like zstd_fast.c and zstd_double_fast.c are broken when using a dictionary. I haven't determined why yet. This is what is causing dictionary the test not to fail.

@danlark1
Copy link
Contributor Author

danlark1 commented Mar 4, 2026

Let us know if you want to leave multiplication hashes but this will lead to slightly different outputs across platforms

We definitely need to produce deterministic output regardless of the platform. This is a strong requirement for Zstd that is relied upon in many places (e.g. build systems). So the big question for whether to merge this is: Do we have HW support on enough platforms that the large slowdown on platforms without it is okay?

Thanks, that was our impression as well. For now crc32c is on x86-64 (SSE4.2+) and aarch64 (armv8+) which covers 99+% modern hardware. The problem is probably sometimes some builds might assume really old defaults like SSE2 or something.

I think default compiler options for GCC/Clang are still SSE2, not SSE4.2. We need to add -msse4.2 in all options. For arm I think the situation is easier with compilers.

What is the slowdown when the SW CRC32C has to be used?

20-30%

One test suddenly started not to have an error, please take a look :)

Looks like the determinism test is failing, because there we check in a hash of the compressed file and validate it matches the hash. We just need to update the stored value with:

cd tests/cli-tests/
./run.py --set-exact-output

Thanks, updated, the test for decompression is still a bit a mystery.

Comment on lines +998 to +1001
size_t hash;
assert(h <= 32);
hash = (size_t)ZSTD_COMPRESS_INTERNAL_CRC32_U64((U32)s, u);
hash &= ((U64)1 << h) - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danlark1 This is what is causing the problem for dictionary compression.

We (implicitly) rely on the fact that:

ZSTD_hash(p, h, s) >> n == ZSTD_hash(p, h - n, s)

When using dictionaries for the fast and double_fast strategies. Dictionary content is loaded here for the fast strategy, using hbits + ZSTD_SHORT_CACHE_TAG_BITS bits of hash:

{ size_t const hashAndTag = ZSTD_hashPtr(ip, hBits, mls);

Then, for larger inputs, we end up copying the dictionary into the working hash table here:

U32 const taggedIndex = src[i];

Those indices are then accessed using just hbits of hash in the match finding loop here:

hash0 = ZSTD_hashPtr(ip0, hlog, mls);

This should fix it. Could you please also add a comment noting this requirement.

Does using a shift impact speed?

Suggested change
size_t hash;
assert(h <= 32);
hash = (size_t)ZSTD_COMPRESS_INTERNAL_CRC32_U64((U32)s, u);
hash &= ((U64)1 << h) - 1;
U32 hash;
assert(h <= 32);
hash = ZSTD_COMPRESS_INTERNAL_CRC32_U64((U32)s, u);
hash >>= (32 - h);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, changed.

Will run performance tests, will report tomorrow

Copy link
Contributor Author

@danlark1 danlark1 Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, after the shift, performance dropped to a baseline because shift (32 - h) requires 2 instructions (negate h and then shift) when & (1 << h ) - 1 is just a bzhi instruction. If possible, we prefer lower bits here.

Other resolution is to make a constant dispatch on hlog in functions so that shifts become constant shifts

Another resolution is to change all hlogs to be negative, then we only need to do >>= 32 + h which is optimized to 1 shift

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danlark1 We could use the low bits here, then we'd need to update everywhere that splits a hash with ZSTD_SHORT_CACHE_TAG_BITS to use the top bits for the tag, and the bottom bits for the hash.

That would change the requirement that:

ZSTD_hash(p, h, s) >> n == ZSTD_hash(p, h - n, s)

Into the requirement that

ZSTD_hash(p, h, s) & ((1u << n) - 1) == ZSTD_hash(p, h - n, s)

Assuming I'm getting my bits right.

@terrelln
Copy link
Contributor

terrelln commented Mar 4, 2026

CC @Cyan4973 for your thoughts about switching to CRC32C & if we need to support fast compression on platforms without CRC32C.

*/
#if defined(__SSE4_2__) && defined(__x86_64__)

#include <x86intrin.h>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elasota
Copy link
Contributor

elasota commented Mar 5, 2026

This patch directly can be detrimental to 32bit platforms as well as RISC-V 64 bits.

Is it worth implementing this with clmul flavor instructions on architectures that support it? I believe all x86 architectures with clmul also have crc32, but PMULL on ARM is lower-spec than CRC32, and RISC-V has clmul with the Zbc extension?

@terrelln
Copy link
Contributor

terrelln commented Mar 5, 2026

@danlark1 I chatted with @Cyan4973 yesterday, and we're a bit hesitant to require CRC32C to get performant compression. But, I still think there is a path forward for this patch.

We could gate the CRC32 hash behind a build macro like ZSTD_HASH_USE_CRC32C. When it is defined, we use CRC32C, and when it isn't we don't. This wouldn't violate our determinism rules, as builds with different build macros are allowed to produce different outputs.

This gets a bit more tricky with the need to use & (1 << h) - 1 rather than >> (32 - h). The multiplicative hash cannot use & (1 << h) - 1, because it degrades the hash quality significantly. So we'd have to centralize all the places that split the hash value (which isn't too many), and have them use helper functions we can swap between the two implementations. It also could just be easier to change all the hlogs to be negative. It would be easy to add an assert(h < 0) in the hash function to make sure we've caught all callsites. Then we don't need to differ between multiplicative and CRC32.

But, then we'd need to actually test this variant. I can help out in setting up our unit tests & determinism tests to test this variant.

@danlark1
Copy link
Contributor Author

danlark1 commented Mar 5, 2026

@danlark1 I chatted with @Cyan4973 yesterday, and we're a bit hesitant to require CRC32C to get performant compression. But, I still think there is a path forward for this patch.

We could gate the CRC32 hash behind a build macro like ZSTD_HASH_USE_CRC32C. When it is defined, we use CRC32C, and when it isn't we don't. This wouldn't violate our determinism rules, as builds with different build macros are allowed to produce different outputs.

This gets a bit more tricky with the need to use & (1 << h) - 1 rather than >> (32 - h). The multiplicative hash cannot use & (1 << h) - 1, because it degrades the hash quality significantly. So we'd have to centralize all the places that split the hash value (which isn't too many), and have them use helper functions we can swap between the two implementations. It also could just be easier to change all the hlogs to be negative. It would be easy to add an assert(h < 0) in the hash function to make sure we've caught all callsites. Then we don't need to differ between multiplicative and CRC32.

But, then we'd need to actually test this variant. I can help out in setting up our unit tests & determinism tests to test this variant.

Thanks a lot, macro compromise sounds good to us.

We are discussing internally and debugging why shift is not working as we expect. Hopefully we'll understand this by tomorrow.

I'll prepare a patch for a macro then and hopefully will find a good compromise, maybe a separate function of size_t ZSTD_reduceHash. Let me try to own it for now and I'll ask for guidance along the way.

@terrelln
Copy link
Contributor

terrelln commented Mar 5, 2026

Sounds good, thanks @danlark1!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants