Use CRC32C instead of multiplication hashes by danlark1 · Pull Request #4605 · facebook/zstd

danlark1 · 2026-03-04T10:15:25Z

At Google we found around 3-5% of performance uplift for different kind of tests, mostly nicely compressed data. The improvement is mostly uniform across levels, x86-64 and aarch64 platforms with no change in compression ratios (variance is around 0.08%).

We decided to implement a slower crc32c version for other platforms. Let us know if you want to leave multiplication hashes but this will lead to slightly different outputs across platforms. We decided to prioritize stability rather than speed but we don't have an opinion on this. This patch directly can be detrimental to 32bit platforms as well as RISC-V 64 bits.

One test suddenly started not to have an error, please take a look :)

At Google we found around 3-5% of performance uplift for different kind of tests. The improvement is mostly uniform across levels, x86-64 and aarch64 platforms with no change in compression ratios (variance is around 0.08%). We decided to implement a slower crc32c version for other platforms. Let us know if you want to leave multiplication hashes. We wanted to prioritize stability rather than speed but this patch can be detrimental to 32bit platforms as well as RISC-V.

tests/zstreamtest.c

lib/compress/zstd_compress_internal.h

terrelln · 2026-03-04T18:31:58Z

Let us know if you want to leave multiplication hashes but this will lead to slightly different outputs across platforms

We definitely need to produce deterministic output regardless of the platform. This is a strong requirement for Zstd that is relied upon in many places (e.g. build systems). So the big question for whether to merge this is: Do we have HW support on enough platforms that the large slowdown on platforms without it is okay?

What is the slowdown when the SW CRC32C has to be used?

One test suddenly started not to have an error, please take a look :)

Looks like the determinism test is failing, because there we check in a hash of the compressed file and validate it matches the hash. We just need to update the stored value with:

cd tests/cli-tests/
./run.py --set-exact-output

lib/compress/zstd_compress_internal.h

terrelln · 2026-03-04T18:52:08Z

It looks like zstd_fast.c and zstd_double_fast.c are broken when using a dictionary. I haven't determined why yet. This is what is causing dictionary the test not to fail.

danlark1 · 2026-03-04T19:46:37Z

Let us know if you want to leave multiplication hashes but this will lead to slightly different outputs across platforms

We definitely need to produce deterministic output regardless of the platform. This is a strong requirement for Zstd that is relied upon in many places (e.g. build systems). So the big question for whether to merge this is: Do we have HW support on enough platforms that the large slowdown on platforms without it is okay?

Thanks, that was our impression as well. For now crc32c is on x86-64 (SSE4.2+) and aarch64 (armv8+) which covers 99+% modern hardware. The problem is probably sometimes some builds might assume really old defaults like SSE2 or something.

I think default compiler options for GCC/Clang are still SSE2, not SSE4.2. We need to add -msse4.2 in all options. For arm I think the situation is easier with compilers.

What is the slowdown when the SW CRC32C has to be used?

20-30%

One test suddenly started not to have an error, please take a look :)

Looks like the determinism test is failing, because there we check in a hash of the compressed file and validate it matches the hash. We just need to update the stored value with:
cd tests/cli-tests/
./run.py --set-exact-output

Thanks, updated, the test for decompression is still a bit a mystery.

terrelln · 2026-03-04T20:31:06Z

lib/compress/zstd_compress_internal.h

+  size_t hash;
+  assert(h <= 32);
+  hash = (size_t)ZSTD_COMPRESS_INTERNAL_CRC32_U64((U32)s, u);
+  hash &= ((U64)1 << h) - 1;


@danlark1 This is what is causing the problem for dictionary compression.

We (implicitly) rely on the fact that:

ZSTD_hash(p, h, s) >> n == ZSTD_hash(p, h - n, s)

When using dictionaries for the fast and double_fast strategies. Dictionary content is loaded here for the fast strategy, using hbits + ZSTD_SHORT_CACHE_TAG_BITS bits of hash:

zstd/lib/compress/zstd_fast.c

Line 38 in 6e1e545

{ size_t const hashAndTag = ZSTD_hashPtr(ip, hBits, mls);

Then, for larger inputs, we end up copying the dictionary into the working hash table here:

zstd/lib/compress/zstd_compress.c

Line 2417 in 6e1e545

U32 const taggedIndex = src[i];

Those indices are then accessed using just hbits of hash in the match finding loop here:

zstd/lib/compress/zstd_fast.c

Line 785 in 6e1e545

hash0 = ZSTD_hashPtr(ip0, hlog, mls);

This should fix it. Could you please also add a comment noting this requirement.

Does using a shift impact speed?

Suggested change

size_t hash;

assert(h <= 32);

hash = (size_t)ZSTD_COMPRESS_INTERNAL_CRC32_U64((U32)s, u);

hash &= ((U64)1 << h) - 1;

U32 hash;

assert(h <= 32);

hash = ZSTD_COMPRESS_INTERNAL_CRC32_U64((U32)s, u);

hash >>= (32 - h);

Thanks, changed.

Will run performance tests, will report tomorrow

So, after the shift, performance dropped to a baseline because shift (32 - h) requires 2 instructions (negate h and then shift) when & (1 << h ) - 1 is just a bzhi instruction. If possible, we prefer lower bits here.

Other resolution is to make a constant dispatch on hlog in functions so that shifts become constant shifts

Another resolution is to change all hlogs to be negative, then we only need to do >>= 32 + h which is optimized to 1 shift

@danlark1 We could use the low bits here, then we'd need to update everywhere that splits a hash with ZSTD_SHORT_CACHE_TAG_BITS to use the top bits for the tag, and the bottom bits for the hash.

That would change the requirement that:

ZSTD_hash(p, h, s) >> n == ZSTD_hash(p, h - n, s)

Into the requirement that

ZSTD_hash(p, h, s) & ((1u << n) - 1) == ZSTD_hash(p, h - n, s)

Assuming I'm getting my bits right.

terrelln · 2026-03-04T20:32:51Z

CC @Cyan4973 for your thoughts about switching to CRC32C & if we need to support fast compression on platforms without CRC32C.

EricLasotaRSE · 2026-03-04T21:29:35Z

lib/compress/zstd_compress_internal.h

+*/
+#if defined(__SSE4_2__) && defined(__x86_64__)
+
+#include <x86intrin.h>


Isn't the correct header for this is nmmintrin.h?

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_crc32_u64&ig_expand=1420

elasota · 2026-03-05T03:31:11Z

This patch directly can be detrimental to 32bit platforms as well as RISC-V 64 bits.

Is it worth implementing this with clmul flavor instructions on architectures that support it? I believe all x86 architectures with clmul also have crc32, but PMULL on ARM is lower-spec than CRC32, and RISC-V has clmul with the Zbc extension?

terrelln · 2026-03-05T22:58:04Z

@danlark1 I chatted with @Cyan4973 yesterday, and we're a bit hesitant to require CRC32C to get performant compression. But, I still think there is a path forward for this patch.

We could gate the CRC32 hash behind a build macro like ZSTD_HASH_USE_CRC32C. When it is defined, we use CRC32C, and when it isn't we don't. This wouldn't violate our determinism rules, as builds with different build macros are allowed to produce different outputs.

This gets a bit more tricky with the need to use & (1 << h) - 1 rather than >> (32 - h). The multiplicative hash cannot use & (1 << h) - 1, because it degrades the hash quality significantly. So we'd have to centralize all the places that split the hash value (which isn't too many), and have them use helper functions we can swap between the two implementations. It also could just be easier to change all the hlogs to be negative. It would be easy to add an assert(h < 0) in the hash function to make sure we've caught all callsites. Then we don't need to differ between multiplicative and CRC32.

But, then we'd need to actually test this variant. I can help out in setting up our unit tests & determinism tests to test this variant.

danlark1 · 2026-03-05T23:08:34Z

@danlark1 I chatted with @Cyan4973 yesterday, and we're a bit hesitant to require CRC32C to get performant compression. But, I still think there is a path forward for this patch.

We could gate the CRC32 hash behind a build macro like ZSTD_HASH_USE_CRC32C. When it is defined, we use CRC32C, and when it isn't we don't. This wouldn't violate our determinism rules, as builds with different build macros are allowed to produce different outputs.

This gets a bit more tricky with the need to use & (1 << h) - 1 rather than >> (32 - h). The multiplicative hash cannot use & (1 << h) - 1, because it degrades the hash quality significantly. So we'd have to centralize all the places that split the hash value (which isn't too many), and have them use helper functions we can swap between the two implementations. It also could just be easier to change all the hlogs to be negative. It would be easy to add an assert(h < 0) in the hash function to make sure we've caught all callsites. Then we don't need to differ between multiplicative and CRC32.

But, then we'd need to actually test this variant. I can help out in setting up our unit tests & determinism tests to test this variant.

Thanks a lot, macro compromise sounds good to us.

We are discussing internally and debugging why shift is not working as we expect. Hopefully we'll understand this by tomorrow.

I'll prepare a patch for a macro then and hopefully will find a good compromise, maybe a separate function of size_t ZSTD_reduceHash. Let me try to own it for now and I'll ask for guidance along the way.

terrelln · 2026-03-05T23:23:04Z

Sounds good, thanks @danlark1!

meta-cla bot added the CLA Signed label Mar 4, 2026

danlark1 added 4 commits March 4, 2026 10:22

Fix unused functions

d2c823b

Fix inline

a1c6319

Fix c89 standard

402e65e

Fix conversions

895de4a

danlark1 force-pushed the dev branch from 0eac5e0 to 895de4a Compare March 4, 2026 10:37

danlark1 added 2 commits March 4, 2026 10:41

Fix a shift

d00c578

Allow reading 8 bytes in external matchfinder

b8e4dd2

danlark1 force-pushed the dev branch from c7a5d42 to b8e4dd2 Compare March 4, 2026 10:55

terrelln reviewed Mar 4, 2026

View reviewed changes

tests/zstreamtest.c Show resolved Hide resolved

terrelln reviewed Mar 4, 2026

View reviewed changes

lib/compress/zstd_compress_internal.h Show resolved Hide resolved

terrelln reviewed Mar 4, 2026

View reviewed changes

lib/compress/zstd_compress_internal.h Outdated Show resolved Hide resolved

Fix determinism tests

a3a2d60

terrelln reviewed Mar 4, 2026

View reviewed changes

danlark1 added 2 commits March 4, 2026 21:21

Use shift instead of and operation for hash

126b57f

Fix fuzzer test

9d207ef

EricLasotaRSE reviewed Mar 4, 2026

View reviewed changes

Conversation

danlark1 commented Mar 4, 2026

Uh oh!

Uh oh!

Uh oh!

terrelln commented Mar 4, 2026

Uh oh!

Uh oh!

terrelln commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danlark1 commented Mar 4, 2026

Uh oh!

terrelln Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

danlark1 Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

danlark1 Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

terrelln Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

terrelln commented Mar 4, 2026

Uh oh!

EricLasotaRSE Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

elasota commented Mar 5, 2026

Uh oh!

terrelln commented Mar 5, 2026

Uh oh!

danlark1 commented Mar 5, 2026

Uh oh!

terrelln commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

terrelln commented Mar 4, 2026 •

edited

Loading

danlark1 Mar 5, 2026 •

edited

Loading