Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checksum performance is slow on Arm64 #2027

Open
kevinzs2048 opened this issue Jun 28, 2023 · 1 comment
Open

Checksum performance is slow on Arm64 #2027

kevinzs2048 opened this issue Jun 28, 2023 · 1 comment

Comments

@kevinzs2048
Copy link

kevinzs2048 commented Jun 28, 2023

The checksum performance in folly is not optimized on Arm64 with Neon, which induce that the performance is quite slow.

./folly/hash/detail/ChecksumDetail.h

Cachelib heavily rely on Folly to realize the checksum.

From the perf top, in the cachelib with hyprid cache configuration, the checksum is consuming a lot of CPU time, which has been a bottleneck.

Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
$ checksum_benchmark --bm_min_usec=10000
============================================================================
folly/hash/test/ChecksumBenchmark.cpp           relative  time/iter  iters/s
============================================================================
crc32_512                                                   55.73ns   17.94M
crc32_1024                                                  85.15ns   11.74M
crc32_2048                                                 116.29ns    8.60M
crc32_4096                                                 191.03ns    5.23M
crc32_8192                                                 341.44ns    2.93M
crc32_16384                                                627.76ns    1.59M
crc32_32768                                                  1.21us  827.16K
============================================================================
Comparison:

============================================================================
[...]folly/hash/test/ChecksumBenchmark.cpp     relative  time/iter   iters/s
============================================================================
crc32_512                                                   1.80us   554.82K
crc32_1024                                                  3.58us   279.35K
crc32_2048                                                  7.14us   140.13K
crc32_4096                                                 14.25us    70.18K
crc32_8192                                                 28.47us    35.12K
crc32_16384                                                56.93us    17.57K
crc32_32768                                               113.83us     8.79K
@Orvid
Copy link
Contributor

Orvid commented Jul 10, 2023

If checksum is the bottleneck, the first thing I'd recommend doing is shifting away from using crc32, which, even fully optimized on x86_64 is less than 1/4th the speed of hash algorithms designed for speed like XXH3. XXH3 in particular should be well optimized for AArch64.

It does appear that there are equivalent hardware instructions to do the CRC32 hashing on ARM, we just haven't implemented it yet since we haven't needed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants