Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add avx512 implementation #79

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Add avx512 implementation #79

wants to merge 3 commits into from

Conversation

Jorropo
Copy link

@Jorropo Jorropo commented May 8, 2024

Here are results on two different kinds of CPU (Zen2 without AVX512 support and Zen4 with AVX512 support):

goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash/v2
cpu: AMD Ryzen 5 3600 6-Core Processor              
                     │ /tmp/old.local.results │       /tmp/new.local.results        │
                     │         sec/op         │   sec/op     vs base                │
Sum64/4B-12                       3.397n ± 1%   4.702n ± 3%  +38.43% (p=0.000 n=10)
Sum64/16B-12                      4.412n ± 3%   5.876n ± 2%  +33.20% (p=0.000 n=10)
Sum64/100B-12                     14.54n ± 2%   15.87n ± 2%   +9.11% (p=0.000 n=10)
Sum64/4KB-12                      268.1n ± 2%   268.8n ± 1%        ~ (p=0.184 n=10)
Sum64/10MB-12                     647.2µ ± 2%   645.3µ ± 2%        ~ (p=0.529 n=10)
Sum64String/4B-12                 3.643n ± 2%   4.953n ± 2%  +35.96% (p=0.000 n=10)
Sum64String/16B-12                4.529n ± 3%   5.887n ± 2%  +29.98% (p=0.000 n=10)
Sum64String/100B-12               14.25n ± 3%   15.76n ± 2%  +10.56% (p=0.000 n=10)
Sum64String/4KB-12                266.8n ± 3%   266.4n ± 2%        ~ (p=0.592 n=10)
Sum64String/10MB-12               637.4µ ± 3%   646.3µ ± 3%        ~ (p=0.529 n=10)
DigestBytes/4B-12                 8.472n ± 4%   7.976n ± 2%   -5.85% (p=0.000 n=10)
DigestBytes/16B-12                10.91n ± 2%   10.04n ± 2%   -7.97% (p=0.000 n=10)
DigestBytes/100B-12               20.09n ± 2%   20.82n ± 3%   +3.66% (p=0.011 n=10)
DigestBytes/4KB-12                277.9n ± 2%   274.9n ± 2%        ~ (p=0.086 n=10)
DigestBytes/10MB-12               643.4µ ± 3%   651.5µ ± 2%        ~ (p=0.123 n=10)
DigestString/4B-12                8.566n ± 2%   8.701n ± 2%        ~ (p=0.138 n=10)
DigestString/16B-12               10.87n ± 5%   10.79n ± 2%   -0.78% (p=0.015 n=10)
DigestString/100B-12              20.49n ± 3%   21.39n ± 1%   +4.37% (p=0.007 n=10)
DigestString/4KB-12               269.1n ± 3%   272.0n ± 2%        ~ (p=0.436 n=10)
DigestString/10MB-12              632.9µ ± 2%   643.6µ ± 1%        ~ (p=0.143 n=10)
geomean                           162.4n        173.8n        +7.00%

                     │ /tmp/old.local.results │        /tmp/new.local.results        │
                     │          B/s           │     B/s       vs base                │
Sum64/4B-12                     1123.0Mi ± 1%   811.2Mi ± 3%  -27.76% (p=0.000 n=10)
Sum64/16B-12                     3.378Gi ± 3%   2.536Gi ± 2%  -24.92% (p=0.000 n=10)
Sum64/100B-12                    6.403Gi ± 2%   5.867Gi ± 2%   -8.37% (p=0.000 n=10)
Sum64/4KB-12                     13.90Gi ± 2%   13.86Gi ± 1%        ~ (p=0.190 n=10)
Sum64/10MB-12                    14.39Gi ± 2%   14.43Gi ± 2%        ~ (p=0.529 n=10)
Sum64String/4B-12               1047.2Mi ± 3%   770.1Mi ± 2%  -26.46% (p=0.000 n=10)
Sum64String/16B-12               3.290Gi ± 3%   2.531Gi ± 2%  -23.06% (p=0.000 n=10)
Sum64String/100B-12              6.537Gi ± 3%   5.910Gi ± 2%   -9.59% (p=0.000 n=10)
Sum64String/4KB-12               13.97Gi ± 3%   13.98Gi ± 2%        ~ (p=0.579 n=10)
Sum64String/10MB-12              14.61Gi ± 3%   14.41Gi ± 3%        ~ (p=0.529 n=10)
DigestBytes/4B-12                450.3Mi ± 4%   478.2Mi ± 2%   +6.21% (p=0.000 n=10)
DigestBytes/16B-12               1.365Gi ± 1%   1.484Gi ± 2%   +8.69% (p=0.000 n=10)
DigestBytes/100B-12              4.636Gi ± 2%   4.473Gi ± 3%   -3.53% (p=0.011 n=10)
DigestBytes/4KB-12               13.40Gi ± 2%   13.55Gi ± 2%        ~ (p=0.105 n=10)
DigestBytes/10MB-12              14.48Gi ± 3%   14.29Gi ± 2%        ~ (p=0.123 n=10)
DigestString/4B-12               445.4Mi ± 2%   438.4Mi ± 2%        ~ (p=0.143 n=10)
DigestString/16B-12              1.371Gi ± 5%   1.382Gi ± 2%   +0.79% (p=0.023 n=10)
DigestString/100B-12             4.544Gi ± 3%   4.354Gi ± 1%   -4.17% (p=0.007 n=10)
DigestString/4KB-12              13.84Gi ± 3%   13.70Gi ± 2%        ~ (p=0.436 n=10)
DigestString/10MB-12             14.72Gi ± 2%   14.47Gi ± 1%        ~ (p=0.143 n=10)
geomean                          4.366Gi        4.081Gi        -6.54%
goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash/v2
cpu: AMD Ryzen 9 7950X 16-Core Processor            
                  │ /tmp/old.results │          /tmp/new.results           │
                  │      sec/op      │   sec/op     vs base                │
Sum64/4B                 2.318n ± 3%   3.064n ± 5%  +32.20% (p=0.000 n=10)
Sum64/16B                3.074n ± 3%   4.243n ± 5%  +38.03% (p=0.000 n=10)
Sum64/100B               9.663n ± 3%   8.431n ± 2%  -12.74% (p=0.000 n=10)
Sum64/4KB                198.0n ± 5%   132.0n ± 3%  -33.38% (p=0.000 n=10)
Sum64/10MB               481.8µ ± 3%   376.5µ ± 3%  -21.86% (p=0.000 n=10)
Sum64String/4B           2.768n ± 4%   3.357n ± 3%  +21.26% (p=0.000 n=10)
Sum64String/16B          3.647n ± 4%   4.537n ± 4%  +24.40% (p=0.000 n=10)
Sum64String/100B        10.915n ± 4%   8.827n ± 3%  -19.13% (p=0.000 n=10)
Sum64String/4KB          203.2n ± 3%   131.1n ± 4%  -35.47% (p=0.000 n=10)
Sum64String/10MB         489.8µ ± 2%   370.2µ ± 3%  -24.42% (p=0.000 n=10)
DigestBytes/4B           9.398n ± 3%   5.836n ± 2%  -37.90% (p=0.000 n=10)
DigestBytes/16B         11.440n ± 3%   7.208n ± 3%  -36.99% (p=0.000 n=10)
DigestBytes/100B         15.81n ± 3%   19.92n ± 3%  +26.00% (p=0.000 n=10)
DigestBytes/4KB          208.7n ± 3%   162.8n ± 2%  -22.00% (p=0.000 n=10)
DigestBytes/10MB         491.6µ ± 3%   374.4µ ± 2%  -23.84% (p=0.000 n=10)
DigestString/4B          8.987n ± 4%   5.913n ± 3%  -34.21% (p=0.000 n=10)
DigestString/16B        11.620n ± 3%   7.115n ± 3%  -38.77% (p=0.000 n=10)
DigestString/100B        15.78n ± 3%   20.14n ± 2%  +27.67% (p=0.000 n=10)
DigestString/4KB         208.7n ± 4%   167.2n ± 3%  -19.89% (p=0.000 n=10)
DigestString/10MB        490.2µ ± 4%   380.4µ ± 3%  -22.39% (p=0.000 n=10)
geomean                  130.7n        112.1n       -14.25%

                  │ /tmp/old.results │           /tmp/new.results            │
                  │       B/s        │      B/s       vs base                │
Sum64/4B                1.607Gi ± 3%    1.216Gi ± 5%  -24.36% (p=0.000 n=10)
Sum64/16B               4.848Gi ± 3%    3.512Gi ± 5%  -27.55% (p=0.000 n=10)
Sum64/100B              9.639Gi ± 3%   11.047Gi ± 2%  +14.60% (p=0.000 n=10)
Sum64/4KB               18.81Gi ± 5%    28.23Gi ± 3%  +50.05% (p=0.000 n=10)
Sum64/10MB              19.33Gi ± 3%    24.74Gi ± 3%  +27.95% (p=0.000 n=10)
Sum64String/4B          1.346Gi ± 4%    1.110Gi ± 3%  -17.53% (p=0.000 n=10)
Sum64String/16B         4.086Gi ± 4%    3.285Gi ± 4%  -19.62% (p=0.000 n=10)
Sum64String/100B        8.532Gi ± 5%   10.551Gi ± 3%  +23.66% (p=0.000 n=10)
Sum64String/4KB         18.34Gi ± 3%    28.42Gi ± 3%  +54.95% (p=0.000 n=10)
Sum64String/10MB        19.02Gi ± 3%    25.16Gi ± 3%  +32.31% (p=0.000 n=10)
DigestBytes/4B          405.9Mi ± 3%    653.7Mi ± 2%  +61.04% (p=0.000 n=10)
DigestBytes/16B         1.302Gi ± 3%    2.068Gi ± 3%  +58.75% (p=0.000 n=10)
DigestBytes/100B        5.892Gi ± 3%    4.675Gi ± 3%  -20.65% (p=0.000 n=10)
DigestBytes/4KB         17.86Gi ± 3%    22.89Gi ± 2%  +28.21% (p=0.000 n=10)
DigestBytes/10MB        18.95Gi ± 4%    24.88Gi ± 2%  +31.31% (p=0.000 n=10)
DigestString/4B         424.5Mi ± 3%    645.3Mi ± 3%  +52.03% (p=0.000 n=10)
DigestString/16B        1.282Gi ± 3%    2.094Gi ± 3%  +63.30% (p=0.000 n=10)
DigestString/100B       5.903Gi ± 3%    4.625Gi ± 2%  -21.65% (p=0.000 n=10)
DigestString/4KB        17.85Gi ± 4%    22.28Gi ± 3%  +24.83% (p=0.000 n=10)
DigestString/10MB       19.00Gi ± 4%    24.48Gi ± 3%  +28.85% (p=0.000 n=10)
geomean                 5.426Gi         6.328Gi       +16.62%

For big sizes it's either no change or way faster.
I don't like how smaller sizes are impacted, all the things I tried made it worst because the overhead of meta algorithm selection and function calls is very significant for theses small buffers (50~25%).
The main solutions I see:

  • Use dynamic dispatch.
    Make the input escape to the heap.
  • Gate behind GOAMD64=v4.
    Not as accessible to end users, prevent to test the scalar impl when GOAMD64=v4 is enabled (because it works be using const useAvx512 bool override that can't be turned off in tests).
  • Use ABIInternal.
    Can break in the future, not covered by compat promise.
  • Provide reg only pure go impls API for fixed size elements, see: Sum64_13(seed uint64, a uint64, b uint32, c uint8).
    Very cumbersome to use.
  • Whatever, it's fine if hashing 16bytes takes 1ns more ¿
    ¯\_(ツ)_/¯

```
goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash/v2
cpu: AMD Ryzen 9 7950X 16-Core Processor
           │ /tmp/old.results │          /tmp/new.results           │
           │      sec/op      │   sec/op     vs base                │
Sum64/4B          2.295n ± 5%   3.018n ± 2%  +31.53% (p=0.000 n=10)
Sum64/16B         3.103n ± 5%   4.168n ± 3%  +34.32% (p=0.000 n=10)
Sum64/100B        9.865n ± 4%   8.515n ± 3%  -13.68% (p=0.000 n=10)
Sum64/4KB         201.4n ± 3%   133.1n ± 3%  -33.91% (p=0.000 n=10)
Sum64/10MB        489.8µ ± 4%   384.4µ ± 3%  -21.52% (p=0.000 n=10)
geomean           92.93n        88.67n        -4.58%

           │ /tmp/old.results │           /tmp/new.results            │
           │       B/s        │      B/s       vs base                │
Sum64/4B         1.623Gi ± 4%    1.234Gi ± 2%  -23.96% (p=0.000 n=10)
Sum64/16B        4.802Gi ± 5%    3.575Gi ± 3%  -25.56% (p=0.000 n=10)
Sum64/100B       9.441Gi ± 4%   10.937Gi ± 3%  +15.85% (p=0.000 n=10)
Sum64/4KB        18.49Gi ± 3%    27.99Gi ± 3%  +51.33% (p=0.000 n=10)
Sum64/10MB       19.01Gi ± 4%    24.23Gi ± 3%  +27.41% (p=0.000 n=10)
geomean          7.631Gi         7.998Gi        +4.81%
```

I've tried to optimize the small numbers but I don't think I can
since a huge part of that slowdown is checking the `useAvx512` global.
I think that fine 4ns is still extremely fast for a single hash operation.
```
goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash/v2
cpu: AMD Ryzen 9 7950X 16-Core Processor
                 │ /tmp/old.results │          /tmp/new.results           │
                 │      sec/op      │   sec/op     vs base                │
DigestBytes/4B          9.358n ± 2%   9.377n ± 2%        ~ (p=0.481 n=10)
DigestBytes/16B         11.48n ± 3%   11.48n ± 1%        ~ (p=0.469 n=10)
DigestBytes/100B        15.97n ± 2%   20.41n ± 3%  +27.80% (p=0.000 n=10)
DigestBytes/4KB         212.4n ± 2%   167.7n ± 3%  -21.09% (p=0.000 n=10)
DigestBytes/10MB        493.2µ ± 3%   380.3µ ± 2%  -22.90% (p=0.000 n=10)
geomean                 178.2n        169.6n        -4.87%

                 │ /tmp/old.results │           /tmp/new.results           │
                 │       B/s        │     B/s       vs base                │
DigestBytes/4B         407.6Mi ± 2%   406.8Mi ± 2%        ~ (p=0.481 n=10)
DigestBytes/16B        1.298Gi ± 3%   1.298Gi ± 1%        ~ (p=0.529 n=10)
DigestBytes/100B       5.831Gi ± 2%   4.563Gi ± 3%  -21.75% (p=0.000 n=10)
DigestBytes/4KB        17.54Gi ± 2%   22.22Gi ± 3%  +26.72% (p=0.000 n=10)
DigestBytes/10MB       18.88Gi ± 2%   24.49Gi ± 2%  +29.69% (p=0.000 n=10)
geomean                3.979Gi        4.183Gi        +5.12%
```
```
> benchstat /tmp/{old,new,extra}.results
goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash/v2
cpu: AMD Ryzen 9 7950X 16-Core Processor
                 │ no avx512        │ avx512                               │ avx512+extra                        │
                 │      sec/op      │    sec/op     vs base                │   sec/op     vs base                │
DigestBytes/4B          9.358n ± 2%    9.377n ± 2%        ~ (p=0.481 n=10)   5.873n ± 5%  -37.24% (p=0.000 n=10)
DigestBytes/16B        11.485n ± 3%   11.485n ± 1%        ~ (p=0.469 n=10)   7.292n ± 5%  -36.51% (p=0.000 n=10)
DigestBytes/100B        15.97n ± 2%    20.41n ± 3%  +27.80% (p=0.000 n=10)   20.31n ± 3%  +27.18% (p=0.000 n=10)
DigestBytes/4KB         212.4n ± 2%    167.7n ± 3%  -21.09% (p=0.000 n=10)   163.5n ± 3%  -23.02% (p=0.000 n=10)
DigestBytes/10MB        493.2µ ± 3%    380.3µ ± 2%  -22.90% (p=0.000 n=10)   375.1µ ± 2%  -23.94% (p=0.000 n=10)
geomean                 178.2n         169.6n        -4.87%                  139.8n       -21.57%

                 │ no avx512        │ avx512                               │ avx512+extra                         │
                 │       B/s        │     B/s       vs base                │     B/s       vs base                │
DigestBytes/4B         407.6Mi ± 2%   406.8Mi ± 2%        ~ (p=0.481 n=10)   649.6Mi ± 5%  +59.35% (p=0.000 n=10)
DigestBytes/16B        1.298Gi ± 3%   1.298Gi ± 1%        ~ (p=0.529 n=10)   2.044Gi ± 5%  +57.49% (p=0.000 n=10)
DigestBytes/100B       5.831Gi ± 2%   4.563Gi ± 3%  -21.75% (p=0.000 n=10)   4.586Gi ± 3%  -21.35% (p=0.000 n=10)
DigestBytes/4KB        17.54Gi ± 2%   22.22Gi ± 3%  +26.72% (p=0.000 n=10)   22.78Gi ± 3%  +29.90% (p=0.000 n=10)
DigestBytes/10MB       18.88Gi ± 2%   24.49Gi ± 2%  +29.69% (p=0.000 n=10)   24.83Gi ± 2%  +31.48% (p=0.000 n=10)
geomean                3.979Gi        4.183Gi        +5.12%                  5.074Gi       +27.51%
```
@klauspost
Copy link

klauspost commented May 9, 2024

Overall LGTM. A few notes:

A) Test on Intel as well if possible. Alder Lake has terrible latency on VPMULLQ compared to AMD. Your switch-over point may be different.

B) wrt small sizes, I would do the size check in Go code and fall back for sizes <32 (or a similar threshold). That also eliminates all the initial branching: https://github.com/cespare/xxhash/pull/79/files#diff-c74c7d4d55252632fe997e1642f175859870550d550e8a96351e1782a6793bf2R82-R89

Actually if I do fallback to

pure Go like this (click to expand)
func Sum64(b []byte) uint64 {
	if len(b) < 32 {
		var h uint64
		h = prime5 + uint64(len(b))
		for ; len(b) >= 8; b = b[8:] {
			k1 := round(0, u64(b[:8]))
			h ^= k1
			h = rol27(h)*prime1 + prime4
		}
		if len(b) >= 4 {
			h ^= uint64(u32(b[:4])) * prime1
			h = rol23(h)*prime2 + prime3
			b = b[4:]
		}
		for ; len(b) > 0; b = b[1:] {
			h ^= uint64(b[0]) * prime5
			h = rol11(h) * prime1
		}

		h ^= h >> 33
		h *= prime2
		h ^= h >> 29
		h *= prime3
		h ^= h >> 32

		return h
	}
	if useAvx512 {
		return sum64Avx512(b)
	}
	return sum64Scalar(b)
}

.. it is actually faster than using scalar asm. Not as fast is before, but the regression is certainly less. Can't test AVX-512.

main/avx-512/with above

BenchmarkSum64/4B 1391.82 MB/s | 942.91 MB/s | 1281.78 MB/s
BenchmarkSum64/16B 3998.77 MB/s | 2942.05 MB/s | 3094.86 MB/s

or...

C) Try doing the branching in the assembly function.

The branch should be fully predictable in the benchmark, so not affecting it too much.

The function call is probably what makes the difference. Keeping Sum64 as an assembly function will probably minimize the overhead for small sizes. You should be able to access useAvx512 with something like MOVBQZX ·useAvx512(SB), CX.

(I don't really like any of your other solutions - maybe except ABIInternal as a last resort)

@Jorropo Jorropo mentioned this pull request May 9, 2024
@cespare
Copy link
Owner

cespare commented May 9, 2024

I'll try to make time to review this but it might be a while (pretty busy with work stuff).

Just to set expectations, though:

  • I'm more of a fan of GOAMD64=v4 than dynamic feature detection for AVX512.
  • I'd like to avoid adding dependencies, especially an avo dependency which is only needed for code gen. So that would need to be a separate module.
    • But I'm also not really sold on avo in general for this use case; it doesn't seem better than just editing the asm code directly.
  • ABIInternal and API changes are non-starters.
  • I'm not willing to sacrifice performance on small inputs or non-AVX512 chips for moderate gains on large inputs. Many of the use cases I'm aware of for this package only use it for small inputs.

By the way, are you making this change just for its own sake or do you have a use case where this matters a lot? If so, can you describe it? Thanks.

@Jorropo
Copy link
Author

Jorropo commented May 9, 2024

I am trying to improve the performance of @klauspost's zstd decompressor.
I have profiles where xxhash is between 5 and 10% of CPU time spent.

zstd benchmarks are very hit and miss, there is a fair amount of 1.2x improvements but there is also as much 0.8x regressions, depends a lot on the files being processed. results.txt

I'll try some of @klauspost's suggestion to make it faster on small inputs.
There are big issues I'm aware about for input between 32b~128b~256b* which is maybe causing most slowdown in zstd benchmarks.

I'm not willing to sacrifice performance on small inputs or non-AVX512 chips for moderate gains on large inputs. Many of the use cases I'm aware of for this package only use it for small inputs.

I think I can get <32b at <=1ns degradation (from 3ns to 4ns), anything above would be on par or better**, would that be ok ?

*one of the issue is moving data from the SIMD-integer complex to the GP-integer complex, latency on VMOVQ and VPEXTRQ is bad, this only happen once after the loop so doesn't impact big inputs.
**on AMD cpus, quick back of the envelope math, it will be way slower on intel and I'll have to use cpuid to only make it run on AMD.

I'm more of a fan of GOAMD64=v4 than dynamic feature detection for AVX512.

I can add that, but it's probable the GOAMD64=v4 path will still need cpuid checks.
I'm waiting on golang/go#67271 to check performance on intel (or if anyone knows where I can rent Alder-Lake CPU cores for not too much, please thx).

@cespare
Copy link
Owner

cespare commented May 9, 2024

Gotcha, thanks, that is useful motivation. Speeding up zstd is certainly an important and worthy goal.

@klauspost
Copy link

GOAMD64=v4

This will effectively mean it will not be used. But it doesn't really change the fundamental problem - the regression on smaller sizes.

32b~128b~256b which is maybe causing most slowdown in zstd benchmarks.

Blocks this small are a rare use case, and usually not a benefit in terms of compression anyway.

I have forked xxhash anyway to remove the use of unsafe. It is a good call to add it here anyway to benefit all. But it does mean we can tweak the usage of AVX-512 eventually.

avo dependency

Yeah. Thought of that, but forgot to add it. I tend to have a go.mod inside the generation folder, and have the code generation starting with _ so it isn't exported.

@Jorropo

If you have extra time one thing I saw is that you could load 64 bytes and do the VPMULLQ(temp, yprime2, temp) once on 64 bytes - unrolling the blockLoop once.

It seems like VPMULLQ can only execute on one port at the time, so even if state is the main bottleneck, this could give an improvement - particularly on Intel.

Since the state updates are highly dependency bound there should be plenty of other ports to handle moving the upper 32 bytes down.

But.... Just a theory.

@Jorropo
Copy link
Author

Jorropo commented May 10, 2024

I tried that it changes nothing the critical dependency is on state, the whole pre state mixing load and multiply doesn't even show up in perf report so not worth complexifying the loops.

@klauspost
Copy link

@Jorropo I was mainly thinking of Intel here. On AMD it can "hide" most of the (next loop) mul behind the 2 cycle latency of the add and rotate as well as the branch check.

I assume you also checked processing speed. Generally be careful about trusting the profiles on an instruction level - the CPU often "stalls" in other places than what is the bottleneck. I often see unreasonably high load attributed to single instructions - where it is actually "collapsing" all of the unresolved OoO instructions, rather than just waiting for this single instruction.

@Jorropo
Copy link
Author

Jorropo commented May 10, 2024

I did tried that not just looking at perf, for intel I don't see how any of this make sense, the scalar loop compute 4 multiplies in 6Lat (with just one multiply pipe, 4lat with two). It needs to share with the first multiply but that still faster than one avx512 multiply.
Will bench once I get some hardware access.

@klauspost
Copy link

Ran the benchmark on Intel. AVX512 is strictly worse :/

$ ./xxbench-plain -test.bench=.
goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash/xxhashbench
cpu: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
BenchmarkHashes/xxhash,direct,bytes,n=5B-128            131890704                9.175 ns/op     544.97 MB/s
BenchmarkHashes/xxhash,direct,string,n=5B-128           103834551               11.38 ns/op      439.21 MB/s
BenchmarkHashes/xxhash,digest,bytes,n=5B-128            34979754                34.61 ns/op      144.46 MB/s
BenchmarkHashes/xxhash,digest,string,n=5B-128           34841151                34.27 ns/op      145.89 MB/s
BenchmarkHashes/xxhash,direct,bytes,n=100B-128          42781322                27.82 ns/op     3594.35 MB/s
BenchmarkHashes/xxhash,direct,string,n=100B-128         38591721                30.67 ns/op     3260.19 MB/s
BenchmarkHashes/xxhash,digest,bytes,n=100B-128          29584527                40.25 ns/op     2484.49 MB/s
BenchmarkHashes/xxhash,digest,string,n=100B-128         29274169                40.74 ns/op     2454.56 MB/s
BenchmarkHashes/xxhash,direct,bytes,n=4KB-128            2238400               526.3 ns/op      7600.02 MB/s
BenchmarkHashes/xxhash,direct,string,n=4KB-128           2272880               500.1 ns/op      7997.87 MB/s
BenchmarkHashes/xxhash,digest,bytes,n=4KB-128            2198166               537.0 ns/op      7448.69 MB/s
BenchmarkHashes/xxhash,digest,string,n=4KB-128           2260663               536.7 ns/op      7452.59 MB/s
BenchmarkHashes/xxhash,direct,bytes,n=10MB-128               951           1270701 ns/op        7869.67 MB/s
BenchmarkHashes/xxhash,direct,string,n=10MB-128              867           1275952 ns/op        7837.28 MB/s
BenchmarkHashes/xxhash,digest,bytes,n=10MB-128               859           1278144 ns/op        7823.85 MB/s
BenchmarkHashes/xxhash,digest,string,n=10MB-128              853           1275316 ns/op        7841.19 MB/s
...

$ ./xxbench-avx512 -test.bench=.
goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash/xxhashbench
cpu: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
BenchmarkHashes/xxhash,direct,bytes,n=5B-128            125890801                9.360 ns/op     534.17 MB/s
BenchmarkHashes/xxhash,direct,string,n=5B-128           102338637               11.57 ns/op      432.28 MB/s
BenchmarkHashes/xxhash,digest,bytes,n=5B-128            35074806                34.06 ns/op      146.79 MB/s
BenchmarkHashes/xxhash,digest,string,n=5B-128           35505313                34.20 ns/op      146.19 MB/s
BenchmarkHashes/xxhash,direct,bytes,n=100B-128          32327865                37.12 ns/op     2693.65 MB/s
BenchmarkHashes/xxhash,direct,string,n=100B-128         32085871                37.43 ns/op     2671.34 MB/s
BenchmarkHashes/xxhash,digest,bytes,n=100B-128          14275802                71.46 ns/op     1399.36 MB/s
BenchmarkHashes/xxhash,digest,string,n=100B-128         14468517                69.72 ns/op     1434.22 MB/s
BenchmarkHashes/xxhash,direct,bytes,n=4KB-128            1136388              1061 ns/op        3769.46 MB/s
BenchmarkHashes/xxhash,direct,string,n=4KB-128           1113885              1065 ns/op        3756.11 MB/s
BenchmarkHashes/xxhash,digest,bytes,n=4KB-128            1053745              1123 ns/op        3560.40 MB/s
BenchmarkHashes/xxhash,digest,string,n=4KB-128           1080630              1123 ns/op        3562.25 MB/s
BenchmarkHashes/xxhash,direct,bytes,n=10MB-128               426           2706600 ns/op        3694.67 MB/s
BenchmarkHashes/xxhash,direct,string,n=10MB-128              428           2706010 ns/op        3695.48 MB/s
BenchmarkHashes/xxhash,digest,bytes,n=10MB-128               423           2767249 ns/op        3613.70 MB/s
BenchmarkHashes/xxhash,digest,string,n=10MB-128              399           2704197 ns/op        3697.96 MB/s

Ran your branch without any modification, except var useAvx512 = false for the "plain" version.

@Jorropo
Copy link
Author

Jorropo commented May 10, 2024

It's actually terrible, I'm using the testing.B github.com/cespare/xxhash for benchmarks so I don't know exactly how comparable it is to my Zen4 benchmarks, but overall it looks terrible I don't think it's even worth trying SIMDing-64bytes in the pre-state stage of the loop.

@klauspost
Copy link

Yeah. Zen4+ is a rather narrow target right now. The latest timings I can find on Intel is Rocket Lake. Still 15 cycles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants