Add avx512 implementation #79

Jorropo · 2024-05-08T15:00:27Z

Here are results on two different kinds of CPU (Zen2 without AVX512 support and Zen4 with AVX512 support):

goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash/v2
cpu: AMD Ryzen 5 3600 6-Core Processor              
                     │ /tmp/old.local.results │       /tmp/new.local.results        │
                     │         sec/op         │   sec/op     vs base                │
Sum64/4B-12                       3.397n ± 1%   4.702n ± 3%  +38.43% (p=0.000 n=10)
Sum64/16B-12                      4.412n ± 3%   5.876n ± 2%  +33.20% (p=0.000 n=10)
Sum64/100B-12                     14.54n ± 2%   15.87n ± 2%   +9.11% (p=0.000 n=10)
Sum64/4KB-12                      268.1n ± 2%   268.8n ± 1%        ~ (p=0.184 n=10)
Sum64/10MB-12                     647.2µ ± 2%   645.3µ ± 2%        ~ (p=0.529 n=10)
Sum64String/4B-12                 3.643n ± 2%   4.953n ± 2%  +35.96% (p=0.000 n=10)
Sum64String/16B-12                4.529n ± 3%   5.887n ± 2%  +29.98% (p=0.000 n=10)
Sum64String/100B-12               14.25n ± 3%   15.76n ± 2%  +10.56% (p=0.000 n=10)
Sum64String/4KB-12                266.8n ± 3%   266.4n ± 2%        ~ (p=0.592 n=10)
Sum64String/10MB-12               637.4µ ± 3%   646.3µ ± 3%        ~ (p=0.529 n=10)
DigestBytes/4B-12                 8.472n ± 4%   7.976n ± 2%   -5.85% (p=0.000 n=10)
DigestBytes/16B-12                10.91n ± 2%   10.04n ± 2%   -7.97% (p=0.000 n=10)
DigestBytes/100B-12               20.09n ± 2%   20.82n ± 3%   +3.66% (p=0.011 n=10)
DigestBytes/4KB-12                277.9n ± 2%   274.9n ± 2%        ~ (p=0.086 n=10)
DigestBytes/10MB-12               643.4µ ± 3%   651.5µ ± 2%        ~ (p=0.123 n=10)
DigestString/4B-12                8.566n ± 2%   8.701n ± 2%        ~ (p=0.138 n=10)
DigestString/16B-12               10.87n ± 5%   10.79n ± 2%   -0.78% (p=0.015 n=10)
DigestString/100B-12              20.49n ± 3%   21.39n ± 1%   +4.37% (p=0.007 n=10)
DigestString/4KB-12               269.1n ± 3%   272.0n ± 2%        ~ (p=0.436 n=10)
DigestString/10MB-12              632.9µ ± 2%   643.6µ ± 1%        ~ (p=0.143 n=10)
geomean                           162.4n        173.8n        +7.00%

                     │ /tmp/old.local.results │        /tmp/new.local.results        │
                     │          B/s           │     B/s       vs base                │
Sum64/4B-12                     1123.0Mi ± 1%   811.2Mi ± 3%  -27.76% (p=0.000 n=10)
Sum64/16B-12                     3.378Gi ± 3%   2.536Gi ± 2%  -24.92% (p=0.000 n=10)
Sum64/100B-12                    6.403Gi ± 2%   5.867Gi ± 2%   -8.37% (p=0.000 n=10)
Sum64/4KB-12                     13.90Gi ± 2%   13.86Gi ± 1%        ~ (p=0.190 n=10)
Sum64/10MB-12                    14.39Gi ± 2%   14.43Gi ± 2%        ~ (p=0.529 n=10)
Sum64String/4B-12               1047.2Mi ± 3%   770.1Mi ± 2%  -26.46% (p=0.000 n=10)
Sum64String/16B-12               3.290Gi ± 3%   2.531Gi ± 2%  -23.06% (p=0.000 n=10)
Sum64String/100B-12              6.537Gi ± 3%   5.910Gi ± 2%   -9.59% (p=0.000 n=10)
Sum64String/4KB-12               13.97Gi ± 3%   13.98Gi ± 2%        ~ (p=0.579 n=10)
Sum64String/10MB-12              14.61Gi ± 3%   14.41Gi ± 3%        ~ (p=0.529 n=10)
DigestBytes/4B-12                450.3Mi ± 4%   478.2Mi ± 2%   +6.21% (p=0.000 n=10)
DigestBytes/16B-12               1.365Gi ± 1%   1.484Gi ± 2%   +8.69% (p=0.000 n=10)
DigestBytes/100B-12              4.636Gi ± 2%   4.473Gi ± 3%   -3.53% (p=0.011 n=10)
DigestBytes/4KB-12               13.40Gi ± 2%   13.55Gi ± 2%        ~ (p=0.105 n=10)
DigestBytes/10MB-12              14.48Gi ± 3%   14.29Gi ± 2%        ~ (p=0.123 n=10)
DigestString/4B-12               445.4Mi ± 2%   438.4Mi ± 2%        ~ (p=0.143 n=10)
DigestString/16B-12              1.371Gi ± 5%   1.382Gi ± 2%   +0.79% (p=0.023 n=10)
DigestString/100B-12             4.544Gi ± 3%   4.354Gi ± 1%   -4.17% (p=0.007 n=10)
DigestString/4KB-12              13.84Gi ± 3%   13.70Gi ± 2%        ~ (p=0.436 n=10)
DigestString/10MB-12             14.72Gi ± 2%   14.47Gi ± 1%        ~ (p=0.143 n=10)
geomean                          4.366Gi        4.081Gi        -6.54%

goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash/v2
cpu: AMD Ryzen 9 7950X 16-Core Processor            
                  │ /tmp/old.results │          /tmp/new.results           │
                  │      sec/op      │   sec/op     vs base                │
Sum64/4B                 2.318n ± 3%   3.064n ± 5%  +32.20% (p=0.000 n=10)
Sum64/16B                3.074n ± 3%   4.243n ± 5%  +38.03% (p=0.000 n=10)
Sum64/100B               9.663n ± 3%   8.431n ± 2%  -12.74% (p=0.000 n=10)
Sum64/4KB                198.0n ± 5%   132.0n ± 3%  -33.38% (p=0.000 n=10)
Sum64/10MB               481.8µ ± 3%   376.5µ ± 3%  -21.86% (p=0.000 n=10)
Sum64String/4B           2.768n ± 4%   3.357n ± 3%  +21.26% (p=0.000 n=10)
Sum64String/16B          3.647n ± 4%   4.537n ± 4%  +24.40% (p=0.000 n=10)
Sum64String/100B        10.915n ± 4%   8.827n ± 3%  -19.13% (p=0.000 n=10)
Sum64String/4KB          203.2n ± 3%   131.1n ± 4%  -35.47% (p=0.000 n=10)
Sum64String/10MB         489.8µ ± 2%   370.2µ ± 3%  -24.42% (p=0.000 n=10)
DigestBytes/4B           9.398n ± 3%   5.836n ± 2%  -37.90% (p=0.000 n=10)
DigestBytes/16B         11.440n ± 3%   7.208n ± 3%  -36.99% (p=0.000 n=10)
DigestBytes/100B         15.81n ± 3%   19.92n ± 3%  +26.00% (p=0.000 n=10)
DigestBytes/4KB          208.7n ± 3%   162.8n ± 2%  -22.00% (p=0.000 n=10)
DigestBytes/10MB         491.6µ ± 3%   374.4µ ± 2%  -23.84% (p=0.000 n=10)
DigestString/4B          8.987n ± 4%   5.913n ± 3%  -34.21% (p=0.000 n=10)
DigestString/16B        11.620n ± 3%   7.115n ± 3%  -38.77% (p=0.000 n=10)
DigestString/100B        15.78n ± 3%   20.14n ± 2%  +27.67% (p=0.000 n=10)
DigestString/4KB         208.7n ± 4%   167.2n ± 3%  -19.89% (p=0.000 n=10)
DigestString/10MB        490.2µ ± 4%   380.4µ ± 3%  -22.39% (p=0.000 n=10)
geomean                  130.7n        112.1n       -14.25%

                  │ /tmp/old.results │           /tmp/new.results            │
                  │       B/s        │      B/s       vs base                │
Sum64/4B                1.607Gi ± 3%    1.216Gi ± 5%  -24.36% (p=0.000 n=10)
Sum64/16B               4.848Gi ± 3%    3.512Gi ± 5%  -27.55% (p=0.000 n=10)
Sum64/100B              9.639Gi ± 3%   11.047Gi ± 2%  +14.60% (p=0.000 n=10)
Sum64/4KB               18.81Gi ± 5%    28.23Gi ± 3%  +50.05% (p=0.000 n=10)
Sum64/10MB              19.33Gi ± 3%    24.74Gi ± 3%  +27.95% (p=0.000 n=10)
Sum64String/4B          1.346Gi ± 4%    1.110Gi ± 3%  -17.53% (p=0.000 n=10)
Sum64String/16B         4.086Gi ± 4%    3.285Gi ± 4%  -19.62% (p=0.000 n=10)
Sum64String/100B        8.532Gi ± 5%   10.551Gi ± 3%  +23.66% (p=0.000 n=10)
Sum64String/4KB         18.34Gi ± 3%    28.42Gi ± 3%  +54.95% (p=0.000 n=10)
Sum64String/10MB        19.02Gi ± 3%    25.16Gi ± 3%  +32.31% (p=0.000 n=10)
DigestBytes/4B          405.9Mi ± 3%    653.7Mi ± 2%  +61.04% (p=0.000 n=10)
DigestBytes/16B         1.302Gi ± 3%    2.068Gi ± 3%  +58.75% (p=0.000 n=10)
DigestBytes/100B        5.892Gi ± 3%    4.675Gi ± 3%  -20.65% (p=0.000 n=10)
DigestBytes/4KB         17.86Gi ± 3%    22.89Gi ± 2%  +28.21% (p=0.000 n=10)
DigestBytes/10MB        18.95Gi ± 4%    24.88Gi ± 2%  +31.31% (p=0.000 n=10)
DigestString/4B         424.5Mi ± 3%    645.3Mi ± 3%  +52.03% (p=0.000 n=10)
DigestString/16B        1.282Gi ± 3%    2.094Gi ± 3%  +63.30% (p=0.000 n=10)
DigestString/100B       5.903Gi ± 3%    4.625Gi ± 2%  -21.65% (p=0.000 n=10)
DigestString/4KB        17.85Gi ± 4%    22.28Gi ± 3%  +24.83% (p=0.000 n=10)
DigestString/10MB       19.00Gi ± 4%    24.48Gi ± 3%  +28.85% (p=0.000 n=10)
geomean                 5.426Gi         6.328Gi       +16.62%

For big sizes it's either no change or way faster.
I don't like how smaller sizes are impacted, all the things I tried made it worst because the overhead of meta algorithm selection and function calls is very significant for theses small buffers (50~25%).
The main solutions I see:

Use dynamic dispatch.
Make the input escape to the heap.
Gate behind GOAMD64=v4.
Not as accessible to end users, prevent to test the scalar impl when GOAMD64=v4 is enabled (because it works be using const useAvx512 bool override that can't be turned off in tests).
Use ABIInternal.
Can break in the future, not covered by compat promise.
Provide reg only pure go impls API for fixed size elements, see: Sum64_13(seed uint64, a uint64, b uint32, c uint8).
Very cumbersome to use.
Whatever, it's fine if hashing 16bytes takes 1ns more ¿
¯\_(ツ)_/¯

``` goos: linux goarch: amd64 pkg: github.com/cespare/xxhash/v2 cpu: AMD Ryzen 9 7950X 16-Core Processor │ /tmp/old.results │ /tmp/new.results │ │ sec/op │ sec/op vs base │ Sum64/4B 2.295n ± 5% 3.018n ± 2% +31.53% (p=0.000 n=10) Sum64/16B 3.103n ± 5% 4.168n ± 3% +34.32% (p=0.000 n=10) Sum64/100B 9.865n ± 4% 8.515n ± 3% -13.68% (p=0.000 n=10) Sum64/4KB 201.4n ± 3% 133.1n ± 3% -33.91% (p=0.000 n=10) Sum64/10MB 489.8µ ± 4% 384.4µ ± 3% -21.52% (p=0.000 n=10) geomean 92.93n 88.67n -4.58% │ /tmp/old.results │ /tmp/new.results │ │ B/s │ B/s vs base │ Sum64/4B 1.623Gi ± 4% 1.234Gi ± 2% -23.96% (p=0.000 n=10) Sum64/16B 4.802Gi ± 5% 3.575Gi ± 3% -25.56% (p=0.000 n=10) Sum64/100B 9.441Gi ± 4% 10.937Gi ± 3% +15.85% (p=0.000 n=10) Sum64/4KB 18.49Gi ± 3% 27.99Gi ± 3% +51.33% (p=0.000 n=10) Sum64/10MB 19.01Gi ± 4% 24.23Gi ± 3% +27.41% (p=0.000 n=10) geomean 7.631Gi 7.998Gi +4.81% ``` I've tried to optimize the small numbers but I don't think I can since a huge part of that slowdown is checking the `useAvx512` global. I think that fine 4ns is still extremely fast for a single hash operation.

``` goos: linux goarch: amd64 pkg: github.com/cespare/xxhash/v2 cpu: AMD Ryzen 9 7950X 16-Core Processor │ /tmp/old.results │ /tmp/new.results │ │ sec/op │ sec/op vs base │ DigestBytes/4B 9.358n ± 2% 9.377n ± 2% ~ (p=0.481 n=10) DigestBytes/16B 11.48n ± 3% 11.48n ± 1% ~ (p=0.469 n=10) DigestBytes/100B 15.97n ± 2% 20.41n ± 3% +27.80% (p=0.000 n=10) DigestBytes/4KB 212.4n ± 2% 167.7n ± 3% -21.09% (p=0.000 n=10) DigestBytes/10MB 493.2µ ± 3% 380.3µ ± 2% -22.90% (p=0.000 n=10) geomean 178.2n 169.6n -4.87% │ /tmp/old.results │ /tmp/new.results │ │ B/s │ B/s vs base │ DigestBytes/4B 407.6Mi ± 2% 406.8Mi ± 2% ~ (p=0.481 n=10) DigestBytes/16B 1.298Gi ± 3% 1.298Gi ± 1% ~ (p=0.529 n=10) DigestBytes/100B 5.831Gi ± 2% 4.563Gi ± 3% -21.75% (p=0.000 n=10) DigestBytes/4KB 17.54Gi ± 2% 22.22Gi ± 3% +26.72% (p=0.000 n=10) DigestBytes/10MB 18.88Gi ± 2% 24.49Gi ± 2% +29.69% (p=0.000 n=10) geomean 3.979Gi 4.183Gi +5.12% ```

``` > benchstat /tmp/{old,new,extra}.results goos: linux goarch: amd64 pkg: github.com/cespare/xxhash/v2 cpu: AMD Ryzen 9 7950X 16-Core Processor │ no avx512 │ avx512 │ avx512+extra │ │ sec/op │ sec/op vs base │ sec/op vs base │ DigestBytes/4B 9.358n ± 2% 9.377n ± 2% ~ (p=0.481 n=10) 5.873n ± 5% -37.24% (p=0.000 n=10) DigestBytes/16B 11.485n ± 3% 11.485n ± 1% ~ (p=0.469 n=10) 7.292n ± 5% -36.51% (p=0.000 n=10) DigestBytes/100B 15.97n ± 2% 20.41n ± 3% +27.80% (p=0.000 n=10) 20.31n ± 3% +27.18% (p=0.000 n=10) DigestBytes/4KB 212.4n ± 2% 167.7n ± 3% -21.09% (p=0.000 n=10) 163.5n ± 3% -23.02% (p=0.000 n=10) DigestBytes/10MB 493.2µ ± 3% 380.3µ ± 2% -22.90% (p=0.000 n=10) 375.1µ ± 2% -23.94% (p=0.000 n=10) geomean 178.2n 169.6n -4.87% 139.8n -21.57% │ no avx512 │ avx512 │ avx512+extra │ │ B/s │ B/s vs base │ B/s vs base │ DigestBytes/4B 407.6Mi ± 2% 406.8Mi ± 2% ~ (p=0.481 n=10) 649.6Mi ± 5% +59.35% (p=0.000 n=10) DigestBytes/16B 1.298Gi ± 3% 1.298Gi ± 1% ~ (p=0.529 n=10) 2.044Gi ± 5% +57.49% (p=0.000 n=10) DigestBytes/100B 5.831Gi ± 2% 4.563Gi ± 3% -21.75% (p=0.000 n=10) 4.586Gi ± 3% -21.35% (p=0.000 n=10) DigestBytes/4KB 17.54Gi ± 2% 22.22Gi ± 3% +26.72% (p=0.000 n=10) 22.78Gi ± 3% +29.90% (p=0.000 n=10) DigestBytes/10MB 18.88Gi ± 2% 24.49Gi ± 2% +29.69% (p=0.000 n=10) 24.83Gi ± 2% +31.48% (p=0.000 n=10) geomean 3.979Gi 4.183Gi +5.12% 5.074Gi +27.51% ```

klauspost · 2024-05-09T13:18:35Z

Overall LGTM. A few notes:

A) Test on Intel as well if possible. Alder Lake has terrible latency on VPMULLQ compared to AMD. Your switch-over point may be different.

B) wrt small sizes, I would do the size check in Go code and fall back for sizes <32 (or a similar threshold). That also eliminates all the initial branching: https://github.com/cespare/xxhash/pull/79/files#diff-c74c7d4d55252632fe997e1642f175859870550d550e8a96351e1782a6793bf2R82-R89

Actually if I do fallback to

pure Go like this (click to expand)

func Sum64(b []byte) uint64 {
	if len(b) < 32 {
		var h uint64
		h = prime5 + uint64(len(b))
		for ; len(b) >= 8; b = b[8:] {
			k1 := round(0, u64(b[:8]))
			h ^= k1
			h = rol27(h)*prime1 + prime4
		}
		if len(b) >= 4 {
			h ^= uint64(u32(b[:4])) * prime1
			h = rol23(h)*prime2 + prime3
			b = b[4:]
		}
		for ; len(b) > 0; b = b[1:] {
			h ^= uint64(b[0]) * prime5
			h = rol11(h) * prime1
		}

		h ^= h >> 33
		h *= prime2
		h ^= h >> 29
		h *= prime3
		h ^= h >> 32

		return h
	}
	if useAvx512 {
		return sum64Avx512(b)
	}
	return sum64Scalar(b)
}

.. it is actually faster than using scalar asm. Not as fast is before, but the regression is certainly less. Can't test AVX-512.

main/avx-512/with above

BenchmarkSum64/4B 1391.82 MB/s | 942.91 MB/s | 1281.78 MB/s
BenchmarkSum64/16B 3998.77 MB/s | 2942.05 MB/s | 3094.86 MB/s

or...

C) Try doing the branching in the assembly function.

The branch should be fully predictable in the benchmark, so not affecting it too much.

The function call is probably what makes the difference. Keeping Sum64 as an assembly function will probably minimize the overhead for small sizes. You should be able to access useAvx512 with something like MOVBQZX ·useAvx512(SB), CX.

(I don't really like any of your other solutions - maybe except ABIInternal as a last resort)

cespare · 2024-05-09T18:50:20Z

I'll try to make time to review this but it might be a while (pretty busy with work stuff).

Just to set expectations, though:

I'm more of a fan of GOAMD64=v4 than dynamic feature detection for AVX512.
I'd like to avoid adding dependencies, especially an avo dependency which is only needed for code gen. So that would need to be a separate module.
- But I'm also not really sold on avo in general for this use case; it doesn't seem better than just editing the asm code directly.
ABIInternal and API changes are non-starters.
I'm not willing to sacrifice performance on small inputs or non-AVX512 chips for moderate gains on large inputs. Many of the use cases I'm aware of for this package only use it for small inputs.

By the way, are you making this change just for its own sake or do you have a use case where this matters a lot? If so, can you describe it? Thanks.

Jorropo · 2024-05-09T20:32:22Z

I am trying to improve the performance of @klauspost's zstd decompressor.
I have profiles where xxhash is between 5 and 10% of CPU time spent.

zstd benchmarks are very hit and miss, there is a fair amount of 1.2x improvements but there is also as much 0.8x regressions, depends a lot on the files being processed. results.txt

I'll try some of @klauspost's suggestion to make it faster on small inputs.
There are big issues I'm aware about for input between 32b~128b~256b* which is maybe causing most slowdown in zstd benchmarks.

I'm not willing to sacrifice performance on small inputs or non-AVX512 chips for moderate gains on large inputs. Many of the use cases I'm aware of for this package only use it for small inputs.

I think I can get <32b at <=1ns degradation (from 3ns to 4ns), anything above would be on par or better**, would that be ok ?

*one of the issue is moving data from the SIMD-integer complex to the GP-integer complex, latency on VMOVQ and VPEXTRQ is bad, this only happen once after the loop so doesn't impact big inputs.
**on AMD cpus, quick back of the envelope math, it will be way slower on intel and I'll have to use cpuid to only make it run on AMD.

I'm more of a fan of GOAMD64=v4 than dynamic feature detection for AVX512.

I can add that, but it's probable the GOAMD64=v4 path will still need cpuid checks.
I'm waiting on golang/go#67271 to check performance on intel (or if anyone knows where I can rent Alder-Lake CPU cores for not too much, please thx).

cespare · 2024-05-09T22:11:00Z

Gotcha, thanks, that is useful motivation. Speeding up zstd is certainly an important and worthy goal.

klauspost · 2024-05-10T07:12:13Z

GOAMD64=v4

This will effectively mean it will not be used. But it doesn't really change the fundamental problem - the regression on smaller sizes.

32b~128b~256b which is maybe causing most slowdown in zstd benchmarks.

Blocks this small are a rare use case, and usually not a benefit in terms of compression anyway.

I have forked xxhash anyway to remove the use of unsafe. It is a good call to add it here anyway to benefit all. But it does mean we can tweak the usage of AVX-512 eventually.

avo dependency

Yeah. Thought of that, but forgot to add it. I tend to have a go.mod inside the generation folder, and have the code generation starting with _ so it isn't exported.

@Jorropo

If you have extra time one thing I saw is that you could load 64 bytes and do the VPMULLQ(temp, yprime2, temp) once on 64 bytes - unrolling the blockLoop once.

It seems like VPMULLQ can only execute on one port at the time, so even if state is the main bottleneck, this could give an improvement - particularly on Intel.

Since the state updates are highly dependency bound there should be plenty of other ports to handle moving the upper 32 bytes down.

But.... Just a theory.

Jorropo · 2024-05-10T07:25:50Z

I tried that it changes nothing the critical dependency is on state, the whole pre state mixing load and multiply doesn't even show up in perf report so not worth complexifying the loops.

klauspost · 2024-05-10T07:47:51Z

@Jorropo I was mainly thinking of Intel here. On AMD it can "hide" most of the (next loop) mul behind the 2 cycle latency of the add and rotate as well as the branch check.

I assume you also checked processing speed. Generally be careful about trusting the profiles on an instruction level - the CPU often "stalls" in other places than what is the bottleneck. I often see unreasonably high load attributed to single instructions - where it is actually "collapsing" all of the unresolved OoO instructions, rather than just waiting for this single instruction.

Jorropo · 2024-05-10T08:16:42Z

I did tried that not just looking at perf, for intel I don't see how any of this make sense, the scalar loop compute 4 multiplies in 6Lat (with just one multiply pipe, 4lat with two). It needs to share with the first multiply but that still faster than one avx512 multiply.
Will bench once I get some hardware access.

klauspost · 2024-05-10T09:22:36Z

Ran the benchmark on Intel. AVX512 is strictly worse :/

$ ./xxbench-plain -test.bench=.
goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash/xxhashbench
cpu: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
BenchmarkHashes/xxhash,direct,bytes,n=5B-128            131890704                9.175 ns/op     544.97 MB/s
BenchmarkHashes/xxhash,direct,string,n=5B-128           103834551               11.38 ns/op      439.21 MB/s
BenchmarkHashes/xxhash,digest,bytes,n=5B-128            34979754                34.61 ns/op      144.46 MB/s
BenchmarkHashes/xxhash,digest,string,n=5B-128           34841151                34.27 ns/op      145.89 MB/s
BenchmarkHashes/xxhash,direct,bytes,n=100B-128          42781322                27.82 ns/op     3594.35 MB/s
BenchmarkHashes/xxhash,direct,string,n=100B-128         38591721                30.67 ns/op     3260.19 MB/s
BenchmarkHashes/xxhash,digest,bytes,n=100B-128          29584527                40.25 ns/op     2484.49 MB/s
BenchmarkHashes/xxhash,digest,string,n=100B-128         29274169                40.74 ns/op     2454.56 MB/s
BenchmarkHashes/xxhash,direct,bytes,n=4KB-128            2238400               526.3 ns/op      7600.02 MB/s
BenchmarkHashes/xxhash,direct,string,n=4KB-128           2272880               500.1 ns/op      7997.87 MB/s
BenchmarkHashes/xxhash,digest,bytes,n=4KB-128            2198166               537.0 ns/op      7448.69 MB/s
BenchmarkHashes/xxhash,digest,string,n=4KB-128           2260663               536.7 ns/op      7452.59 MB/s
BenchmarkHashes/xxhash,direct,bytes,n=10MB-128               951           1270701 ns/op        7869.67 MB/s
BenchmarkHashes/xxhash,direct,string,n=10MB-128              867           1275952 ns/op        7837.28 MB/s
BenchmarkHashes/xxhash,digest,bytes,n=10MB-128               859           1278144 ns/op        7823.85 MB/s
BenchmarkHashes/xxhash,digest,string,n=10MB-128              853           1275316 ns/op        7841.19 MB/s
...

$ ./xxbench-avx512 -test.bench=.
goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash/xxhashbench
cpu: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
BenchmarkHashes/xxhash,direct,bytes,n=5B-128            125890801                9.360 ns/op     534.17 MB/s
BenchmarkHashes/xxhash,direct,string,n=5B-128           102338637               11.57 ns/op      432.28 MB/s
BenchmarkHashes/xxhash,digest,bytes,n=5B-128            35074806                34.06 ns/op      146.79 MB/s
BenchmarkHashes/xxhash,digest,string,n=5B-128           35505313                34.20 ns/op      146.19 MB/s
BenchmarkHashes/xxhash,direct,bytes,n=100B-128          32327865                37.12 ns/op     2693.65 MB/s
BenchmarkHashes/xxhash,direct,string,n=100B-128         32085871                37.43 ns/op     2671.34 MB/s
BenchmarkHashes/xxhash,digest,bytes,n=100B-128          14275802                71.46 ns/op     1399.36 MB/s
BenchmarkHashes/xxhash,digest,string,n=100B-128         14468517                69.72 ns/op     1434.22 MB/s
BenchmarkHashes/xxhash,direct,bytes,n=4KB-128            1136388              1061 ns/op        3769.46 MB/s
BenchmarkHashes/xxhash,direct,string,n=4KB-128           1113885              1065 ns/op        3756.11 MB/s
BenchmarkHashes/xxhash,digest,bytes,n=4KB-128            1053745              1123 ns/op        3560.40 MB/s
BenchmarkHashes/xxhash,digest,string,n=4KB-128           1080630              1123 ns/op        3562.25 MB/s
BenchmarkHashes/xxhash,direct,bytes,n=10MB-128               426           2706600 ns/op        3694.67 MB/s
BenchmarkHashes/xxhash,direct,string,n=10MB-128              428           2706010 ns/op        3695.48 MB/s
BenchmarkHashes/xxhash,digest,bytes,n=10MB-128               423           2767249 ns/op        3613.70 MB/s
BenchmarkHashes/xxhash,digest,string,n=10MB-128              399           2704197 ns/op        3697.96 MB/s

Ran your branch without any modification, except var useAvx512 = false for the "plain" version.

Jorropo · 2024-05-10T09:27:56Z

It's actually terrible, I'm using the testing.B github.com/cespare/xxhash for benchmarks so I don't know exactly how comparable it is to my Zen4 benchmarks, but overall it looks terrible I don't think it's even worth trying SIMDing-64bytes in the pre-state stage of the loop.

klauspost · 2024-05-10T09:41:11Z

Yeah. Zen4+ is a rather narrow target right now. The latest timings I can find on Intel is Rocket Lake. Still 15 cycles.

Keeps avo out of the dependency tree.

klauspost · 2024-06-19T08:26:27Z

@Jorropo What are the numbers comparing non-avx512 use to master?

Jorropo · 2024-06-19T17:59:09Z

@klauspost ~~"use" ? I don't understand the question~~
I understand now, it's not ready yet.

This is to work around the go compiler's prove limitations. It compiles to nothing but it requires go1.17 since casting to array pointers has only been added in go1.17.

See https://go.dev/issue/68081 golang/go#68081.

This didn't impacted anything since the process would exit right away, but we would like the file if w.WriteTo failed.

This makes the slide faster than anything else on most inputs. Extremely small inputs are slower but by little.

Jorropo · 2024-06-20T07:45:08Z

I implemented a slide sum in pure go in efforts of improving the regressions on small sizes.
It has improvements over @main in the 32~127 bytes length.

But I can't seem to beat @main on the <=31 length buffer length the fair ways.
The problem I have is that end to end in the benchmark I fighting over literally 3 cycles.
I also don't believe 3 cycle on xxhash.Sum64 call would be significant for anyone's app. I would be extremely surprised if anyone doesn't take 100+ cycles to feed the buffer and consume the output hash.

I'm gonna do proper benchmarks later.

Jorropo · 2024-06-20T07:48:37Z

For example various pieces of the algorithm benefits from adding LUTs, but this only impact extremely size buffer like <= 4bytes and I'm not sure it's worth storing 514KiB LUT just to save a low single digit cycles on a benchmark when it's unlikely to ever show up in a real app.

klauspost · 2024-06-20T13:56:23Z

I feel a tingling in the repos. :)

Jorropo · 2024-06-20T13:58:05Z

I fixed the performance regressions on small inputs (for both avx512 and non-avx512 with runtime detection) and improved the avx512 performance even more.
But now the avx512 hashes are wrong (likely a dumb mistake I made), I need to fix that, do a proper benchstat run on my two machines and this will be ready for review.

Jorropo · 2024-06-20T14:44:52Z

Great news, with the latest commits no performance regression (beyond inexplicable block layout randomness) on my Zen2 (non-avx512):

> benchstat /tmp/old.results /tmp/new.results
goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash/v2
cpu: AMD Ryzen 5 3600 6-Core Processor              
                     │ /tmp/old.results │          /tmp/new.results           │
                     │      sec/op      │   sec/op     vs base                │
Sum64/4B-12                 3.349n ± 2%   3.115n ± 1%   -6.96% (p=0.000 n=10)
Sum64/16B-12                4.323n ± 2%   4.285n ± 1%        ~ (p=0.393 n=10)
Sum64/100B-12               14.40n ± 2%   14.71n ± 1%   +2.12% (p=0.001 n=10)
Sum64/4KB-12                262.8n ± 2%   267.9n ± 2%   +1.96% (p=0.011 n=10)
Sum64/10MB-12               633.3µ ± 2%   630.6µ ± 2%        ~ (p=0.631 n=10)
Sum64String/4B-12           3.627n ± 1%   3.889n ± 1%   +7.22% (p=0.000 n=10)
Sum64String/16B-12          4.476n ± 1%   4.643n ± 2%   +3.74% (p=0.000 n=10)
Sum64String/100B-12         14.04n ± 1%   14.88n ± 3%   +6.02% (p=0.000 n=10)
Sum64String/4KB-12          263.7n ± 2%   267.8n ± 2%   +1.55% (p=0.019 n=10)
Sum64String/10MB-12         638.0µ ± 2%   637.2µ ± 1%        ~ (p=0.481 n=10)
DigestBytes/4B-12           8.274n ± 2%   8.125n ± 1%   -1.80% (p=0.002 n=10)
DigestBytes/16B-12         10.915n ± 2%   9.890n ± 1%   -9.39% (p=0.000 n=10)
DigestBytes/100B-12         19.91n ± 2%   19.37n ± 2%   -2.74% (p=0.001 n=10)
DigestBytes/4KB-12          275.1n ± 1%   272.5n ± 1%        ~ (p=0.060 n=10)
DigestBytes/10MB-12         629.6µ ± 3%   634.7µ ± 2%        ~ (p=0.971 n=10)
DigestString/4B-12          8.367n ± 1%   9.228n ± 6%  +10.29% (p=0.000 n=10)
DigestString/16B-12         10.97n ± 2%   10.96n ± 4%        ~ (p=0.631 n=10)
DigestString/100B-12        20.30n ± 1%   19.70n ± 2%   -2.98% (p=0.000 n=10)
DigestString/4KB-12         271.7n ± 1%   271.4n ± 2%        ~ (p=0.424 n=10)
DigestString/10MB-12        631.8µ ± 2%   633.7µ ± 2%        ~ (p=0.796 n=10)
geomean                     160.7n        161.2n        +0.29%

                     │ /tmp/old.results │           /tmp/new.results           │
                     │       B/s        │     B/s       vs base                │
Sum64/4B-12                1.112Gi ± 2%   1.196Gi ± 1%   +7.48% (p=0.000 n=10)
Sum64/16B-12               3.447Gi ± 2%   3.477Gi ± 1%        ~ (p=0.393 n=10)
Sum64/100B-12              6.465Gi ± 2%   6.331Gi ± 1%   -2.08% (p=0.002 n=10)
Sum64/4KB-12               14.18Gi ± 2%   13.90Gi ± 2%   -1.95% (p=0.011 n=10)
Sum64/10MB-12              14.71Gi ± 2%   14.77Gi ± 2%        ~ (p=0.631 n=10)
Sum64String/4B-12         1051.7Mi ± 1%   980.8Mi ± 1%   -6.74% (p=0.000 n=10)
Sum64String/16B-12         3.329Gi ± 1%   3.209Gi ± 2%   -3.60% (p=0.000 n=10)
Sum64String/100B-12        6.633Gi ± 2%   6.256Gi ± 3%   -5.69% (p=0.000 n=10)
Sum64String/4KB-12         14.13Gi ± 2%   13.91Gi ± 2%   -1.52% (p=0.019 n=10)
Sum64String/10MB-12        14.60Gi ± 2%   14.62Gi ± 2%        ~ (p=0.481 n=10)
DigestBytes/4B-12          461.1Mi ± 2%   469.5Mi ± 1%   +1.84% (p=0.002 n=10)
DigestBytes/16B-12         1.365Gi ± 2%   1.507Gi ± 1%  +10.38% (p=0.000 n=10)
DigestBytes/100B-12        4.678Gi ± 2%   4.810Gi ± 2%   +2.83% (p=0.000 n=10)
DigestBytes/4KB-12         13.54Gi ± 1%   13.67Gi ± 1%        ~ (p=0.075 n=10)
DigestBytes/10MB-12        14.79Gi ± 2%   14.67Gi ± 2%        ~ (p=0.971 n=10)
DigestString/4B-12         455.9Mi ± 1%   413.4Mi ± 6%   -9.33% (p=0.000 n=10)
DigestString/16B-12        1.358Gi ± 2%   1.359Gi ± 4%        ~ (p=0.631 n=10)
DigestString/100B-12       4.587Gi ± 1%   4.729Gi ± 2%   +3.08% (p=0.000 n=10)
DigestString/4KB-12        13.71Gi ± 1%   13.72Gi ± 2%        ~ (p=0.436 n=10)
DigestString/10MB-12       14.74Gi ± 2%   14.70Gi ± 2%        ~ (p=0.796 n=10)
geomean                    4.412Gi        4.399Gi        -0.29%

The only outliar is DigestString/4B but given DigestBytes/4B runs the same routines and is even, I'm gonna call benchmark side effect.

Here is the throughput on Zen4, improvements everywhere (*except 16B which is tied):

goos: linux
goarch: amd64
pkg: github.com/cespare/xxhash/v2
cpu: AMD Ryzen 9 7950X 16-Core Processor            
                  │ /tmp/old.results │          /tmp/new.results           │
                  │      sec/op      │   sec/op     vs base                │
Sum64/4B                 2.325n ± 5%   2.215n ± 5%   -4.73% (p=0.016 n=10)
Sum64/16B                3.118n ± 4%   3.056n ± 3%        ~ (p=0.289 n=10)
Sum64/100B               9.776n ± 2%   6.472n ± 4%  -33.80% (p=0.000 n=10)
Sum64/4KB                202.6n ± 4%   138.0n ± 4%  -31.89% (p=0.000 n=10)
Sum64/10MB               485.9µ ± 2%   376.3µ ± 3%  -22.55% (p=0.000 n=10)
Sum64String/4B           2.769n ± 2%   2.597n ± 3%   -6.21% (p=0.000 n=10)
Sum64String/16B          3.616n ± 2%   3.612n ± 3%        ~ (p=0.796 n=10)
Sum64String/100B        10.610n ± 3%   6.812n ± 3%  -35.80% (p=0.000 n=10)
Sum64String/4KB          199.6n ± 4%   135.6n ± 3%  -32.09% (p=0.000 n=10)
Sum64String/10MB         481.5µ ± 3%   377.9µ ± 2%  -21.51% (p=0.000 n=10)
DigestBytes/4B           9.173n ± 1%   5.839n ± 4%  -36.34% (p=0.000 n=10)
DigestBytes/16B         11.290n ± 2%   7.154n ± 4%  -36.63% (p=0.000 n=10)
DigestBytes/100B         15.60n ± 2%   20.22n ± 3%  +29.69% (p=0.000 n=10)
DigestBytes/4KB          205.8n ± 3%   164.1n ± 3%  -20.24% (p=0.000 n=10)
DigestBytes/10MB         479.4µ ± 3%   376.8µ ± 3%  -21.41% (p=0.000 n=10)
DigestString/4B          8.750n ± 5%   5.946n ± 3%  -32.04% (p=0.000 n=10)
DigestString/16B        11.315n ± 4%   7.173n ± 2%  -36.60% (p=0.000 n=10)
DigestString/100B        15.36n ± 3%   19.89n ± 2%  +29.46% (p=0.000 n=10)
DigestString/4KB         205.8n ± 2%   164.6n ± 3%  -20.06% (p=0.000 n=10)
DigestString/10MB        490.3µ ± 2%   374.1µ ± 3%  -23.70% (p=0.000 n=10)
geomean                  129.4n        103.6n       -19.95%

                  │ /tmp/old.results │           /tmp/new.results            │
                  │       B/s        │      B/s       vs base                │
Sum64/4B                1.603Gi ± 5%    1.683Gi ± 5%   +4.99% (p=0.019 n=10)
Sum64/16B               4.780Gi ± 4%    4.876Gi ± 3%        ~ (p=0.315 n=10)
Sum64/100B              9.526Gi ± 2%   14.390Gi ± 4%  +51.06% (p=0.000 n=10)
Sum64/4KB               18.38Gi ± 4%    26.99Gi ± 4%  +46.84% (p=0.000 n=10)
Sum64/10MB              19.17Gi ± 2%    24.75Gi ± 2%  +29.12% (p=0.000 n=10)
Sum64String/4B          1.345Gi ± 2%    1.434Gi ± 3%   +6.62% (p=0.000 n=10)
Sum64String/16B         4.121Gi ± 2%    4.126Gi ± 3%        ~ (p=0.796 n=10)
Sum64String/100B        8.780Gi ± 3%   13.671Gi ± 3%  +55.71% (p=0.000 n=10)
Sum64String/4KB         18.66Gi ± 4%    27.49Gi ± 3%  +47.28% (p=0.000 n=10)
Sum64String/10MB        19.34Gi ± 3%    24.64Gi ± 2%  +27.41% (p=0.000 n=10)
DigestBytes/4B          415.9Mi ± 1%    653.6Mi ± 4%  +57.16% (p=0.000 n=10)
DigestBytes/16B         1.320Gi ± 2%    2.083Gi ± 4%  +57.79% (p=0.000 n=10)
DigestBytes/100B        5.974Gi ± 2%    4.605Gi ± 3%  -22.90% (p=0.000 n=10)
DigestBytes/4KB         18.10Gi ± 3%    22.70Gi ± 3%  +25.41% (p=0.000 n=10)
DigestBytes/10MB        19.43Gi ± 3%    24.72Gi ± 3%  +27.26% (p=0.000 n=10)
DigestString/4B         436.0Mi ± 5%    641.5Mi ± 3%  +47.15% (p=0.000 n=10)
DigestString/16B        1.317Gi ± 4%    2.077Gi ± 2%  +57.80% (p=0.000 n=10)
DigestString/100B       6.062Gi ± 2%    4.684Gi ± 2%  -22.74% (p=0.000 n=10)
DigestString/4KB        18.10Gi ± 2%    22.65Gi ± 3%  +25.12% (p=0.000 n=10)
DigestString/10MB       19.00Gi ± 3%    24.90Gi ± 3%  +31.06% (p=0.000 n=10)
geomean                 5.479Gi         6.846Gi       +24.94%

I think I addressed all the issues that were brought up:

Performance regressions on small inputs.
I changed the strategy to do if useAvx512 in assembly and do a stack transfer which is free (pipelined away) in this context.
This get around the double function call overhead seen in previous versions of this PR.
Avo.
Removed.
Intel.
Excluded, use the non avx512 impl.
cpuid import.
Ok I didn't addressed that one but I think things are better this way, this import is only compiled under amd64, !purego and >= go1.22, it is small.
If @cespare really does not want it, I'll copy the cpuid assembly functions and duplicate the cpuid logic here.

Jorropo · 2024-06-20T14:53:28Z

*I just noticed that DigestBytes/100B is slower on avx512 it's still extremely fast in the scheme of things.
I tried profiling and experimenting a bit but the results hardly make sense given some smaller unaligned sizes are faster sometime.
Also adding a random other function in the binary changed it from -25% to -5% so I don't really feel to fight with layout randomness. It is faster overall, can't win them all.

This removes the extra overhead and is cycle for cycle tied with pre-avx512 decision on non avx512. The PCALIGN on avx512 improves the performance from 28.5GiB/s to 30GiB/s on 4K, and 25GiB/s to 28GiB/s on 10M. This also removes avo because it were running in my legs as I couldn't get it to jump to an other function and cespare emited reservation to using it.

PCALIGN gives a ~12% performance improvement so we really want it, if you are using a previous version of go, too bad please upgrade.

xxhash_avx512_amd64.s

klauspost · 2024-06-22T14:10:43Z

xxhash_avx512_amd64.s

+	CMPQ         n, $3
+	JBE          loop_1
+	CMPQ         n, $7
+	JBE          do_4


It seems like this would make more sense at the just before loop_8 with the branch below going to that, so this would also get run when then avx loop exits.

Plus it would make the >31 start much faster and you avoid the CMPQ n, $8 when exiting the avx512 loop.

Maybe not have this at all? Seems like something that is faster in benchmarks only, where all the sizes are the same, so branches can be fully predicted.

Jorropo force-pushed the avx512 branch from 8d3d5fa to 19b797a Compare May 8, 2024 15:08

Jorropo added 3 commits May 8, 2024 17:11

Jorropo force-pushed the avx512 branch from 19b797a to 53f0391 Compare May 8, 2024 15:12

Jorropo mentioned this pull request May 9, 2024

access: gomote golang/go#67271

Closed

Jorropo added 2 commits June 18, 2024 21:37

move generation to it's own submodule

4dc0b91

Keeps avo out of the dependency tree.

tweak CPUID so AVX512 impl only runs on amd

24dee90

Jorropo force-pushed the avx512 branch 2 times, most recently from 4fb1336 to 24dee90 Compare June 18, 2024 19:39

Jorropo added 2 commits June 18, 2024 21:44

clarify useAvx512 in arm64

0108fb0

simplify extra handling in generic writeBlocks

95cabeb

Jorropo added 5 commits June 20, 2024 09:21

implement slide sum for amd64 fallback

0a6336c

change gen/go.mod to use a more reasonable go.mod version

b55cf51

add very unsubtle bounds hints to the compiler in the slide

5060892

This is to work around the go compiler's prove limitations. It compiles to nothing but it requires go1.17 since casting to array pointers has only been added in go1.17.

manually inline u64 and u32 in the slide to work around go bug.

084a763

See https://go.dev/issue/68081 golang/go#68081.

fix file leak in gen/slide.go

907c96d

This didn't impacted anything since the process would exit right away, but we would like the file if w.WriteTo failed.

manually inline functions in the slide

a0584e2

This makes the slide faster than anything else on most inputs. Extremely small inputs are slower but by little.

Jorropo force-pushed the avx512 branch from f0be891 to a0584e2 Compare June 20, 2024 07:23

fix slidegen so it outputs go fmted code

5103d35

Jorropo force-pushed the avx512 branch from 9d42e9e to 5103d35 Compare June 20, 2024 08:04

Jorropo marked this pull request as draft June 20, 2024 08:34

Jorropo force-pushed the avx512 branch from 5f07674 to e7870b6 Compare June 20, 2024 14:06

Jorropo marked this pull request as ready for review June 20, 2024 14:11

Jorropo added 2 commits June 20, 2024 17:11

gate avx512 behind go1.22

1bea645

PCALIGN gives a ~12% performance improvement so we really want it, if you are using a previous version of go, too bad please upgrade.

Jorropo force-pushed the avx512 branch 2 times, most recently from ec61def to 1bea645 Compare June 20, 2024 15:12

klauspost reviewed Jun 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add avx512 implementation #79

Add avx512 implementation #79

Jorropo commented May 8, 2024 •

edited

Loading

klauspost commented May 9, 2024 •

edited

Loading

cespare commented May 9, 2024

Jorropo commented May 9, 2024 •

edited

Loading

cespare commented May 9, 2024

klauspost commented May 10, 2024

Jorropo commented May 10, 2024 •

edited

Loading

klauspost commented May 10, 2024

Jorropo commented May 10, 2024 •

edited

Loading

klauspost commented May 10, 2024

Jorropo commented May 10, 2024

klauspost commented May 10, 2024

klauspost commented Jun 19, 2024

Jorropo commented Jun 19, 2024 •

edited

Loading

Jorropo commented Jun 20, 2024 •

edited

Loading

Jorropo commented Jun 20, 2024

klauspost commented Jun 20, 2024

Jorropo commented Jun 20, 2024 •

edited

Loading

Jorropo commented Jun 20, 2024 •

edited

Loading

Jorropo commented Jun 20, 2024 •

edited

Loading

klauspost Jun 22, 2024

Add avx512 implementation #79

Are you sure you want to change the base?

Add avx512 implementation #79

Conversation

Jorropo commented May 8, 2024 • edited Loading

klauspost commented May 9, 2024 • edited Loading

cespare commented May 9, 2024

Jorropo commented May 9, 2024 • edited Loading

cespare commented May 9, 2024

klauspost commented May 10, 2024

Jorropo commented May 10, 2024 • edited Loading

klauspost commented May 10, 2024

Jorropo commented May 10, 2024 • edited Loading

klauspost commented May 10, 2024

Jorropo commented May 10, 2024

klauspost commented May 10, 2024

klauspost commented Jun 19, 2024

Jorropo commented Jun 19, 2024 • edited Loading

Jorropo commented Jun 20, 2024 • edited Loading

Jorropo commented Jun 20, 2024

klauspost commented Jun 20, 2024

Jorropo commented Jun 20, 2024 • edited Loading

Jorropo commented Jun 20, 2024 • edited Loading

Jorropo commented Jun 20, 2024 • edited Loading

klauspost Jun 22, 2024

Choose a reason for hiding this comment

Jorropo commented May 8, 2024 •

edited

Loading

klauspost commented May 9, 2024 •

edited

Loading

Jorropo commented May 9, 2024 •

edited

Loading

Jorropo commented May 10, 2024 •

edited

Loading

Jorropo commented May 10, 2024 •

edited

Loading

Jorropo commented Jun 19, 2024 •

edited

Loading

Jorropo commented Jun 20, 2024 •

edited

Loading

Jorropo commented Jun 20, 2024 •

edited

Loading

Jorropo commented Jun 20, 2024 •

edited

Loading

Jorropo commented Jun 20, 2024 •

edited

Loading