-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add avx512 implementation #79
base: main
Are you sure you want to change the base?
Conversation
``` goos: linux goarch: amd64 pkg: github.com/cespare/xxhash/v2 cpu: AMD Ryzen 9 7950X 16-Core Processor │ /tmp/old.results │ /tmp/new.results │ │ sec/op │ sec/op vs base │ Sum64/4B 2.295n ± 5% 3.018n ± 2% +31.53% (p=0.000 n=10) Sum64/16B 3.103n ± 5% 4.168n ± 3% +34.32% (p=0.000 n=10) Sum64/100B 9.865n ± 4% 8.515n ± 3% -13.68% (p=0.000 n=10) Sum64/4KB 201.4n ± 3% 133.1n ± 3% -33.91% (p=0.000 n=10) Sum64/10MB 489.8µ ± 4% 384.4µ ± 3% -21.52% (p=0.000 n=10) geomean 92.93n 88.67n -4.58% │ /tmp/old.results │ /tmp/new.results │ │ B/s │ B/s vs base │ Sum64/4B 1.623Gi ± 4% 1.234Gi ± 2% -23.96% (p=0.000 n=10) Sum64/16B 4.802Gi ± 5% 3.575Gi ± 3% -25.56% (p=0.000 n=10) Sum64/100B 9.441Gi ± 4% 10.937Gi ± 3% +15.85% (p=0.000 n=10) Sum64/4KB 18.49Gi ± 3% 27.99Gi ± 3% +51.33% (p=0.000 n=10) Sum64/10MB 19.01Gi ± 4% 24.23Gi ± 3% +27.41% (p=0.000 n=10) geomean 7.631Gi 7.998Gi +4.81% ``` I've tried to optimize the small numbers but I don't think I can since a huge part of that slowdown is checking the `useAvx512` global. I think that fine 4ns is still extremely fast for a single hash operation.
``` goos: linux goarch: amd64 pkg: github.com/cespare/xxhash/v2 cpu: AMD Ryzen 9 7950X 16-Core Processor │ /tmp/old.results │ /tmp/new.results │ │ sec/op │ sec/op vs base │ DigestBytes/4B 9.358n ± 2% 9.377n ± 2% ~ (p=0.481 n=10) DigestBytes/16B 11.48n ± 3% 11.48n ± 1% ~ (p=0.469 n=10) DigestBytes/100B 15.97n ± 2% 20.41n ± 3% +27.80% (p=0.000 n=10) DigestBytes/4KB 212.4n ± 2% 167.7n ± 3% -21.09% (p=0.000 n=10) DigestBytes/10MB 493.2µ ± 3% 380.3µ ± 2% -22.90% (p=0.000 n=10) geomean 178.2n 169.6n -4.87% │ /tmp/old.results │ /tmp/new.results │ │ B/s │ B/s vs base │ DigestBytes/4B 407.6Mi ± 2% 406.8Mi ± 2% ~ (p=0.481 n=10) DigestBytes/16B 1.298Gi ± 3% 1.298Gi ± 1% ~ (p=0.529 n=10) DigestBytes/100B 5.831Gi ± 2% 4.563Gi ± 3% -21.75% (p=0.000 n=10) DigestBytes/4KB 17.54Gi ± 2% 22.22Gi ± 3% +26.72% (p=0.000 n=10) DigestBytes/10MB 18.88Gi ± 2% 24.49Gi ± 2% +29.69% (p=0.000 n=10) geomean 3.979Gi 4.183Gi +5.12% ```
``` > benchstat /tmp/{old,new,extra}.results goos: linux goarch: amd64 pkg: github.com/cespare/xxhash/v2 cpu: AMD Ryzen 9 7950X 16-Core Processor │ no avx512 │ avx512 │ avx512+extra │ │ sec/op │ sec/op vs base │ sec/op vs base │ DigestBytes/4B 9.358n ± 2% 9.377n ± 2% ~ (p=0.481 n=10) 5.873n ± 5% -37.24% (p=0.000 n=10) DigestBytes/16B 11.485n ± 3% 11.485n ± 1% ~ (p=0.469 n=10) 7.292n ± 5% -36.51% (p=0.000 n=10) DigestBytes/100B 15.97n ± 2% 20.41n ± 3% +27.80% (p=0.000 n=10) 20.31n ± 3% +27.18% (p=0.000 n=10) DigestBytes/4KB 212.4n ± 2% 167.7n ± 3% -21.09% (p=0.000 n=10) 163.5n ± 3% -23.02% (p=0.000 n=10) DigestBytes/10MB 493.2µ ± 3% 380.3µ ± 2% -22.90% (p=0.000 n=10) 375.1µ ± 2% -23.94% (p=0.000 n=10) geomean 178.2n 169.6n -4.87% 139.8n -21.57% │ no avx512 │ avx512 │ avx512+extra │ │ B/s │ B/s vs base │ B/s vs base │ DigestBytes/4B 407.6Mi ± 2% 406.8Mi ± 2% ~ (p=0.481 n=10) 649.6Mi ± 5% +59.35% (p=0.000 n=10) DigestBytes/16B 1.298Gi ± 3% 1.298Gi ± 1% ~ (p=0.529 n=10) 2.044Gi ± 5% +57.49% (p=0.000 n=10) DigestBytes/100B 5.831Gi ± 2% 4.563Gi ± 3% -21.75% (p=0.000 n=10) 4.586Gi ± 3% -21.35% (p=0.000 n=10) DigestBytes/4KB 17.54Gi ± 2% 22.22Gi ± 3% +26.72% (p=0.000 n=10) 22.78Gi ± 3% +29.90% (p=0.000 n=10) DigestBytes/10MB 18.88Gi ± 2% 24.49Gi ± 2% +29.69% (p=0.000 n=10) 24.83Gi ± 2% +31.48% (p=0.000 n=10) geomean 3.979Gi 4.183Gi +5.12% 5.074Gi +27.51% ```
Overall LGTM. A few notes: A) Test on Intel as well if possible. Alder Lake has terrible latency on VPMULLQ compared to AMD. Your switch-over point may be different. B) wrt small sizes, I would do the size check in Go code and fall back for sizes <32 (or a similar threshold). That also eliminates all the initial branching: https://github.com/cespare/xxhash/pull/79/files#diff-c74c7d4d55252632fe997e1642f175859870550d550e8a96351e1782a6793bf2R82-R89 Actually if I do fallback to pure Go like this (click to expand)func Sum64(b []byte) uint64 {
if len(b) < 32 {
var h uint64
h = prime5 + uint64(len(b))
for ; len(b) >= 8; b = b[8:] {
k1 := round(0, u64(b[:8]))
h ^= k1
h = rol27(h)*prime1 + prime4
}
if len(b) >= 4 {
h ^= uint64(u32(b[:4])) * prime1
h = rol23(h)*prime2 + prime3
b = b[4:]
}
for ; len(b) > 0; b = b[1:] {
h ^= uint64(b[0]) * prime5
h = rol11(h) * prime1
}
h ^= h >> 33
h *= prime2
h ^= h >> 29
h *= prime3
h ^= h >> 32
return h
}
if useAvx512 {
return sum64Avx512(b)
}
return sum64Scalar(b)
} .. it is actually faster than using scalar asm. Not as fast is before, but the regression is certainly less. Can't test AVX-512. main/avx-512/with above BenchmarkSum64/4B 1391.82 MB/s | 942.91 MB/s | 1281.78 MB/s or... C) Try doing the branching in the assembly function. The branch should be fully predictable in the benchmark, so not affecting it too much. The function call is probably what makes the difference. Keeping (I don't really like any of your other solutions - maybe except ABIInternal as a last resort) |
I'll try to make time to review this but it might be a while (pretty busy with work stuff). Just to set expectations, though:
By the way, are you making this change just for its own sake or do you have a use case where this matters a lot? If so, can you describe it? Thanks. |
I am trying to improve the performance of @klauspost's zstd decompressor. zstd benchmarks are very hit and miss, there is a fair amount of 1.2x improvements but there is also as much 0.8x regressions, depends a lot on the files being processed. I'll try some of @klauspost's suggestion to make it faster on small inputs.
I think I can get *one of the issue is moving data from the SIMD-integer complex to the GP-integer complex, latency on
I can add that, but it's probable the |
Gotcha, thanks, that is useful motivation. Speeding up zstd is certainly an important and worthy goal. |
This will effectively mean it will not be used. But it doesn't really change the fundamental problem - the regression on smaller sizes.
Blocks this small are a rare use case, and usually not a benefit in terms of compression anyway. I have forked xxhash anyway to remove the use of unsafe. It is a good call to add it here anyway to benefit all. But it does mean we can tweak the usage of AVX-512 eventually.
Yeah. Thought of that, but forgot to add it. I tend to have a If you have extra time one thing I saw is that you could load 64 bytes and do the It seems like VPMULLQ can only execute on one port at the time, so even if Since the But.... Just a theory. |
I tried that it changes nothing the critical dependency is on state, the whole pre state mixing load and multiply doesn't even show up in |
@Jorropo I was mainly thinking of Intel here. On AMD it can "hide" most of the (next loop) mul behind the 2 cycle latency of the add and rotate as well as the branch check. I assume you also checked processing speed. Generally be careful about trusting the profiles on an instruction level - the CPU often "stalls" in other places than what is the bottleneck. I often see unreasonably high load attributed to single instructions - where it is actually "collapsing" all of the unresolved OoO instructions, rather than just waiting for this single instruction. |
I did tried that not just looking at perf, for intel I don't see how any of this make sense, the scalar loop compute 4 multiplies in 6Lat (with just one multiply pipe, 4lat with two). It needs to share with the first multiply but that still faster than one avx512 multiply. |
Ran the benchmark on Intel. AVX512 is strictly worse :/
Ran your branch without any modification, except |
It's actually terrible, I'm using the |
Yeah. Zen4+ is a rather narrow target right now. The latest timings I can find on Intel is Rocket Lake. Still 15 cycles. |
Here are results on two different kinds of CPU (Zen2 without AVX512 support and Zen4 with AVX512 support):
For big sizes it's either no change or way faster.
I don't like how smaller sizes are impacted, all the things I tried made it worst because the overhead of meta algorithm selection and function calls is very significant for theses small buffers (50~25%).
The main solutions I see:
Make the input escape to the heap.
GOAMD64=v4
.Not as accessible to end users, prevent to test the scalar impl when
GOAMD64=v4
is enabled (because it works be usingconst useAvx512 bool
override that can't be turned off in tests).ABIInternal
.Can break in the future, not covered by compat promise.
Sum64_13(seed uint64, a uint64, b uint32, c uint8)
.Very cumbersome to use.
¯\_(ツ)_/¯