-
Notifications
You must be signed in to change notification settings - Fork 36.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unroll the ChaCha20 inner loop for performance #24946
Conversation
Concept ACK, I see |
7f3f84c
to
4f3a189
Compare
Can you give commands to run just those benches for those willing to replicate? |
|
I'm somewhat surprised unrolling a loop that is |
Getting a rough average of 15% speedup as well |
@laanwj It may also have to do with better register scheduling when unrolling (the same variable doesn't need to stay in the same register every iteration), though I haven't investigated what the difference in emitted asm is. This change may be very compiler and platform dependent, so it may be good to know what its impact is with modern clang versions and/or on arm64 systems. |
Debian testing clang 15, normal (non-debug) build, fixed CPU speed, I'm not sure I'm seeing a difference. Trying again after optimizing and tuning further. |
Gcc 11.2.0, x86_64:
|
Restarted and tuned (i7 6500U CPU @ 2.5 GHz) with
Edit: re-ran the bench a dozen times each to verify that these results are representative. |
I'm seeing much smaller improvements (0%-2.5% with gcc 11; 1.3%-7% with clang 13) on an old i7. (And very slightly worse performance compared to master with debug enabled) Did you consider just changing the |
It's a nice speedup, and a simple change, tested ACK 4f3a189
I like this idea, more elegantly than copy/pasting it makes it immediately clear it's the same. I would guess the generated code is exactly the same. |
Not seeing a large difference on an i7. (Maybe a 1%-3% speedup?) gcc-12 Before:
gcc-12 After:
gcc-10 Before:
gcc-10 after:
|
I get the same 1-3% speedup on my i7. In my test adding Side note 1: use e.g. Side note 2: No need to quote the result, it's markdown 🙂 My results on i7-8700, with clang 13.0.1: master
branch
|
+1 for |
So I think the conclusion here is that on i7 there's no (or not much) difference but on other platforms it varies. But it never becomes worse. I think a performance optimization like this is mostly interesting for slower CPUs with less effective branch prediction so that's OK with me. |
@sipa What are your thoughts on using |
@laanwj That won't work on every compiler. I'd be ok with switching to a macro to do the 10x expansion. |
TIL that it is possible to pass multiple lines as an argument to a macro |
You clearly never saw the original serialization code this codebase had ;) |
Done, used @ajtowns's approach suggested above. |
tested ACK 81c09ee with clang++ 13.0.1, test
|
|
ACK 81c09ee 🍟 Show signatureSignature:
|
81c09ee Unroll the ChaCha20 inner loop for performance (Pieter Wuille) Pull request description: Unrolling the inner ChaCha20 loop gives a ~15% speedup for me in the CHACHA20_* benchmarks. It's a simple change, this performance helps with RNG generation, and will matter more for BIP324. ACKs for top commit: martinus: tested ACK 81c09ee with clang++ 13.0.1, test `CHACHA20_1MB`: MarcoFalke: ACK 81c09ee 🍟 Tree-SHA512: 108bd0ba573bb08de92d611e7be7c09a2c2700f9655f44129b87f9b71f7e101dfc6bd345783e7b4b9b40f0b003913cf59187f422da8cdb5b20887f7855b2611a
Unrolling the inner ChaCha20 loop gives a ~15% speedup for me in the CHACHA20_* benchmarks. It's a simple change, this performance helps with RNG generation, and will matter more for BIP324.