speedup falcon signature verification#15
Open
Rexicon226 wants to merge 3 commits intoalgorand:mainfrom
Open
Conversation
During verification (when we use a variable-time H2P), Falcon samples the SHAKE state in a loop until it gets enough elements to full the polynomial. The criteria for a valid element is that it is smaller than (12289). The loop samples 2 bytes (16-bits), intepreters them in big-endian, and then checks the criteria. The naive method used by the reference implementation has it sample 2 bytes at a time and check each one. Instead, we can trivially optimize this by simply extracting an amount of bytes close to the Keccak state size (136), and re-use the bytes until we run out again. This gives us an approximate 10% speedup for verification (saving 1.2us on Zen 5).
See the docstrings placed around this commit for a better understanding of the strategy used.
The original goal of the DIT-DIF transformation, which this implementation also uses, was to allow for the core butterflys to be parallized with SIMD. Unfortunately, this idea seems to have been mostly lost to time. By having the omega (or s as it's called in this codebase) one loop higher, we're able to perform the entire inner loop as a single chain of SIMD instructions, specifically for the cases when t = 8, t = 4, and even t = 16 for AVX512. This gives us a very large speedup, bringing the total verification time down to around 7.6us (on Zen 5), from 11.6us. This is when combined with the other optimizations introduced in this branch.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Dramatically improves performance through vectorization.
The benchmarks have been taken on a Ryzen 5 9600X:
New:
Old:
Ported from my Zig Falcon implementation: https://github.com/Rexicon226/zk/blob/f6a47de040e99d1150853522378653ae90805215/src/signatures/falcon.zig
Details in commits.
Please note that this only affects routines which were already variable time, or has no difference.