Skip to content

speedup falcon signature verification#15

Open
Rexicon226 wants to merge 3 commits intoalgorand:mainfrom
Rexicon226:speedup
Open

speedup falcon signature verification#15
Rexicon226 wants to merge 3 commits intoalgorand:mainfrom
Rexicon226:speedup

Conversation

@Rexicon226
Copy link
Copy Markdown

@Rexicon226 Rexicon226 commented Feb 17, 2026

Dramatically improves performance through vectorization.

The benchmarks have been taken on a Ryzen 5 9600X:

New:

degree  kg(ms)   ek(us)   sd(us)  sdc(us)   st(us)  stc(us)   vv(us)  vvc(us)
 256:     1.38    15.63    51.67    57.95    31.43    37.75     3.91    10.65
 512:     3.13    31.62   102.79   113.49    61.91    72.76     7.81    19.40
1024:     9.41    64.41   206.02   226.89   123.04   143.26    15.74    37.73

Old:

degree  kg(ms)   ek(us)   sd(us)  sdc(us)   st(us)  stc(us)   vv(us)  vvc(us)
 256:     1.36    16.98    53.68    59.63    31.89    38.07     5.73    11.94
 512:     3.11    34.73   107.03   116.79    62.98    72.96    11.59    21.89
1024:     9.67    71.37   215.84   234.02   124.95   143.60    23.78    43.35

Ported from my Zig Falcon implementation: https://github.com/Rexicon226/zk/blob/f6a47de040e99d1150853522378653ae90805215/src/signatures/falcon.zig

Details in commits.

Please note that this only affects routines which were already variable time, or has no difference.

During verification (when we use a variable-time H2P), Falcon samples the SHAKE
state in a loop until it gets enough elements to full the polynomial. The
criteria for a valid element is that it is smaller than (12289). The loop
samples 2 bytes (16-bits), intepreters them in big-endian, and then checks
the criteria.

The naive method used by the reference implementation has it sample 2 bytes
at a time and check each one. Instead, we can trivially optimize this by
simply extracting an amount of bytes close to the Keccak state size (136),
and re-use the bytes until we run out again.

This gives us an approximate 10% speedup for verification (saving 1.2us on Zen 5).
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Feb 17, 2026

CLA assistant check
All committers have signed the CLA.

See the docstrings placed around this commit for a better
understanding of the strategy used.
The original goal of the DIT-DIF transformation, which
this implementation also uses, was to allow for the core
butterflys to be parallized with SIMD. Unfortunately, this
idea seems to have been mostly lost to time.

By having the omega (or s as it's called in this codebase)
one loop higher, we're able to perform the entire inner loop
as a single chain of SIMD instructions, specifically for the
cases when t = 8, t = 4, and even t = 16 for AVX512.

This gives us a very large speedup, bringing the total verification
time down to around 7.6us (on Zen 5), from 11.6us. This is
when combined with the other optimizations introduced in this branch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants