speedup falcon signature verification by Rexicon226 · Pull Request #15 · algorand/falcon

Rexicon226 · 2026-02-17T07:33:02Z

Dramatically improves performance through vectorization.

The benchmarks have been taken on a Ryzen 5 9600X:

New:

degree  kg(ms)   ek(us)   sd(us)  sdc(us)   st(us)  stc(us)   vv(us)  vvc(us)
 256:     1.38    15.63    51.67    57.95    31.43    37.75     3.91    10.65
 512:     3.13    31.62   102.79   113.49    61.91    72.76     7.81    19.40
1024:     9.41    64.41   206.02   226.89   123.04   143.26    15.74    37.73

Old:

degree  kg(ms)   ek(us)   sd(us)  sdc(us)   st(us)  stc(us)   vv(us)  vvc(us)
 256:     1.36    16.98    53.68    59.63    31.89    38.07     5.73    11.94
 512:     3.11    34.73   107.03   116.79    62.98    72.96    11.59    21.89
1024:     9.67    71.37   215.84   234.02   124.95   143.60    23.78    43.35

Ported from my Zig Falcon implementation: https://github.com/Rexicon226/zk/blob/f6a47de040e99d1150853522378653ae90805215/src/signatures/falcon.zig

Details in commits.

Please note that this only affects routines which were already variable time, or has no difference.

During verification (when we use a variable-time H2P), Falcon samples the SHAKE state in a loop until it gets enough elements to full the polynomial. The criteria for a valid element is that it is smaller than (12289). The loop samples 2 bytes (16-bits), intepreters them in big-endian, and then checks the criteria. The naive method used by the reference implementation has it sample 2 bytes at a time and check each one. Instead, we can trivially optimize this by simply extracting an amount of bytes close to the Keccak state size (136), and re-use the bytes until we run out again. This gives us an approximate 10% speedup for verification (saving 1.2us on Zen 5).

CLAassistant · 2026-02-17T07:33:11Z

All committers have signed the CLA.

See the docstrings placed around this commit for a better understanding of the strategy used.

The original goal of the DIT-DIF transformation, which this implementation also uses, was to allow for the core butterflys to be parallized with SIMD. Unfortunately, this idea seems to have been mostly lost to time. By having the omega (or s as it's called in this codebase) one loop higher, we're able to perform the entire inner loop as a single chain of SIMD instructions, specifically for the cases when t = 8, t = 4, and even t = 16 for AVX512. This gives us a very large speedup, bringing the total verification time down to around 7.6us (on Zen 5), from 11.6us. This is when combined with the other optimizations introduced in this branch.

Rexicon226 force-pushed the speedup branch from 7e90425 to 8db8954 Compare February 17, 2026 08:27

Rexicon226 added 2 commits February 17, 2026 03:39

speedup pubkey decoding with SIMD

62307fb

See the docstrings placed around this commit for a better understanding of the strategy used.

Rexicon226 force-pushed the speedup branch from 8db8954 to 4f4ae0e Compare February 17, 2026 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speedup falcon signature verification#15

speedup falcon signature verification#15
Rexicon226 wants to merge 3 commits intoalgorand:mainfrom
Rexicon226:speedup

Rexicon226 commented Feb 17, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Feb 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Rexicon226 commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rexicon226 commented Feb 17, 2026 •

edited

Loading

CLAassistant commented Feb 17, 2026 •

edited

Loading