Reimplement bitrev() using a compiler intrinsic #1010
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I noticed that
bitrev()
was taking up a few percent of a profile when doing preparation of shares from a large Prio3 instance. This PR reimplements that function using thereverse_bits()
compiler intrinsic. On x86_64,reverse_bits()
gets compiled down to abswap
, then a sequence of ANDs with masks, shifts, and ORs, which swap nibbles, then pairs of bits, then bits. On ARM architectures, it compiles to anrbit
. Note that I had to add a special case for when the NTT size is 1, andd
is 0, to avoid an overflow from a too-large shift.Benchmark results are below, using an x86_64 processor. I see up to 10% improvement in polynomial multiplication itself, and 5%-8% improvements in Prio3 with medium to larger circuit sizes. (both in Criterion and Cachegrind)
Benchmark results
Based on the benchmark results, I also lowered the
FFT_THRESHOLD
constant to change when we cut over between the different polynomial multiplication implementations. Note that even before this change, previous optimizations had moved the break-even point to around 30.