Reimplement bitrev() using a compiler intrinsic #1010

divergentdave · 2024-04-23T15:15:12Z

I noticed that bitrev() was taking up a few percent of a profile when doing preparation of shares from a large Prio3 instance. This PR reimplements that function using the reverse_bits() compiler intrinsic. On x86_64, reverse_bits() gets compiled down to a bswap, then a sequence of ANDs with masks, shifts, and ORs, which swap nibbles, then pairs of bits, then bits. On ARM architectures, it compiles to an rbit. Note that I had to add a special case for when the NTT size is 1, and d is 0, to avoid an overflow from a too-large shift.

Benchmark results are below, using an x86_64 processor. I see up to 10% improvement in polynomial multiplication itself, and 5%-8% improvements in Prio3 with medium to larger circuit sizes. (both in Criterion and Cachegrind)

Benchmark results

   Compiling prio v0.16.2 (/home/david/Code/ppm/libprio-rs)
    Finished bench [optimized] target(s) in 19.51s
     Running benches/cycle_counts.rs (target/release/deps/cycle_counts-5133e5cd3f206d1f)
prng_16
  Instructions:               21517 (No change)
  L1 Accesses:                29649 (-0.013489%)
  L2 Accesses:                   45 (+15.38462%)
  RAM Accesses:                 125 (-1.574803%)
  Estimated Cycles:           34249 (-0.128306%)

prng_256
  Instructions:              137292 (No change)
  L1 Accesses:               188741 (+0.000530%)
  L2 Accesses:                   51 (+2.000000%)
  RAM Accesses:                 196 (-1.010101%)
  Estimated Cycles:          195856 (-0.032666%)

prng_1024
  Instructions:              524154 (No change)
  L1 Accesses:               720583 (+0.000278%)
  L2 Accesses:                   81 (-1.219512%)
  RAM Accesses:                 389 (-0.256410%)
  Estimated Cycles:          734603 (-0.005173%)

prng_4096
  Instructions:             2073325 (No change)
  L1 Accesses:              2850432 (No change)
  L2 Accesses:                  144 (+1.408451%)
  RAM Accesses:                1157 (-0.172563%)
  Estimated Cycles:         2891647 (-0.002075%)

prio3_client_count
  Instructions:               64313 (+0.024885%)
  L1 Accesses:                81632 (+0.009801%)
  L2 Accesses:                   79 (+9.722222%)
  RAM Accesses:                 374 (-1.319261%)
  Estimated Cycles:           95117 (-0.138584%)

prio3_client_histogram_10
  Instructions:              241092 (-0.238344%)
  L1 Accesses:               299063 (-0.287405%)
  L2 Accesses:                  119 (+2.586207%)
  RAM Accesses:                 544 (+0.184162%)
  Estimated Cycles:          318698 (-0.254139%)

prio3_client_sum_32
  Instructions:              305448 (-0.562544%)
  L1 Accesses:               385120 (-0.501728%)
  L2 Accesses:                   98 (-3.921569%)
  RAM Accesses:                 560 (-0.884956%)
  Estimated Cycles:          405210 (-0.524614%)

prio3_client_count_vec_1000
  Instructions:            10196188 (-9.572059%)
  L1 Accesses:             12644489 (-8.968731%)
  L2 Accesses:                 4182 (-0.095557%)
  RAM Accesses:                3036 (-0.393701%)
  Estimated Cycles:        12771659 (-8.890224%)

     Running benches/speed_tests.rs (target/release/deps/speed_tests-bed856efc3eab923)
prio3count_shard        time:   [19.276 µs 19.287 µs 19.301 µs]
                        change: [-0.2032% +0.3423% +1.0869%] (p = 0.30 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

prio3count_prepare_init time:   [12.983 µs 12.984 µs 12.986 µs]
                        change: [-0.0255% +0.0508% +0.1105%] (p = 0.15 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

prio3sum_shard/8        time:   [54.570 µs 54.576 µs 54.584 µs]
                        change: [-0.8920% -0.4923% -0.0951%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  2 (2.00%) low mild
  6 (6.00%) high mild
  6 (6.00%) high severe
prio3sum_shard/32       time:   [114.73 µs 114.75 µs 114.76 µs]
                        change: [-3.7436% -3.4195% -3.1721%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

prio3sum_prepare_init/8 time:   [36.645 µs 36.652 µs 36.659 µs]
                        change: [-1.4062% -1.3453% -1.2875%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  6 (6.00%) high mild
  4 (4.00%) high severe
prio3sum_prepare_init/32
                        time:   [63.930 µs 63.946 µs 63.962 µs]
                        change: [-3.7192% -3.5769% -3.4687%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

prio3sumvec_shard/serial/10
                        time:   [64.299 µs 64.307 µs 64.316 µs]
                        change: [-0.1328% +0.1674% +0.4105%] (p = 0.25 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
prio3sumvec_shard/serial/100
                        time:   [190.84 µs 191.07 µs 191.58 µs]
                        change: [-1.0428% -0.9358% -0.7647%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
prio3sumvec_shard/serial/1000
                        time:   [2.5150 ms 2.5154 ms 2.5158 ms]
                        change: [-6.7462% -6.7235% -6.7008%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

prio3sumvec_prepare_init/serial/10
                        time:   [47.060 µs 47.076 µs 47.098 µs]
                        change: [-0.5761% -0.4134% -0.2983%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe
prio3sumvec_prepare_init/serial/100
                        time:   [97.600 µs 97.623 µs 97.649 µs]
                        change: [-2.5127% -2.4355% -2.3505%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  4 (4.00%) high mild
  3 (3.00%) high severe
prio3sumvec_prepare_init/serial/1000
                        time:   [819.02 µs 819.21 µs 819.41 µs]
                        change: [-5.5935% -5.5392% -5.4859%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

prio3histogram_shard/serial/10
                        time:   [72.620 µs 72.632 µs 72.646 µs]
                        change: [-0.2647% -0.1418% -0.0611%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
prio3histogram_shard/serial/100
                        time:   [196.55 µs 196.58 µs 196.61 µs]
                        change: [-2.2466% -1.8212% -1.5246%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
prio3histogram_shard/serial/1000
                        time:   [2.5097 ms 2.5152 ms 2.5238 ms]
                        change: [-6.4880% -6.2815% -5.9588%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
prio3histogram_shard/serial/10000
                        time:   [18.602 ms 18.606 ms 18.611 ms]
                        change: [-8.1314% -7.9196% -7.7908%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
Benchmarking prio3histogram_shard/serial/100000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 15.0s. You may wish to increase target time to 27.8s, or reduce sample count to 50.
prio3histogram_shard/serial/100000
                        time:   [278.10 ms 278.21 ms 278.33 ms]
                        change: [-8.5075% -8.4507% -8.3957%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

prio3histogram_prepare_init/serial/10
                        time:   [56.264 µs 56.277 µs 56.290 µs]
                        change: [-0.3958% -0.2657% -0.1852%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
prio3histogram_prepare_init/serial/100
                        time:   [107.71 µs 107.82 µs 107.93 µs]
                        change: [-2.0976% -1.8083% -1.4433%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
prio3histogram_prepare_init/serial/1000
                        time:   [836.84 µs 837.61 µs 839.04 µs]
                        change: [-5.2016% -5.1105% -4.9614%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
prio3histogram_prepare_init/serial/10000
                        time:   [5.9908 ms 5.9930 ms 5.9952 ms]
                        change: [-5.3161% -5.2600% -5.2020%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
prio3histogram_prepare_init/serial/100000
                        time:   [82.276 ms 82.345 ms 82.432 ms]
                        change: [-7.1225% -7.0078% -6.8815%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

rand/16                 time:   [5.5245 µs 5.5253 µs 5.5267 µs]
                        change: [-1.4011% -1.2693% -1.1788%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
rand/256                time:   [29.580 µs 29.584 µs 29.587 µs]
                        change: [-0.4399% -0.0590% +0.1605%] (p = 0.81 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  3 (3.00%) high mild
  6 (6.00%) high severe
rand/1024               time:   [106.01 µs 106.04 µs 106.07 µs]
                        change: [-1.2403% -0.9535% -0.4947%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
rand/4096               time:   [412.71 µs 412.80 µs 412.90 µs]
                        change: [-1.6381% -1.3841% -1.1943%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

poly_mul/fft/1          time:   [649.81 ns 650.02 ns 650.24 ns]
                        change: [+0.7407% +0.8303% +0.9173%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe
poly_mul/direct/1       time:   [165.30 ns 165.35 ns 165.41 ns]
                        change: [+0.7426% +0.8280% +0.9389%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high severe
poly_mul/fft/30         time:   [21.775 µs 21.782 µs 21.789 µs]
                        change: [-8.9807% -8.8200% -8.7044%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
poly_mul/direct/30      time:   [25.078 µs 25.104 µs 25.128 µs]
                        change: [+0.5286% +0.8276% +1.2733%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
poly_mul/fft/60         time:   [49.830 µs 49.845 µs 49.861 µs]
                        change: [-8.3666% -7.9938% -7.5635%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
poly_mul/direct/60      time:   [100.17 µs 100.27 µs 100.45 µs]
                        change: [+0.4895% +0.5770% +0.6764%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe
poly_mul/fft/90         time:   [111.28 µs 111.30 µs 111.33 µs]
                        change: [-10.544% -10.388% -10.279%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  5 (5.00%) low mild
  2 (2.00%) high severe
poly_mul/direct/90      time:   [400.21 µs 400.26 µs 400.31 µs]
                        change: [+0.3027% +0.3377% +0.3648%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
poly_mul/fft/120        time:   [111.28 µs 111.30 µs 111.32 µs]
                        change: [-10.633% -10.225% -9.7366%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  8 (8.00%) high severe
poly_mul/direct/120     time:   [399.55 µs 399.59 µs 399.63 µs]
                        change: [-1.4389% -0.7539% -0.1834%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe
poly_mul/fft/150        time:   [247.55 µs 247.59 µs 247.63 µs]
                        change: [-10.495% -10.454% -10.403%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high severe
Benchmarking poly_mul/direct/150: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.0s, enable flat sampling, or reduce sample count to 50.
poly_mul/direct/150     time:   [1.5928 ms 1.5929 ms 1.5931 ms]
                        change: [-0.4093% -0.0877% +0.0911%] (p = 0.73 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  6 (6.00%) high mild
  1 (1.00%) high severe
poly_mul/fft/255        time:   [247.58 µs 247.63 µs 247.69 µs]
                        change: [-11.234% -10.678% -10.386%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
Benchmarking poly_mul/direct/255: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.0s, enable flat sampling, or reduce sample count to 50.
poly_mul/direct/255     time:   [1.5928 ms 1.5929 ms 1.5931 ms]
                        change: [-0.0085% +0.0561% +0.1053%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

Based on the benchmark results, I also lowered the FFT_THRESHOLD constant to change when we cut over between the different polynomial multiplication implementations. Note that even before this change, previous optimizations had moved the break-even point to around 30.

inahga

Nice!

* Add cases to polynomial multiplication benchmarks * Implement bitrev with a compiler intrinsic * Lower FFT_THRESHOLD to 30

divergentdave added 3 commits April 23, 2024 09:06

Add cases to polynomial multiplication benchmarks

6024811

Implement bitrev with a compiler intrinsic

fa296ab

Lower FFT_THRESHOLD to 30

b94a974

divergentdave requested a review from a team as a code owner April 23, 2024 15:15

inahga approved these changes Apr 23, 2024

View reviewed changes

tgeoghegan approved these changes Apr 23, 2024

View reviewed changes

divergentdave merged commit a6d1c0a into main Apr 23, 2024
6 checks passed

divergentdave deleted the david/bitrev-intrinsic branch April 23, 2024 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplement bitrev() using a compiler intrinsic #1010

Reimplement bitrev() using a compiler intrinsic #1010

divergentdave commented Apr 23, 2024

inahga left a comment

Reimplement bitrev() using a compiler intrinsic #1010

Reimplement bitrev() using a compiler intrinsic #1010

Conversation

divergentdave commented Apr 23, 2024

inahga left a comment

Choose a reason for hiding this comment