Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement bitrev() using a compiler intrinsic #1010

Merged
merged 3 commits into from
Apr 23, 2024

Conversation

divergentdave
Copy link
Contributor

I noticed that bitrev() was taking up a few percent of a profile when doing preparation of shares from a large Prio3 instance. This PR reimplements that function using the reverse_bits() compiler intrinsic. On x86_64, reverse_bits() gets compiled down to a bswap, then a sequence of ANDs with masks, shifts, and ORs, which swap nibbles, then pairs of bits, then bits. On ARM architectures, it compiles to an rbit. Note that I had to add a special case for when the NTT size is 1, and d is 0, to avoid an overflow from a too-large shift.

Benchmark results are below, using an x86_64 processor. I see up to 10% improvement in polynomial multiplication itself, and 5%-8% improvements in Prio3 with medium to larger circuit sizes. (both in Criterion and Cachegrind)

Benchmark results
   Compiling prio v0.16.2 (/home/david/Code/ppm/libprio-rs)
    Finished bench [optimized] target(s) in 19.51s
     Running benches/cycle_counts.rs (target/release/deps/cycle_counts-5133e5cd3f206d1f)
prng_16
  Instructions:               21517 (No change)
  L1 Accesses:                29649 (-0.013489%)
  L2 Accesses:                   45 (+15.38462%)
  RAM Accesses:                 125 (-1.574803%)
  Estimated Cycles:           34249 (-0.128306%)

prng_256
  Instructions:              137292 (No change)
  L1 Accesses:               188741 (+0.000530%)
  L2 Accesses:                   51 (+2.000000%)
  RAM Accesses:                 196 (-1.010101%)
  Estimated Cycles:          195856 (-0.032666%)

prng_1024
  Instructions:              524154 (No change)
  L1 Accesses:               720583 (+0.000278%)
  L2 Accesses:                   81 (-1.219512%)
  RAM Accesses:                 389 (-0.256410%)
  Estimated Cycles:          734603 (-0.005173%)

prng_4096
  Instructions:             2073325 (No change)
  L1 Accesses:              2850432 (No change)
  L2 Accesses:                  144 (+1.408451%)
  RAM Accesses:                1157 (-0.172563%)
  Estimated Cycles:         2891647 (-0.002075%)

prio3_client_count
  Instructions:               64313 (+0.024885%)
  L1 Accesses:                81632 (+0.009801%)
  L2 Accesses:                   79 (+9.722222%)
  RAM Accesses:                 374 (-1.319261%)
  Estimated Cycles:           95117 (-0.138584%)

prio3_client_histogram_10
  Instructions:              241092 (-0.238344%)
  L1 Accesses:               299063 (-0.287405%)
  L2 Accesses:                  119 (+2.586207%)
  RAM Accesses:                 544 (+0.184162%)
  Estimated Cycles:          318698 (-0.254139%)

prio3_client_sum_32
  Instructions:              305448 (-0.562544%)
  L1 Accesses:               385120 (-0.501728%)
  L2 Accesses:                   98 (-3.921569%)
  RAM Accesses:                 560 (-0.884956%)
  Estimated Cycles:          405210 (-0.524614%)

prio3_client_count_vec_1000
  Instructions:            10196188 (-9.572059%)
  L1 Accesses:             12644489 (-8.968731%)
  L2 Accesses:                 4182 (-0.095557%)
  RAM Accesses:                3036 (-0.393701%)
  Estimated Cycles:        12771659 (-8.890224%)

     Running benches/speed_tests.rs (target/release/deps/speed_tests-bed856efc3eab923)
prio3count_shard        time:   [19.276 µs 19.287 µs 19.301 µs]
                        change: [-0.2032% +0.3423% +1.0869%] (p = 0.30 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

prio3count_prepare_init time:   [12.983 µs 12.984 µs 12.986 µs]
                        change: [-0.0255% +0.0508% +0.1105%] (p = 0.15 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

prio3sum_shard/8        time:   [54.570 µs 54.576 µs 54.584 µs]
                        change: [-0.8920% -0.4923% -0.0951%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  2 (2.00%) low mild
  6 (6.00%) high mild
  6 (6.00%) high severe
prio3sum_shard/32       time:   [114.73 µs 114.75 µs 114.76 µs]
                        change: [-3.7436% -3.4195% -3.1721%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

prio3sum_prepare_init/8 time:   [36.645 µs 36.652 µs 36.659 µs]
                        change: [-1.4062% -1.3453% -1.2875%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  6 (6.00%) high mild
  4 (4.00%) high severe
prio3sum_prepare_init/32
                        time:   [63.930 µs 63.946 µs 63.962 µs]
                        change: [-3.7192% -3.5769% -3.4687%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

prio3sumvec_shard/serial/10
                        time:   [64.299 µs 64.307 µs 64.316 µs]
                        change: [-0.1328% +0.1674% +0.4105%] (p = 0.25 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
prio3sumvec_shard/serial/100
                        time:   [190.84 µs 191.07 µs 191.58 µs]
                        change: [-1.0428% -0.9358% -0.7647%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
prio3sumvec_shard/serial/1000
                        time:   [2.5150 ms 2.5154 ms 2.5158 ms]
                        change: [-6.7462% -6.7235% -6.7008%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

prio3sumvec_prepare_init/serial/10
                        time:   [47.060 µs 47.076 µs 47.098 µs]
                        change: [-0.5761% -0.4134% -0.2983%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe
prio3sumvec_prepare_init/serial/100
                        time:   [97.600 µs 97.623 µs 97.649 µs]
                        change: [-2.5127% -2.4355% -2.3505%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  4 (4.00%) high mild
  3 (3.00%) high severe
prio3sumvec_prepare_init/serial/1000
                        time:   [819.02 µs 819.21 µs 819.41 µs]
                        change: [-5.5935% -5.5392% -5.4859%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

prio3histogram_shard/serial/10
                        time:   [72.620 µs 72.632 µs 72.646 µs]
                        change: [-0.2647% -0.1418% -0.0611%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
prio3histogram_shard/serial/100
                        time:   [196.55 µs 196.58 µs 196.61 µs]
                        change: [-2.2466% -1.8212% -1.5246%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
prio3histogram_shard/serial/1000
                        time:   [2.5097 ms 2.5152 ms 2.5238 ms]
                        change: [-6.4880% -6.2815% -5.9588%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
prio3histogram_shard/serial/10000
                        time:   [18.602 ms 18.606 ms 18.611 ms]
                        change: [-8.1314% -7.9196% -7.7908%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
Benchmarking prio3histogram_shard/serial/100000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 15.0s. You may wish to increase target time to 27.8s, or reduce sample count to 50.
prio3histogram_shard/serial/100000
                        time:   [278.10 ms 278.21 ms 278.33 ms]
                        change: [-8.5075% -8.4507% -8.3957%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

prio3histogram_prepare_init/serial/10
                        time:   [56.264 µs 56.277 µs 56.290 µs]
                        change: [-0.3958% -0.2657% -0.1852%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
prio3histogram_prepare_init/serial/100
                        time:   [107.71 µs 107.82 µs 107.93 µs]
                        change: [-2.0976% -1.8083% -1.4433%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
prio3histogram_prepare_init/serial/1000
                        time:   [836.84 µs 837.61 µs 839.04 µs]
                        change: [-5.2016% -5.1105% -4.9614%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
prio3histogram_prepare_init/serial/10000
                        time:   [5.9908 ms 5.9930 ms 5.9952 ms]
                        change: [-5.3161% -5.2600% -5.2020%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
prio3histogram_prepare_init/serial/100000
                        time:   [82.276 ms 82.345 ms 82.432 ms]
                        change: [-7.1225% -7.0078% -6.8815%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

rand/16                 time:   [5.5245 µs 5.5253 µs 5.5267 µs]
                        change: [-1.4011% -1.2693% -1.1788%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
rand/256                time:   [29.580 µs 29.584 µs 29.587 µs]
                        change: [-0.4399% -0.0590% +0.1605%] (p = 0.81 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  3 (3.00%) high mild
  6 (6.00%) high severe
rand/1024               time:   [106.01 µs 106.04 µs 106.07 µs]
                        change: [-1.2403% -0.9535% -0.4947%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
rand/4096               time:   [412.71 µs 412.80 µs 412.90 µs]
                        change: [-1.6381% -1.3841% -1.1943%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

poly_mul/fft/1          time:   [649.81 ns 650.02 ns 650.24 ns]
                        change: [+0.7407% +0.8303% +0.9173%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe
poly_mul/direct/1       time:   [165.30 ns 165.35 ns 165.41 ns]
                        change: [+0.7426% +0.8280% +0.9389%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high severe
poly_mul/fft/30         time:   [21.775 µs 21.782 µs 21.789 µs]
                        change: [-8.9807% -8.8200% -8.7044%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
poly_mul/direct/30      time:   [25.078 µs 25.104 µs 25.128 µs]
                        change: [+0.5286% +0.8276% +1.2733%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
poly_mul/fft/60         time:   [49.830 µs 49.845 µs 49.861 µs]
                        change: [-8.3666% -7.9938% -7.5635%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
poly_mul/direct/60      time:   [100.17 µs 100.27 µs 100.45 µs]
                        change: [+0.4895% +0.5770% +0.6764%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe
poly_mul/fft/90         time:   [111.28 µs 111.30 µs 111.33 µs]
                        change: [-10.544% -10.388% -10.279%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  5 (5.00%) low mild
  2 (2.00%) high severe
poly_mul/direct/90      time:   [400.21 µs 400.26 µs 400.31 µs]
                        change: [+0.3027% +0.3377% +0.3648%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
poly_mul/fft/120        time:   [111.28 µs 111.30 µs 111.32 µs]
                        change: [-10.633% -10.225% -9.7366%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  8 (8.00%) high severe
poly_mul/direct/120     time:   [399.55 µs 399.59 µs 399.63 µs]
                        change: [-1.4389% -0.7539% -0.1834%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe
poly_mul/fft/150        time:   [247.55 µs 247.59 µs 247.63 µs]
                        change: [-10.495% -10.454% -10.403%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high severe
Benchmarking poly_mul/direct/150: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.0s, enable flat sampling, or reduce sample count to 50.
poly_mul/direct/150     time:   [1.5928 ms 1.5929 ms 1.5931 ms]
                        change: [-0.4093% -0.0877% +0.0911%] (p = 0.73 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  6 (6.00%) high mild
  1 (1.00%) high severe
poly_mul/fft/255        time:   [247.58 µs 247.63 µs 247.69 µs]
                        change: [-11.234% -10.678% -10.386%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
Benchmarking poly_mul/direct/255: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.0s, enable flat sampling, or reduce sample count to 50.
poly_mul/direct/255     time:   [1.5928 ms 1.5929 ms 1.5931 ms]
                        change: [-0.0085% +0.0561% +0.1053%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

Based on the benchmark results, I also lowered the FFT_THRESHOLD constant to change when we cut over between the different polynomial multiplication implementations. Note that even before this change, previous optimizations had moved the break-even point to around 30.

@divergentdave divergentdave requested a review from a team as a code owner April 23, 2024 15:15
Copy link
Contributor

@inahga inahga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@divergentdave divergentdave merged commit a6d1c0a into main Apr 23, 2024
6 checks passed
@divergentdave divergentdave deleted the david/bitrev-intrinsic branch April 23, 2024 17:54
hannahdaviscrypto pushed a commit to hannahdaviscrypto/mastic that referenced this pull request May 2, 2024
* Add cases to polynomial multiplication benchmarks

* Implement bitrev with a compiler intrinsic

* Lower FFT_THRESHOLD to 30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants