Pippenger multiscalar multiplication algorithm for very large inputs #129

oleganza · 2018-04-09T19:08:00Z

⚠️ REPLACED BY #249 ⚠️

Currently supported Straus algorithm does not improve performance as input size grows and saturates at ≈4x improvement over naïve multiplication. Pippenger’s algorithm takes advantage of very large amount of points (>190) by avoiding premultiplication of points and instead placing points in the buckets indexed by the multiplication factor. Then, buckets are cleverly added up to have their multipliers applied automatically. The process is repeated for each "digit" (that are 6 to 15 bits wide, depending on number of input points). As a result, the cost of multiplication grows slower than linearly with respect to the input size. For 1024 points Pippenger is >60% faster than Straus.

This patch adds:

An implementation of VartimeMultiscalarMul using Pippenger algorithm.
Dynamic switch based on input's size_hint() from Straus to Pippenger (at 190 points).
New DigitsBatch API to enumerate digits of scalars of arbitrary radix in a batch. Digits are produced via an iterator API on the fly, without pre-allocation: this allows us to have wider digits (i16) for more efficient computation of over 10K points.

TODO

There is no consttime version of Pippenger. It should be pretty straightforward to add (there is only one conditional point negation that depends on the value of a scalar), but I need some help in figuring this out.
there is no ClearOnDrop support yet.

This addresses #130.

oleganza · 2018-04-24T18:11:21Z

Some notes on parallelizing the current implementation:

Each scalar is represented in fixed radix. This means, we can have 256/w parallel threads, one per window, to add points into buckets. Once all buckets are summed up, we can join the threads and serially add up those sums from the high window to the low one (w doublings plus one addition per window). For low number of processors (e.g. CPU) this could be enough parallelization, and points being (re)added are all read-only and shared among threads in the cache. For w=5..15 this gives 52..18 threads which is enough to saturate the CPU.
In each window, points are (re)added into 2^w/2-1 buckets, based on the digit value. We can have that many threads, one per bucket, to perform these readditions in parallel using very little data. together with (1) parallelization, this gives approximately 256*(2^w/2-1)/w threads. For w=8..10 (1K-10K points) this is 4000-13000 threads total.
Adding buckets can also be parallelized with a slighly less optimal formula for serial execution, but amenable to parallelization. E.g. we can divide the triangle of A+B+B+C+C+C+D+D+D+D+... into sub-triangles and large squares by saving some additions and having more doublings. However, since the (1) and (2) already add enough parallelism, this may not be effective.

oleganza · 2018-04-24T18:30:36Z

Similarly for Straus:

Premultiplication for dynamic bases can be parallelized (1 thread per point).
If we use fixed-window scalar representation, we can have 1 thread per window and add up premultiplied points in parallel. This is similar to pippenger, but we don't use buckets as points are already premultiplied. The resulting sums are added serially with w doublings applied in-between.

Straus has enough parallelism for CPU, but not for GPU. But that's not an issue as Straus is less effective for large input sizes than Pippenger anyway.

oleganza · 2018-05-16T00:39:41Z

Replaced the branch with a rebase done by @hdevalence.

oleganza · 2018-05-16T00:42:08Z

src/backend/avx2/scalar_mul/pippenger.rs

+            // Step 4. Iterate over pairs of (point, scalar)
+            // and add/sub the point to the corresponding bucket.
+            // Note: when we add support for precomputed lookup tables,
+            // we'll be adding/subtractiong point premultiplied by `digits[i]` to buckets[0].


Note 2: adding precomputed points to buckets[0] won't really help us with performance if we have at least one non-precomputed point. This is because we'd have to add up all buckets anyway.

hdevalence · 2018-05-16T17:15:08Z

benches/dalek_benchmarks.rs

+    1218, 1448, 1722, 2048,
+    2435, 2896, 3444, 4096,
+    4871, 5793, 6889, 8192
+];


Where do these sizes come from? Is there a reason not to use powers-of-two?

This is powers-of-two with 3 values in-between each powers to give more fine-grained benchmarks. I used it at one point to help me find the thresholds, but maybe we can simplify it now.

hdevalence · 2018-05-16T17:24:13Z

benches/dalek_benchmarks.rs

+                    .collect();
+                b.iter(|| edwards::vartime::multiscalar_mul_tuning(&scalars, &points, threshold));
+            },
+            sizes,


I guess this code should probably be removed once we settle on the final size tradeoff?

hdevalence · 2018-05-16T17:30:51Z

src/backend/avx2/scalar_mul/pippenger.rs

+impl VartimeMultiscalarMul for Pippenger {
+    type Point = EdwardsPoint;
+
+    fn vartime_multiscalar_mul<I, J>(scalars: I, points: J) -> EdwardsPoint


(Skipping review of this for now since it's similar to the affine case)

hdevalence · 2018-05-16T17:31:40Z

src/lib.rs

@@ -20,7 +20,7 @@
 //
 // This means that missing docs will still fail CI, but means we can use
 // README.md as the crate documentation.
-#![cfg_attr(feature = "nightly", deny(missing_docs))]
+#![cfg_attr(all(feature = "nightly"), deny(missing_docs))]


This all seems unnecessary?

Yeah, it's a leftover from some temporary flags that I used.

hdevalence · 2018-05-16T17:33:04Z

src/scalar.rs

+use std::vec::Vec;
+
+#[cfg(feature = "alloc")]
+use alloc::Vec;


I'm not sure it's necessary to import these explicitly.

hdevalence · 2018-05-16T17:45:41Z

src/scalar.rs

+    /// with \\(-2\^r/2 \leq a_i < 2\^r/2\\) for \\(0 \leq i < (n-1)\\) and \\(-2\^r/2 \leq a_{n-1} \leq 2\^r/2\\).
+    /// Radix power \\(r\\) can be between 1 and 15.
+    #[cfg(any(feature = "alloc", feature = "std"))]
+    pub(crate) fn to_arbitrary_radix(&self, mut radix_power: usize) -> Vec<i16> {


Returning a Vec<i16> means we're doing a separate allocation for every single scalar.

Since the digits can be computed in least-significant to most-significant order without backtracking, I wonder whether it makes sense for this function to return an iterator of the digits. This way the caller code can decide how to allocate space for the results.

Ideally this would allow getting rid of the to_radix_16 method entirely, but I think that this isn't easy to do, since it's not possible to .collect into an array. (We need to store those digits without heap allocations, which means using an array. I guess we could do it manually...)

hdevalence · 2018-05-16T17:46:20Z

src/scalar.rs

+        debug_assert!(self[31] <= 127);
+
+        // Ensure that radix is in [2^1, 2^15]
+        radix_power = if radix_power > 15 { 15 } else { radix_power };


These should be debug_asserts rather than silently changing bad parameters, IMO.

Yeah, it makes even more sense if we do not expose these through the API.

hdevalence · 2018-05-16T17:50:31Z

src/scalar_mul/pippenger.rs

+        J::Item: Borrow<EdwardsPoint>,
+    {
+        let mut scalars = scalars.into_iter();
+        let size = scalars.by_ref().size_hint().0;


It would be great to have a comment explaining where the window size boundaries come from (I guess from your cost estimation script?)

Nope, the estimation script only gives a rough idea. I was manually tweaking the test to fine-tune where the threshold should be. The script does not account for the real code overhead, so sometime it's off by 10-20%.

hdevalence · 2018-05-16T17:56:18Z

src/scalar_mul/pippenger.rs

+            7
+        } else if size < 3_000 {
+            8
+        } else if size < 6_000 {


How much of a performance degradation is there by capping the window size at 8? It means that the scalar digits can fit in i8 rather than i16, so they will take half the space in cache. For instance, for 2048 scalars, we have 256/8 = 32 digits each for 64K digits total, so using 1 byte per digit rather than 2 would save 64KB of cache.

On the other hand, if we arrange the scalar digits properly, we may be able to achieve linear access, in which case this might not matter.

According to an estimation using the provided ruby script, the difference is like 5-7% if you double the points from 3K->6K. It gets to 25%-35% difference when you go to 30K and 300K points respectively.

Also, interestingly, if we reduce the support to max(w)=8, then we only need w = {6,7,8}, because w < 6 are optimal for n<190 where Straus is faster anyway. I wonder if having only three possible window sizes is beneficial in some sense?

hdevalence · 2018-05-16T18:00:02Z

src/scalar_mul/pippenger.rs

+        // Iterate the digits from last to first so we can shift the result by w bits
+        // at the end of each iteration (e.g. w doublings).
+        let mut result = EdwardsPoint::identity();
+        for i in (0..windows_count).rev() {


Is windows_count supposed to be the number of digits in each scalar, so that (0..windows_count).rev() iterates over the digits in reverse order?

Yeah. Window <=> digit.

hdevalence · 2018-05-16T18:20:21Z

src/scalar_mul/pippenger.rs

+            //   C B
+            //   C B A   Sum = C + (C+B) + (C+B+A)
+            let mut buckets_intermediate_sum = buckets[buckets_count - 1].clone();
+            let mut buckets_sum = buckets[buckets_count - 1].clone();


~~It would be good to do allocate the scratch space for this up-front, rather than doing two allocations per loop iteration~~

edit: I misread this code, and thought you were cloning the vectors. Since points are Copy I think it would be better just to use normal assignment for copyable stack data.

Good point, fixed in 67a6a97.

These traits have the same interface, but with different names, so that it's not possible to use them interchangeably. (Constant-time and variable-time routines should not be used interchangeably). This commit changes the external API to use these traits, replacing ``` edwards::multiscalar_mul edwards::vartime::multiscalar_mul ``` with ``` EdwardsPoint::multiscalar_mul (as an impl) EdwardsPoint::vartime_multiscalar_mul (as an impl) ``` and similarly for Ristretto. Refactoring the backend is for a later commit. Multiscalar multiplication with precomputation is for a later commit. The `edwards::vartime` module is retained since it's used for `vartime_double_base_scalar_mul`. It should be subsumed into the precomputation API in a later commit.

oleganza · 2018-08-02T21:24:31Z

Re:
#129 (comment)

Returning a Vec<i16> means we're doing a separate allocation for every single scalar.

Since the digits can be computed in least-significant to most-significant order without backtracking, I wonder whether it makes sense for this function to return an iterator of the digits. This way the caller code can decide how to allocate space for the results.

Ideally this would allow getting rid of the to_radix_16 method entirely, but I think that this isn't easy to do, since it's not possible to .collect into an array. (We need to store those digits without heap allocations, which means using an array. I guess we could do it manually...)

~~Unfortunately, we need to iterate over digits in reverse order (from most-significant to least-significant), but currently create them in the normal order (lo->hi) to keep track of the carry.~~

EDIT: we can scan digits lo->hi and generate them on the fly if we delay adding everything up by remembering accumulated points "per column" (per digit position). Then we will efficiently add these accumulated points hi->lo as we do today, at the expense of additional memory for 30-40 points. See
#129 (comment)

Alternative is to have a static array for the digits, big enough to fit scalars in all supported representations. The smallest digit size is 6 (less than 5 is optimal for input sizes < 190 where Straus is more efficient, so we can safely ignore widths 1..5), which requires 43 slots for digits. Larger digits require less space, so we can leave the rest of the array zeroed.

If we allow digits up to 2^15-1, then we need [i16; 43], which makes the array take 86 bytes. If we allow digits up to 2^8-1, then we need [i8; 43] which is 43 bytes, only 34% more than the raw 32-byte scalar. Using i8s will force us to cap signed digits, making them suboptimal on inputs larger than 3K points. My estimation is that on 10K points the overhead would be 5%, on 30K — 25% and on 300K — 30-35%.

oleganza · 2018-08-02T21:33:06Z

How reasonable are inputs of over 10K points (where availability of digits wider than 8 bits makes a difference)?

Case 1: Bulletproof range proofs use 17 dynamic points and ≈130 static points. We need to batch-verify 588 range proofs in order to use Straus for 130 shared static points and Pippenger for the remaining 10K dynamic points.

Case 2: Aggregated range proof with 74 participants also requires ≈10K-point multi-scalar multiplications.

Case 3: 10K Schnorr signatures verified in a batch.

Finally, batch verification of 100 transactions with 64-party aggregated range proofs in each would require multiplication of over 800K points, where the speedup from wider digits could be as big as 2x.

oleganza · 2018-08-03T16:52:14Z

Correction: we don’t need to scan digits hi->lo if we delay adding and shifting intermediate results. We’d only have to add memory for 30-40 points (one per digit), that will be processed hi->lo after we finish with the main pippenger logic.

ValarDragon · 2018-09-02T20:24:57Z

src/scalar_mul/pippenger.rs

+/// 1. Prepare `2^(w-1) - 1` buckets with indices `[1..2^(w-1))` initialized with identity points.
+///    Bucket 0 is not needed as it would contain points multiplied by 0.
+/// 2. Convert scalars to a radix-`2^w` representation with signed digits in `[-2^w/2, 2^w/2]`.
+///    Note: all but last digit will never be equal `2^w/2`, but that's irrelevant to us here.


I think the "be" is a typo, or it should be "equal to". (Sorry for the minor nitpick)

Yeah, that's my favorite mistake (I'm not a native English speaker), thanks for pointing it out!

Fixed in 5051777

oleganza · 2019-04-15T05:13:46Z

Holy shit, this PR is one year old.

oleganza · 2019-04-15T05:28:46Z

Going to try converting scalars to i8 digits, per Henry's suggestion above. #129 (comment)

hdevalence · 2019-04-18T14:33:23Z

Cool! I can help with rebasing if you'd like. Since then I restructured the source tree so that there are serial and vector backends, each with a scalar_mul module (where this code should go).

hdevalence · 2019-05-21T15:53:35Z

@oleganza Can I help revitalize this branch?

oleganza · 2019-05-21T19:41:32Z

⚠️ Replacing this PR by #249

hdevalence · 2019-05-22T14:48:42Z

Closing in favor of #249.

oleganza mentioned this pull request Apr 13, 2018

Batch verification for range proofs (replaced) dalek-cryptography/bulletproofs#27

Closed

oleganza mentioned this pull request May 3, 2018

Batch verification for range proofs dalek-cryptography/bulletproofs#86

Open

This was referenced May 14, 2018

Pippenger multiscalar multiplication #140

Closed

Efficient multiscalar multiplication for large inputs #130

Closed

Use traits for multiscalar multiplication #144

Merged

oleganza force-pushed the pippenger branch from 6810b6f to 58445f7 Compare May 16, 2018 00:39

oleganza commented May 16, 2018

View reviewed changes

hdevalence reviewed May 16, 2018

View reviewed changes

hdevalence and others added 9 commits July 13, 2018 14:51

Use multiscalar traits for the backend implementations.

8400592

Move double-base scmul to the EdwardsPoint type

58de8cc

Add doc comments

e43c737

Bump curve25519-dalek version to 0.17.0.

7c0ee03

wip on pippenger algorithm for multiscalar_mul

84cb0ec

Fit pippenger functions into the VartimeMultiscalarMul trait

d0865db

Fix compilation for AVX2 pippenger

6451f9a

rename vartime_pippenger to pippenger

f7c7436

oleganza added 4 commits August 1, 2018 17:37

cleanup

8954e25

use debug_assert instead of silently capping parameters

88864e4

use 4xu64 instead of 5xu64 with an extra small check

d700135

removed unneeded test file

54e6f8e

oleganza closed this Aug 2, 2018

oleganza reopened this Aug 2, 2018

oleganza added 3 commits August 3, 2018 11:26

wip on digit iterators

422dc1c

added pippenger for avx2

f4ba88c

cargo fmt pippenge

b091134

oleganza changed the title ~~[WIP] Add Pippenger multiscalar multiplication algorithm for larger input vectors~~ Add Pippenger multiscalar multiplication algorithm for larger input vectors Aug 3, 2018

oleganza changed the title ~~Add Pippenger multiscalar multiplication algorithm for larger input vectors~~ Pippenger multiscalar multiplication algorithm for larger input vectors Aug 3, 2018

oleganza changed the title ~~Pippenger multiscalar multiplication algorithm for larger input vectors~~ Pippenger multiscalar multiplication algorithm for very large inputs Aug 3, 2018

burdges mentioned this pull request Aug 18, 2018

Multiscalar multiplication zkcrypto/pairing#94

Open

ValarDragon reviewed Sep 2, 2018

View reviewed changes

english, do you speak it?

5051777

hdevalence added this to the 1.1 milestone Dec 29, 2018

hdevalence modified the milestones: 1.1, 1.2 Jan 20, 2019

oleganza mentioned this pull request May 21, 2019

Pippenger multiscalar multiplication algorithm #249

Merged

hdevalence closed this May 22, 2019

burdges mentioned this pull request Oct 1, 2019

Replace ring with ed25519-dalek in primitives paritytech/substrate#2415

Merged

Pippenger multiscalar multiplication algorithm for very large inputs #129

Pippenger multiscalar multiplication algorithm for very large inputs #129

Conversation

oleganza commented Apr 9, 2018 • edited

⚠️ REPLACED BY #249 ⚠️

TODO

oleganza commented Apr 24, 2018 • edited

oleganza commented Apr 24, 2018

oleganza commented May 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hdevalence May 16, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleganza commented Aug 2, 2018 • edited

oleganza commented Aug 2, 2018

oleganza commented Aug 3, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleganza commented Apr 15, 2019

oleganza commented Apr 15, 2019 • edited

hdevalence commented Apr 18, 2019

hdevalence commented May 21, 2019

oleganza commented May 21, 2019

hdevalence commented May 22, 2019

oleganza commented Apr 9, 2018 •

edited

oleganza commented Apr 24, 2018 •

edited

hdevalence May 16, 2018 •

edited

oleganza commented Aug 2, 2018 •

edited

oleganza commented Aug 3, 2018 •

edited

oleganza commented Apr 15, 2019 •

edited