Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pippenger multiscalar multiplication algorithm for very large inputs #129

Closed
wants to merge 21 commits into from

Conversation

oleganza
Copy link
Contributor

@oleganza oleganza commented Apr 9, 2018

⚠️ REPLACED BY #249 ⚠️

Currently supported Straus algorithm does not improve performance as input size grows and saturates at ≈4x improvement over naïve multiplication. Pippenger’s algorithm takes advantage of very large amount of points (>190) by avoiding premultiplication of points and instead placing points in the buckets indexed by the multiplication factor. Then, buckets are cleverly added up to have their multipliers applied automatically. The process is repeated for each "digit" (that are 6 to 15 bits wide, depending on number of input points). As a result, the cost of multiplication grows slower than linearly with respect to the input size. For 1024 points Pippenger is >60% faster than Straus.

This patch adds:

  1. An implementation of VartimeMultiscalarMul using Pippenger algorithm.
  2. Dynamic switch based on input's size_hint() from Straus to Pippenger (at 190 points).
  3. New DigitsBatch API to enumerate digits of scalars of arbitrary radix in a batch. Digits are produced via an iterator API on the fly, without pre-allocation: this allows us to have wider digits (i16) for more efficient computation of over 10K points.

TODO

  • There is no consttime version of Pippenger. It should be pretty straightforward to add (there is only one conditional point negation that depends on the value of a scalar), but I need some help in figuring this out.
  • there is no ClearOnDrop support yet.

This addresses #130.

@oleganza
Copy link
Contributor Author

oleganza commented Apr 24, 2018

Some notes on parallelizing the current implementation:

  1. Each scalar is represented in fixed radix. This means, we can have 256/w parallel threads, one per window, to add points into buckets. Once all buckets are summed up, we can join the threads and serially add up those sums from the high window to the low one (w doublings plus one addition per window). For low number of processors (e.g. CPU) this could be enough parallelization, and points being (re)added are all read-only and shared among threads in the cache. For w=5..15 this gives 52..18 threads which is enough to saturate the CPU.
  2. In each window, points are (re)added into 2^w/2-1 buckets, based on the digit value. We can have that many threads, one per bucket, to perform these readditions in parallel using very little data. together with (1) parallelization, this gives approximately 256*(2^w/2-1)/w threads. For w=8..10 (1K-10K points) this is 4000-13000 threads total.
  3. Adding buckets can also be parallelized with a slighly less optimal formula for serial execution, but amenable to parallelization. E.g. we can divide the triangle of A+B+B+C+C+C+D+D+D+D+... into sub-triangles and large squares by saving some additions and having more doublings. However, since the (1) and (2) already add enough parallelism, this may not be effective.

@oleganza
Copy link
Contributor Author

Similarly for Straus:

  1. Premultiplication for dynamic bases can be parallelized (1 thread per point).
  2. If we use fixed-window scalar representation, we can have 1 thread per window and add up premultiplied points in parallel. This is similar to pippenger, but we don't use buckets as points are already premultiplied. The resulting sums are added serially with w doublings applied in-between.

Straus has enough parallelism for CPU, but not for GPU. But that's not an issue as Straus is less effective for large input sizes than Pippenger anyway.

@oleganza
Copy link
Contributor Author

Replaced the branch with a rebase done by @hdevalence.

// Step 4. Iterate over pairs of (point, scalar)
// and add/sub the point to the corresponding bucket.
// Note: when we add support for precomputed lookup tables,
// we'll be adding/subtractiong point premultiplied by `digits[i]` to buckets[0].
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note 2: adding precomputed points to buckets[0] won't really help us with performance if we have at least one non-precomputed point. This is because we'd have to add up all buckets anyway.

1218, 1448, 1722, 2048,
2435, 2896, 3444, 4096,
4871, 5793, 6889, 8192
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do these sizes come from? Is there a reason not to use powers-of-two?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is powers-of-two with 3 values in-between each powers to give more fine-grained benchmarks. I used it at one point to help me find the thresholds, but maybe we can simplify it now.

.collect();
b.iter(|| edwards::vartime::multiscalar_mul_tuning(&scalars, &points, threshold));
},
sizes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this code should probably be removed once we settle on the final size tradeoff?

impl VartimeMultiscalarMul for Pippenger {
type Point = EdwardsPoint;

fn vartime_multiscalar_mul<I, J>(scalars: I, points: J) -> EdwardsPoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Skipping review of this for now since it's similar to the affine case)

src/lib.rs Outdated
@@ -20,7 +20,7 @@
//
// This means that missing docs will still fail CI, but means we can use
// README.md as the crate documentation.
#![cfg_attr(feature = "nightly", deny(missing_docs))]
#![cfg_attr(all(feature = "nightly"), deny(missing_docs))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all seems unnecessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a leftover from some temporary flags that I used.

src/scalar.rs Outdated
use std::vec::Vec;

#[cfg(feature = "alloc")]
use alloc::Vec;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's necessary to import these explicitly.

src/scalar.rs Outdated
/// with \\(-2\^r/2 \leq a_i < 2\^r/2\\) for \\(0 \leq i < (n-1)\\) and \\(-2\^r/2 \leq a_{n-1} \leq 2\^r/2\\).
/// Radix power \\(r\\) can be between 1 and 15.
#[cfg(any(feature = "alloc", feature = "std"))]
pub(crate) fn to_arbitrary_radix(&self, mut radix_power: usize) -> Vec<i16> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a Vec<i16> means we're doing a separate allocation for every single scalar.

Since the digits can be computed in least-significant to most-significant order without backtracking, I wonder whether it makes sense for this function to return an iterator of the digits. This way the caller code can decide how to allocate space for the results.

Ideally this would allow getting rid of the to_radix_16 method entirely, but I think that this isn't easy to do, since it's not possible to .collect into an array. (We need to store those digits without heap allocations, which means using an array. I guess we could do it manually...)

src/scalar.rs Outdated
debug_assert!(self[31] <= 127);

// Ensure that radix is in [2^1, 2^15]
radix_power = if radix_power > 15 { 15 } else { radix_power };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be debug_asserts rather than silently changing bad parameters, IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it makes even more sense if we do not expose these through the API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

J::Item: Borrow<EdwardsPoint>,
{
let mut scalars = scalars.into_iter();
let size = scalars.by_ref().size_hint().0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have a comment explaining where the window size boundaries come from (I guess from your cost estimation script?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, the estimation script only gives a rough idea. I was manually tweaking the test to fine-tune where the threshold should be. The script does not account for the real code overhead, so sometime it's off by 10-20%.

7
} else if size < 3_000 {
8
} else if size < 6_000 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of a performance degradation is there by capping the window size at 8? It means that the scalar digits can fit in i8 rather than i16, so they will take half the space in cache. For instance, for 2048 scalars, we have 256/8 = 32 digits each for 64K digits total, so using 1 byte per digit rather than 2 would save 64KB of cache.

On the other hand, if we arrange the scalar digits properly, we may be able to achieve linear access, in which case this might not matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to an estimation using the provided ruby script, the difference is like 5-7% if you double the points from 3K->6K. It gets to 25%-35% difference when you go to 30K and 300K points respectively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, interestingly, if we reduce the support to max(w)=8, then we only need w = {6,7,8}, because w < 6 are optimal for n<190 where Straus is faster anyway. I wonder if having only three possible window sizes is beneficial in some sense?

// Iterate the digits from last to first so we can shift the result by w bits
// at the end of each iteration (e.g. w doublings).
let mut result = EdwardsPoint::identity();
for i in (0..windows_count).rev() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is windows_count supposed to be the number of digits in each scalar, so that (0..windows_count).rev() iterates over the digits in reverse order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Window <=> digit.

// C B
// C B A Sum = C + (C+B) + (C+B+A)
let mut buckets_intermediate_sum = buckets[buckets_count - 1].clone();
let mut buckets_sum = buckets[buckets_count - 1].clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to do allocate the scratch space for this up-front, rather than doing two allocations per loop iteration

edit: I misread this code, and thought you were cloning the vectors. Since points are Copy I think it would be better just to use normal assignment for copyable stack data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, fixed in 67a6a97.

hdevalence and others added 9 commits July 13, 2018 14:51
These traits have the same interface, but with different names, so that it's
not possible to use them interchangeably.  (Constant-time and variable-time
routines should not be used interchangeably).

This commit changes the external API to use these traits, replacing
```
edwards::multiscalar_mul
edwards::vartime::multiscalar_mul
```
with
```
EdwardsPoint::multiscalar_mul (as an impl)
EdwardsPoint::vartime_multiscalar_mul (as an impl)
```
and similarly for Ristretto.

Refactoring the backend is for a later commit.

Multiscalar multiplication with precomputation is for a later commit.

The `edwards::vartime` module is retained since it's used for
`vartime_double_base_scalar_mul`.

It should be subsumed into the precomputation API in a later commit.
@oleganza
Copy link
Contributor Author

oleganza commented Aug 2, 2018

Re:
#129 (comment)

Returning a Vec<i16> means we're doing a separate allocation for every single scalar.

Since the digits can be computed in least-significant to most-significant order without backtracking, I wonder whether it makes sense for this function to return an iterator of the digits. This way the caller code can decide how to allocate space for the results.

Ideally this would allow getting rid of the to_radix_16 method entirely, but I think that this isn't easy to do, since it's not possible to .collect into an array. (We need to store those digits without heap allocations, which means using an array. I guess we could do it manually...)

Unfortunately, we need to iterate over digits in reverse order (from most-significant to least-significant), but currently create them in the normal order (lo->hi) to keep track of the carry.

EDIT: we can scan digits lo->hi and generate them on the fly if we delay adding everything up by remembering accumulated points "per column" (per digit position). Then we will efficiently add these accumulated points hi->lo as we do today, at the expense of additional memory for 30-40 points. See
#129 (comment)

Alternative is to have a static array for the digits, big enough to fit scalars in all supported representations. The smallest digit size is 6 (less than 5 is optimal for input sizes < 190 where Straus is more efficient, so we can safely ignore widths 1..5), which requires 43 slots for digits. Larger digits require less space, so we can leave the rest of the array zeroed.

If we allow digits up to 2^15-1, then we need [i16; 43], which makes the array take 86 bytes. If we allow digits up to 2^8-1, then we need [i8; 43] which is 43 bytes, only 34% more than the raw 32-byte scalar. Using i8s will force us to cap signed digits, making them suboptimal on inputs larger than 3K points. My estimation is that on 10K points the overhead would be 5%, on 30K — 25% and on 300K — 30-35%.

@oleganza
Copy link
Contributor Author

oleganza commented Aug 2, 2018

How reasonable are inputs of over 10K points (where availability of digits wider than 8 bits makes a difference)?

Case 1: Bulletproof range proofs use 17 dynamic points and ≈130 static points. We need to batch-verify 588 range proofs in order to use Straus for 130 shared static points and Pippenger for the remaining 10K dynamic points.

Case 2: Aggregated range proof with 74 participants also requires ≈10K-point multi-scalar multiplications.

Case 3: 10K Schnorr signatures verified in a batch.

Finally, batch verification of 100 transactions with 64-party aggregated range proofs in each would require multiplication of over 800K points, where the speedup from wider digits could be as big as 2x.

@oleganza oleganza closed this Aug 2, 2018
@oleganza oleganza reopened this Aug 2, 2018
@oleganza
Copy link
Contributor Author

oleganza commented Aug 3, 2018

Correction: we don’t need to scan digits hi->lo if we delay adding and shifting intermediate results. We’d only have to add memory for 30-40 points (one per digit), that will be processed hi->lo after we finish with the main pippenger logic.

@oleganza oleganza changed the title [WIP] Add Pippenger multiscalar multiplication algorithm for larger input vectors Add Pippenger multiscalar multiplication algorithm for larger input vectors Aug 3, 2018
@oleganza oleganza changed the title Add Pippenger multiscalar multiplication algorithm for larger input vectors Pippenger multiscalar multiplication algorithm for larger input vectors Aug 3, 2018
@oleganza oleganza changed the title Pippenger multiscalar multiplication algorithm for larger input vectors Pippenger multiscalar multiplication algorithm for very large inputs Aug 3, 2018
/// 1. Prepare `2^(w-1) - 1` buckets with indices `[1..2^(w-1))` initialized with identity points.
/// Bucket 0 is not needed as it would contain points multiplied by 0.
/// 2. Convert scalars to a radix-`2^w` representation with signed digits in `[-2^w/2, 2^w/2]`.
/// Note: all but last digit will never be equal `2^w/2`, but that's irrelevant to us here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "be" is a typo, or it should be "equal to". (Sorry for the minor nitpick)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's my favorite mistake (I'm not a native English speaker), thanks for pointing it out!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5051777

@hdevalence hdevalence added this to the 1.1 milestone Dec 29, 2018
@hdevalence hdevalence modified the milestones: 1.1, 1.2 Jan 20, 2019
@oleganza
Copy link
Contributor Author

Holy shit, this PR is one year old.

@oleganza
Copy link
Contributor Author

oleganza commented Apr 15, 2019

Going to try converting scalars to i8 digits, per Henry's suggestion above. #129 (comment)

@hdevalence
Copy link
Contributor

Cool! I can help with rebasing if you'd like. Since then I restructured the source tree so that there are serial and vector backends, each with a scalar_mul module (where this code should go).

@hdevalence
Copy link
Contributor

@oleganza Can I help revitalize this branch?

@oleganza
Copy link
Contributor Author

⚠️ Replacing this PR by #249

@hdevalence
Copy link
Contributor

Closing in favor of #249.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants