New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pippenger multiscalar multiplication algorithm for very large inputs #129
Conversation
Some notes on parallelizing the current implementation:
|
Similarly for Straus:
Straus has enough parallelism for CPU, but not for GPU. But that's not an issue as Straus is less effective for large input sizes than Pippenger anyway. |
Replaced the branch with a rebase done by @hdevalence. |
// Step 4. Iterate over pairs of (point, scalar) | ||
// and add/sub the point to the corresponding bucket. | ||
// Note: when we add support for precomputed lookup tables, | ||
// we'll be adding/subtractiong point premultiplied by `digits[i]` to buckets[0]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note 2: adding precomputed points to buckets[0] won't really help us with performance if we have at least one non-precomputed point. This is because we'd have to add up all buckets anyway.
benches/dalek_benchmarks.rs
Outdated
1218, 1448, 1722, 2048, | ||
2435, 2896, 3444, 4096, | ||
4871, 5793, 6889, 8192 | ||
]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do these sizes come from? Is there a reason not to use powers-of-two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is powers-of-two with 3 values in-between each powers to give more fine-grained benchmarks. I used it at one point to help me find the thresholds, but maybe we can simplify it now.
benches/dalek_benchmarks.rs
Outdated
.collect(); | ||
b.iter(|| edwards::vartime::multiscalar_mul_tuning(&scalars, &points, threshold)); | ||
}, | ||
sizes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this code should probably be removed once we settle on the final size tradeoff?
impl VartimeMultiscalarMul for Pippenger { | ||
type Point = EdwardsPoint; | ||
|
||
fn vartime_multiscalar_mul<I, J>(scalars: I, points: J) -> EdwardsPoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Skipping review of this for now since it's similar to the affine case)
src/lib.rs
Outdated
@@ -20,7 +20,7 @@ | |||
// | |||
// This means that missing docs will still fail CI, but means we can use | |||
// README.md as the crate documentation. | |||
#![cfg_attr(feature = "nightly", deny(missing_docs))] | |||
#![cfg_attr(all(feature = "nightly"), deny(missing_docs))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all
seems unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a leftover from some temporary flags that I used.
src/scalar.rs
Outdated
use std::vec::Vec; | ||
|
||
#[cfg(feature = "alloc")] | ||
use alloc::Vec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's necessary to import these explicitly.
src/scalar.rs
Outdated
/// with \\(-2\^r/2 \leq a_i < 2\^r/2\\) for \\(0 \leq i < (n-1)\\) and \\(-2\^r/2 \leq a_{n-1} \leq 2\^r/2\\). | ||
/// Radix power \\(r\\) can be between 1 and 15. | ||
#[cfg(any(feature = "alloc", feature = "std"))] | ||
pub(crate) fn to_arbitrary_radix(&self, mut radix_power: usize) -> Vec<i16> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning a Vec<i16>
means we're doing a separate allocation for every single scalar.
Since the digits can be computed in least-significant to most-significant order without backtracking, I wonder whether it makes sense for this function to return an iterator of the digits. This way the caller code can decide how to allocate space for the results.
Ideally this would allow getting rid of the to_radix_16
method entirely, but I think that this isn't easy to do, since it's not possible to .collect
into an array. (We need to store those digits without heap allocations, which means using an array. I guess we could do it manually...)
src/scalar.rs
Outdated
debug_assert!(self[31] <= 127); | ||
|
||
// Ensure that radix is in [2^1, 2^15] | ||
radix_power = if radix_power > 15 { 15 } else { radix_power }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be debug_assert
s rather than silently changing bad parameters, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it makes even more sense if we do not expose these through the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/scalar_mul/pippenger.rs
Outdated
J::Item: Borrow<EdwardsPoint>, | ||
{ | ||
let mut scalars = scalars.into_iter(); | ||
let size = scalars.by_ref().size_hint().0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to have a comment explaining where the window size boundaries come from (I guess from your cost estimation script?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, the estimation script only gives a rough idea. I was manually tweaking the test to fine-tune where the threshold should be. The script does not account for the real code overhead, so sometime it's off by 10-20%.
7 | ||
} else if size < 3_000 { | ||
8 | ||
} else if size < 6_000 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much of a performance degradation is there by capping the window size at 8? It means that the scalar digits can fit in i8
rather than i16
, so they will take half the space in cache. For instance, for 2048
scalars, we have 256/8 = 32
digits each for 64K
digits total, so using 1 byte per digit rather than 2 would save 64KB
of cache.
On the other hand, if we arrange the scalar digits properly, we may be able to achieve linear access, in which case this might not matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to an estimation using the provided ruby script, the difference is like 5-7% if you double the points from 3K->6K. It gets to 25%-35% difference when you go to 30K and 300K points respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, interestingly, if we reduce the support to max(w)=8
, then we only need w = {6,7,8}
, because w < 6 are optimal for n<190 where Straus is faster anyway. I wonder if having only three possible window sizes is beneficial in some sense?
src/scalar_mul/pippenger.rs
Outdated
// Iterate the digits from last to first so we can shift the result by w bits | ||
// at the end of each iteration (e.g. w doublings). | ||
let mut result = EdwardsPoint::identity(); | ||
for i in (0..windows_count).rev() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is windows_count
supposed to be the number of digits in each scalar, so that (0..windows_count).rev()
iterates over the digits in reverse order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Window <=> digit.
src/scalar_mul/pippenger.rs
Outdated
// C B | ||
// C B A Sum = C + (C+B) + (C+B+A) | ||
let mut buckets_intermediate_sum = buckets[buckets_count - 1].clone(); | ||
let mut buckets_sum = buckets[buckets_count - 1].clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to do allocate the scratch space for this up-front, rather than doing two allocations per loop iteration
edit: I misread this code, and thought you were cloning the vectors. Since points are Copy
I think it would be better just to use normal assignment for copyable stack data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, fixed in 67a6a97.
These traits have the same interface, but with different names, so that it's not possible to use them interchangeably. (Constant-time and variable-time routines should not be used interchangeably). This commit changes the external API to use these traits, replacing ``` edwards::multiscalar_mul edwards::vartime::multiscalar_mul ``` with ``` EdwardsPoint::multiscalar_mul (as an impl) EdwardsPoint::vartime_multiscalar_mul (as an impl) ``` and similarly for Ristretto. Refactoring the backend is for a later commit. Multiscalar multiplication with precomputation is for a later commit. The `edwards::vartime` module is retained since it's used for `vartime_double_base_scalar_mul`. It should be subsumed into the precomputation API in a later commit.
Re:
EDIT: we can scan digits lo->hi and generate them on the fly if we delay adding everything up by remembering accumulated points "per column" (per digit position). Then we will efficiently add these accumulated points hi->lo as we do today, at the expense of additional memory for 30-40 points. See Alternative is to have a static array for the digits, big enough to fit scalars in all supported representations. The smallest digit size is 6 (less than 5 is optimal for input sizes < 190 where Straus is more efficient, so we can safely ignore widths 1..5), which requires 43 slots for digits. Larger digits require less space, so we can leave the rest of the array zeroed. If we allow digits up to 2^15-1, then we need |
How reasonable are inputs of over 10K points (where availability of digits wider than 8 bits makes a difference)? Case 1: Bulletproof range proofs use 17 dynamic points and ≈130 static points. We need to batch-verify 588 range proofs in order to use Straus for 130 shared static points and Pippenger for the remaining 10K dynamic points. Case 2: Aggregated range proof with 74 participants also requires ≈10K-point multi-scalar multiplications. Case 3: 10K Schnorr signatures verified in a batch. Finally, batch verification of 100 transactions with 64-party aggregated range proofs in each would require multiplication of over 800K points, where the speedup from wider digits could be as big as 2x. |
Correction: we don’t need to scan digits hi->lo if we delay adding and shifting intermediate results. We’d only have to add memory for 30-40 points (one per digit), that will be processed hi->lo after we finish with the main pippenger logic. |
src/scalar_mul/pippenger.rs
Outdated
/// 1. Prepare `2^(w-1) - 1` buckets with indices `[1..2^(w-1))` initialized with identity points. | ||
/// Bucket 0 is not needed as it would contain points multiplied by 0. | ||
/// 2. Convert scalars to a radix-`2^w` representation with signed digits in `[-2^w/2, 2^w/2]`. | ||
/// Note: all but last digit will never be equal `2^w/2`, but that's irrelevant to us here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "be" is a typo, or it should be "equal to". (Sorry for the minor nitpick)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's my favorite mistake (I'm not a native English speaker), thanks for pointing it out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 5051777
Holy shit, this PR is one year old. |
Going to try converting scalars to |
Cool! I can help with rebasing if you'd like. Since then I restructured the source tree so that there are |
@oleganza Can I help revitalize this branch? |
|
Closing in favor of #249. |
Currently supported Straus algorithm does not improve performance as input size grows and saturates at ≈4x improvement over naïve multiplication. Pippenger’s algorithm takes advantage of very large amount of points (>190) by avoiding premultiplication of points and instead placing points in the buckets indexed by the multiplication factor. Then, buckets are cleverly added up to have their multipliers applied automatically. The process is repeated for each "digit" (that are 6 to 15 bits wide, depending on number of input points). As a result, the cost of multiplication grows slower than linearly with respect to the input size. For 1024 points Pippenger is >60% faster than Straus.
This patch adds:
VartimeMultiscalarMul
using Pippenger algorithm.size_hint()
from Straus to Pippenger (at 190 points).DigitsBatch
API to enumerate digits of scalars of arbitrary radix in a batch. Digits are produced via an iterator API on the fly, without pre-allocation: this allows us to have wider digits (i16
) for more efficient computation of over 10K points.TODO
This addresses #130.