variable-sizes of precomputed tables for basepoint scalar multiplication #311

isislovecruft · 2019-12-31T03:21:58Z

Hi!

I've implemented more table sizes for fixed-base scalar multiplication. The motivation here is primarily to allow for memory/time optimisations for signature verifications (cf. https://github.com/isislovecruft/ed25519-dalek/blob/feature/public-key-precomputation_r1/src/traits.rs and https://github.com/isislovecruft/ed25519-dalek/blob/feature/public-key-precomputation_r1/src/precomputed.rs) where a user who needs to verify many signatures with the same public key can dynamically chose to recompute larger/smaller tables, but it's also likely useful for systems which use Chaum-Pedersen commitments or Okamoto commitments or any other NIZK proof/sigma protocol which uses two basepoints.

I've benchmarked this (see my separate feature/dynamic-tables-sizes-benches branch) and the results so far on a few different chipsets seem to indicate speedups as long as the table fits in the L1D subcache. I've implemented tables up to the limit (as self-imposed by Scalar.to_radix_2w()) of radix-256, which are ~480KB tables, as commonly used server builds these days (such as the Xeon E5-2686 v4s found on AWS, and those internal to other large companies) often have 512KB L1{I,D} caches.

This is currently implemented via a new BasepointTable trait, which makes them not semver compatible with 2.0.0, however I can remove the trait for now and add it back in when we're ready for a 3.0.0, if that's preferable. I'd like to be able to start using this now (or at least, I want it before releasing ed25519-dalek-1.0.0, so if we want the trait removed, I'd like to instead add a marker trait for the different sizes of tables, so that the ed25519-dalek code can generalise over them as "public keys".

Another implementation note: I opted for implementing these via macros, as it required the least amount of changes to existing code.

Also, naming question: Should we consider renaming EdwardsBasepointTable to something more generic, like EdwardsPrecomputedPointTable, etc.?

Please review!

This implements a macro for implementing the BasepointTable trait, and uses the macro to create basepoint table types. The default table still uses radix-16 representation and is ~30KB in size. The new table types, and their memory usage and additions required per basepoint multiplication are: * `EdwardsBasepointTableRadix64`: ~120KB, 43 additions * `EdwardsBasepointTableRadix128`: ~240KB, 37 additions * `EdwardsBasepointTableRadix256`: ~480KB, 32 additions

This is useful for programs/protocol which can do some heuristics or learning-based approach towards optimising the table size based on the number of uses of e.g. a public key, the second basepoint in a Pedersen commitment, etc., i.e. the first time a public key is used to verify a signature, the usual variable-time basepoint multiscalar multiplication is used, however after 1000 verifications, the table size is upgraded, and again after 10000 verifications, etc.

cathieyun · 2020-01-02T22:58:39Z

src/window.rs

+}}  // End macro_rules! impl_lookup_table
+
+// The first one has to be named "LookupTable" because it's used as a constructor for consts.
+impl_lookup_table! {Name = LookupTable,         Size =   8, SizeNeg =   -8, SizeRange = 1 ..   9, ConversionRange = 0 ..   7} // radix-16


For my understanding/curiosity - is SizeNeg always the negative of Size?
(It holds true for these examples, but will that be true in general? What kinds of lookup tables would have Size != -SizeNeg?) (if it's true in general, would it make sense to get rid of the SizeNeg input?)

Yes, SizeNeg is always the negative of Size, because the coefficients for the radix-2^n representation are always 0-centered. I remember having some difficulty with getting it to compile without SizeNeg, do to the compiler not knowing for certain that Size would be a type that implements Neg, but I could try doing away with this?

Thanks for explaining! I see how it's hard to guarantee to the compiler that Size implements Neg... in that case it makes sense to keep SizeNeg. It's not a big deal, I was mostly curious why it was there.

cathieyun · 2020-01-02T23:11:32Z

src/window.rs

-        debug_assert!(x >= -8);
-        debug_assert!(x <= 8);
+        debug_assert!(x >= $neg);
+        debug_assert!(x as i16 <= $size as i16); // XXX We have to convert to i16s here for the radix-256 case.. this is wrong.


I can see why this is required ($size == 258 in radix-256, whereas i8::MAX == 257)... but why is the conversion to i16 "wrong"? Will it result in incorrect calculations (it doesn't seem so), or is it just bad practice?

Would it be better to make x an i16 to start with, or would that incur too much of a performance cost?

It's wrong because $size is 128 in radix-256, but x is an i8 which has a max of 127, so x must be less than or equal to $size by virtue of the type system. We can't make it an i16 without also changing the return type of Scalar.to_radix_2w(), because the bytes returned from the latter are what are used to call select() here.

Ah, also it's even more wrong, because in debug mode this should panic when we try to select(128), but the test basepoints and scalars I wrote are static and don't seem to trigger this. I should probably write a fuzzer and see if I can get it to panic.

Okay, I see how this is tricky/wrong for radix-256.

Would you like to address those issues (making sure it panics when you call select(128), writing fuzzers, etc) in this PR? Or, I can LGTM this PR with your placeholder comment, so you can address that in a later PR / so it doesn't block anything else.

I think I'd like fuzzers and better tests for all the sizes, even though I'm pretty thoroughly convinced that everything less than radix-256 is correct. I can try to get to it this evening.

src/scalar.rs

cathieyun

LGTM, with a future TODO described here:
https://github.com/dalek-cryptography/curve25519-dalek/pull/311/files/c01bd780dca04602f2d3fc18d8cb1aacb1daf1f6#r362999281

hdevalence · 2020-01-06T19:14:32Z

In terms of the API, I think it would be much better if decisions about the implementation of the basepoint tables were encapsulated inside the existing EdwardsBasepointTable type, rather than forcing users to make all of their code generic over the implementation details of the basepoint table (and thus also baking the implementation details into a public API that can't be changed later).

In terms of the implementation, the size of the LookupTable trades the cost of memory access (all table entries must be accessed to achieve uniform memory access patterns) against the cost of point additions (exponentially larger table sizes mean fewer additions). But both of these costs are known, so there should be an optimum tradeoff. For this reason, I'm not sure that it makes sense to have multiple sizes, rather than just picking the best one. It could be that the current size is not optimal, but then we should just change the size to the optimal one.

The size of the individual LookupTables is distinct from the overall size of the precomputed data (as in the current EdwardsBasepointTable struct), which uses multiple LookupTables to save on doublings. It would be convenient to be able to scale down the precomputation, but I'm not sure it makes sense to scale it up, as it is already at or beyond the L1d$ size. (In fact the E5-2686 v4 has a 32KiB L1d$ cache per-core; the point computations only use one core each, so the total L1d$ is not relevant). However, scaling down the size of the precomputed data is orthogonal to the LookupTable internals.

isislovecruft · 2020-01-07T00:56:55Z

In terms of the API, I think it would be much better if decisions about the implementation of the basepoint tables were encapsulated inside the existing EdwardsBasepointTable type, rather than forcing users to make all of their code generic over the implementation details of the basepoint table (and thus also baking the implementation details into a public API that can't be changed later).

I'm not sure how we would accomplish this, as some users are running on small embedded devices, others in SGX enclaves, and still others on laptops or in server farms. In an SGX enclave, I'm currently using radix-16 (and can't afford more memory usage), on my laptop (an i7-8550U) the radix-32 tables are the best time/memory tradeoff, and on an E5-2686 v4 I'm seeing that radix-64 is the best variant (see attached benchmarks):

curve25519-dalek-benchmarks-i7-8550U.txt
curve25519-dalek-benchmarks-i9-7900X.txt
curve25519-dalek-benchmarks-E5-2686v4.txt

This might be an argument for leaving radix-128 and radix-256 out? However I also wouldn't put it past someone running a large consensus network to throw money at a custom chipset eventually. In any case, I find it confusing to say that we could somehow know ahead of time whether the E5-2686v4 users are running the code in a TEE or outside one, let alone what other memory constraints they might have. (Consider also the case of a consensus verifier node which is seeing consensus signers drop on- and off-line, how are we to say when it's advantageous to use 30KB of memory versus 60/120/240/480KB?)

… tables. Found by the fuzzer in 4f5d2d4.

This adds a fuzzer as discussed in dalek-cryptography#311.

isislovecruft · 2020-01-08T22:23:53Z

I've added a fuzzer for the radix-256 tables and run it, it quickly hit an attempted overflow for the computation of the absolute value of the coefficient in the LookupTable.select() method, when the coefficient is -127 or 127. This is fixed with a couple simple casts to an i16.

However, I had suspected that, since the coefficient range in the radix-256 case is [-128, 128], that -128 and 128 were over-/under- flowing to compute the wrong results, but my tests and the fuzzers are checking for this and I haven't been able to hit a case yet where this happens.

hdevalence · 2020-01-14T18:38:05Z

In terms of how the implementation decisions could be encapsulated, one option would be for the EdwardsBasepointTable to store an enum and use a match statement to call an appropriate function. Different sizes could be selected when the table is constructed, preferably in a way that hints at the maximum size rather than leaking implementation details through the API.

isislovecruft · 2020-01-17T22:25:27Z

That sounds much messier? And with less type safety? If we did it that way, as a user, I'd have to "hint" at what size I wanted, and not be sure what size I got back, and then I'd need to call some sort of size_hint() function, potentially a bunch of times, for every time my heuristics need to check if we should upgrade/downgrade the cache size.

isislovecruft · 2020-07-06T21:40:03Z

Hi!

Is it okay to merge this so that I can go forward with the configurable pre-computed public key sizes in ed25519-dalek? I have multiple clients now, several using different HSMs, who want to be able to trade off size constraints for speed at runtime. There's also a likelihood that other parties conducting consensus protocols would like to use this feature, and it'd be good to advertise it.

cathieyun · 2020-07-09T23:12:54Z

It seems like there isn't a clean / type safe / non-implementation-detail-leaking way to implement this, but with fuzz testing and proper documentation, I am okay with merging this approach. I think I already approved the changes previously. @hdevalence wdyt?

hdevalence · 2020-07-10T06:29:57Z

No, I don't think this is good to merge, for the same reasons I mentioned above. The crate should not leak implementation details into the public API, because this causes API stability guarantees to constrain the implementation details. As I mentioned, it's possible to implement variable-size basepoint tables using an enum internally.

External API considerations for aside, I also don't think that implementation strategy of maintaining multiple sizes of lookup table is a good idea. It provides relatively little performance benefit on a microbenchmark at the cost of increased code complexity and at the cost of evicting the rest of the application's data from the cache, presumably degrading performance of the rest of the application. In the opposite direction of this PR, it would be useful to be able to provide smaller basepoint tables, but this doesn't require changing the lookup tables, only using fewer of them.

isislovecruft · 2020-09-18T23:43:09Z

I played around with @hdevalence's idea to hide the implementation details in an enum, and this does result in a nicer API, however — unless I misunderstood what was meant — it also:

Ignores the fact that you can't have an enum which leaks private types, and, worse,
Table sizes cannot realistically be changed this way at runtime (again, not without leaking the enum variants and allowing the user to specify exactly which size they wish to cast to) without smashing the stack. (See for example this doctest which despite playing around with using Box and calls to drop() still overflows the default stack size.)

As an aside, I now have four clients and a few other parties specifically interested in the ability to upgrade/downgrade the memory/efficiency tradeoffs of dynamic basepoint table sizes at runtime, according to their own algorithms and hardware.

isislovecruft added 5 commits December 27, 2019 12:49

Create macro for generating different sized LookupTable<T>s.

4f29935

Add a trait for implementing a basepoint table.

383e65f

Implement radix-32 precomputed scalar multiplication tables.

c01bd78

cathieyun reviewed Jan 2, 2020

View reviewed changes

src/scalar.rs Show resolved Hide resolved

Fix debug_assert! range in Scalar::to_radix_2w_size_hint().

ca1f730

cathieyun approved these changes Jan 4, 2020

View reviewed changes

isislovecruft added 2 commits January 7, 2020 08:04

Fix an attempted overflow on absolute value computation for radix-256…

1f8a19a

… tables. Found by the fuzzer in 4f5d2d4.

Add test for basepoint table multiplication by unreduced scalar.

e82910d

isislovecruft added a commit to isislovecruft/curve25519-dalek that referenced this pull request Jan 8, 2020

Add fuzzer for radix-256 basepoint table multiplication.

4f5d2d4

This adds a fuzzer as discussed in dalek-cryptography#311.

isislovecruft added api - design efficiency - memory efficiency - speed enhancement research - cryptography safety - type labels Mar 25, 2021

isislovecruft added this to the 3.1 milestone Mar 25, 2021

isislovecruft mentioned this pull request Apr 13, 2021

Merge configurably sized basepoint multiplication lookup tables #350

Merged

isislovecruft merged commit cc0c444 into dalek-cryptography:develop Apr 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

variable-sizes of precomputed tables for basepoint scalar multiplication #311

variable-sizes of precomputed tables for basepoint scalar multiplication #311

isislovecruft commented Dec 31, 2019 •

edited

cathieyun Jan 2, 2020

isislovecruft Jan 3, 2020

cathieyun Jan 4, 2020

cathieyun Jan 2, 2020

isislovecruft Jan 3, 2020

isislovecruft Jan 4, 2020

cathieyun Jan 4, 2020

isislovecruft Jan 7, 2020

cathieyun left a comment

hdevalence commented Jan 6, 2020

isislovecruft commented Jan 7, 2020

isislovecruft commented Jan 8, 2020

hdevalence commented Jan 14, 2020

isislovecruft commented Jan 17, 2020

isislovecruft commented Jul 6, 2020

cathieyun commented Jul 9, 2020 •

edited

hdevalence commented Jul 10, 2020

isislovecruft commented Sep 18, 2020

variable-sizes of precomputed tables for basepoint scalar multiplication #311

variable-sizes of precomputed tables for basepoint scalar multiplication #311

Conversation

isislovecruft commented Dec 31, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cathieyun left a comment

Choose a reason for hiding this comment

hdevalence commented Jan 6, 2020

isislovecruft commented Jan 7, 2020

isislovecruft commented Jan 8, 2020

hdevalence commented Jan 14, 2020

isislovecruft commented Jan 17, 2020

isislovecruft commented Jul 6, 2020

cathieyun commented Jul 9, 2020 • edited

hdevalence commented Jul 10, 2020

isislovecruft commented Sep 18, 2020

isislovecruft commented Dec 31, 2019 •

edited

cathieyun commented Jul 9, 2020 •

edited