Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

variable-sizes of precomputed tables for basepoint scalar multiplication #311

Conversation

isislovecruft
Copy link
Member

@isislovecruft isislovecruft commented Dec 31, 2019

Hi!

I've implemented more table sizes for fixed-base scalar multiplication. The motivation here is primarily to allow for memory/time optimisations for signature verifications (cf. https://github.com/isislovecruft/ed25519-dalek/blob/feature/public-key-precomputation_r1/src/traits.rs and https://github.com/isislovecruft/ed25519-dalek/blob/feature/public-key-precomputation_r1/src/precomputed.rs) where a user who needs to verify many signatures with the same public key can dynamically chose to recompute larger/smaller tables, but it's also likely useful for systems which use Chaum-Pedersen commitments or Okamoto commitments or any other NIZK proof/sigma protocol which uses two basepoints.

I've benchmarked this (see my separate feature/dynamic-tables-sizes-benches branch) and the results so far on a few different chipsets seem to indicate speedups as long as the table fits in the L1D subcache. I've implemented tables up to the limit (as self-imposed by Scalar.to_radix_2w()) of radix-256, which are ~480KB tables, as commonly used server builds these days (such as the Xeon E5-2686 v4s found on AWS, and those internal to other large companies) often have 512KB L1{I,D} caches.

This is currently implemented via a new BasepointTable trait, which makes them not semver compatible with 2.0.0, however I can remove the trait for now and add it back in when we're ready for a 3.0.0, if that's preferable. I'd like to be able to start using this now (or at least, I want it before releasing ed25519-dalek-1.0.0, so if we want the trait removed, I'd like to instead add a marker trait for the different sizes of tables, so that the ed25519-dalek code can generalise over them as "public keys".

Another implementation note: I opted for implementing these via macros, as it required the least amount of changes to existing code.

Also, naming question: Should we consider renaming EdwardsBasepointTable to something more generic, like EdwardsPrecomputedPointTable, etc.?

Please review!

This implements a macro for implementing the BasepointTable trait, and
uses the macro to create basepoint table types. The default table
still uses radix-16 representation and is ~30KB in size.  The new
table types, and their memory usage and additions required per
basepoint multiplication are:

 * `EdwardsBasepointTableRadix64`: ~120KB, 43 additions
 * `EdwardsBasepointTableRadix128`: ~240KB, 37 additions
 * `EdwardsBasepointTableRadix256`: ~480KB, 32 additions
This is useful for programs/protocol which can do some heuristics or
learning-based approach towards optimising the table size based on the number of
uses of e.g. a public key, the second basepoint in a Pedersen commitment, etc.,
i.e. the first time a public key is used to verify a signature, the usual
variable-time basepoint multiscalar multiplication is used, however after 1000
verifications, the table size is upgraded, and again after 10000 verifications,
etc.
}} // End macro_rules! impl_lookup_table

// The first one has to be named "LookupTable" because it's used as a constructor for consts.
impl_lookup_table! {Name = LookupTable, Size = 8, SizeNeg = -8, SizeRange = 1 .. 9, ConversionRange = 0 .. 7} // radix-16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding/curiosity - is SizeNeg always the negative of Size?
(It holds true for these examples, but will that be true in general? What kinds of lookup tables would have Size != -SizeNeg?) (if it's true in general, would it make sense to get rid of the SizeNeg input?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, SizeNeg is always the negative of Size, because the coefficients for the radix-2^n representation are always 0-centered. I remember having some difficulty with getting it to compile without SizeNeg, do to the compiler not knowing for certain that Size would be a type that implements Neg, but I could try doing away with this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining! I see how it's hard to guarantee to the compiler that Size implements Neg... in that case it makes sense to keep SizeNeg. It's not a big deal, I was mostly curious why it was there.

debug_assert!(x >= -8);
debug_assert!(x <= 8);
debug_assert!(x >= $neg);
debug_assert!(x as i16 <= $size as i16); // XXX We have to convert to i16s here for the radix-256 case.. this is wrong.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why this is required ($size == 258 in radix-256, whereas i8::MAX == 257)... but why is the conversion to i16 "wrong"? Will it result in incorrect calculations (it doesn't seem so), or is it just bad practice?

Would it be better to make x an i16 to start with, or would that incur too much of a performance cost?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's wrong because $size is 128 in radix-256, but x is an i8 which has a max of 127, so x must be less than or equal to $size by virtue of the type system. We can't make it an i16 without also changing the return type of Scalar.to_radix_2w(), because the bytes returned from the latter are what are used to call select() here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, also it's even more wrong, because in debug mode this should panic when we try to select(128), but the test basepoints and scalars I wrote are static and don't seem to trigger this. I should probably write a fuzzer and see if I can get it to panic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see how this is tricky/wrong for radix-256.

Would you like to address those issues (making sure it panics when you call select(128), writing fuzzers, etc) in this PR? Or, I can LGTM this PR with your placeholder comment, so you can address that in a later PR / so it doesn't block anything else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like fuzzers and better tests for all the sizes, even though I'm pretty thoroughly convinced that everything less than radix-256 is correct. I can try to get to it this evening.

Copy link
Member

@cathieyun cathieyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hdevalence
Copy link
Contributor

In terms of the API, I think it would be much better if decisions about the implementation of the basepoint tables were encapsulated inside the existing EdwardsBasepointTable type, rather than forcing users to make all of their code generic over the implementation details of the basepoint table (and thus also baking the implementation details into a public API that can't be changed later).

In terms of the implementation, the size of the LookupTable trades the cost of memory access (all table entries must be accessed to achieve uniform memory access patterns) against the cost of point additions (exponentially larger table sizes mean fewer additions). But both of these costs are known, so there should be an optimum tradeoff. For this reason, I'm not sure that it makes sense to have multiple sizes, rather than just picking the best one. It could be that the current size is not optimal, but then we should just change the size to the optimal one.

The size of the individual LookupTables is distinct from the overall size of the precomputed data (as in the current EdwardsBasepointTable struct), which uses multiple LookupTables to save on doublings. It would be convenient to be able to scale down the precomputation, but I'm not sure it makes sense to scale it up, as it is already at or beyond the L1d$ size. (In fact the E5-2686 v4 has a 32KiB L1d$ cache per-core; the point computations only use one core each, so the total L1d$ is not relevant). However, scaling down the size of the precomputed data is orthogonal to the LookupTable internals.

@isislovecruft
Copy link
Member Author

In terms of the API, I think it would be much better if decisions about the implementation of the basepoint tables were encapsulated inside the existing EdwardsBasepointTable type, rather than forcing users to make all of their code generic over the implementation details of the basepoint table (and thus also baking the implementation details into a public API that can't be changed later).

I'm not sure how we would accomplish this, as some users are running on small embedded devices, others in SGX enclaves, and still others on laptops or in server farms. In an SGX enclave, I'm currently using radix-16 (and can't afford more memory usage), on my laptop (an i7-8550U) the radix-32 tables are the best time/memory tradeoff, and on an E5-2686 v4 I'm seeing that radix-64 is the best variant (see attached benchmarks):

curve25519-dalek-benchmarks-i7-8550U.txt
curve25519-dalek-benchmarks-i9-7900X.txt
curve25519-dalek-benchmarks-E5-2686v4.txt

This might be an argument for leaving radix-128 and radix-256 out? However I also wouldn't put it past someone running a large consensus network to throw money at a custom chipset eventually. In any case, I find it confusing to say that we could somehow know ahead of time whether the E5-2686v4 users are running the code in a TEE or outside one, let alone what other memory constraints they might have. (Consider also the case of a consensus verifier node which is seeing consensus signers drop on- and off-line, how are we to say when it's advantageous to use 30KB of memory versus 60/120/240/480KB?)

isislovecruft added a commit to isislovecruft/curve25519-dalek that referenced this pull request Jan 8, 2020
@isislovecruft
Copy link
Member Author

I've added a fuzzer for the radix-256 tables and run it, it quickly hit an attempted overflow for the computation of the absolute value of the coefficient in the LookupTable.select() method, when the coefficient is -127 or 127. This is fixed with a couple simple casts to an i16.

However, I had suspected that, since the coefficient range in the radix-256 case is [-128, 128], that -128 and 128 were over-/under- flowing to compute the wrong results, but my tests and the fuzzers are checking for this and I haven't been able to hit a case yet where this happens.

@hdevalence
Copy link
Contributor

In terms of how the implementation decisions could be encapsulated, one option would be for the EdwardsBasepointTable to store an enum and use a match statement to call an appropriate function. Different sizes could be selected when the table is constructed, preferably in a way that hints at the maximum size rather than leaking implementation details through the API.

@isislovecruft
Copy link
Member Author

That sounds much messier? And with less type safety? If we did it that way, as a user, I'd have to "hint" at what size I wanted, and not be sure what size I got back, and then I'd need to call some sort of size_hint() function, potentially a bunch of times, for every time my heuristics need to check if we should upgrade/downgrade the cache size.

@isislovecruft
Copy link
Member Author

Hi!

Is it okay to merge this so that I can go forward with the configurable pre-computed public key sizes in ed25519-dalek? I have multiple clients now, several using different HSMs, who want to be able to trade off size constraints for speed at runtime. There's also a likelihood that other parties conducting consensus protocols would like to use this feature, and it'd be good to advertise it.

@cathieyun
Copy link
Member

cathieyun commented Jul 9, 2020

It seems like there isn't a clean / type safe / non-implementation-detail-leaking way to implement this, but with fuzz testing and proper documentation, I am okay with merging this approach. I think I already approved the changes previously. @hdevalence wdyt?

@hdevalence
Copy link
Contributor

No, I don't think this is good to merge, for the same reasons I mentioned above. The crate should not leak implementation details into the public API, because this causes API stability guarantees to constrain the implementation details. As I mentioned, it's possible to implement variable-size basepoint tables using an enum internally.

External API considerations for aside, I also don't think that implementation strategy of maintaining multiple sizes of lookup table is a good idea. It provides relatively little performance benefit on a microbenchmark at the cost of increased code complexity and at the cost of evicting the rest of the application's data from the cache, presumably degrading performance of the rest of the application. In the opposite direction of this PR, it would be useful to be able to provide smaller basepoint tables, but this doesn't require changing the lookup tables, only using fewer of them.

@isislovecruft
Copy link
Member Author

I played around with @hdevalence's idea to hide the implementation details in an enum, and this does result in a nicer API, however — unless I misunderstood what was meant — it also:

  1. Ignores the fact that you can't have an enum which leaks private types, and, worse,
  2. Table sizes cannot realistically be changed this way at runtime (again, not without leaking the enum variants and allowing the user to specify exactly which size they wish to cast to) without smashing the stack. (See for example this doctest which despite playing around with using Box and calls to drop() still overflows the default stack size.)

As an aside, I now have four clients and a few other parties specifically interested in the ability to upgrade/downgrade the memory/efficiency tradeoffs of dynamic basepoint table sizes at runtime, according to their own algorithms and hardware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants