Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signed-digit multi-comb for ecmult_gen (by peterdettman) #693

Open
wants to merge 9 commits into
base: master
from

Conversation

@sipa
Copy link
Contributor

sipa commented Nov 11, 2019

This is a rebase of #546. See the original PR for a full description, but in short, this introduces a new constant-time multiplication algorithm with precomputation, with better speed/size tradeoffs. It is more flexible, allowing both better speeds with the same table size, or smaller table sizes for the same speed. It permits extrmely small tables with still reasonable speeds.

In addition to the original PR, this also:

  • Removes the old ecmult algorithm entirely
  • Makes the tunables configurable through configure, and tests a few combinations in Travis.
@sipa

This comment has been minimized.

Copy link
Contributor Author

sipa commented Nov 11, 2019

Note that I have not reviewed the multiplication code in detail yet.

@gmaxwell

This comment has been minimized.

Copy link
Contributor

gmaxwell commented Nov 13, 2019

@sipa do you think it would make more sense to benchmark all the blocks and teeth settings, and then find the efficient frontier for best performance as a function of memory, and then have the setting just offer a couple options along that frontier?

As this stands, by my count this creates 306 build configurations which should at least get minimal testing occasionally. Probably just a couple sizes are sufficient for exposed options. Usage probably comes in "embedded device, as small as possible please" "embedded device, I care about performance but small please" "desktop, meh whatever but don't blow out my cache too bad" "signing network service GIVE ME ALL YOUR MEMORY", flavour and not too many others.

@sipa

This comment has been minimized.

Copy link
Contributor Author

sipa commented Nov 13, 2019

@gmaxwell Yeah, I was thinking of doing something like that. I hoped to reduce the number of combinations by looking at the number of adds/doubles, and removing things that don't improve on either but have a larger table size. The problem is that the number of cmovs grows exponentially with the teeth settings, and at teeth=8 every point add is preceded by iterating over 8 kB of data to select the right entry, at which point the cmov cost is certainly nontrivial as well.

@sipa

This comment has been minimized.

Copy link
Contributor Author

sipa commented Nov 13, 2019

Anyway, ignoring the costs of scalar operations and cmovs, and counting a point add as 260 ns and a doubling as 159 ns, these are the optimal configurations:

blocks=1 teeth=2 tblsize=192 adds=127 doubles=127
blocks=1 teeth=3 tblsize=256 adds=85 doubles=85
blocks=2 teeth=3 tblsize=512 adds=85 doubles=42
blocks=1 teeth=4 tblsize=576 adds=63 doubles=63
blocks=1 teeth=5 tblsize=1024 adds=51 doubles=51
blocks=2 teeth=4 tblsize=1088 adds=63 doubles=31
blocks=3 teeth=4 tblsize=1536 adds=65 doubles=21
blocks=1 teeth=6 tblsize=2048 adds=42 doubles=42
blocks=2 teeth=5 tblsize=2048 adds=51 doubles=25
blocks=3 teeth=5 tblsize=3072 adds=53 doubles=17
blocks=1 teeth=7 tblsize=4096 adds=36 doubles=36
blocks=2 teeth=6 tblsize=4096 adds=43 doubles=21
blocks=3 teeth=6 tblsize=6144 adds=44 doubles=14
blocks=2 teeth=7 tblsize=8192 adds=37 doubles=18
blocks=3 teeth=7 tblsize=12288 adds=38 doubles=12
blocks=4 teeth=7 tblsize=16384 adds=39 doubles=9
blocks=2 teeth=8 tblsize=16448 adds=31 doubles=15
blocks=3 teeth=8 tblsize=24576 adds=32 doubles=10
blocks=4 teeth=8 tblsize=32832 adds=31 doubles=7
blocks=8 teeth=8 tblsize=65600 adds=31 doubles=3
blocks=16 teeth=8 tblsize=131136 adds=31 doubles=1
blocks=32 teeth=8 tblsize=262208 adds=31 doubles=0

@sipa

This comment has been minimized.

Copy link
Contributor Author

sipa commented Nov 13, 2019

Here is a more extensive list of all potentially-useful combinations, taking number of cmovs into account (and not making any assumptions about the relative costs of adds vs doubles vs cmovs):

teeth=1 blocks=1 tblsize=128 adds=255 doubles=255 cmovs=256
teeth=1 blocks=3 tblsize=192 adds=257 doubles=85 cmovs=258
teeth=1 blocks=5 tblsize=320 adds=259 doubles=51 cmovs=260
teeth=1 blocks=7 tblsize=448 adds=258 doubles=36 cmovs=259
teeth=1 blocks=9 tblsize=576 adds=260 doubles=28 cmovs=261
teeth=1 blocks=11 tblsize=704 adds=263 doubles=23 cmovs=264
teeth=1 blocks=13 tblsize=832 adds=259 doubles=19 cmovs=260
teeth=1 blocks=15 tblsize=960 adds=269 doubles=17 cmovs=270
teeth=1 blocks=19 tblsize=1216 adds=265 doubles=13 cmovs=266
teeth=1 blocks=29 tblsize=1856 adds=260 doubles=8 cmovs=261
teeth=1 blocks=37 tblsize=2368 adds=258 doubles=6 cmovs=259
teeth=1 blocks=43 tblsize=2752 adds=257 doubles=5 cmovs=258

teeth=2 blocks=1 tblsize=192 adds=127 doubles=127 cmovs=256
teeth=2 blocks=2 tblsize=320 adds=127 doubles=63 cmovs=256
teeth=2 blocks=3 tblsize=384 adds=128 doubles=42 cmovs=258
teeth=2 blocks=4 tblsize=576 adds=127 doubles=31 cmovs=256
teeth=2 blocks=5 tblsize=640 adds=129 doubles=25 cmovs=260
teeth=2 blocks=6 tblsize=768 adds=131 doubles=21 cmovs=264
teeth=2 blocks=7 tblsize=896 adds=132 doubles=18 cmovs=266
teeth=2 blocks=8 tblsize=1088 adds=127 doubles=15 cmovs=256
teeth=2 blocks=9 tblsize=1152 adds=134 doubles=14 cmovs=270
teeth=2 blocks=10 tblsize=1280 adds=129 doubles=12 cmovs=260
teeth=2 blocks=11 tblsize=1408 adds=131 doubles=11 cmovs=264
teeth=2 blocks=12 tblsize=1536 adds=131 doubles=10 cmovs=264
teeth=2 blocks=13 tblsize=1664 adds=129 doubles=9 cmovs=260
teeth=2 blocks=15 tblsize=1920 adds=134 doubles=8 cmovs=270
teeth=2 blocks=16 tblsize=2112 adds=127 doubles=7 cmovs=256
teeth=2 blocks=19 tblsize=2432 adds=132 doubles=6 cmovs=266
teeth=2 blocks=22 tblsize=2816 adds=131 doubles=5 cmovs=264
teeth=2 blocks=26 tblsize=3328 adds=129 doubles=4 cmovs=260
teeth=2 blocks=32 tblsize=4160 adds=127 doubles=3 cmovs=256
teeth=2 blocks=43 tblsize=5504 adds=128 doubles=2 cmovs=258
teeth=2 blocks=64 tblsize=8256 adds=127 doubles=1 cmovs=256
teeth=2 blocks=128 tblsize=16448 adds=127 doubles=0 cmovs=256

teeth=3 blocks=1 tblsize=256 adds=85 doubles=85 cmovs=344
teeth=3 blocks=2 tblsize=512 adds=85 doubles=42 cmovs=344
teeth=3 blocks=3 tblsize=768 adds=86 doubles=28 cmovs=348
teeth=3 blocks=4 tblsize=1024 adds=87 doubles=21 cmovs=352
teeth=3 blocks=5 tblsize=1280 adds=89 doubles=17 cmovs=360
teeth=3 blocks=6 tblsize=1536 adds=89 doubles=14 cmovs=360
teeth=3 blocks=7 tblsize=1792 adds=90 doubles=12 cmovs=364
teeth=3 blocks=8 tblsize=2048 adds=87 doubles=10 cmovs=352
teeth=3 blocks=9 tblsize=2304 adds=89 doubles=9 cmovs=360
teeth=3 blocks=10 tblsize=2560 adds=89 doubles=8 cmovs=360
teeth=3 blocks=11 tblsize=2816 adds=87 doubles=7 cmovs=352
teeth=3 blocks=13 tblsize=3328 adds=90 doubles=6 cmovs=364
teeth=3 blocks=15 tblsize=3840 adds=89 doubles=5 cmovs=360
teeth=3 blocks=18 tblsize=4608 adds=89 doubles=4 cmovs=360
teeth=3 blocks=22 tblsize=5632 adds=87 doubles=3 cmovs=352
teeth=3 blocks=29 tblsize=7424 adds=86 doubles=2 cmovs=348
teeth=3 blocks=43 tblsize=11008 adds=85 doubles=1 cmovs=344
teeth=3 blocks=86 tblsize=22016 adds=85 doubles=0 cmovs=344

teeth=4 blocks=1 tblsize=576 adds=63 doubles=63 cmovs=512
teeth=4 blocks=2 tblsize=1088 adds=63 doubles=31 cmovs=512
teeth=4 blocks=3 tblsize=1536 adds=65 doubles=21 cmovs=528
teeth=4 blocks=4 tblsize=2112 adds=63 doubles=15 cmovs=512
teeth=4 blocks=5 tblsize=2560 adds=64 doubles=12 cmovs=520
teeth=4 blocks=6 tblsize=3072 adds=65 doubles=10 cmovs=528
teeth=4 blocks=7 tblsize=3584 adds=69 doubles=9 cmovs=560
teeth=4 blocks=8 tblsize=4160 adds=63 doubles=7 cmovs=512
teeth=4 blocks=10 tblsize=5120 adds=69 doubles=6 cmovs=560
teeth=4 blocks=11 tblsize=5632 adds=65 doubles=5 cmovs=528
teeth=4 blocks=13 tblsize=6656 adds=64 doubles=4 cmovs=520
teeth=4 blocks=16 tblsize=8256 adds=63 doubles=3 cmovs=512
teeth=4 blocks=22 tblsize=11264 adds=65 doubles=2 cmovs=528
teeth=4 blocks=32 tblsize=16448 adds=63 doubles=1 cmovs=512
teeth=4 blocks=64 tblsize=32832 adds=63 doubles=0 cmovs=512

teeth=5 blocks=1 tblsize=1024 adds=51 doubles=51 cmovs=832
teeth=5 blocks=2 tblsize=2048 adds=51 doubles=25 cmovs=832
teeth=5 blocks=3 tblsize=3072 adds=53 doubles=17 cmovs=864
teeth=5 blocks=4 tblsize=4096 adds=51 doubles=12 cmovs=832
teeth=5 blocks=5 tblsize=5120 adds=54 doubles=10 cmovs=880
teeth=5 blocks=6 tblsize=6144 adds=53 doubles=8 cmovs=864
teeth=5 blocks=7 tblsize=7168 adds=55 doubles=7 cmovs=896
teeth=5 blocks=8 tblsize=8192 adds=55 doubles=6 cmovs=896
teeth=5 blocks=9 tblsize=9216 adds=53 doubles=5 cmovs=864
teeth=5 blocks=11 tblsize=11264 adds=54 doubles=4 cmovs=880
teeth=5 blocks=13 tblsize=13312 adds=51 doubles=3 cmovs=832
teeth=5 blocks=18 tblsize=18432 adds=53 doubles=2 cmovs=864
teeth=5 blocks=26 tblsize=26624 adds=51 doubles=1 cmovs=832
teeth=5 blocks=52 tblsize=53248 adds=51 doubles=0 cmovs=832

teeth=6 blocks=1 tblsize=2048 adds=42 doubles=42 cmovs=1376
teeth=6 blocks=2 tblsize=4096 adds=43 doubles=21 cmovs=1408
teeth=6 blocks=3 tblsize=6144 adds=44 doubles=14 cmovs=1440
teeth=6 blocks=4 tblsize=8192 adds=43 doubles=10 cmovs=1408
teeth=6 blocks=5 tblsize=10240 adds=44 doubles=8 cmovs=1440
teeth=6 blocks=6 tblsize=12288 adds=47 doubles=7 cmovs=1536
teeth=6 blocks=8 tblsize=16384 adds=47 doubles=5 cmovs=1536
teeth=6 blocks=9 tblsize=18432 adds=44 doubles=4 cmovs=1440
teeth=6 blocks=11 tblsize=22528 adds=43 doubles=3 cmovs=1408
teeth=6 blocks=15 tblsize=30720 adds=44 doubles=2 cmovs=1440
teeth=6 blocks=22 tblsize=45056 adds=43 doubles=1 cmovs=1408
teeth=6 blocks=43 tblsize=88064 adds=42 doubles=0 cmovs=1376

teeth=7 blocks=1 tblsize=4096 adds=36 doubles=36 cmovs=2368
teeth=7 blocks=2 tblsize=8192 adds=37 doubles=18 cmovs=2432
teeth=7 blocks=3 tblsize=12288 adds=38 doubles=12 cmovs=2496
teeth=7 blocks=4 tblsize=16384 adds=39 doubles=9 cmovs=2560
teeth=7 blocks=5 tblsize=20480 adds=39 doubles=7 cmovs=2560
teeth=7 blocks=8 tblsize=32768 adds=39 doubles=4 cmovs=2560
teeth=7 blocks=10 tblsize=40960 adds=39 doubles=3 cmovs=2560
teeth=7 blocks=13 tblsize=53248 adds=38 doubles=2 cmovs=2496
teeth=7 blocks=19 tblsize=77824 adds=37 doubles=1 cmovs=2432
teeth=7 blocks=37 tblsize=151552 adds=36 doubles=0 cmovs=2368

teeth=8 blocks=1 tblsize=8256 adds=31 doubles=31 cmovs=4096
teeth=8 blocks=2 tblsize=16448 adds=31 doubles=15 cmovs=4096
teeth=8 blocks=3 tblsize=24576 adds=32 doubles=10 cmovs=4224
teeth=8 blocks=4 tblsize=32832 adds=31 doubles=7 cmovs=4096
teeth=8 blocks=5 tblsize=40960 adds=34 doubles=6 cmovs=4480
teeth=8 blocks=6 tblsize=49152 adds=35 doubles=5 cmovs=4608
teeth=8 blocks=7 tblsize=57344 adds=34 doubles=4 cmovs=4480
teeth=8 blocks=8 tblsize=65600 adds=31 doubles=3 cmovs=4096
teeth=8 blocks=11 tblsize=90112 adds=32 doubles=2 cmovs=4224
teeth=8 blocks=16 tblsize=131136 adds=31 doubles=1 cmovs=4096
teeth=8 blocks=32 tblsize=262208 adds=31 doubles=0 cmovs=4096

@gmaxwell

This comment has been minimized.

Copy link
Contributor

gmaxwell commented Nov 14, 2019

Here is the efficient frontier I found running bench_sign on an AMD 7742 (using sha256+sha-ni as the nonce function).

Blocks Teeth bytes µs Sigs/core/sec
1 1 64 157 6369
1 2 128 87.7 11403
1 3 256 65.3 15314
1 4 512 54.0 18519
1 5 1024 48.6 20576
3 4 1536 47.5 21053
2 5 2048 44.1 22676
3 5 3072 43.6 22936
4 5 4096 41.7 23981
3 6 6144 41.1 24331
4 6 8192 39.9 25063
9 6 18432 39.3 25445
11 6 22528 38.7 25840
22 6 45056 38.4 26042
43 6 88064 38.1 26247
37 7 151552 37.9 26385

Here are all the configurations visualized:

image

Most points off the frontier aren't even close. The current 64kb table is 43.7µs on my test setup.

I think the logic options are {11,6} (22528b), {2,5} (2048b) and then one of the smallest three options. I would say {1,3} (256b) but I assume that {1,1} (64b) has the benefit that a considerable amount of code could also be omitted which might understate its space advantage. Presumably only someone who was extremely memory starved would consider using something smaller than {2,5}.

One possibility would be to just offer the 22kb choice and 2kb choice and leave it to the user to pick their best trade-off if they really want to go smaller, with a copy of these benchmark numbers in some integration documentation.

You also might want to consider how this changes with endomorphism, since after next year the library can probably drop the non-endomorphism configuration. Under the old algorithm endomorphism would halve the table size for only slightly less performance.

I might also want to run the figures for schnorr signing, since the constant time modular inversion of K is probably a big fraction of the total time in these figures and it doesn't exist there. The extra constant time chunk diminishes the magnitude of the speedup. Edit: LOL, yes, it's a 1.4x speedup to avoid that inversion.

Maybe someone who wants to use this library to sign on a very memory constrained device might want to comment? The code itself is some tens of kilobytes, so I don't know how much anyone cares about saving a hundred bytes in exchange for running a tiny fraction of the speed. I wouldn't be surprised that if the slower configurations were also somewhat easier to attack with differential power analysis too.

@gmaxwell

This comment has been minimized.

Copy link
Contributor

gmaxwell commented Nov 14, 2019

How small does a table need to be before it's reasonable to include in the source explicitly? Not having to run gen_context at build time would be nice.

In #614 I considered 1024 bytes obviously OK.

@gmaxwell

This comment has been minimized.

Copy link
Contributor

gmaxwell commented Nov 14, 2019

Just for fun, I ran the same test using a more optimized signer that doesn't use any slow inversions.

The efficient frontier is slightly different, but I think my suggested options would be the same.

Blocks Teeth bytes µs Sigs/core/sec
1 1 64 141 7092
1 2 128 73.3 13643
1 3 256 50.7 19724
1 4 512 39.3 25445
1 5 1024 33.5 29851
3 4 1536 32.3 30960
2 5 2048 28.9 34602
3 5 3072 28.4 35211
4 5 4096 26.6 37594
3 6 6144 26.4 37879
4 6 8192 25.0 40000
5 6 10240 24.9 40161
9 6 18432 24.3 41152
11 6 22528 23.7 42194
22 6 45056 23.4 42735
43 6 88064 22.5 44444

image

@gmaxwell

This comment has been minimized.

Copy link
Contributor

gmaxwell commented Nov 15, 2019

@douglasbakkum @jonasschnelli @ddustin @benma you've commented before about memory usage in signing.

@sipa

This comment has been minimized.

Copy link
Contributor Author

sipa commented Nov 18, 2019

@gmaxwell I'm not sure what the best strategy is:

  • Just leaving comb/teeth individually configurable by the build system
  • Providing a number of presets (small/medium/large)
  • Letting the configuration control the actual size (e.g. --ecmult-gen-table-size=N, which guarantees a table size of at most (1<<N) bytes).

It'd be very useful to hear from people on embedded systems how much they care about small table sizes.

@gmaxwell

This comment has been minimized.

Copy link
Contributor

gmaxwell commented Nov 19, 2019

I dislike leaving comb/teeth configurable in the build system. It would increase the number of build configurations that the project should be testing by hundreds of times for no clear gain...

Plus if the project can't figure out how to set them correctly, it's doubtful that someone just using the build system is going to know what to do.

I wouldn't see any problem with having hidden options to override-- no promises made that those configurations were in any way tested.

Obviously I'm a fan of small number of presets but there needs to be feedback on which options people want. I'm hoping to hear something like the size of the library itself means that no one cares too much about 2048 bytes vs 256 bytes vs 64 bytes-- if so then choosing a small option will be much easier.

Taking a size and solving for the best value that fits would be my second choice... but only because it results in a testing headache. Things like the peak stack usage and other behavioural differences wouldn't be monotone with the size... so I don't think e.g. testing the smallest and largest would be sufficient. Otherwise I would probably prefer this one.

@ddustin

This comment has been minimized.

Copy link

ddustin commented Nov 20, 2019

Excited that this is getting worked on!

Maybe someone who wants to use this library to sign on a very memory constrained device might want to comment? The code itself is some tens of kilobytes, so I don't know how much anyone cares about saving a hundred bytes in exchange for running a tiny fraction of the speed. I wouldn't be surprised that if the slower configurations were also somewhat easier to attack with differential power analysis too.

If the code itself is tens of kilobytes than making it small may be a moot issue. What's making the code so big -- is there any low hanging fruit for shrinking it down?

I've been using the mbed tls library: https://tls.mbed.org/api/group__encdec__module.html

The main perk there being it uses probably under 1 kilobyte of ram in total -- though it is super slow. It also doesn't seem to get the attention that secp256k1 gets, so perhaps some vulnerabilities could slip through the cracks.

I had a hell of a time figuring out that I needed to flip negative values to their positive variants in some places -- a task that libsecp256k1 mostly handles for you automatically. So perhaps there's an argument for libsecp256k1 getting more embed usable from an ease of use standpoint. I don't really know what's up with the project other than I guess ARM bought it? It actually looks fairly active: https://github.com/ARMmbed/mbedtls

Surprisingly in the last few years some chips have been adding hardware optimized curves. Though support, documentation, etc can be sparse to non-existent.

@ddustin

This comment has been minimized.

Copy link

ddustin commented Nov 20, 2019

Taking a size and solving for the best value that fits would be my second choice... but only because it results in a testing headache. Things like the peak stack usage and other behavioural differences wouldn't be monotone with the size... so I don't think e.g. testing the smallest and largest would be sufficient. Otherwise I would probably prefer this one.

While adjusting the size for speed with granularity would be nice I don't think it's every really going to be needed. Steps would be great though -- xlarge, large, mediumlarge, medium, mediumsmall, small, xtrasmall, tiny would be plenty.

Often you end up in this situation where you're just out of memory and you're hunting for things to make smaller. Having an option for that would be great -- but I can't imagine a scenario where it needs to be granular.

@gmaxwell

This comment has been minimized.

Copy link
Contributor

gmaxwell commented Nov 25, 2019

@ddustin The library isn't optimized for code size: Correctness, testability, performance, safe interfaces..., sure.

To some extent there are size/performance trade-offs:

Things like the constant time inverse and the sqrt could be implemented using a small state machine, instead of an entirely unrolled function. Instead of 5 different field element normalize functions for different combinations of strength and constant-timeness we could have just one that was the strongest and constant time. Instead of implementing things that give big speedups like WNAF but take a bit of code ... we could ... just not do that. Rather than having an overcomplete field with carefully hand optimize multiplies and squares we could have just a plain bignum with schoolbook multiplies and squaring accomplished using the multiply instruction...

On a desktop computer, all these decisions favouring speed at the expense of size are almost certainly good ones-- in fact it would probably be well worth it to increase the code size another 50% if it made it a few percent faster. Even on a mobile phone or similar, the same applies. On some tiny embedded device? Probably not-- it depends on the application. Of course, these devices are also slow-- so a tradeoff that made it 50% slower just to save a few hundred bytes of code probably wouldn't be a win.

Some of these tradeoffs like the multi-fold normalizes could easily be made ifdef switchable between existing code and are also not that huge a performance cost. Other tradeoffs would be big performance hits or would require extensive rewrites. But it's hard to justify expending effort doing that, even where its easy, without a lot of help or at least feedback from someone that really cares.

Otherwise, it's probably more interesting spending time on additional validation, other optimizations that are useful on big desktop/server cpus-- things like ADX or AVX2-FMA or GMP-less fast jacobi/inverse, because there the objectives are more clear: Is it faster enough to justify the effort testing and reviewing it. :) or on implementing additional cryptographic functions, since those at least enable new and interesting applications.

Some things could probably be made a lot smaller without being any slower too (I think the aforementioned constant time sqrt and inverses are likely candidates), but just no one has tried.

I think to the extent that this project is active at all the contributors to it are interested in working with people to make size optimizations at least ones that don't come at a substantial cost to other applications. But without at least people who care about them helping to drive the tradeoffs, I wouldn't expect any to happen quickly.

@ddustin

This comment has been minimized.

Copy link

ddustin commented Dec 2, 2019

@gmaxwell That all makes sense. Targeting super small chips may be foolish.

Medium size chips might be interesting. I've been throwing around the idea of a cheap and low power as possible embedded full node. In that situation power usage is probably the main concern, though turning on flash cells also takes power.

As kind of an aside to all of this -- I wonder if having an easy way to throw in backend hardware optimized curves would be worthwhile. Keep the safe interfaces while leveraging hardware optimization where it's available. Also hardware optimized curves are just, well, cool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.