Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signed-digit multi-comb for ecmult_gen (by peterdettman) #693

Closed
wants to merge 8 commits into from

Conversation

sipa
Copy link
Contributor

@sipa sipa commented Nov 11, 2019

This is a rebase of #546. See the original PR for a full description, but in short, this introduces a new constant-time multiplication algorithm with precomputation, with better speed/size tradeoffs. It is more flexible, allowing both better speeds with the same table size, or smaller table sizes for the same speed. It permits extrmely small tables with still reasonable speeds.

In addition to the original PR, this also:

  • Removes the old ecmult algorithm entirely
  • Makes the tunables configurable through configure, and tests a few combinations in Travis.

@sipa
Copy link
Contributor Author

@sipa sipa commented Nov 11, 2019

Note that I have not reviewed the multiplication code in detail yet.

@gmaxwell
Copy link
Contributor

@gmaxwell gmaxwell commented Nov 13, 2019

@sipa do you think it would make more sense to benchmark all the blocks and teeth settings, and then find the efficient frontier for best performance as a function of memory, and then have the setting just offer a couple options along that frontier?

As this stands, by my count this creates 306 build configurations which should at least get minimal testing occasionally. Probably just a couple sizes are sufficient for exposed options. Usage probably comes in "embedded device, as small as possible please" "embedded device, I care about performance but small please" "desktop, meh whatever but don't blow out my cache too bad" "signing network service GIVE ME ALL YOUR MEMORY", flavour and not too many others.

@sipa
Copy link
Contributor Author

@sipa sipa commented Nov 13, 2019

@gmaxwell Yeah, I was thinking of doing something like that. I hoped to reduce the number of combinations by looking at the number of adds/doubles, and removing things that don't improve on either but have a larger table size. The problem is that the number of cmovs grows exponentially with the teeth settings, and at teeth=8 every point add is preceded by iterating over 8 kB of data to select the right entry, at which point the cmov cost is certainly nontrivial as well.

@sipa
Copy link
Contributor Author

@sipa sipa commented Nov 13, 2019

Anyway, ignoring the costs of scalar operations and cmovs, and counting a point add as 260 ns and a doubling as 159 ns, these are the optimal configurations:

blocks=1 teeth=2 tblsize=192 adds=127 doubles=127
blocks=1 teeth=3 tblsize=256 adds=85 doubles=85
blocks=2 teeth=3 tblsize=512 adds=85 doubles=42
blocks=1 teeth=4 tblsize=576 adds=63 doubles=63
blocks=1 teeth=5 tblsize=1024 adds=51 doubles=51
blocks=2 teeth=4 tblsize=1088 adds=63 doubles=31
blocks=3 teeth=4 tblsize=1536 adds=65 doubles=21
blocks=1 teeth=6 tblsize=2048 adds=42 doubles=42
blocks=2 teeth=5 tblsize=2048 adds=51 doubles=25
blocks=3 teeth=5 tblsize=3072 adds=53 doubles=17
blocks=1 teeth=7 tblsize=4096 adds=36 doubles=36
blocks=2 teeth=6 tblsize=4096 adds=43 doubles=21
blocks=3 teeth=6 tblsize=6144 adds=44 doubles=14
blocks=2 teeth=7 tblsize=8192 adds=37 doubles=18
blocks=3 teeth=7 tblsize=12288 adds=38 doubles=12
blocks=4 teeth=7 tblsize=16384 adds=39 doubles=9
blocks=2 teeth=8 tblsize=16448 adds=31 doubles=15
blocks=3 teeth=8 tblsize=24576 adds=32 doubles=10
blocks=4 teeth=8 tblsize=32832 adds=31 doubles=7
blocks=8 teeth=8 tblsize=65600 adds=31 doubles=3
blocks=16 teeth=8 tblsize=131136 adds=31 doubles=1
blocks=32 teeth=8 tblsize=262208 adds=31 doubles=0

@sipa
Copy link
Contributor Author

@sipa sipa commented Nov 13, 2019

Here is a more extensive list of all potentially-useful combinations, taking number of cmovs into account (and not making any assumptions about the relative costs of adds vs doubles vs cmovs):

teeth=1 blocks=1 tblsize=128 adds=255 doubles=255 cmovs=256
teeth=1 blocks=3 tblsize=192 adds=257 doubles=85 cmovs=258
teeth=1 blocks=5 tblsize=320 adds=259 doubles=51 cmovs=260
teeth=1 blocks=7 tblsize=448 adds=258 doubles=36 cmovs=259
teeth=1 blocks=9 tblsize=576 adds=260 doubles=28 cmovs=261
teeth=1 blocks=11 tblsize=704 adds=263 doubles=23 cmovs=264
teeth=1 blocks=13 tblsize=832 adds=259 doubles=19 cmovs=260
teeth=1 blocks=15 tblsize=960 adds=269 doubles=17 cmovs=270
teeth=1 blocks=19 tblsize=1216 adds=265 doubles=13 cmovs=266
teeth=1 blocks=29 tblsize=1856 adds=260 doubles=8 cmovs=261
teeth=1 blocks=37 tblsize=2368 adds=258 doubles=6 cmovs=259
teeth=1 blocks=43 tblsize=2752 adds=257 doubles=5 cmovs=258

teeth=2 blocks=1 tblsize=192 adds=127 doubles=127 cmovs=256
teeth=2 blocks=2 tblsize=320 adds=127 doubles=63 cmovs=256
teeth=2 blocks=3 tblsize=384 adds=128 doubles=42 cmovs=258
teeth=2 blocks=4 tblsize=576 adds=127 doubles=31 cmovs=256
teeth=2 blocks=5 tblsize=640 adds=129 doubles=25 cmovs=260
teeth=2 blocks=6 tblsize=768 adds=131 doubles=21 cmovs=264
teeth=2 blocks=7 tblsize=896 adds=132 doubles=18 cmovs=266
teeth=2 blocks=8 tblsize=1088 adds=127 doubles=15 cmovs=256
teeth=2 blocks=9 tblsize=1152 adds=134 doubles=14 cmovs=270
teeth=2 blocks=10 tblsize=1280 adds=129 doubles=12 cmovs=260
teeth=2 blocks=11 tblsize=1408 adds=131 doubles=11 cmovs=264
teeth=2 blocks=12 tblsize=1536 adds=131 doubles=10 cmovs=264
teeth=2 blocks=13 tblsize=1664 adds=129 doubles=9 cmovs=260
teeth=2 blocks=15 tblsize=1920 adds=134 doubles=8 cmovs=270
teeth=2 blocks=16 tblsize=2112 adds=127 doubles=7 cmovs=256
teeth=2 blocks=19 tblsize=2432 adds=132 doubles=6 cmovs=266
teeth=2 blocks=22 tblsize=2816 adds=131 doubles=5 cmovs=264
teeth=2 blocks=26 tblsize=3328 adds=129 doubles=4 cmovs=260
teeth=2 blocks=32 tblsize=4160 adds=127 doubles=3 cmovs=256
teeth=2 blocks=43 tblsize=5504 adds=128 doubles=2 cmovs=258
teeth=2 blocks=64 tblsize=8256 adds=127 doubles=1 cmovs=256
teeth=2 blocks=128 tblsize=16448 adds=127 doubles=0 cmovs=256

teeth=3 blocks=1 tblsize=256 adds=85 doubles=85 cmovs=344
teeth=3 blocks=2 tblsize=512 adds=85 doubles=42 cmovs=344
teeth=3 blocks=3 tblsize=768 adds=86 doubles=28 cmovs=348
teeth=3 blocks=4 tblsize=1024 adds=87 doubles=21 cmovs=352
teeth=3 blocks=5 tblsize=1280 adds=89 doubles=17 cmovs=360
teeth=3 blocks=6 tblsize=1536 adds=89 doubles=14 cmovs=360
teeth=3 blocks=7 tblsize=1792 adds=90 doubles=12 cmovs=364
teeth=3 blocks=8 tblsize=2048 adds=87 doubles=10 cmovs=352
teeth=3 blocks=9 tblsize=2304 adds=89 doubles=9 cmovs=360
teeth=3 blocks=10 tblsize=2560 adds=89 doubles=8 cmovs=360
teeth=3 blocks=11 tblsize=2816 adds=87 doubles=7 cmovs=352
teeth=3 blocks=13 tblsize=3328 adds=90 doubles=6 cmovs=364
teeth=3 blocks=15 tblsize=3840 adds=89 doubles=5 cmovs=360
teeth=3 blocks=18 tblsize=4608 adds=89 doubles=4 cmovs=360
teeth=3 blocks=22 tblsize=5632 adds=87 doubles=3 cmovs=352
teeth=3 blocks=29 tblsize=7424 adds=86 doubles=2 cmovs=348
teeth=3 blocks=43 tblsize=11008 adds=85 doubles=1 cmovs=344
teeth=3 blocks=86 tblsize=22016 adds=85 doubles=0 cmovs=344

teeth=4 blocks=1 tblsize=576 adds=63 doubles=63 cmovs=512
teeth=4 blocks=2 tblsize=1088 adds=63 doubles=31 cmovs=512
teeth=4 blocks=3 tblsize=1536 adds=65 doubles=21 cmovs=528
teeth=4 blocks=4 tblsize=2112 adds=63 doubles=15 cmovs=512
teeth=4 blocks=5 tblsize=2560 adds=64 doubles=12 cmovs=520
teeth=4 blocks=6 tblsize=3072 adds=65 doubles=10 cmovs=528
teeth=4 blocks=7 tblsize=3584 adds=69 doubles=9 cmovs=560
teeth=4 blocks=8 tblsize=4160 adds=63 doubles=7 cmovs=512
teeth=4 blocks=10 tblsize=5120 adds=69 doubles=6 cmovs=560
teeth=4 blocks=11 tblsize=5632 adds=65 doubles=5 cmovs=528
teeth=4 blocks=13 tblsize=6656 adds=64 doubles=4 cmovs=520
teeth=4 blocks=16 tblsize=8256 adds=63 doubles=3 cmovs=512
teeth=4 blocks=22 tblsize=11264 adds=65 doubles=2 cmovs=528
teeth=4 blocks=32 tblsize=16448 adds=63 doubles=1 cmovs=512
teeth=4 blocks=64 tblsize=32832 adds=63 doubles=0 cmovs=512

teeth=5 blocks=1 tblsize=1024 adds=51 doubles=51 cmovs=832
teeth=5 blocks=2 tblsize=2048 adds=51 doubles=25 cmovs=832
teeth=5 blocks=3 tblsize=3072 adds=53 doubles=17 cmovs=864
teeth=5 blocks=4 tblsize=4096 adds=51 doubles=12 cmovs=832
teeth=5 blocks=5 tblsize=5120 adds=54 doubles=10 cmovs=880
teeth=5 blocks=6 tblsize=6144 adds=53 doubles=8 cmovs=864
teeth=5 blocks=7 tblsize=7168 adds=55 doubles=7 cmovs=896
teeth=5 blocks=8 tblsize=8192 adds=55 doubles=6 cmovs=896
teeth=5 blocks=9 tblsize=9216 adds=53 doubles=5 cmovs=864
teeth=5 blocks=11 tblsize=11264 adds=54 doubles=4 cmovs=880
teeth=5 blocks=13 tblsize=13312 adds=51 doubles=3 cmovs=832
teeth=5 blocks=18 tblsize=18432 adds=53 doubles=2 cmovs=864
teeth=5 blocks=26 tblsize=26624 adds=51 doubles=1 cmovs=832
teeth=5 blocks=52 tblsize=53248 adds=51 doubles=0 cmovs=832

teeth=6 blocks=1 tblsize=2048 adds=42 doubles=42 cmovs=1376
teeth=6 blocks=2 tblsize=4096 adds=43 doubles=21 cmovs=1408
teeth=6 blocks=3 tblsize=6144 adds=44 doubles=14 cmovs=1440
teeth=6 blocks=4 tblsize=8192 adds=43 doubles=10 cmovs=1408
teeth=6 blocks=5 tblsize=10240 adds=44 doubles=8 cmovs=1440
teeth=6 blocks=6 tblsize=12288 adds=47 doubles=7 cmovs=1536
teeth=6 blocks=8 tblsize=16384 adds=47 doubles=5 cmovs=1536
teeth=6 blocks=9 tblsize=18432 adds=44 doubles=4 cmovs=1440
teeth=6 blocks=11 tblsize=22528 adds=43 doubles=3 cmovs=1408
teeth=6 blocks=15 tblsize=30720 adds=44 doubles=2 cmovs=1440
teeth=6 blocks=22 tblsize=45056 adds=43 doubles=1 cmovs=1408
teeth=6 blocks=43 tblsize=88064 adds=42 doubles=0 cmovs=1376

teeth=7 blocks=1 tblsize=4096 adds=36 doubles=36 cmovs=2368
teeth=7 blocks=2 tblsize=8192 adds=37 doubles=18 cmovs=2432
teeth=7 blocks=3 tblsize=12288 adds=38 doubles=12 cmovs=2496
teeth=7 blocks=4 tblsize=16384 adds=39 doubles=9 cmovs=2560
teeth=7 blocks=5 tblsize=20480 adds=39 doubles=7 cmovs=2560
teeth=7 blocks=8 tblsize=32768 adds=39 doubles=4 cmovs=2560
teeth=7 blocks=10 tblsize=40960 adds=39 doubles=3 cmovs=2560
teeth=7 blocks=13 tblsize=53248 adds=38 doubles=2 cmovs=2496
teeth=7 blocks=19 tblsize=77824 adds=37 doubles=1 cmovs=2432
teeth=7 blocks=37 tblsize=151552 adds=36 doubles=0 cmovs=2368

teeth=8 blocks=1 tblsize=8256 adds=31 doubles=31 cmovs=4096
teeth=8 blocks=2 tblsize=16448 adds=31 doubles=15 cmovs=4096
teeth=8 blocks=3 tblsize=24576 adds=32 doubles=10 cmovs=4224
teeth=8 blocks=4 tblsize=32832 adds=31 doubles=7 cmovs=4096
teeth=8 blocks=5 tblsize=40960 adds=34 doubles=6 cmovs=4480
teeth=8 blocks=6 tblsize=49152 adds=35 doubles=5 cmovs=4608
teeth=8 blocks=7 tblsize=57344 adds=34 doubles=4 cmovs=4480
teeth=8 blocks=8 tblsize=65600 adds=31 doubles=3 cmovs=4096
teeth=8 blocks=11 tblsize=90112 adds=32 doubles=2 cmovs=4224
teeth=8 blocks=16 tblsize=131136 adds=31 doubles=1 cmovs=4096
teeth=8 blocks=32 tblsize=262208 adds=31 doubles=0 cmovs=4096

@gmaxwell
Copy link
Contributor

@gmaxwell gmaxwell commented Nov 14, 2019

Here is the efficient frontier I found running bench_sign on an AMD 7742 (using sha256+sha-ni as the nonce function).

Blocks Teeth bytes µs Sigs/core/sec
1 1 64 157 6369
1 2 128 87.7 11403
1 3 256 65.3 15314
1 4 512 54.0 18519
1 5 1024 48.6 20576
3 4 1536 47.5 21053
2 5 2048 44.1 22676
3 5 3072 43.6 22936
4 5 4096 41.7 23981
3 6 6144 41.1 24331
4 6 8192 39.9 25063
9 6 18432 39.3 25445
11 6 22528 38.7 25840
22 6 45056 38.4 26042
43 6 88064 38.1 26247
37 7 151552 37.9 26385

Here are all the configurations visualized:

image

Most points off the frontier aren't even close. The current 64kb table is 43.7µs on my test setup.

I think the logic options are {11,6} (22528b), {2,5} (2048b) and then one of the smallest three options. I would say {1,3} (256b) but I assume that {1,1} (64b) has the benefit that a considerable amount of code could also be omitted which might understate its space advantage. Presumably only someone who was extremely memory starved would consider using something smaller than {2,5}.

One possibility would be to just offer the 22kb choice and 2kb choice and leave it to the user to pick their best trade-off if they really want to go smaller, with a copy of these benchmark numbers in some integration documentation.

You also might want to consider how this changes with endomorphism, since after next year the library can probably drop the non-endomorphism configuration. Under the old algorithm endomorphism would halve the table size for only slightly less performance.

I might also want to run the figures for schnorr signing, since the constant time modular inversion of K is probably a big fraction of the total time in these figures and it doesn't exist there. The extra constant time chunk diminishes the magnitude of the speedup. Edit: LOL, yes, it's a 1.4x speedup to avoid that inversion.

Maybe someone who wants to use this library to sign on a very memory constrained device might want to comment? The code itself is some tens of kilobytes, so I don't know how much anyone cares about saving a hundred bytes in exchange for running a tiny fraction of the speed. I wouldn't be surprised that if the slower configurations were also somewhat easier to attack with differential power analysis too.

@gmaxwell
Copy link
Contributor

@gmaxwell gmaxwell commented Nov 14, 2019

How small does a table need to be before it's reasonable to include in the source explicitly? Not having to run gen_context at build time would be nice.

In #614 I considered 1024 bytes obviously OK.

@gmaxwell
Copy link
Contributor

@gmaxwell gmaxwell commented Nov 14, 2019

Just for fun, I ran the same test using a more optimized signer that doesn't use any slow inversions.

The efficient frontier is slightly different, but I think my suggested options would be the same.

Blocks Teeth bytes µs Sigs/core/sec
1 1 64 141 7092
1 2 128 73.3 13643
1 3 256 50.7 19724
1 4 512 39.3 25445
1 5 1024 33.5 29851
3 4 1536 32.3 30960
2 5 2048 28.9 34602
3 5 3072 28.4 35211
4 5 4096 26.6 37594
3 6 6144 26.4 37879
4 6 8192 25.0 40000
5 6 10240 24.9 40161
9 6 18432 24.3 41152
11 6 22528 23.7 42194
22 6 45056 23.4 42735
43 6 88064 22.5 44444

image

@gmaxwell
Copy link
Contributor

@gmaxwell gmaxwell commented Nov 15, 2019

@douglasbakkum @jonasschnelli @ddustin @benma you've commented before about memory usage in signing.

@sipa
Copy link
Contributor Author

@sipa sipa commented Nov 18, 2019

@gmaxwell I'm not sure what the best strategy is:

  • Just leaving comb/teeth individually configurable by the build system
  • Providing a number of presets (small/medium/large)
  • Letting the configuration control the actual size (e.g. --ecmult-gen-table-size=N, which guarantees a table size of at most (1<<N) bytes).

It'd be very useful to hear from people on embedded systems how much they care about small table sizes.

@gmaxwell
Copy link
Contributor

@gmaxwell gmaxwell commented Nov 19, 2019

I dislike leaving comb/teeth configurable in the build system. It would increase the number of build configurations that the project should be testing by hundreds of times for no clear gain...

Plus if the project can't figure out how to set them correctly, it's doubtful that someone just using the build system is going to know what to do.

I wouldn't see any problem with having hidden options to override-- no promises made that those configurations were in any way tested.

Obviously I'm a fan of small number of presets but there needs to be feedback on which options people want. I'm hoping to hear something like the size of the library itself means that no one cares too much about 2048 bytes vs 256 bytes vs 64 bytes-- if so then choosing a small option will be much easier.

Taking a size and solving for the best value that fits would be my second choice... but only because it results in a testing headache. Things like the peak stack usage and other behavioural differences wouldn't be monotone with the size... so I don't think e.g. testing the smallest and largest would be sufficient. Otherwise I would probably prefer this one.

@ddustin
Copy link

@ddustin ddustin commented Nov 20, 2019

Excited that this is getting worked on!

Maybe someone who wants to use this library to sign on a very memory constrained device might want to comment? The code itself is some tens of kilobytes, so I don't know how much anyone cares about saving a hundred bytes in exchange for running a tiny fraction of the speed. I wouldn't be surprised that if the slower configurations were also somewhat easier to attack with differential power analysis too.

If the code itself is tens of kilobytes than making it small may be a moot issue. What's making the code so big -- is there any low hanging fruit for shrinking it down?

I've been using the mbed tls library: https://tls.mbed.org/api/group__encdec__module.html

The main perk there being it uses probably under 1 kilobyte of ram in total -- though it is super slow. It also doesn't seem to get the attention that secp256k1 gets, so perhaps some vulnerabilities could slip through the cracks.

I had a hell of a time figuring out that I needed to flip negative values to their positive variants in some places -- a task that libsecp256k1 mostly handles for you automatically. So perhaps there's an argument for libsecp256k1 getting more embed usable from an ease of use standpoint. I don't really know what's up with the project other than I guess ARM bought it? It actually looks fairly active: https://github.com/ARMmbed/mbedtls

Surprisingly in the last few years some chips have been adding hardware optimized curves. Though support, documentation, etc can be sparse to non-existent.

@ddustin
Copy link

@ddustin ddustin commented Nov 20, 2019

Taking a size and solving for the best value that fits would be my second choice... but only because it results in a testing headache. Things like the peak stack usage and other behavioural differences wouldn't be monotone with the size... so I don't think e.g. testing the smallest and largest would be sufficient. Otherwise I would probably prefer this one.

While adjusting the size for speed with granularity would be nice I don't think it's every really going to be needed. Steps would be great though -- xlarge, large, mediumlarge, medium, mediumsmall, small, xtrasmall, tiny would be plenty.

Often you end up in this situation where you're just out of memory and you're hunting for things to make smaller. Having an option for that would be great -- but I can't imagine a scenario where it needs to be granular.

@gmaxwell
Copy link
Contributor

@gmaxwell gmaxwell commented Nov 25, 2019

@ddustin The library isn't optimized for code size: Correctness, testability, performance, safe interfaces..., sure.

To some extent there are size/performance trade-offs:

Things like the constant time inverse and the sqrt could be implemented using a small state machine, instead of an entirely unrolled function. Instead of 5 different field element normalize functions for different combinations of strength and constant-timeness we could have just one that was the strongest and constant time. Instead of implementing things that give big speedups like WNAF but take a bit of code ... we could ... just not do that. Rather than having an overcomplete field with carefully hand optimize multiplies and squares we could have just a plain bignum with schoolbook multiplies and squaring accomplished using the multiply instruction...

On a desktop computer, all these decisions favouring speed at the expense of size are almost certainly good ones-- in fact it would probably be well worth it to increase the code size another 50% if it made it a few percent faster. Even on a mobile phone or similar, the same applies. On some tiny embedded device? Probably not-- it depends on the application. Of course, these devices are also slow-- so a tradeoff that made it 50% slower just to save a few hundred bytes of code probably wouldn't be a win.

Some of these tradeoffs like the multi-fold normalizes could easily be made ifdef switchable between existing code and are also not that huge a performance cost. Other tradeoffs would be big performance hits or would require extensive rewrites. But it's hard to justify expending effort doing that, even where its easy, without a lot of help or at least feedback from someone that really cares.

Otherwise, it's probably more interesting spending time on additional validation, other optimizations that are useful on big desktop/server cpus-- things like ADX or AVX2-FMA or GMP-less fast jacobi/inverse, because there the objectives are more clear: Is it faster enough to justify the effort testing and reviewing it. :) or on implementing additional cryptographic functions, since those at least enable new and interesting applications.

Some things could probably be made a lot smaller without being any slower too (I think the aforementioned constant time sqrt and inverses are likely candidates), but just no one has tried.

I think to the extent that this project is active at all the contributors to it are interested in working with people to make size optimizations at least ones that don't come at a substantial cost to other applications. But without at least people who care about them helping to drive the tradeoffs, I wouldn't expect any to happen quickly.

@ddustin
Copy link

@ddustin ddustin commented Dec 2, 2019

@gmaxwell That all makes sense. Targeting super small chips may be foolish.

Medium size chips might be interesting. I've been throwing around the idea of a cheap and low power as possible embedded full node. In that situation power usage is probably the main concern, though turning on flash cells also takes power.

As kind of an aside to all of this -- I wonder if having an easy way to throw in backend hardware optimized curves would be worthwhile. Keep the safe interfaces while leveraging hardware optimization where it's available. Also hardware optimized curves are just, well, cool!

@real-or-random
Copy link
Contributor

@real-or-random real-or-random commented Feb 25, 2020

Needs rebase. I think we should push this forward, also because it is the cleanest way to solve #692, and any other work to solve #692 will be wasted if want to merge this here anyway.

@sipa sipa force-pushed the 201911_peterdettman_sdmc branch from 413713c to 1eb38a5 Compare Jul 29, 2020
@sipa
Copy link
Contributor Author

@sipa sipa commented Jul 29, 2020

Rebased. I've left the configurability for now - we can decide on a set of supported configurations later if all tests still pass.

I also noticed that COMB_NEGATION wasn't configurable before; added that.

@sipa
Copy link
Contributor Author

@sipa sipa commented Jul 29, 2020

@gmaxwell If you feel inclined, there are now 2x more configurations to choose from (with COMB_NEGATION=0, the table size is doubled but negations are avoided... I'm not sure that's ever a good tradeoff, but with benchmarks we could tell for sure).

@gmaxwell
Copy link
Contributor

@gmaxwell gmaxwell commented Jul 29, 2020

Sounds fun. I can't imagine disabling negation will ever help, but lets see. :)

@real-or-random
Copy link
Contributor

@real-or-random real-or-random commented Jul 29, 2020

Mentioning @peterdettman here, because I'm not sure he's aware that people are working on this.

@peterdettman
Copy link
Contributor

@peterdettman peterdettman commented Jul 29, 2020

Yep, was aware, but it's in capable hands. Regarding the optional COMB_NEGATION, I recall I was thinking more about power analysis than performance.

@sipa sipa force-pushed the 201911_peterdettman_sdmc branch from 1eb38a5 to 4a165e8 Compare Jul 30, 2020
@sipa
Copy link
Contributor Author

@sipa sipa commented Jul 30, 2020

Rebased.

@sipa sipa force-pushed the 201911_peterdettman_sdmc branch from 4a165e8 to 026cbfb Compare Jul 31, 2020
@sipa
Copy link
Contributor Author

@sipa sipa commented Jul 31, 2020

@peterdettman In what way is avoiding negations useful for power analysis? If there isn't a good reason to support it, I'd rather remove it from the codebase so there are fewer combinations to test and maintain.

@gmaxwell
Copy link
Contributor

@gmaxwell gmaxwell commented Jul 31, 2020

image

Neg Blocks Teeth bytes Sigs/core/sec
1 1 1 128 1192
1 1 2 192 2278
1 1 3 256 3257
1 2 3 512 3759
1 1 4 576 4149
1 1 5 1024 4762
1 3 4 1536 4878
1 2 5 2048 5464
1 3 5 3072 5587
1 2 6 4096 5988
1 3 6 6144 6135
1 4 6 8192 6410
1 9 6 18432 6494
1 11 6 22528 6667
1 22 6 45056 6711
1 19 7 77824 6803
1 43 6 88064 6944
1 37 7 151552 6993
0 1 1 192 1183
0 1 2 320 2268
0 1 3 512 3215
0 2 3 1024 3690
0 1 4 1088 4000
0 1 5 2048 4587
0 2 4 2112 4608
0 3 4 3072 4739
0 2 5 4096 5236
0 3 5 6144 5319
0 4 5 8192 5618
0 4 6 16384 5848
0 11 6 45056 5917
0 22 6 90112 6098
0 43 6 176128 6250

@peterdettman
Copy link
Contributor

@peterdettman peterdettman commented Jul 31, 2020

@peterdettman In what way is avoiding negations useful for power analysis? If there isn't a good reason to support it, I'd rather remove it from the codebase so there are fewer combinations to test and maintain.

See e.g. One&Done: A Single-Decryption EM-Based Attack on OpenSSL’s Constant-Time Blinded RSA, by Alam et. al.

The code for calculating windows in this PR was written to somewhat obscure the bits during window assembly by keeping more bits involved at each step (but it's not perfect). OTOH, the sign bit is isolated and turned into an all 0s or all 1s value. The paper is about attacking RSA, so negation of lookup values doesn't arise, but I think it's reasonable to extrapolate that isolating the sign bit and using it as a mask should be visible in the same way.

It's possible to conceive of mitigations. e.g. the precomputed values could be possibly-pre-negated according to a random mask (as part of blinding), then the sign bit could be XORed with that mask before being isolated.

EDIT: Removed section mistakenly based on scalar blinding cost being related to precomp size; that's only true with #570).

@sipa sipa force-pushed the 201911_peterdettman_sdmc branch from 4935a46 to 937a8c2 Compare Mar 18, 2021
@sipa
Copy link
Contributor Author

@sipa sipa commented Mar 18, 2021

Rebased, but I still intend to write an explanation of the algorithm.

@real-or-random
Copy link
Contributor

@real-or-random real-or-random commented Aug 25, 2021

What's the status here? Concretely, I'm wondering whether it makes sense to work on a PR that sticks with the existing gen_context code but prebuilds the table and adds a to the repo, in the same way as in #956. Then we would get rid of the common cross-compilation issues entirely, we'd be in the position to reconsider some things about contexts (flags, etc.)

Pro: I could get this done pretty quickly. It's mostly changes in the build system and we have the boilerplate in #956 already.

Con: This would conflict a lot of with the changes in this PR.

@sipa
Copy link
Contributor Author

@sipa sipa commented Aug 25, 2021

I'm ok with moving forward with making the ecmult_gen context compile-time/precomputed only, and then rebasing this.

@sipa
Copy link
Contributor Author

@sipa sipa commented Dec 26, 2021

I'm reworking this, and am considering a few changes. In particular:

  • The recoding operation needs to find (input+2^bits-1)/2. If we store a precomputed scalar offset = 2^bits-1-blind in the context (instead of blind directly), the computation becomes just recode = (input+offset)/2. This avoids the need for COMB_OFFSET (and the associated pre-offset initial point). It also avoids the need for the 288 limit on blocks*teeth*spacing.
  • With the recoding being simply an add+halving, it's perhaps easier to just implement it directly on a scalar with a new secp256k1_scalar_half, avoiding the need for a specialized 8*32bit representation for the recoded number.
  • We can now just initialize r to INFINITY, and instead add blind*G (also precomputed in the context) at the end. The first table point addition can become a fetch instead, saving a point addition - so in terms of point additions the result is identical.
  • With the point unblinding at the end, we leave the door open for #1051-like approaches, as it lets us reason about potential doublings/cancellations in intermediary additions.
  • With the first table lookup not being an addition, we can use an arbitrary corresponding jacobian coordinate for its fetching, giving a place for performing projective blinding.

@peterdettman Thoughts?

@peterdettman
Copy link
Contributor

@peterdettman peterdettman commented Dec 26, 2021

@sipa The recoding is into 9*32bit, not 8. This is also why it can't just be a scalar. It's also the reason for the 288 limit. It's true that only 1 bit is ever set in the highest limb of the recoding, but it simplified the core loop to just be able to read into that limb instead of treating it specially.

COMB_OFFSET is a specific optimization to let the comb cover exactly 256 bits since that has nice factors; there's still an implied extra bit for 257 total which is why we need the initial value in that case. If it can be merged with the blind somehow, that's all good.

I'm a bit unsure of the rest since it's been a while. Concepts around initial point and where adjustments are done sound fine.

@sipa
Copy link
Contributor Author

@sipa sipa commented Dec 26, 2021

@peterdettman If there is a benefit to the 9*32 representation that could easily be kept, but I think it's only useful for COMB_GROUPED configurations really, and none of those appear to be among the best candidates.

@peterdettman
Copy link
Contributor

@peterdettman peterdettman commented Dec 27, 2021

@sipa The comb is always larger than 256 bits (although only 1 higher bit is ever set), e.g. the current default config covers 260 bits. The inner loop of _ecmul_gen collects the "teeth" at each position:
bit = recoded[bit_pos >> 5] >> (bit_pos & 0x1F);

e.g. bit_pos is in the range [0, 260) for the default config. So having the 9*32 representation avoids dealing specially with the bit positions >= 256.

It may be possible to always use an offset point (secp256k1_ecmult_gen_context.offset) to account for the high bit (currently only done in the case of a comb config that covers exactly 256 bits) and just a scalar for the remaining bits. It would require dealing with those high comb positions somehow - probably just some const-time bit-twiddling can ensure we always get 0 when bit_pos >= 256:

    valid_mask = (bit_pos - 256) >> 31;
    valid_bit_pos = bit_pos & valid_mask;
    bit = secp256k1_scalar_get_bit(&recoded, valid_bit_pos) & valid_mask;

Depending on the comb configuration that is probably only really needed for the highest block and maybe only the highest tooth of that block.

Just to be clear: unless the offset point is merged with other initialization/finalization there is effectively an extra addition incurred which the comb usually does "for free". Also if the comb reads the high-bit as '0' it's subtracting 2^(bits-1) instead of adding, so the offset would then actually be 2^bits instead.

I think that COMB_GROUPED should be removed. It only makes possible sense for 4 or 8 teeth and the fast options are actually 5,6,7. Even at 4 or 8 (even 2), using spacing of 1 would be extravagant use of memory, since you could halve that usage at the cost of a single doubling per sig. Removing it will simplify dealing with the above "high-tooth" issue.

Once that is all resolved, then the ideas in #693 (comment) seem reasonable things to investigate.

@sipa
Copy link
Contributor Author

@sipa sipa commented Dec 27, 2021

@peterdettman See https://github.com/sipa/secp256k1/commits/202112_sdmc

I think the design decisions fall into three questions:

  • Additive blinding of the scalar needs to be compensated by something.
    • The current PR does that using an initial value, ctx->initial.
    • Using an initial value has the downside of not being able to reason about the ranges of DLs of intermediary results. By changing to a ctx->final instead (of the would-be ctx->initial value, but divided by 2^(COMB_SPACING-1)) this becomes possible. This adds a point addition, but since it makes the initial value infinity, the very first table lookup can avoid an addition too, so this change doesn't affect the number of group operations.
  • We need to find the bits of a value equivalent to (input + 2^COMB_BITS - 1) / 2 (mod order).
    • The current PR achieves that by computing the bits of (input - 1)/2 + 2^(COMB_BITS-1), and if COMB_BITS ends up in a 32-bit limb by itself (the COMB_OFFSET optimization), instead moving its effect to ctx->initial, and using (input - 1)/2 instead.
    • Alternatively, it is possible to operate on the bits of (input + 2^COMB - 1) / 2 (mod order) directly. I find this slightly easier to explain as it avoids special-casing the COMB_BITS==256 situation, but it has also a few very weak advantages:
      • It enables an optimization where the final addition in a block, if it spans the 256th bit, can use a smaller range for the index cmov lookup (because the actual bits used never exceed 256).
      • It permits a blinding-free version of the code with one fewer point addition (not implemented, and I don't plan to), even when COMB_BITS==256, because it would result in initial/final both infinity.
  • There is the question of how to organize the computed bits:
    • The current PR uses the uint32_t recoded[9] approach, as it needs more than 256 bits to store the bits. Even when there are only 256 bits this approach could be used, as it avoids the need for dealing with reads beyond 256 bits, but if the reduced-cmov optimization above is used anyway, there is little benefit I think.

@peterdettman
Copy link
Contributor

@peterdettman peterdettman commented Dec 28, 2021

  • It enables an optimization where the final addition in a block, if it spans the 256th bit, can use a smaller range for the index cmov lookup (because the actual bits used never exceed 256).

There's also a more formal way of restricting the comb, which is to not have all the blocks be uniform. (This options is mentioned in the original paper already, BTW: https://eprint.iacr.org/2012/309).

One general pattern for this might be distinguishing "lower blocks" and "upper blocks" (even though it may be a common case to have only one upper block).

If our current default scheme is 11B * 6T * 4S (==264), therefore because of the extra 8 bits, we're doing in a vague sense 1 useless addition (6 of those bits), and 1 excessive lookup (the other 2); there may be table entries that are unused in the top block (or let's say, at least, less-frequently used). So we can somehow save some combination of an addition, or memory, or lookup cmovs.

e.g.

  • Upper(1B * 4T * 1S), Lower(14B * 6T * 3S) covers 256 bits. 43 additions (1 saved), 3 doubles (1 saved), but uses (3 * 2^5 + 1 * 2^3 == 104) extra table entries vs. original (11 * 2^5 == 352).
  • Upper(1B * 4T * 4S), Lower(10B * 6T * 4S) covers 256 bits. 44 additions, 4 doubles, but saves (1 * 2^5 - 1 * 2^3 == 24) table entries.
  • Upper(1B * 4T * 4S), Lower(8B * 6T * 5S) covers 256 bits. Still 44 additions, 1 extra double, but saves (3 * 2^5 - 1 * 2^3 == 88) table entries.
  • Upper(1B * 4T * 1S), Lower(7B * 6T * 6S) covers 256 bits. 43 additions (1 saved), 6 doubles (2 extra), but saves (4 * 2^5 - 1 * 2^3 = 120) of 352 table entries.

There are also implied cmov lookup savings in the upper blocks of those options. The single upper block of 4 bits ones are maybe especially attractive since they are easily added with a single addition at the end of the comb and are likely to simplify things for non-uniform addition.

@peterdettman
Copy link
Contributor

@peterdettman peterdettman commented Dec 28, 2021

Is there any hope that dedicated asm could speed up the cmov table lookups? After all we have a very nice data shape and just want to blit it all with some masking. Can explicit prefetch add anything beyond what the compiler and/or hardware are doing automatically (given that we are just scanning forward over the whole precomputed data in each pass). Is there any possibility that consolidating the lookups (one-per-block) before performing the additions can be faster?

@real-or-random
Copy link
Contributor

@real-or-random real-or-random commented Dec 28, 2021

No idea if it will help here but I sometimes considered an "arbitray-data" cmov that works on char arrays.

Advantages will include not only the possibility to cmov data of arbitrary size. Also, we'll need to get only one cmov function right and consolidate some improvements/issues (#777, #457). So we could spend our time more efficiently on getting that single function constant-time on all platforms (the compiler could still specialize it for some sizes but I still think it's a good idea). We could then write an x86 ASM version, which could be beneficial not only for performance but using native CMOVs will guarantee that the compiler can't make it variable time.

(In spirit it's similar to my unfinished #636 on data cleaning. IIRC, this implements all the _clear functions for different types via single one that works on arbitrary memory.)

@real-or-random
Copy link
Contributor

@real-or-random real-or-random commented Dec 28, 2021

  • Using an initial value has the downside of not being able to reason about the ranges of DLs of intermediary results. By changing to a ctx->final instead (of the would-be ctx->initial value, but divided by 2^(COMB_SPACING-1)) this becomes possible. This adds a point addition, but since it makes the initial value infinity, the very first table lookup can avoid an addition too, so this change doesn't affect the number of group operations.

Am I right that we could still use the rescaling blinding that we currently use (e.g., by applying rescaling the new first point instead of initial)?

@peterdettman
Copy link
Contributor

@peterdettman peterdettman commented Dec 28, 2021

Am I right that we could still use the rescaling blinding that we currently use (e.g., by applying rescaling the new first point instead of initial)?

Yes, it's already in the new PR: https://github.com/sipa/secp256k1/blob/c914fcc130a04f7df174aa14b9ba134fcf79efa3/src/ecmult_gen_impl.h#L187 .

@sipa
Copy link
Contributor Author

@sipa sipa commented Dec 28, 2021

@peterdettman Good to point out that non-uniform approaches are possible too, and if we're going in the direction of actually optimizing the fetch/add loop, it's probably better to just do nonuniform blocks (as they can avoid post-bit-256 issues entirely, and are probably much easier to emit efficient code for, as the loops are more predictable).

I think I'll try that. My preference would be to keep the spacing uniform for simplicity reasons (the outer loop can stay), and then have lower blocks with T teeth en upper blocks with T-1 teeth. The upper blocks can use the same table and fetch loop, but forego conditional negation (or halve the fetch loop but keep conditional negation). This doesn't save additions, but isn't too big of a change either, and my intuition is that the optimal performance will be had with solutions that don't have too much difference between upper and lower blocks.

EDIT: What I said here isn't very accurate; the upper blocks of course have separate tables to begin with. So in that case, I think the most reasonable approach to try is halving the upper block tables in size, and keep conditional negation for them.

@peterdettman
Copy link
Contributor

@peterdettman peterdettman commented Dec 28, 2021

@sipa I think you may come to appreciate the virtues of just a single upper block of 4 (in the default case) bits as I mentioned previously.

I actually find Upper(1B * 4T * 1S), Lower(7B * 6T * 6S) quite attractive. The extra 2 doubles are (with latest formulae) cheaper than the saved addition, so it will actually be faster than the current one, using 2/3 the table memory (which will also make it faster), and it saves an entire lookup PLUS replaces another 6T lookup (32 entries) with a 4T lookup (8 entries), so let's say 1.75 lookups saved (faster again). All that and the awkward upper block can just be added at the end outside the loop (basically it joins the final iteration), so it doesn't complicate that. The lower blocks only cover 252 bits (i.e. an odd scalar in (-2^252, 2^252) and it seems likely that could all be non-uniform additions. It's even possible to add the final blind/offset value to all 8 of the upper block table entries in precomp and save another (uniform) addition (need context space for that though). We should check for a similar option in a small-memory config too I guess.

Since maximizing the number of teeth (in the face of the cmov costs) is where most of the speed comes from I realise it's intuitive to say lower@T then upper@(T-1), but I played around with the available configurations, and (trying to target exactly 256 bits) it's not all that important in the end and mostly the optimum is a single block at the top. If the spacing is less in the upper blocks, it just means join the loop after a few doubles, but yes that might end up being a separate loop. OTOH if it's a single block with a spacing of one it can just be a single addition after the loop.

@sipa
Copy link
Contributor Author

@sipa sipa commented Dec 28, 2021

@peterdettman I'm starting to see the appeal of having a single upper block only, and giving it spacing 1 would be extra nice.

Thinking about how one would specify the parametrization for that?

  • Configuration input is B blocks and max T teeth.
  • If BT divides 256, just have (B,T,256/(BT)) as configuration.
  • Otherwise:
    • (B-1,T,floor(256/((B-1)T))) as lower configuration.
    • (1,256 mod ((B-1)T),1) as upper configuration.

We wouldn't want the number of teeth in the upper configuration to exceed T (because going beyond the user's wish for T has a large impact on table size and cmov cost). So that would mean a requirement that (256 mod ((B-1)T)) <= T. Do you think that could be overly restrictive?

Another possibility may be specifying configurations as (max adds, max table size) pairs, which probably more accurately corresponds to real-world costs (the entire table is read in cmovs, and most of the cost beyond that is adds (and the bit fetching for it)). It's not obvious to me how to translate these to actual table configurations though.

To be clear, I do expect we'll have just 2 or 3 precomputed configurations to choose from for production, which could easily be curated, but it'd be nice to be able to test the code by hand for a wide variety of configurations easily.

@sipa
Copy link
Contributor Author

@sipa sipa commented Dec 28, 2021

Here are a list of configurations (blocks=number of "full" blocks excluding the upper one if any, fteeth=teeth for final block which has spacing one, tblsize=in bytes, groupops=adds+doubles, bitops=number of get_bits or equivalent operations to construct the table entries). Excluded any configuration that is not better in at least one of the metrics:

blocks=27 teeth=9: spacing=1 fteeth=13 tblsize=704512 groupops=28 bitops=28
blocks=28 teeth=9: spacing=1 fteeth=4 tblsize=459264 groupops=29 bitops=29
blocks=14 teeth=9: spacing=2 fteeth=4 tblsize=229888 groupops=30 bitops=57
blocks=7 teeth=9: spacing=4 fteeth=4 tblsize=115200 groupops=32 bitops=113
blocks=31 teeth=8: spacing=1 fteeth=8 tblsize=262144 groupops=32 bitops=32
blocks=16 teeth=8: spacing=2 fteeth=0 tblsize=131072 groupops=33 bitops=64
blocks=8 teeth=8: spacing=4 fteeth=0 tblsize=65536 groupops=35 bitops=128
blocks=35 teeth=7: spacing=1 fteeth=11 tblsize=208896 groupops=36 bitops=36
blocks=36 teeth=7: spacing=1 fteeth=4 tblsize=147968 groupops=37 bitops=37
blocks=18 teeth=7: spacing=2 fteeth=4 tblsize=74240 groupops=38 bitops=73
blocks=4 teeth=8: spacing=8 fteeth=0 tblsize=32768 groupops=39 bitops=256
blocks=12 teeth=7: spacing=3 fteeth=4 tblsize=49664 groupops=39 bitops=109
blocks=9 teeth=7: spacing=4 fteeth=4 tblsize=37376 groupops=40 bitops=145
blocks=6 teeth=7: spacing=6 fteeth=4 tblsize=25088 groupops=42 bitops=217
blocks=41 teeth=6: spacing=1 fteeth=10 tblsize=116736 groupops=42 bitops=42
blocks=42 teeth=6: spacing=1 fteeth=4 tblsize=86528 groupops=43 bitops=43
blocks=21 teeth=6: spacing=2 fteeth=4 tblsize=43520 groupops=44 bitops=85
blocks=4 teeth=7: spacing=9 fteeth=4 tblsize=16896 groupops=45 bitops=325
blocks=14 teeth=6: spacing=3 fteeth=4 tblsize=29184 groupops=45 bitops=127
blocks=2 teeth=8: spacing=16 fteeth=0 tblsize=16384 groupops=47 bitops=512
blocks=3 teeth=7: spacing=12 fteeth=4 tblsize=12800 groupops=48 bitops=433
blocks=7 teeth=6: spacing=6 fteeth=4 tblsize=14848 groupops=48 bitops=253
blocks=6 teeth=6: spacing=7 fteeth=4 tblsize=12800 groupops=49 bitops=295
blocks=50 teeth=5: spacing=1 fteeth=6 tblsize=53248 groupops=51 bitops=51
blocks=25 teeth=5: spacing=2 fteeth=6 tblsize=27648 groupops=52 bitops=101
blocks=51 teeth=5: spacing=1 fteeth=1 tblsize=52288 groupops=52 bitops=52
blocks=2 teeth=7: spacing=18 fteeth=4 tblsize=8704 groupops=54 bitops=649
blocks=17 teeth=5: spacing=3 fteeth=1 tblsize=17472 groupops=54 bitops=154
blocks=10 teeth=5: spacing=5 fteeth=6 tblsize=12288 groupops=55 bitops=251
blocks=3 teeth=6: spacing=14 fteeth=4 tblsize=6656 groupops=56 bitops=589
blocks=5 teeth=5: spacing=10 fteeth=6 tblsize=7168 groupops=60 bitops=501
blocks=2 teeth=6: spacing=21 fteeth=4 tblsize=4608 groupops=63 bitops=883
blocks=62 teeth=4: spacing=1 fteeth=8 tblsize=39936 groupops=63 bitops=63
blocks=31 teeth=4: spacing=2 fteeth=8 tblsize=24064 groupops=64 bitops=125
blocks=63 teeth=4: spacing=1 fteeth=4 tblsize=32768 groupops=64 bitops=64
blocks=32 teeth=4: spacing=2 fteeth=0 tblsize=16384 groupops=65 bitops=128
blocks=21 teeth=4: spacing=3 fteeth=4 tblsize=11264 groupops=66 bitops=190
blocks=16 teeth=4: spacing=4 fteeth=0 tblsize=8192 groupops=67 bitops=256
blocks=3 teeth=5: spacing=17 fteeth=1 tblsize=3136 groupops=68 bitops=868
blocks=9 teeth=4: spacing=7 fteeth=4 tblsize=5120 groupops=70 bitops=442
blocks=8 teeth=4: spacing=8 fteeth=0 tblsize=4096 groupops=71 bitops=512
blocks=4 teeth=4: spacing=16 fteeth=0 tblsize=2048 groupops=79 bitops=1024
blocks=83 teeth=3: spacing=1 fteeth=7 tblsize=25344 groupops=84 bitops=84
blocks=84 teeth=3: spacing=1 fteeth=4 tblsize=22016 groupops=85 bitops=85
blocks=42 teeth=3: spacing=2 fteeth=4 tblsize=11264 groupops=86 bitops=169
blocks=85 teeth=3: spacing=1 fteeth=1 tblsize=21824 groupops=86 bitops=86
blocks=28 teeth=3: spacing=3 fteeth=4 tblsize=7680 groupops=87 bitops=253
blocks=21 teeth=3: spacing=4 fteeth=4 tblsize=5888 groupops=88 bitops=337
blocks=14 teeth=3: spacing=6 fteeth=4 tblsize=4096 groupops=90 bitops=505
blocks=17 teeth=3: spacing=5 fteeth=1 tblsize=4416 groupops=90 bitops=426
blocks=12 teeth=3: spacing=7 fteeth=4 tblsize=3584 groupops=91 bitops=589
blocks=2 teeth=4: spacing=32 fteeth=0 tblsize=1024 groupops=95 bitops=2048
blocks=7 teeth=3: spacing=12 fteeth=4 tblsize=2304 groupops=96 bitops=1009
blocks=5 teeth=3: spacing=17 fteeth=1 tblsize=1344 groupops=102 bitops=1446
blocks=125 teeth=2: spacing=1 fteeth=6 tblsize=18048 groupops=126 bitops=126
blocks=1 teeth=4: spacing=64 fteeth=0 tblsize=512 groupops=127 bitops=4096
blocks=126 teeth=2: spacing=1 fteeth=4 tblsize=16640 groupops=127 bitops=127
blocks=21 teeth=2: spacing=6 fteeth=4 tblsize=3200 groupops=132 bitops=757
blocks=18 teeth=2: spacing=7 fteeth=4 tblsize=2816 groupops=133 bitops=883
blocks=1 teeth=3: spacing=85 fteeth=1 tblsize=320 groupops=170 bitops=7226
blocks=2 teeth=2: spacing=64 fteeth=0 tblsize=256 groupops=191 bitops=8192
blocks=1 teeth=2: spacing=128 fteeth=0 tblsize=128 groupops=255 bitops=16384
blocks=1 teeth=1: spacing=256 fteeth=0 tblsize=64 groupops=511 bitops=65536

If I also include conditional negations as a metric, 72 configurations remain. If I additionally split doubles/adds into separate metrics, 95.

@peterdettman
Copy link
Contributor

@peterdettman peterdettman commented Dec 28, 2021

This may just be a rephrasing, but for some choice of B,T, use the highest S such that B*T*S <= 256, then require that the remainder (U) be <= T itself or the config is invalid. Then there are S doubles, B*S + (U > 0) additions, and (2^(T-1)*B + 2^(U-1)) table entries (1 entry is 64 bytes).

For 7T we need to hit 252 == 7*(2*2*3*3) (but is 7T ever reasonable?)
9B,7T => 4S,4U => 4 doubles, 37 additions, 584 entries
6B,7T => 6S,4U => 6 doubles, 37 additions, 392 entries

For 6T we need to hit 252 == 6*(2*3*7)
14B,6T => 3S,4U => 3 doubles, 43 additions, 456 entries
7B,6T => 6S,4U => 6 doubles, 43 additions, 232 entries => GOOD

For 5T we need to hit 255 == 5*(3*17) (restrictive!)
17B,5T => 3S,1U => 3 doubles, 52 additions, 273 entries (probably absurd)
3B,5T => 17S,1U => 17 doubles, 52 additions, 49 entries => NOT TERRIBLE
2B,5T => 25S,6U => INVALID, which is annoying as (I think) this was on the perf. frontier as (2B,5T,26S) => 26D, 52A, 32E

For 4T we can hit 256 == 4*(2^6) or 252 == 4*(3*3*7)
7B,4T => 9S,4U => 9 doubles, 64 additions, 56 entries (probably absurd)
4B,4T => 16S,0U => 16 doubles, 64 additions, 32 entries => OK, but probably inferior to the (invalid) 2B,5T
3B,4T => 21S,4U => 21 doubles, 64 additions, 24 entries
2B,4T => 32S,0U => 32 doubles, 64 additions, 16 entries
1B,4T => 64S,0U => 64 doubles, 64 additions, 8 entries

We're already down at 512 byte, so I doubt fewer teeth are relevant, but in case 4T is somehow too many:
5B,3T => 17S,1U => 17 doubles, 85 additions, 20 entries (the only choice at 3T due to factorization of 255)

So, yeah, overall it feels a little restrictive to me. This is basically why the original PR just let you aim for 256 plus a bit and we just accepted that the price of simplicity was probably going to be an extra addition. I'm not saying that's better or worse, it's just that the comb configurations aren't as flexible as we might like.

Edit: sorry, of course it's the number of doubles is 1 less than the spacing, so all the above are 1 too high.

@peterdettman
Copy link
Contributor

@peterdettman peterdettman commented Dec 28, 2021

If I also include conditional negations as a metric, 72 configurations remain. If I additionally split doubles/adds into separate metrics, 95.

Just give the additions a relative weight of 2 vs the doubles (even 2.1).

Edit: Then again, what's the weight for a non-uniform addition?
Edit 2: Then again again, each addition comes with a relatively expensive cmov lookup too.

@peterdettman
Copy link
Contributor

@peterdettman peterdettman commented Dec 28, 2021

Also there seems to be no restriction on fteeth, that will cut your numbers down.

@sipa
Copy link
Contributor Author

@sipa sipa commented Dec 28, 2021

I realized that there is no need for restricting fteeth: its "badness" is captured perfectly by the increase in tablesize (both memory usage and cmovs are affected by it, but those scale 1:1 with table size).

I'll run more analysis to compare with the configurations supported by the current code, but my thinking now is that this design space is getting too big, and we should focus on getting some form of SDMC in first. I'll probably remove some optimizations from my new PR and perhaps reintroduce the uint32[9] array to simplify some edge cases.

@peterdettman
Copy link
Contributor

@peterdettman peterdettman commented Dec 28, 2021

Yes I think that it would be wise to rein it in and get a basic version in place.

@sipa
Copy link
Contributor Author

@sipa sipa commented Dec 29, 2021

See #1058 for the newest iteration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants