WIP: "safegcd" field and scalar inversion #767

peterdettman · 2020-07-15T08:08:12Z

see "Fast constant-time gcd computation and modular inversion" by Daniel J. Bernstein and Bo-Yin Yang https://gcd.cr.yp.to

Implements constant-time field and scalar inversion using a Euclidean-style algorithm operating on the least-significant bits. It follows the curve25519 case study as outlined in section 12 of the above paper, with some small tweaks. ~~I do not have the actual curve25519 code to compare to, but this PR appears to be already 15-20% faster than their reported 10050 haswell cycles for field inversion.~~ Actually, this initial measurement was in error, but we did reach 5-10% faster in the end, without any asm.

Performance comparison (gcc 10.1.0, -O3, haswell, endo=yes, asm=no, field=64bit scalar=64bit):

master:
scalar_inverse: min 12.4us
field_inverse: min 5.04us
ecdh: min 59.7us
ecdsa_sign: min 47.9us

safegcd_inv:
scalar_inverse: min 3.45us
field_inverse: min 3.23us
ecdh: min 56.2us
ecdsa_sign: min 37.7us

I've not done a 32bit version yet, although it is mostly analogous and should have an even greater relative advantage over Fermat. The field and scalar implementations have substantial common code that is duplicated here for simplicity.

- see "Fast constant-time gcd computation and modular inversion" by Daniel J. Bernstein and Bo-Yin Yang https://gcd.cr.yp.to()

real-or-random · 2020-07-15T10:01:31Z

I don't have time to look at this right now but this is very neat!

related to #730

peterdettman · 2020-07-15T10:17:44Z

I should add that for variable time inversion we can just make a simple rewrite of _divsteps_62 and it's already faster than existing one:

field_inverse: min 2.10us <-- (actually a variable-time implementation)
field_inverse_var: min 2.22us <-- (unmodified secp256k1_fe_inv_var)

_divsteps_62 is also an ideal target for asm with a bunch of conditional swaps and negates (although I'm not sure how smart gcc is already being about them).

real-or-random · 2020-07-15T13:13:25Z

I see a few TODOs in the code but nothing substantial. Can you comment on how much this is is "WIP" (except the fact that 32 bit is missing?

_divsteps_62 is also an ideal target for asm with a bunch of conditional swaps and negates (although I'm not sure how smart gcc is already being about them).

An interesting data point is #726 (comment) which shows that current versions of gcc -O3 outperform our ASM (but we switched to a default of -O2 a few months ago). So ASM is in general getting less and less interesting for us.

peterdettman · 2020-07-15T13:50:40Z

It's conceptually complete at the top level. So besides general review, the main TODO is a thorough bounds analysis (they are quite tight for performance) and in a couple of places I expect that could require an extra carry step or two. The code also uses arithmetic right shift as things stand (although it's not used in the main inner loop), which I guess will provoke discussion.

Of course there's the 32bit version, and some organisational work to share common code in a sensible way.

Apart from the code itself, we would probably want to get some outside opinions from mathematicians on the paper's proofs (meaning no offence to the authors), as it is less obvious than Fermat inversion. Indeed, in Appendix F.2 of the paper the authors show that an earlier analysis of a similar algorithm ("A binary recursive gcd algorithm", Stehlé andZimmermann) was incorrect.

mratsim · 2020-07-15T15:14:31Z

I wonder if you measured against Niels Moller constant-time inversion in Nettle/Botan/GMP?

The Yang-Bernstein GCD seems to require significantly more operations compared to algorithm 5 of Fast Software Polynomial Multiplicationon ARM Processors Using the NEON Engine

I've outline my performance concerns in Botan issue randombit/botan#1479 (comment)

peterdettman · 2020-07-15T16:32:04Z

@mratsim I have not, but frankly I don't think it could improve on these results because it is performing full-length operations on the operands in each loop iteration. This has been the fundamental performance barrier for all of the binary GCD style algorithms that several contributors have attempted here and elsewhere over the years. So I am focused now on the 2-adic division, "bottom bits" algorithms.

This PR performs 62 iterations at a time on the "bottom bits" only (see the ..._divsteps_62 method(s)), and only updates the full sized operands 12 times. It's true safegcd has more total iterations (744 vs 512), but they are much much cheaper. safegcd also defers the full u/v calculation to the post-processing phase (about 25% of total time in this PR), where the matrices can be combined very efficiently.

At 256 bits, I suppose it's possible that the NEON vector instructions dramatically change the accounting, and perhaps something similar is possible with x86_64 SIMD extensions; I would be very interested to see the results of either. Of course there are opportunities for SIMD to improve this PR also.

gmaxwell · 2020-07-20T09:36:57Z

VERY COOL. One comment on correctness is that inversions are easy and fairly cheap to test at runtime, so at least flaws can be turned into clean crashes instead of incorrect results-- for whatever consolation that is. :)

gmaxwell · 2020-07-20T09:47:47Z

One could even implement code that multiplies to check the inverse, and if its wrong redoes the operation using the ladder.

This would be a fringe timing channel if it were triggerable (but it's easy to test to be confident that its rare and randomize it if there are untrusted inputs to a constant time version), so I think that wouldn't be a problem.

This would essentially guarantee algorithmic correctness at little cost, assuming the software just didn't have an ordinary programming flaw.

So think fear of correctness and a shortage of computational number theorists need not stand in the way of this work.

- definitely needs bounds analysis

peterdettman · 2020-07-21T17:25:40Z

There's now 32-bit versions for field and scalar inversion. It needs bounds analysis, carry chains checked, etc., but these are the current timing comparisons with 32bit scalar and field configured (bearing in mind the test machine is x86_64):

master:
scalar_inverse: min 38.9us
field_inverse: min 7.36us

safegcd_inv:
scalar_inverse: min 4.35us
field_inverse: min 3.70us

real-or-random · 2020-07-20T12:29:29Z

src/field_5x52_impl.h

+    r[4] =  a4 >> 40;
+}
+
+static int secp256k1_fe_divsteps_62(uint16_t eta, uint64_t f0, uint64_t g0, int64_t *t) {


Just one thing I noticed when skimming through it: The types of eta are pretty inconsistent. You use uint16_t inside the function, return an int and then store it in an int16_t.

Sure, it should be cleaned up. It's logically a signed value (~ in [-744,744]) in the sense that the algorithm as-written switches on the sign of it, but the actual constant-time code for _divsteps is in "bit-twiddling" style, and I wanted to make it clear that there's no reliance (in this method) on arithmetic right-shift. Conceptually maybe it's (u)int_fast16_t, but I was a bit lazy about dealing with the variable sizes before things worked. Possibly uint64_t (resp. uint32_t for 32bit) makes more sense.

real-or-random · 2020-07-21T22:18:11Z

src/field_5x52_impl.h

+        z = (v ^ r) & c1;
+        v ^= z; r ^= z; r ^= c1; r -= c1;
+
+        eta = (eta ^ (uint16_t)c1) - (uint16_t)c1 - 1;


On a related note, seeing integer types smaller than int makes me somewhat nervous. Most people looking at this line for example won't notice that all the arithmetic here is performed on (signed) int due to integer promotion rules. (I believe that's not an issue in this line, it's just a general remark.)

Am I just missing a U i.e. 1U or else could you explain?

The way integer operations (and this includes bitwise as well as unary operators, shifts and comparisons) is done in C can be pretty unexpected for types smaller than int: Before any other rules apply, "Integer types smaller than int are promoted when an operation is performed on them. If all values of the original type can be represented as an int, the value of the smaller type is converted to an int; otherwise, it is converted to an unsigned int" [1]. On our targets, int is the same as int32_t, so an uint16_t for sure fits in an int. The possibility that an unsigned integer is converted to a signed integer is fucked up because signed overflow is undefined behavior. And compilers have been reported to rely on this [2, 3].

For this line here, this means eta ^ (uint16_t)c1 is evaluated by first converting eta and (uint16_t)c1 to int and then performing the ^. Then for the -, the right operand (uint16_t)c1 is converted to int. In those cases, it's easy to see that it doesn't make a difference, at least for our targets.

In general, if you ignore the theoretical cases like int 17 is bits wide, it seems (without a proof) that the only interesting operations are multiplication, left shifts, right shifts, and comparisons, the former two because of possible overflows, comparisons because of issues with comparing signed and unsigned values, and right shifts because it's relevant whether the left operand is signed or unsigned.

The real fuckup here is that the stdint.h types were supposed to be used for portable arithmetic but portable arithmetic is not possible in C because all integer arithmetic, no matter what type, implicitly depends on the size of int by the above rule. Our code would probably blow up on a target where int is 64 bits. (Yes, this exists.)

The typical trick to work around this is to start the arithmetic with an unsigned int value, e.g., (1U * eta) ^ (uint16_t)c1. No need to point out that this is ugly.

[1] https://wiki.sei.cmu.edu/confluence/display/c/INT02-C.+Understand+integer+conversion+rules
[2] https://stackoverflow.com/questions/24795651/whats-the-best-c-way-to-multiply-unsigned-integers-modularly-safely the link is for C++ but the same rules apply in this case
[3] https://www.cryptologie.net/article/418/integer-promotion-in-c/

gmaxwell · 2020-07-22T00:40:53Z

iMX.6 (arm7)

Before:
scalar_inverse: min 338us / avg 338us / max 342us
scalar_inverse_var: min 13.7us / avg 13.8us / max 13.9us
field_inverse: min 209us / avg 210us / max 210us
field_inverse_var: min 13.6us / avg 13.6us / max 13.7us

After:
scalar_inverse: min 33.7us / avg 33.8us / max 34.3us
scalar_inverse_var: min 13.6us / avg 13.6us / max 13.7us
field_inverse: min 31.5us / avg 31.5us / max 31.6us
field_inverse_var: min 13.5us / avg 13.5us / max 13.6us

I suppose I shouldn't let the fact that it only matched GMP for the variable time here distract me from the fact that it's 10x faster for the constant time scalar inverse. This will be a big speedup for GMP-less ECDSA validation and general signing on arm.

Edit: ah, I guess its not actually using a variable time version of this algo for the variable time ones? Even using the constant time would would still be a big speedup for the non-gmp builds. It would sure be nice to get rid of optional linking against gmp.

gmaxwell · 2020-07-22T04:44:05Z

Gmp-less comparison.

iMX.6 (arm7) no-asm
Before: ecdsa_verify: min 2810us / avg 2815us / max 2832us
After: ecdsa_verify: min 2504us / avg 2506us / max 2510us

iMX.6 (arm7) with-asm
Before: ecdsa_verify: min 1383us / avg 1387us / max 1396us
After: ecdsa_verify: min 1074us / avg 1075us / max 1076us

Obviously this speedup would be greater w/ a variable time version of the new code in use.

peterdettman · 2020-07-22T09:35:29Z

@gmaxwell Agreed that random blinding can be useful in the ways you describe. Still it starts to lean towards your point elsewhere that one could just use the variable-time alg. with blinding anyway.

Yes, there's no variable-time in the PR yet. I gave an example timing above from minor changes to _divsteps_62. I did experiment a bit further with the 64bit version and got var-time field inverse down to around ~1.75us, vs. ~3.25us for const-time and ~2.2us for the gmp one. That's already enough to dispense with gmp for 64bit, but I'm sure it can be improved further.

For the arm7 numbers (thanks - I should really try and get a hardware setup for this), I am actually a bit surprised at how slow the "Before" field_inverse is (especially relative to gmp for inv_var). I guess this has to be down to fe_mul, fe_sqr being slow (slow mul instruction?), which partially affects the timings for this PR. As things stand, beating 32bit gmp on inv_var would appear to require about 2.5x speedup over this PR.

Could you please satisfy my curiosity and give the full output of bench_internal on that arm7? I also have some possible variations of _fe_mul that I could PR which it would be very helpful to get measured on real devices.

JeanLucPons · 2020-07-22T14:37:38Z

Hello,
VanitySearch uses a very similar implementation of modular inversion (called DRS62, delayed right shift 62 bits). It operates on the 62 least-significant bits of the integer and maintain a 2x2 matrix in the inner XCD loop. It is a variable time implementation.

https://github.com/JeanLucPons/VanitySearch/blob/3ba22e9072db540a445d37471e7969bff92fa504/IntMod.cpp#L380
It is between 2 and 3 times faster than Fermat/Euler method on Secp256K1 prime.

gmaxwell · 2020-07-22T17:10:23Z

iMX.6 with ASM:
scalar_add: min 0.157us / avg 0.158us / max 0.160us
scalar_negate: min 0.0697us / avg 0.0699us / max 0.0705us
scalar_sqr: min 1.10us / avg 1.13us / max 1.31us
scalar_mul: min 1.07us / avg 1.10us / max 1.30us
scalar_inverse: min 33.9us / avg 34.1us / max 34.3us
scalar_inverse_var: min 33.9us / avg 33.9us / max 34.0us
field_normalize: min 0.0764us / avg 0.0771us / max 0.0777us
field_normalize_weak: min 0.0327us / avg 0.0329us / max 0.0332us
field_sqr: min 0.291us / avg 0.291us / max 0.292us
field_mul: min 0.361us / avg 0.366us / max 0.368us
field_inverse: min 27.6us / avg 27.7us / max 27.9us
field_inverse_var: min 27.6us / avg 27.8us / max 28.4us
field_sqrt: min 78.9us / avg 79.2us / max 80.2us
group_double_var: min 2.44us / avg 2.45us / max 2.49us
group_add_var: min 5.70us / avg 5.73us / max 5.82us
group_add_affine: min 4.90us / avg 4.91us / max 4.93us
group_add_affine_var: min 4.08us / avg 4.08us / max 4.10us
group_jacobi_var: min 79.3us / avg 79.9us / max 83.8us
wnaf_const: min 2.44us / avg 2.45us / max 2.48us
ecmult_wnaf: min 2.60us / avg 2.62us / max 2.63us
hash_sha256: min 1.56us / avg 1.56us / max 1.57us
hash_hmac_sha256: min 6.35us / avg 6.36us / max 6.40us
hash_rfc6979_hmac_sha256: min 34.9us / avg 35.2us / max 35.5us
context_verify: min 54297us / avg 54598us / max 55353us
context_sign: min 501us / avg 503us / max 510us

iMX.6 with no-asm:
scalar_add: min 0.157us / avg 0.157us / max 0.158us
scalar_negate: min 0.0697us / avg 0.0703us / max 0.0720us
scalar_sqr: min 1.10us / avg 1.10us / max 1.10us
scalar_mul: min 1.05us / avg 1.05us / max 1.05us
scalar_inverse: min 33.7us / avg 33.7us / max 33.7us
scalar_inverse_var: min 33.7us / avg 34.1us / max 34.4us
field_normalize: min 0.0771us / avg 0.0774us / max 0.0778us
field_normalize_weak: min 0.0326us / avg 0.0327us / max 0.0329us
field_sqr: min 0.817us / avg 0.818us / max 0.819us
field_mul: min 0.939us / avg 0.942us / max 0.962us
field_inverse: min 31.5us / avg 31.6us / max 32.2us
field_inverse_var: min 31.5us / avg 31.7us / max 32.5us
field_sqrt: min 207us / avg 208us / max 211us
group_double_var: min 5.97us / avg 5.98us / max 6.02us
group_add_var: min 14.5us / avg 14.5us / max 14.6us
group_add_affine: min 11.1us / avg 11.2us / max 11.4us
group_add_affine_var: min 10.0us / avg 10.0us / max 10.1us
group_jacobi_var: min 208us / avg 209us / max 210us
wnaf_const: min 2.44us / avg 2.44us / max 2.44us
ecmult_wnaf: min 2.60us / avg 2.61us / max 2.65us
hash_sha256: min 1.56us / avg 1.59us / max 1.60us
hash_hmac_sha256: min 6.38us / avg 6.40us / max 6.44us
hash_rfc6979_hmac_sha256: min 35.0us / avg 35.4us / max 37.2us
context_verify: min 124901us / avg 125307us / max 125731us
context_sign: min 904us / avg 914us / max 958us

peterdettman · 2020-07-22T18:04:43Z

Hello,
VanitySearch uses a very similar implementation of modular inversion (called DRS62, delayed right shift 62 bits). It operates on the 62 least-significant bits of the integer and maintain a 2x2 matrix in the inner XCD loop. It is a variable time implementation.

https://github.com/JeanLucPons/VanitySearch/blob/3ba22e9072db540a445d37471e7969bff92fa504/IntMod.cpp#L380
It is between 2 and 3 times faster than Fermat/Euler method on Secp256K1 prime.

Yes it's quite similar, but the "divstep" used here should ultimately be faster than the approach there. Section 8 of the safegcd paper discusses some other variants and why they're slower (at least for constant-time case, but probably in general). I'd suggest you check particularly 8.4 which shows that you have to be very careful about the "divstep" you use, since merely clearing 0s and combining to get more 0s is not sufficient to guarantee termination.

The comments suggest this is Binary xGCD but I would usually take that to include a size comparison b/w u,v on each iteration, which that code isn't doing. So I guess it's actually a variant of Stehlé and Zimmermann's "Binary Recursive Gcd"?

JeanLucPons · 2020-07-23T07:23:25Z

The "divstep62" used in VanitySearch is a slightly optimized signed bxcd loop to get zeros a bit faster.
The transition matrix starts with identity and simply records the steps done on native 64 bits signed integers.
It ends (in average) in 62x9.84= 610 steps (against 62x12=744 steps for the constant time implementation described in the paper). A standard bxcd "divstep62" gives (in average) 62x10.08=624.96 steps.

JeanLucPons · 2020-07-24T14:26:52Z

Yes, there's no variable-time in the PR yet. I gave an example timing above from minor changes to _divsteps_62. I did experiment a bit further with the 64bit version and got var-time field inverse down to around ~1.75us, vs. ~3.25us for const-time and ~2.2us for the gmp one. That's already enough to dispense with gmp for 64bit, but I'm sure it can be improved further.

Hello,
Could you tell me on which hardware you get these perf ?
On my i5-8500, my implementation for field inverse gives 2.73us (in average, for 256 bit numbers uniformly distributed). I think I can win few cycles by optimizing the vector/matrix product (something similar to what you did in secp256k1_fe_update_fg) but I'm not sure to win a lot.
Thanks ;)

gmaxwell · 2020-07-24T17:53:55Z

@JeanLucPons aside, you can benchmark this code easily by applying the patch and running bench_internal. :) This doesn't totally answer your question since he hasn't posted those variable time optimizations but you could scale his figures based on comparing the constant time numbers.

peterdettman · 2020-07-24T19:49:39Z

@JeanLucPons It's an i7 (Haswell) @2.6GHz. The constant-time algorithm does 744 iterations because that's (more than) enough to guarantee completion, but the average number of iterations to send g to 0 is actually ~532.

In your code I would first try making the inner loop branchless. If you are able to use __builtin_ctzll or equivalent, then the while(even) loop can be rewritten very simply.

However I would suggest you turn your attention first to whether the algorithm is correct i.e. is it guaranteed to terminate for all inputs? A typical Euclidean algorithm always reduces the larger of two variables, and so will terminate. This is the case e.g. for Binary GCD also. A weirder way of saying that is that each iteration produces 0s in the high bits of whichever variable currently has fewer.

The 2-adic ("bottom bits") algorithms - including this PR, "binary recursive GCD", Algorithm PM ("Plus-or-Minus") - instead produce 0s in the low bits of whichever variable currently has fewer. This is what the "eta" variable is tracking in this PR (in the paper it's "delta"; eta is just -delta). Even with this property, one has to be careful as section 8.4 of the safegcd paper explains.

AFAICT your algorithm does't have a property like this, it alternates between u,v without regard for which of the two variables has made more progress. Are you aware of a proof that it does always terminate? The same question applies to the "BXCD" algorithm there.

sipa · 2020-07-25T03:52:59Z

This is really nice!

I've skimmed the paper and the code, and this is my understanding (which may help other reviewers):

We have a "micro iteration" function divstep(eta,f,g) that returns updated eta,f,g. When calling with arguments (1,f=oddinteger,g=integer), it will after a bounded number of iterations converge to (something,+-gcd(f,g),0).
Every iteration of divstep actually corresponds with a matrix multiplication with the input, and that matrix has small magnitude coefficients, and only depends on eta, and the bottom bit of f and g.
N iterations of divstep(eta,f,g) is thus a matrix multiplication consisting of the product of all those individual matrices, whose coefficients' bitlengths scale with N. It turns out that the combined matrix only depends on eta and the lower N bits of f and g.
This gives rise to macroiterations consisting of a fast way of applying N microiterations. Each first computes the combined matrix for N microiterations (as many as result in matrix coefficients that fit in a 64-bit or 32-bit integer; N=62 or 31), and then applying that matrix to the entire f/g values.
The overall algorithm is encoding the input in correctly-sized limbs (62 bits or 31 bits), applying 12 or 24 macroiterations on those limbs, remembering the 2x2 64/32-bit matrices from each, and then combining those matrices in order to extract the inverse from it, after a scalar multiplication to correct for the 744 powers of 2 the result was divided by.

Some overall comments:

Would it be possible to introduce simple 2x2 matrix types (with 32/64/128/256 bit coefficients, i assume?) and operations on it? The many coefficients are a bit hard to follow. Or is this hard because they start off being 2x2 32/64-bit matrices and are then reused to serve as coefficients for 128/256 bit matrices, and it would be excessive to use separate memory for those?
Alternatively, can you explain the combine_Ns, and decode_matrix functions more?
Would it make sense to merge the inner core (the macroiterations applying on the limbs) as they seem to be shared by fe/scalar code?

Follow-up questions for understanding:

The algorithm could be made variable time fairly easily by checking if g=0 after every macro iteration? I suspect it will also make the correction term to undo the power of 2 multiplications dependent on the number of iterations (but there are at most 12/24, so precomputing them all would not be a problem).
Could jacobi be implemented effectively using the same code, except using a different extraction from the matrix coefficients?

JeanLucPons · 2020-07-25T05:51:15Z

@peterdettman
Yes my divstep62 always end. It is true that the way this bxcd is written might be a bit confusing.
Roughly speaking if u does not make progress u and v may be swapped 2 times.
My divstep62 ends in 62loops + 31 extra swaps in worst case.
I tried a constant time implementation but it is ~20% slower.

Anyway, in 99.9999999...% of case my implementation ends in less than 11 divstep62 and I know it exists a small class of number where this limit is overpassed and also exceed 12. I don't know if it is the same in your implementation, I must admit that the theorem 11.2 is unclear for me.
I do no have time this weekend to compile the stuff and test your program. I will do this next Monday ;)
Keep up the good work ;)

real-or-random · 2020-07-26T12:41:56Z

The code also uses arithmetic right shift as things stand (although it's not used in the main inner loop), which I guess will provoke discussion.

Technically that's implemented-defined but I can't find a reference to a platform where a right shift of a negative value is not implemented as an arithmetic shift. We already rely on it in here since a recent PR:

secp256k1/src/ecmult_const_impl.h

Line 19 in d567b77

int mask = (n) >> (sizeof(n) * CHAR_BIT - 1); \

(Okay, there it could easily be avoided by casting to unsigned first.)

So I don't think that's a big issue. If this can be avoided for free, yes, we should do it. But otherwise I'd simply accept this. My understanding of the algorithm is not good enough but I suspect it cannot be avoided at zero cost?

We have multiple other undocumented assumptions in the code base, e.g., about sizes of some types. I guess we should just document those using a CHECK somewhere that asserts this, e.g., like Bitcoin Core does:
https://github.com/bitcoin/bitcoin/blob/master/src/compat/assumptions.h

real-or-random · 2020-09-19T10:24:39Z

We discussed random blinding here. I just found this talk which demonstrates a sidechannel attack on the subsequent gcd algorithm if the blinding multiplication is done over the integers (without mod): https://youtu.be/ORGCHkSInjs?list=PLeeS-3Ml-rpqyNMiXWuheOmKAgCkUaass&t=1450 It's not directly related to the discussion here but I found it interesting.

sipa · 2020-10-12T07:39:02Z

See rebased/squashed version in #831.

sipa

A few comments to test my understanding.

src/field_10x26_impl.h

sipa · 2020-10-14T04:32:48Z

src/field_10x26_impl.h

+        fn = f[len - 1];
+        gn = g[len - 1];
+
+        cond = ((int32_t)len - 2) >> 31;


Am I right in thinking that (cond == 0) corresponds to (len >= 2 && (fn == 0 || fn == -1) && (gn == 0 || gn == -1))?

Yes. i.e. if the length is more than 1, and the top words of both f and g contain only a sign bit, then we can shorten the length and keep the sign(s) in the top bits of the new top word(s).

I should clarify that whilst d, e, f, g are arrays of signed values, only the top word is ever negative.

Is there an easy argument why that's always the case, even with a modulus that's encoded using negative limbs?

I overstated it; f can be initialised with negative limbs from the modulus. The update methods only leave a negative value in the top limb. _update_fg is always called at least once.

sipa · 2020-10-14T04:34:25Z

src/field_10x26_impl.h

+        cond |= fn ^ (fn >> 31);
+        cond |= gn ^ (gn >> 31);
+
+        if (cond == 0) {


This clause is there so that when the top nonzero element is 0 or -1, we stuff it into bit 29 of the element below?

How is it guaranteed that at the end this 29th bit is gone (in scalar_encode)?

The sign bit is stuffed into bits 30, 31 - I'm not sure where 29 comes from. Are you asking about secp256k1_fe_decode_30 (since we are in field code here)? f, g are never decoded.

Sorry for the confusion, let me restate:

All the inner arithmetic supports 31 bits values (+ sign bit)

Generally both d/e and g/f only store 30 significant bits per limb

Except when the top limb of f/g are both 0 or -1, in which case it's stuffed into the previous limb, possibly pushing that one to 31 bits.

D/e strictly only have 30 bits per limb, and only the top limb can be negative (otherwise the decode functions would fail).

Analogously for the 64-bit version, using with s/30/62/.

My bounds analysis for the update methods was based on s31 values in the matrices produced by _divsteps_30, and s30 values in d, e, f, g. The update methods currently only leave negative values in the top limb of d, e, f, g, but the modulus (and therefore initial f) might have negatives elsewhere.

Your questions have made me realize that the dynamic shortening of f, g can result in a top limb of f or g (or both) that is s31. Probably not an issue, but it might be best to redo bounds analysis for the update methods.

real-or-random · 2020-10-14T09:21:28Z

A few comments to test my understanding.

It would be nice add code comments for all these clarifications.

sipa · 2020-10-15T20:23:10Z

@real-or-random Definitely.

sipa · 2020-10-18T23:24:59Z

src/scalar_4x64_impl.h

+     * divsteps are needed. */
+    eta = -(uint64_t)1;
+
+    for (i = 0; i < 12; ++i) {


Is there no point in taking advantage of the fact that 741 iterations is enough? (this loop looks like it supports 12*62 = 744 iterations).

@gmaxwell and I have been trying to find example inputs that need a large number of iterations (using a beam search that tries to find low bits that increase the average iteration count (over many high bits that follow) , but the best so far is 0x4212888e98e92e1c5384b37d0bf0655ec3bbad64381e713d6d6b0490d4858c83 (modulo the field order p), which needs 637 iterations.

I've also worked on some code that uses partial evaluation and interval reasoning to split up the range of inputs that have a large bound (using the log2(sqrt(f^2+4*g^2)) formula from the paper), until better bounds for the entire range is found. If my code is correct, both the field and scalar logic only need 737 iterations at most - but proving anything better will take a long time.

Some scratch code is here: https://gist.github.com/sipa/5736b83903336a1e6f3ccdeaa4cfbfea

For a limit of 741 I doubt there's much to be gained by saving 3 divsteps, and I prefer the simplicity. Maybe for the 32-bit version saving 9 divsteps could be more viable. Two example approaches: 1) more code (alternate divsteps method plus extra calling code) -> careful about benchmark results neglecting e.g. the instruction cache impact, or branch prediction slot. 2) pass the iteration count to divsteps -> careful not to break compiler's ability to unroll divsteps loop (or handcoded asm).

I do hope that mathematical developments could reduce the max iterations. Checking the limit for the specific prime could gain a few bits as you say. It's unclear to me whether the worst-case input(s) can be generated, but it would be wonderful if "bad" inputs could be cheaply characterised (maybe based on something like the Hamming weight in NAF), then cheaply "fixed" in constant-time (something simple like multiplication by a small constant), so that a new limit would apply.

If we call b(x,y) = ⌊49/17 log₂(√(x² + 4y²)))⌋, with appropriate corrections for small inputs, then I believe these bounds are correct:

δ = 1: iter ≤ b(|f|,|g|) [from the paper]

δ < 1: iter ≤ 1 - δ + b(|f|, ((2^1-δ-1)|f| + |g|) / 2^1-δ) [δ will only go up the next 1-δ steps, and in each, the g argument is either halved or averaged with f; after that many steps, this formula is an upper bound (corresponding to always averaging)]

δ > 1: iter ≤ 1 + δ + b(|g|, |g| + (|f| + |g|)/2^1+δ) [by assuming the next step will have an odd g and using the previous rule, but correcting for the fact that more δ-incrementing even-g steps can happen in between]

With those rules, it seems like I've proven that no input (for either field or scalar order modulus) ever needs more than 736 iterations using the code above (with finds a partition of the input range based on fixing low bits, and using partial evaluation + the bounds above to find a bound <= 736 for each).

Update: we've proven that no inputs for either field or scalar order need more than 735 iterations. We could do 734 but I don't think it's worth the effort, as by extrapolation it would take around 16 years of CPU time (which is doable, but needs some effort to distribute the work across network nodes etc) - which means that even if proven it would make the computation hard to repeat. The ≤736 bound takes a couple of minutes on to prove with the (somewhat involved) C++ code, and a few hours in Python.

Bound Proof steps for field Proof steps for scalar

≤741 1 1

≤740 21 21

≤739 85 85

≤738 17835 17835

≤737 6753276 6753276

≤736 2879322241 2879322241

≤735 1216906236685 1216976994782

The ≤735 proof takes 80 minutes on a 128 core host.

I wouldn't mind doing the ≤734 and wouldn't mind doing-- even without networking code-- if it would actually be even slightly useful, but it'll take me about a month realtime that way (for each field and scalar) (I could do it 4 or 5 days with a networked version). ≤733-- by extrapolation-- would be about 7000 cpu-years and reasonably beyond my ability to do but it could be done (e.g. with the help of an actual supercomputer).

I don't think ≤732 is currently provable using all the kings cycles and all the kings mems, at least not using these techniques.

I don't think ≤732 is currently provable using all the kings cycles and all the kings mems, at least not using these techniques.

Surely all the kings men can muster 1M c6g.16xlarge instances for a week (~ 350M USD).

@sipa curious to see the code used to prove this

@elichai https://gist.github.com/sipa/5736b83903336a1e6f3ccdeaa4cfbfea

fgcd.cpp is a heuristic search for large inputs

fgcd_recurse.{py,cpp} are provers for a maximum bound

sipa · 2020-10-29T08:06:26Z

src/scalar_4x64_impl.h

+
+    secp256k1_scalar_reduce(r, secp256k1_scalar_check_overflow(r));
+
+    secp256k1_scalar_add(&u, r, &SECP256K1_SCALAR_NEG_TWO_POW_256);


I'm wondering if there is a good reason to do this in the scalar/field representation rather than the signed30/62 representation? There could be a single addition chain that negates (based on the sign of f[0]) and (conditionally) adds the modulus if the computed inverse is negative.

No good reason - I do it as part of the inverse in the BouncyCastle implementation. As written that requires D to be in [-P, P), which I believe is correct, but so far in secp256k1 I only assumed [-2^256,2^256).

I had a hard time verifying those bounds, so I wrote a (hopefully faithful) reimplementation of the algorithm here in Python, only replacing the limb decomposition with native integers (which are arbitrary length in Python): https://gist.github.com/sipa/5736b83903336a1e6f3ccdeaa4cfbfea#file-fgcd-py

Based on the choice of the limb size and moduli, better bounds may hold, but if I make no assumptions (allowing any modulus, and any limb size from 1 bit up to however many bits the modulus is), I can only verify that the "d" value at the end of the divsteps iterations is in [-2P, P). Care to share you reasoning for why it'd be tighter than that?

~~Hmm, when I restrict it to even limb sizes of at least 6 bits, [-P,P) seems to hold.~~

Scratch that, inverse of 19934 modulo 21163, with 6 bit limbs, gives d=-23436.

It was a while ago now... I recall doing some exhaustive testing for small values and finding an asymptote for the worst-case at around +/- 5*P/6 (actually just a bit larger), but I'm foggy on the details. We could certainly use more analysis here and it's probably best to go with [-2P, P) for the moment if you can prove that.

I'll think about that... but is that easy/cheap to do? You may need to scan for the first non-zero limb to find the sign; doing that (especially in constant time) may not be that much cheaper than an addition chain to add/subtract a modulus.

It's cheap, the sign is always correct in the top limb for D, E. EDIT: They are in 2's complement basically, on a 2^30 radix to avoid overflow issues in the updates and so that the implicit shift is free.

BTW, I've kind of accepted [-2P, 2P) for the moment. Just looking for a sign-based fixup for update_de; the ones I've tried so far actually give larger worst-case values in random trials, but perhaps they allow a locally-reasoned proof.

It seems this convex hull approximation technique also works for bounding the number of divsteps needed! https://gist.github.com/sipa/5736b83903336a1e6f3ccdeaa4cfbfea#file-hull-bound-py

If the code is correct (and it looks like it is for sufficiently small inputs), it proves that for any 256-bit modulus M and input in [0..M), no more than 723 divsteps are needed. Still not enough to reduce our number of big step iterations in either 32 or 64 bit, but perhaps there are other gains?

EDIT: if the input is restricted to 0..MODULUS/2, 721 is enough. So close to only needing 24 iterations in 32 bit mode.

I've pushed some changes that should address this issue of the size constraints on D, E. See comments in _update_de method(s). I've only done 64bit, but the pattern should be clear. This latest commit also does the final normalization within the 62-bit format.

peterdettman · 2020-11-12T06:34:03Z

@sipa Just so we don't duplicate efforts: I am working on porting this into your new PR.

sipa · 2020-11-12T07:12:05Z

@peterdettman Cool, thanks! I was planning on doing so myself soon, but hadn't started.

sipa · 2021-01-30T08:05:02Z

Closing in favor of #831.

24ad04f Make scalar_inverse{,_var} benchmark scale with SECP256K1_BENCH_ITERS (Pieter Wuille) ebc1af7 Optimization: track f,g limb count and pass to new variable-time update_fg_var (Peter Dettman) b306935 Optimization: use formulas instead of lookup tables for cancelling g bits (Peter Dettman) 9164a1b Optimization: special-case zero modulus limbs in modinv64 (Pieter Wuille) 1f233b3 Remove num/gmp support (Pieter Wuille) 20448b8 Remove unused Jacobi symbol support (Pieter Wuille) 5437e7b Remove unused scalar_sqr (Pieter Wuille) aa9cc52 Improve field/scalar inverse tests (Pieter Wuille) 1e0e885 Make field/scalar code use the new modinv modules for inverses (Pieter Wuille) 436281a Move secp256k1_fe_inverse{_var} to per-impl files (Pieter Wuille) aa404d5 Move secp256k1_scalar_{inverse{_var},is_even} to per-impl files (Pieter Wuille) 08d5496 Improve bounds checks in modinv modules (Pieter Wuille) 151aac0 Add tests for modinv modules (Pieter Wuille) d8a92fc Add extensive comments on the safegcd algorithm and implementation (Pieter Wuille) 8e415ac Add safegcd based modular inverse modules (Peter Dettman) de0a643 Add secp256k1_ctz{32,64}_var functions (Pieter Wuille) Pull request description: This is a rebased and squashed version of #767, adding safegcd-based implementations of constant-time and variable-time modular inverses for scalars and field elements, by Peter Dettman. The PR is organized as follows: * **Add secp256k1_ctz{32,64}_var functions** Introduction of ctz functions to util.h (which use `__builtin_ctz` on recent GCC and Clang, but fall back to using a software emulation using de Bruijn on other platforms). This isn't used anywhere in this commit, but does include tests. * **Add safegcd based modular inverse modules** Add Peter Dettman's safegcd code from #767 (without some of his optimizations, which are moved to later commits), turned into separate modules by me. * **Add extensive comments on the safegcd algorithm and implementation** Add a long description of the algorithm and optimizations to `doc/safegcd_implementation.md`, as well as additional comments to the code itself. It is probably best to review this together with the previous commit (they're separated to keep authorship). * **Add tests for modinv modules** Adds tests on the modinv interface directly, for arbitrary moduli. * **Improve bounds checks in modinv modules** Adds a lot of sanity checking to the modinv modules. * **Move secp256k1_scalar_{inverse{_var},is_even} to per-impl files** A pure refactor to prepare for switching the field and scalar code to modinv. * **Make field/scalar code use the new modinv modules for inverses** Actually switch over. * **Add extra modular inverse tests** This adds modular inverse tests through the field/scalar interface, now that those use modinv. * **Remove unused Jacobi symbol support** No longer needed. * **Remove num/gmp support** Bye-bye. * 3 commits with further optimizations. ACKs for top commit: gmaxwell: ACK 24ad04f sanket1729: ACK 24ad04f real-or-random: ACK 24ad04f careful code review, some testing Tree-SHA512: 732fe29315965e43ec9a10ee8c71eceeb983c43fe443da9dc5380a5a11b5e40b06e98d6abf67b773b1de74571fd2014973c6376f3a0caeac85e0cf163ba2144b

…ster 5c7ee1b libsecp256k1 no longer has --with-bignum= configure option (Pieter Wuille) bdca9bc Squashed 'src/secp256k1/' changes from 3967d96bf1..efad3506a8 (Pieter Wuille) cabb566 Disable certain false positive warnings for libsecp256k1 msvc build (Pieter Wuille) Pull request description: This updates our src/secp256k1 subtree to the latest upstream master. The changes include: * The introduction of safegcd-based modular inverses, reducing ECDSA signing time by 25%-30% and ECDSA verification time by 15%-17%. * [Original paper](https://gcd.cr.yp.to/papers.html) by Daniel J. Bernstein and Bo-Yin Yang * [Implementation](bitcoin-core/secp256k1#767) by Peter Dettman; [final](bitcoin-core/secp256k1#831) version * [Explanation](https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md) of the algorithm using Python snippets * [Analysis](https://github.com/sipa/safegcd-bounds) of the maximum number of iterations the algorithm needs * [Formal proof in Coq](https://medium.com/blockstream/a-formal-proof-of-safegcd-bounds-695e1735a348) by Russell O'Connor, for a high-level equivalent algorithm * Removal of libgmp as an (optional) dependency (which wasn't used in the Bitcoin Core build) * CI changes (Travis -> Cirrus) * Build system improvements ACKs for top commit: laanwj: Tested ACK 5c7ee1b Tree-SHA512: ad8ac3746264d279556a4aa7efdde3733e114fdba8856dd53218588521f04d83950366f5c1ea8fd56329b4c7fe08eedf8e206f8f26dbe3f0f81852e138655431

5c7ee1b libsecp256k1 no longer has --with-bignum= configure option (Pieter Wuille) bdca9bc Squashed 'src/secp256k1/' changes from 3967d96..efad350 (Pieter Wuille) cabb566 Disable certain false positive warnings for libsecp256k1 msvc build (Pieter Wuille) Pull request description: This updates our src/secp256k1 subtree to the latest upstream master. The changes include: * The introduction of safegcd-based modular inverses, reducing ECDSA signing time by 25%-30% and ECDSA verification time by 15%-17%. * [Original paper](https://gcd.cr.yp.to/papers.html) by Daniel J. Bernstein and Bo-Yin Yang * [Implementation](bitcoin-core/secp256k1#767) by Peter Dettman; [final](bitcoin-core/secp256k1#831) version * [Explanation](https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md) of the algorithm using Python snippets * [Analysis](https://github.com/sipa/safegcd-bounds) of the maximum number of iterations the algorithm needs * [Formal proof in Coq](https://medium.com/blockstream/a-formal-proof-of-safegcd-bounds-695e1735a348) by Russell O'Connor, for a high-level equivalent algorithm * Removal of libgmp as an (optional) dependency (which wasn't used in the Bitcoin Core build) * CI changes (Travis -> Cirrus) * Build system improvements ACKs for top commit: laanwj: Tested ACK 5c7ee1b Tree-SHA512: ad8ac3746264d279556a4aa7efdde3733e114fdba8856dd53218588521f04d83950366f5c1ea8fd56329b4c7fe08eedf8e206f8f26dbe3f0f81852e138655431

peterdettman added 2 commits July 15, 2020 14:34

"safegcd" field and scalar inversion

9fbe485

- see "Fast constant-time gcd computation and modular inversion" by Daniel J. Bernstein and Bo-Yin Yang https://gcd.cr.yp.to()

Fix secp256k1_scalar_is_even/scalar_low issue

4fab082

peterdettman added 2 commits July 16, 2020 12:53

TODOs and comments

0b90a57

VERIFY_CHECK _divsteps_62 loop invariant

0c3869a

peterdettman added 3 commits July 21, 2020 17:00

More checks and comments

11b525c

Update f,g at full length until proper analysis

3ae7179

Initial 32bit safegcd

2f643ad

- definitely needs bounds analysis

real-or-random reviewed Jul 21, 2020

View reviewed changes

mratsim mentioned this pull request Sep 10, 2020

Research zkSNARKS blocker: Benchmark and optimize proof time vacp2p/research#7

Open

sipa mentioned this pull request Oct 12, 2020

Safegcd inverses, drop Jacobi symbols, remove libgmp #831

Merged

sipa reviewed Oct 14, 2020

View reviewed changes

fanquake mentioned this pull request Oct 15, 2020

configure: Allow users to explicitly enable libsecp256k1's GMP bignum support bitcoin/bitcoin#20121

Closed

sipa mentioned this pull request Oct 17, 2020

Update secp256k1 zcash/zcash#4757

Merged

sipa reviewed Oct 18, 2020

View reviewed changes

sipa reviewed Oct 29, 2020

View reviewed changes

peterdettman added 2 commits November 10, 2020 22:52

Rework _update_de I/O bounds

85da7a9

Rework _update_de for 32bit

c9b7717

This was referenced Jan 27, 2021

Use blinded instead of constant time inverse in secret gej->ge? #694

Open

Rescale initial point before every ecmult_gen #881

Open

sipa closed this Jan 30, 2021

This was referenced Apr 2, 2021

Update libsecp256k1 subtree to latest master bitcoin/bitcoin#21573

Merged

Safegcd-based modular inverses in MuHash3072 bitcoin/bitcoin#21590

Open

JeanLucPons mentioned this pull request Jan 19, 2022

DRS62 JeanLucPons/VanitySearch#121

Open

This was referenced Feb 8, 2022

Fast modular inversion mratsim/constantine#172

Merged

Implement the modular inverse using unsigned 256-bit integers addition and shifts #1073

Closed

Bound	Proof steps for field	Proof steps for scalar
≤741	1	1
≤740	21	21
≤739	85	85
≤738	17835	17835
≤737	6753276	6753276
≤736	2879322241	2879322241
≤735	1216906236685	1216976994782


		secp256k1_scalar_reduce(r, secp256k1_scalar_check_overflow(r));

		secp256k1_scalar_add(&u, r, &SECP256K1_SCALAR_NEG_TWO_POW_256);

WIP: "safegcd" field and scalar inversion #767

WIP: "safegcd" field and scalar inversion #767

Conversation

peterdettman commented Jul 15, 2020 • edited

real-or-random commented Jul 15, 2020

peterdettman commented Jul 15, 2020

real-or-random commented Jul 15, 2020

peterdettman commented Jul 15, 2020

mratsim commented Jul 15, 2020

peterdettman commented Jul 15, 2020

gmaxwell commented Jul 20, 2020 • edited

gmaxwell commented Jul 20, 2020 • edited

peterdettman commented Jul 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

real-or-random Jul 22, 2020 • edited

Choose a reason for hiding this comment

gmaxwell commented Jul 22, 2020 • edited

gmaxwell commented Jul 22, 2020

peterdettman commented Jul 22, 2020

JeanLucPons commented Jul 22, 2020

gmaxwell commented Jul 22, 2020

peterdettman commented Jul 22, 2020

JeanLucPons commented Jul 23, 2020 • edited

JeanLucPons commented Jul 24, 2020

gmaxwell commented Jul 24, 2020 • edited

peterdettman commented Jul 24, 2020

sipa commented Jul 25, 2020 • edited

JeanLucPons commented Jul 25, 2020

real-or-random commented Jul 26, 2020

real-or-random commented Sep 19, 2020

sipa commented Oct 12, 2020

sipa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

real-or-random commented Oct 14, 2020

sipa commented Oct 15, 2020

sipa Oct 18, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sipa Oct 20, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sipa Oct 25, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sipa Oct 29, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sipa Oct 30, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterdettman Nov 1, 2020 • edited

Choose a reason for hiding this comment

peterdettman Nov 1, 2020 • edited

Choose a reason for hiding this comment

sipa Nov 3, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterdettman commented Nov 12, 2020

sipa commented Nov 12, 2020

sipa commented Jan 30, 2021

peterdettman commented Jul 15, 2020 •

edited

gmaxwell commented Jul 20, 2020 •

edited

gmaxwell commented Jul 20, 2020 •

edited

real-or-random Jul 22, 2020 •

edited

gmaxwell commented Jul 22, 2020 •

edited

JeanLucPons commented Jul 23, 2020 •

edited

gmaxwell commented Jul 24, 2020 •

edited

sipa commented Jul 25, 2020 •

edited

sipa Oct 18, 2020 •

edited

sipa Oct 20, 2020 •

edited

sipa Oct 25, 2020 •

edited

sipa Oct 29, 2020 •

edited

sipa Oct 30, 2020 •

edited

peterdettman Nov 1, 2020 •

edited

peterdettman Nov 1, 2020 •

edited

sipa Nov 3, 2020 •

edited