secp256k1: Optimize k splitting with mod n scalar. by davecgh · Pull Request #2888 · decred/dcrd

davecgh · 2022-02-27T10:19:18Z

This requires #2887.

This optimizes the scalar decomposition code by rewriting it to make use of the highly-efficient zero-allocation ModNScalar type along with math/bits (which provides hardware acceleration on most platforms) instead of big.Ints.

The net effect is that the decomposition is significantly faster, allocation free, and constant time.

Since this is touching code that is critical to consensus, and the math involved might not be immediately familiar to reviewers, it also adds a bunch of detailed comments to better describe the endomorphism and how it is used in the scalar multiplication in addition to fully describing the scalar decomposition process and derivation.

Finally, it adds logic to derive and print all of the constants the reworked code makes of to the precomputation code that runs with go generate.

The following is a high level overview of the changes:

Adds a benchmark for the scalar decomposition
Adds derivation code for all precomputed constants
Reworks associated tests to make the testing methodology less reliant on the specific implementation details and to test edge cases and to additionally assert the following properties:
- The decomposed scalars must result in the original scalar via the equation k = k1 + k2*λ (mod n)
- Both decomposed scalars must have small magnitude as determined by having a maximum bit length of one more than half of the bit size of the underlying field
Updates the random k splitting tests to ensure new random values are tested each run
Rewrites splitK to use the ModNScalar type instead of big.Ints:
- Allocation free
- Constant time
- Provides hardware acceleration on most platforms
- Includes detailed comments describing the scalar decomposition process and derivation
Updates endomorphism parameter constant definitions to be ModNScalars instead of big.Int
- Moves the convenience func hexToModNScalar to the main package
Adds new endoZ1 and endoZ2 constants for the rounded multiplication used by the new decomposition code
- Adds logic to derive and print the new endomorphism constants with go generate
Updates the scalar multiplication to account for the new semantics
Adds detailed comments to scalar multiplication
Tightens negation magnitudes in addZ1EqualsZ2 and remove no longer needed normalization
Ensures the calculation when recovering compact signatures uses normalized points

The following benchmarks show a before and after comparison of scalar decomposition as well as how it that translates to scalar multiplication and signature verfication:

name         old time/op    new time/op     delta
---------------------------------------------------------------------
SplitK       1.61µs ±32%    0.89µs ± 2%     -44.69% (p=0.000 n=50+47)
ScalarMult   125µs ± 1%     115µs ± 1%      -7.82%  (p=0.000 n=43+46)
SigVerify    161µs ±25%     160µs ±19%      -0.53%  (p=0.001 n=50+50)

name         old allocs/op  new allocs/op   delta
-----------------------------------------------------------------------
SplitK       10.0 ± 0%       0.0            -100.00%  (p=0.000 n=50+50)
ScalarMult   11.0 ± 0%       0.0            -100.00%  (p=0.000 n=50+50)
SigVerify    28.0 ± 0%      16.0 ± 0%       -42.86%   (p=0.000 n=50+50)

While it only saves about 1 µs per signature verification in the benchmarking scenario, the primary win is the reduction in the number of allocations per signature verification which has a much more significant impact when verifying large numbers of signatures back to back, such as when processing new blocks, and especially during the initial chain sync process.

matheusd

Woop Woop!

It's invaluable having all the rationale for the calculations inline with the code. Makes reviewing the correctness of the code much easier.

I've marked some places inline where I've manually checked the assumptions. I also had GECC opened here to follow the explanations, but the inline, step-by-step derivations are much easier for me to follow.

All in all, excellent work as usual!

matheusd · 2022-03-03T19:18:51Z

+	// Note the carry on the final add is safe to discard because the maximum
+	// possible value is:
+	//   (2^64 - 1)(2^64 - 1) + (2^64 - 1) = 2^128 - 2^64
+	// and:
+	//   2^128 - 2^64 < 2^128.


matheusd · 2022-03-03T19:20:05Z

+	// Note the carries on the high order adds are safe to discard because the
+	// maximum possible value is:
+	//   (2^64 - 1)(2^64 - 1) + 2*(2^64 - 1) = 2^128 - 1
+	// and:
+	//   2^128 - 1 < 2^128.


matheusd · 2022-03-03T19:34:31Z

+	roundBit := r4 >> 63
+	r2, r1, r0 = r7, r6, r5


320/64 = 5, so dropping r0..r4 checks out

matheusd · 2022-03-03T19:37:25Z

+	//
+	// No modular reduction is needed because the result is guaranteed to be
+	// less than the group order given the group order is > 2^255 and the
+	// maximum possible value of the result is 2^192.


2^512 / 2^320 = 2^192, so checks out.

matheusd · 2022-03-03T20:04:37Z

+	//
+	// Note that, per above, the components of vector v1 are a1 and b1 while the
+	// components of vector v2 are a2 and b2.  Given the vectors v1 and v2 were
+	// generated such that a1*b2 - a2*b1 = n, solving the equation for g1 and g2


a1*b2 - a2*b1 = n (mod n) = 0 (mod n) (due to requirement 1 in the generation of the vectors)

Noting this here because the "= n" didn't seem like a given while reading, but the independence requirement implies this is 0 (mod n) which is n. I do understand you're using n here (instead of zero) because otherwise the next division would be meaningless.

Recall that we're working in the integers (v1 = <a1,b1>, and v2 = <a2,b2> in ℤ⨯ℤ) and rationals here ("where g1, g2 ∈ ℚ") so the (mod n) bits don't apply (yet) and thus division by n is well defined. It's n because the vectors are generated via the Extended Euclidean Algorithm (EEA) which generates a sequence of equations s[i]*n + t[i]*λ = r[i] (because n and λ are the two inputs).

The vectors are chosen from three consecutive iterations as soon as the remainder becomes less than sqrt(n) as a1 = r[i+1], b1 = -t[i+1] and either a2 = r[i], b2 = -t[i] or a2 = r[i+2], b2 = -t[i+2] depending on which has the smaller Euclidean norm.

Since the EEA works by consecutively dividing the previous remainders to obtain an integer quotient and a new remainder, it follows that a1*b2 - a2*b1 alternates between n and -n and, as noted above, the iteration on whichever side of the the i+1 iteration with the smaller Euclidean norm is chosen which forces the value to be n.

matheusd · 2022-03-03T20:10:15Z

+	// g1 = b2*k / n
+	// g2 = -b1*k / n
+	//
+	// Observe:


matheusd · 2022-03-03T20:10:22Z

+	//        = <[a1*b2*k - a2*b1*k]/n, 0>                        | simplify
+	//        = <k*[a1*b2 - a2*b1]/n, 0>                          | factor out k
+	//        = <k*n/n, 0>                                        | substitute
+	//        = <k, 0>                                            | simplify


matheusd · 2022-03-03T20:38:47Z

+	// z1 = floor(b2<<320 / n)                                     | precomputed
+	// z2 = floor(((-b1)%n)<<320) / n)                             | precomputed
+	// c1 = ((k * z1) >> 320) + (((k * z1) >> 319) & 1)
+	// c2 = ((k * z2) >> 320) + (((k * z2) >> 319) & 1)


the >> 319 & 1 bit being the round function... ✔️

matheusd · 2022-03-03T20:46:14Z

+	c2 := mul512Rsh320Round(k, endoZ2)
+	k2.Add2(c1.Mul(endoNegB1), c2.Mul(endoNegB2))
+	k1.Mul2(&k2, endoLambda).Negate().Add(k)
+	return k1, k2


I updated this to shave an extra field negation off by storing the negative version of lambda.

See the diff here.

matheusd · 2022-03-03T20:50:24Z

+	endoZ1     = hexToModNScalar("3086d221a7d46bcde86c90e49284eb153daa8a1471e8ca7f")
+	endoZ2     = hexToModNScalar("e4437ed6010e88286f547fa90abfe4c4221208ac9df506c6")


Double checked the hard coded values correspond to the generated ones.

davecgh · 2022-03-04T08:02:30Z

Thanks for the thorough review @matheusd.

rstaudt2

This looks great. I was not familiar with some of the math/optimizations beforehand but was able to follow with the excellent comments and references. I also tested by running a full sync on mainnet with --nocheckpoints and --assumevalid=0.

rstaudt2 · 2022-03-08T20:27:59Z

+	c2 := mul512Rsh320Round(k, endoZ2)
+	k2.Add2(c1.Mul(endoNegB1), c2.Mul(endoNegB2))
+	k1.Mul2(&k2, endoNegLambda).Add(k)
+	return k1, k2


Epic comment to lines of code ratio. Super helpful explanation of the math and optimizations!

rstaudt2

The latest updates look good to me as well! 🚀

JoeGruffins

Code looks correct to me. Have synced from zero on mainnet and testnet with the same flags as @rstaudt2 . Have run this compiled with other projects and no problems. The explanations of the "magic" are extremely helpful.

This reworks the k (scalar) splitting tests to make them more consistent with modern practices in the code as well as to make the testing methodology less reliant on the specific implementation details and to test edge cases. It also updates the random k splitting tests to ensure new random values are tested each run as opposed to the existing tests which always test the same values since they use the same random seed. Specifically, the following properties are asserted: - The decomposed scalars must result in the original scalar via the equation k = k1 + k2*lambda (mod n) - Both decomposed scalars must have small magnitude as determined by having a maximum bit length of one more than half of the bit size of the underlying field

This modifies the code related to precomputing the linearly independent vectors used by the endomorphism to compute all of the related params, including lambda and beta, as well as the alternative params. It also ensures the derived values satisfy the required equations. This helps makes it clear how they are derived as opposed to being magical constants and also makes it easier to include new derived values which are planned for an upcoming optimization.

This optimizes the scalar decomposition code by rewriting it to make use of the highly-efficient zero-allocation ModNScalar type along with math/bits (which provides hardware acceleration on most platforms) instead of big.Ints. The net effect is that the decomposition is significantly faster, allocation free, and constant time. It also adds a bunch of detailed comments to better describe the endomorphism and how it is used in scalar multiplication in addition to fully describing the scalar decomposition process and derivation. Finally, it adds logic to derive and print the new constants the reworked code makes of to the precomputation code that runs with 'go generate'. The following is a high level overview of the changes: - Rewrites splitK to use the ModNScalar type instead of big.Ints: - Allocation free - Constant time - Provides hardware acceleration on most platforms - Includes detailed comments describing the scalar decomposition process and derivation - Updates endomorphism parameter constant definitions to be ModNScalars instead of big.Int - Moves the convenience func hexToModNScalar to the main package - Adds new endoZ1 and endoZ2 constants for the rounded multiplication used by the new decomposition code - Adds logic to derive and print the new endomorphism constants with 'go generate' - Updates the scalar multiplication to account for the new semantics - Adds detailed comments to scalar multiplication - Tightens negation magnitudes in addZ1EqualsZ2 and remove no longer needed normalization - Ensures the calculation when recovering compact signatures uses normalized points - Updates associated tests - Updates associated benchmark - Removes splitKModN test helper since conversion is no longer needed The following benchmarks show a before and after comparison of scalar decomposition as well as how it that translates to scalar multiplication and signature verification: name old time/op new time/op delta --------------------------------------------------------------------- SplitK 1.61µs ±32% 0.89µs ± 2% -44.69% (p=0.000 n=50+47) ScalarMult 125µs ± 1% 115µs ± 1% -7.82% (p=0.000 n=43+46) SigVerify 161µs ±25% 160µs ±19% -0.53% (p=0.001 n=50+50) name old allocs/op new allocs/op delta ----------------------------------------------------------------------- SplitK 10.0 ± 0% 0.0 -100.00% (p=0.000 n=50+50) ScalarMult 11.0 ± 0% 0.0 -100.00% (p=0.000 n=50+50) SigVerify 28.0 ± 0% 16.0 ± 0% -42.86% (p=0.000 n=50+50) While it only saves about 1 µs per signature verification in the benchmarking scenario, the primary win is the reduction in the number of allocations per signature verification which has a much more significant impact when verifying large numbers of signatures back to back, such as when processing new blocks, and especially during the initial chain sync process.

davecgh added the optimization label Feb 27, 2022

davecgh added this to the 1.8.0 milestone Feb 27, 2022

davecgh force-pushed the secp256k1_splitk_use_modnscalar branch 6 times, most recently from fd0368c to b7d310b Compare March 3, 2022 17:23

matheusd approved these changes Mar 3, 2022

View reviewed changes

davecgh force-pushed the secp256k1_splitk_use_modnscalar branch 3 times, most recently from b411d19 to cb7e99a Compare March 5, 2022 08:32

dajohi approved these changes Mar 5, 2022

View reviewed changes

rstaudt2 reviewed Mar 9, 2022

View reviewed changes

davecgh force-pushed the secp256k1_splitk_use_modnscalar branch 3 times, most recently from cf62f3a to c4ccaeb Compare March 9, 2022 06:25

davecgh mentioned this pull request Mar 9, 2022

secp256k1: Reduce scalar base mult copies. #2898

Merged

rstaudt2 approved these changes Mar 9, 2022

View reviewed changes

davecgh force-pushed the secp256k1_splitk_use_modnscalar branch 2 times, most recently from 0755280 to 8399238 Compare March 10, 2022 17:05

JoeGruffins approved these changes Mar 14, 2022

View reviewed changes

Comment thread dcrec/secp256k1/curve.go

davecgh added 4 commits March 14, 2022 14:34

secp256k1: Add benchmark for splitK.

b4c4535

davecgh force-pushed the secp256k1_splitk_use_modnscalar branch from 8399238 to 7833beb Compare March 14, 2022 19:35

davecgh merged commit 7833beb into decred:master Mar 14, 2022

davecgh deleted the secp256k1_splitk_use_modnscalar branch March 14, 2022 19:40

		endoZ1 = hexToModNScalar("3086d221a7d46bcde86c90e49284eb153daa8a1471e8ca7f")
		endoZ2 = hexToModNScalar("e4437ed6010e88286f547fa90abfe4c4221208ac9df506c6")

Conversation

davecgh commented Feb 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matheusd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davecgh Mar 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davecgh Mar 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davecgh commented Mar 4, 2022

Uh oh!

rstaudt2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rstaudt2 left a comment

Choose a reason for hiding this comment

Uh oh!

JoeGruffins left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

davecgh commented Feb 27, 2022 •

edited

Loading

davecgh Mar 4, 2022 •

edited

Loading

davecgh Mar 5, 2022 •

edited

Loading