precomp: use normalized extended points #59

jsign · 2023-09-07T18:15:59Z

Thanks to Gottfried Herold for suggesting this change in a chat we had some time ago.

This PR improves the performance of our fixed-basis MSM algorithm, which is used for most heavy cryptography operations (e.g: tree key hashing, vector commitments, etc.).

The idea is simple: use Extended points additions instead of Projective mixed additions when we aggregate the precomputed points. That saves finite-field muls, thus improving performance. It has a 50% memory cost, but considering some improvements we did a while ago (and other ideas that we can still use), it feels justified for this speedup.

The path to doing this was longer than expected since I had to do some things before to be ready for the change:

Stop relying on generated code and directly on gnark-crypto, which already supported extended points (PR 1, PR 2). [This was good since we planned to do this for a while. Less code to maintain and audit].
After that, when I started doing this change, I detected a regression in gnark-crypto, which I helped solve in the repo).
Then I realized gnark-crypto was doing unintended inversions in extended points additions, so I helped fix that in the library).
Then this PR was possible!

The bottom line is that we’re saving two multiplications per group operation. (Rather than one that we expected in our chat with Gottfried).

Benchmarks

AMD Ryzen 7 3800XT 8-Core Processor:

name                                  old time/op    new time/op    delta
PrecompMSM/msm_length=1/precomp-16      3.39µs ± 2%    2.87µs ± 2%  -15.57%  (p=0.000 n=10+10)
PrecompMSM/msm_length=2/precomp-16      6.63µs ± 1%    5.70µs ± 1%  -14.05%  (p=0.000 n=9+9)
PrecompMSM/msm_length=4/precomp-16      13.1µs ± 1%    11.2µs ± 3%  -14.53%  (p=0.000 n=10+10)
PrecompMSM/msm_length=8/precomp-16      35.3µs ± 2%    30.7µs ± 1%  -13.11%  (p=0.000 n=10+8)
PrecompMSM/msm_length=16/precomp-16     88.8µs ± 2%    76.9µs ± 1%  -13.42%  (p=0.000 n=10+10)
PrecompMSM/msm_length=32/precomp-16      208µs ± 1%     178µs ± 1%  -14.49%  (p=0.000 n=8+8)
PrecompMSM/msm_length=64/precomp-16      451µs ± 1%     393µs ± 1%  -12.90%  (p=0.000 n=8+9)
PrecompMSM/msm_length=128/precomp-16     959µs ± 2%     831µs ± 1%  -13.40%  (p=0.000 n=10+10)
PrecompMSM/msm_length=256/precomp-16    1.98ms ± 1%    1.70ms ± 2%  -13.89%  (p=0.000 n=8+10)

This is an amd64 processor, so it uses gnark-crypto assembly for bigints ops.

Rock5B (this machine has another version of the stat-compare, so you might see some diffs in magnitude symbols):

PrecompMSM/msm_length=1/precomp-8      19.40µ ± 3%   16.87µ ± 0%  -13.06% (p=0.000 n=10)
PrecompMSM/msm_length=2/precomp-8      39.21µ ± 3%   32.30µ ± 2%  -17.63% (p=0.000 n=10)
PrecompMSM/msm_length=4/precomp-8      77.64µ ± 1%   65.43µ ± 1%  -15.73% (p=0.000 n=10)
PrecompMSM/msm_length=8/precomp-8      212.7µ ± 2%   182.4µ ± 2%  -14.25% (p=0.000 n=10)
PrecompMSM/msm_length=16/precomp-8     522.4µ ± 2%   456.2µ ± 2%  -12.67% (p=0.000 n=10)
PrecompMSM/msm_length=32/precomp-8     1.170m ± 2%   1.017m ± 2%  -13.12% (p=0.000 n=10)
PrecompMSM/msm_length=64/precomp-8     2.508m ± 3%   2.114m ± 2%  -15.71% (p=0.000 n=10)
PrecompMSM/msm_length=128/precomp-8    5.042m ± 2%   4.337m ± 2%  -13.99% (p=0.000 n=10)
PrecompMSM/msm_length=256/precomp-8    10.543m ± 2%   9.347m ± 3%  -11.35% (p=0.000 n=10)

This is an arm machine, so it doesn’t use assembly (and it has a less powerful clock).

The speedups in both cases are the same, which makes sense since this optimization mostly avoids work and is independent of whatever implementation of bigint is being used.

Finer details

So, it turns out that I still have to include a further optimized extended point formula optimized for the case when the second point has Z==1, which is the case for our precomputed points. That allows us to save one extra mul from the amount used by gnark-crypto. (More about this in PR comments).

Also, I had to create a further “extended point” flavor, which I name “normalized”, since our precomputed points would have Z==1, so it doesn’t make sense to use gnark-crypto structs which would waste 8 bytes per precomputed point. This makes the precomp tables size to increase 50% (~110MiB) rather than 100% (~220MiB).

At some point, we can chat with the gnark team if it makes sense to add a new method with this optimized formula when Z==1; or add an extra if in the current procedure to avoid the addition. (In the latter, we’d have to double-check if the potential branch misprediction… but I doubt that might be relevant; we can measure).

TODO:

Run in Kaustinen to triple-check correctness.

bandersnatch/bandersnatch.go

banderwagon/precomp.go

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

jsign · 2023-10-18T17:10:52Z

@kevaundray, this should be ready to review. I've run it in the new Kaustinen which has >30k blocks and a node with this version has synced correctly, so looks good.

kevaundray · 2023-10-21T14:30:53Z

bandersnatch/bandersnatch.go

+	G.Add(&D, &C)
+	H.Set(&A)
+
+	// mulBy5(&H)


nit: I'm guessing this comment can be removed or put one line down

kevaundray · 2023-10-21T14:31:20Z

bandersnatch/bandersnatch.go

+	H.Neg(&H)
+	gnarkfr.MulBy5(&H)
+
+	H.Sub(&B, &H)


Thanks for making this match the referenced link including the letters used :)

kevaundray · 2023-10-21T14:33:52Z

bandersnatch/bandersnatch.go

@@ -5,20 +5,24 @@ import (
 	"io"

 	gnarkbandersnatch "github.com/consensys/gnark-crypto/ecc/bls12-381/bandersnatch"
+	gnarkfr "github.com/consensys/gnark-crypto/ecc/bls12-381/fr"


nit: note that although this is gnarkfr (bls12-381), its fp for bandersnatch, so maybe gnarkfp or something along those lines would be less confusing since we use fp already in this file

jsign commented Sep 7, 2023

View reviewed changes

jsign added 6 commits October 18, 2023 11:20

bandersnatch: add extended normalized point

592403e

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

precomp: use extended normalized points and optimized formula

40a8ff7

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

add comment

0e450b1

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

bandersnatch: improve PointExtendedFromProj

6676c28

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

fix comment link

6a385c1

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

improve naming

9d4221b

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

jsign force-pushed the jsign-ext-points branch from 62a8645 to 9d4221b Compare October 18, 2023 14:21

jsign marked this pull request as ready for review October 18, 2023 17:09

jsign requested a review from kevaundray October 18, 2023 17:09

kevaundray reviewed Oct 21, 2023

View reviewed changes

kevaundray approved these changes Oct 21, 2023

View reviewed changes

kevaundray reviewed Oct 21, 2023

View reviewed changes

kevaundray merged commit ff2c8f7 into master Oct 21, 2023
2 checks passed

jsign mentioned this pull request Oct 25, 2023

mod: update go-ipa ethereum/go-verkle#408

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

precomp: use normalized extended points #59

precomp: use normalized extended points #59

jsign commented Sep 7, 2023 •

edited

Loading

jsign commented Oct 18, 2023

kevaundray Oct 21, 2023

kevaundray Oct 21, 2023

kevaundray Oct 21, 2023

precomp: use normalized extended points #59

precomp: use normalized extended points #59

Conversation

jsign commented Sep 7, 2023 • edited Loading

Benchmarks

Finer details

jsign commented Oct 18, 2023

kevaundray Oct 21, 2023

Choose a reason for hiding this comment

kevaundray Oct 21, 2023

Choose a reason for hiding this comment

kevaundray Oct 21, 2023

Choose a reason for hiding this comment

jsign commented Sep 7, 2023 •

edited

Loading