Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

precomp: use normalized extended points #59

Merged
merged 6 commits into from
Oct 21, 2023
Merged

Conversation

jsign
Copy link
Collaborator

@jsign jsign commented Sep 7, 2023

Thanks to Gottfried Herold for suggesting this change in a chat we had some time ago.

This PR improves the performance of our fixed-basis MSM algorithm, which is used for most heavy cryptography operations (e.g: tree key hashing, vector commitments, etc.).

The idea is simple: use Extended points additions instead of Projective mixed additions when we aggregate the precomputed points. That saves finite-field muls, thus improving performance. It has a 50% memory cost, but considering some improvements we did a while ago (and other ideas that we can still use), it feels justified for this speedup.

The path to doing this was longer than expected since I had to do some things before to be ready for the change:

  • Stop relying on generated code and directly on gnark-crypto, which already supported extended points (PR 1, PR 2). [This was good since we planned to do this for a while. Less code to maintain and audit].
  • After that, when I started doing this change, I detected a regression in gnark-crypto, which I helped solve in the repo).
  • Then I realized gnark-crypto was doing unintended inversions in extended points additions, so I helped fix that in the library).
  • Then this PR was possible!

The bottom line is that we’re saving two multiplications per group operation. (Rather than one that we expected in our chat with Gottfried).

Benchmarks

AMD Ryzen 7 3800XT 8-Core Processor:

name                                  old time/op    new time/op    delta
PrecompMSM/msm_length=1/precomp-16      3.39µs ± 2%    2.87µs ± 2%  -15.57%  (p=0.000 n=10+10)
PrecompMSM/msm_length=2/precomp-16      6.63µs ± 1%    5.70µs ± 1%  -14.05%  (p=0.000 n=9+9)
PrecompMSM/msm_length=4/precomp-16      13.1µs ± 1%    11.2µs ± 3%  -14.53%  (p=0.000 n=10+10)
PrecompMSM/msm_length=8/precomp-16      35.3µs ± 2%    30.7µs ± 1%  -13.11%  (p=0.000 n=10+8)
PrecompMSM/msm_length=16/precomp-16     88.8µs ± 2%    76.9µs ± 1%  -13.42%  (p=0.000 n=10+10)
PrecompMSM/msm_length=32/precomp-16      208µs ± 1%     178µs ± 1%  -14.49%  (p=0.000 n=8+8)
PrecompMSM/msm_length=64/precomp-16      451µs ± 1%     393µs ± 1%  -12.90%  (p=0.000 n=8+9)
PrecompMSM/msm_length=128/precomp-16     959µs ± 2%     831µs ± 1%  -13.40%  (p=0.000 n=10+10)
PrecompMSM/msm_length=256/precomp-16    1.98ms ± 1%    1.70ms ± 2%  -13.89%  (p=0.000 n=8+10)

This is an amd64 processor, so it uses gnark-crypto assembly for bigints ops.

Rock5B (this machine has another version of the stat-compare, so you might see some diffs in magnitude symbols):

PrecompMSM/msm_length=1/precomp-8      19.40µ ± 3%   16.87µ ± 0%  -13.06% (p=0.000 n=10)
PrecompMSM/msm_length=2/precomp-8      39.21µ ± 3%   32.30µ ± 2%  -17.63% (p=0.000 n=10)
PrecompMSM/msm_length=4/precomp-8      77.64µ ± 1%   65.43µ ± 1%  -15.73% (p=0.000 n=10)
PrecompMSM/msm_length=8/precomp-8      212.7µ ± 2%   182.4µ ± 2%  -14.25% (p=0.000 n=10)
PrecompMSM/msm_length=16/precomp-8     522.4µ ± 2%   456.2µ ± 2%  -12.67% (p=0.000 n=10)
PrecompMSM/msm_length=32/precomp-8     1.170m ± 2%   1.017m ± 2%  -13.12% (p=0.000 n=10)
PrecompMSM/msm_length=64/precomp-8     2.508m ± 3%   2.114m ± 2%  -15.71% (p=0.000 n=10)
PrecompMSM/msm_length=128/precomp-8    5.042m ± 2%   4.337m ± 2%  -13.99% (p=0.000 n=10)
PrecompMSM/msm_length=256/precomp-8    10.543m ± 2%   9.347m ± 3%  -11.35% (p=0.000 n=10)

This is an arm machine, so it doesn’t use assembly (and it has a less powerful clock).

The speedups in both cases are the same, which makes sense since this optimization mostly avoids work and is independent of whatever implementation of bigint is being used.

Finer details

So, it turns out that I still have to include a further optimized extended point formula optimized for the case when the second point has Z==1, which is the case for our precomputed points. That allows us to save one extra mul from the amount used by gnark-crypto. (More about this in PR comments).

Also, I had to create a further “extended point” flavor, which I name “normalized”, since our precomputed points would have Z==1, so it doesn’t make sense to use gnark-crypto structs which would waste 8 bytes per precomputed point. This makes the precomp tables size to increase 50% (~110MiB) rather than 100% (~220MiB).

At some point, we can chat with the gnark team if it makes sense to add a new method with this optimized formula when Z==1; or add an extra if in the current procedure to avoid the addition. (In the latter, we’d have to double-check if the potential branch misprediction… but I doubt that might be relevant; we can measure).

TODO:

  • Run in Kaustinen to triple-check correctness.

bandersnatch/bandersnatch.go Show resolved Hide resolved
bandersnatch/bandersnatch.go Show resolved Hide resolved
bandersnatch/bandersnatch.go Show resolved Hide resolved
bandersnatch/bandersnatch.go Show resolved Hide resolved
bandersnatch/bandersnatch.go Show resolved Hide resolved
banderwagon/precomp.go Show resolved Hide resolved
banderwagon/precomp.go Show resolved Hide resolved
banderwagon/precomp.go Show resolved Hide resolved
banderwagon/precomp.go Show resolved Hide resolved
banderwagon/precomp.go Show resolved Hide resolved
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
@jsign jsign marked this pull request as ready for review October 18, 2023 17:09
@jsign jsign requested a review from kevaundray October 18, 2023 17:09
@jsign
Copy link
Collaborator Author

jsign commented Oct 18, 2023

@kevaundray, this should be ready to review. I've run it in the new Kaustinen which has >30k blocks and a node with this version has synced correctly, so looks good.

G.Add(&D, &C)
H.Set(&A)

// mulBy5(&H)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm guessing this comment can be removed or put one line down

H.Neg(&H)
gnarkfr.MulBy5(&H)

H.Sub(&B, &H)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this match the referenced link including the letters used :)

@@ -5,20 +5,24 @@ import (
"io"

gnarkbandersnatch "github.com/consensys/gnark-crypto/ecc/bls12-381/bandersnatch"
gnarkfr "github.com/consensys/gnark-crypto/ecc/bls12-381/fr"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: note that although this is gnarkfr (bls12-381), its fp for bandersnatch, so maybe gnarkfp or something along those lines would be less confusing since we use fp already in this file

@kevaundray kevaundray merged commit ff2c8f7 into master Oct 21, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants