Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ASM for field ops/Arithmetic optimisations #78

Open
6 tasks
jon-chuang opened this issue Mar 29, 2020 · 10 comments
Open
6 tasks

Implement ASM for field ops/Arithmetic optimisations #78

jon-chuang opened this issue Mar 29, 2020 · 10 comments
Labels
T-feature Type: new features T-performance Type: performance improvements

Comments

@jon-chuang
Copy link
Contributor

It seems unacceptable that a primitive that can be accelerated (leading to ~1.5x speedup) is not.

Hence I suggest to look into how one should add assembly for field ops to make used of vectorised instructions etc. I also suggest to look into other modes of multiplication besides textbook such as karatsuba multiplication. This may not lead to huge speedups on x86_64, but might on other platforms. Further one should look into optimisations for faster extension field arithmetic.

Finally, I propose to look into how much of this code can be generically created for fields of arbitrary size using either macros or function calls to wrapped assembly, with target-dependent compilation.

The current pairing timings are about 2-2.5ms for the BLS12 curves on a 3+GHz machine. I believe this can be improved closer to the theoretical limit of roughly 1.1-1.3ms. As a target, I would set 1.5ms.

  • x86_64
  • ARM
  • AMD64 (lower priority)
  • Karatsuba Multiplication
  • Explore faster extension field arithmetic
  • Macro-based code gen/ASM gen
@Pratyush
Copy link
Member

If you'd like to implement this, please go ahead, but I'd prefer to merge this behind a feature if we're going to be using unsafe code.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Mar 29, 2020

Sure, I can abide. By the way, there seems to be a discrepancy between the full pairing timings for BLS12-381 (~2ms) and BLS12-377 (~2.5ms). I have replicated it 3 times. This does not quite stand to reason, as the underlying fields are the same size. Do you know of possible reasons?

@Pratyush
Copy link
Member

The curves are of different types (M-type vs D-type), and so have different final exponentiation algorithms.

@Pratyush
Copy link
Member

btw, you don't necessarily need to drop down to assembly to utilize SIMD instructions; these are available as intrinsics here, for example here: https://doc.rust-lang.org/1.29.1/core/arch/x86_64/index.html

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Mar 30, 2020

I see. I will try to study this. I noticed the miller loop for 377 is slower too. Is there a reason why Zexe uses the 377 rather than 381 curve? Is this to have a power of 2 subgroup in both p and r? Perhaps I should read the Zexe paper more carefully.

I will test out using the SIMD instructions provided first. From what I gather, it's still unsafe rust.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Mar 30, 2020

I've looked into things - SIMD isn't efficient for small numbers unless one uses an unsaturated representation that does not work for general moduli. Nonetheless, one could possibly try to rewrite point addition formulas to take advantage of SIMD, rearranging ops to avoid data dependencies that can cause clock speeds to throttle, by parallelising at the data level, rather than the representation (add 4 field elements at the same time etc), but this is a lot of manual work. There is some work on 32-bit SIMD which seems performant that I will try to review more carefully. In particular, I will try to take a look if strategies used to target NEON can be used for WASM SIMD.

The Rust std_arch intrinsics do not support Intel's ADX multiprecision add and carry (adc) instructions for now, so we will have to resort to writing assembly. Since it is difficult to maintain this assembly for all the given field sizes, I will try to do code gen with macros and focus on schoolbook Montgomery with goff's optimisation. I will try to get similar or better performance to goff.

Karatsuba might be reserved for curves over larger base field sizes like the SW ans BW curves and will be prioritised for later.

@Pratyush
Copy link
Member

Thanks, this sounds like good plan!

@jon-chuang
Copy link
Contributor Author

As some updates, I managed to integrate Fp256 assembly leading to about 25% speedup in Fp256 mults. But it is proving harder to write generic assembly code than expected, and the inline assembly module doesn't really play nice with catching errors and doing what it's supposed to.

I am going to try to simplify my code by writing smaller chunks of inline assembly that are easier to verify correct in a piecemeal fashion and also hopefully still allows the rust compiler to reason about memory fetches better than I can. Hopefully, I will be able to tweak both the rust and assembly sides of the code with careful benchmarking.

Also, I should use the following tool for benchmarking clock cycles: https://git.zx2c4.com/kbench9000/about/. But it may only work with C? In any case, it seems better practice to benchmark clock cycles than time because there really is a lot of clock speed volatility across time, which is undesirable. I will look into what Rust tooling exists.

@Pratyush
Copy link
Member

Pratyush commented Apr 6, 2020

Re: benchmarking in terms of cycles, one way could be to look into how perf.rust-lang.org counts number of instructions instead of wall-clock time.

@Pratyush
Copy link
Member

@jon-chuang what's the status of this? Should this be closed?

@Pratyush Pratyush transferred this issue from arkworks-rs/snark Nov 20, 2020
@Pratyush Pratyush added T-feature Type: new features T-performance Type: performance improvements labels Nov 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-feature Type: new features T-performance Type: performance improvements
Projects
None yet
Development

No branches or pull requests

2 participants