Implement ASM for field ops/Arithmetic optimisations #78

jon-chuang · 2020-03-29T21:03:45Z

It seems unacceptable that a primitive that can be accelerated (leading to ~1.5x speedup) is not.

Hence I suggest to look into how one should add assembly for field ops to make used of vectorised instructions etc. I also suggest to look into other modes of multiplication besides textbook such as karatsuba multiplication. This may not lead to huge speedups on x86_64, but might on other platforms. Further one should look into optimisations for faster extension field arithmetic.

Finally, I propose to look into how much of this code can be generically created for fields of arbitrary size using either macros or function calls to wrapped assembly, with target-dependent compilation.

The current pairing timings are about 2-2.5ms for the BLS12 curves on a 3+GHz machine. I believe this can be improved closer to the theoretical limit of roughly 1.1-1.3ms. As a target, I would set 1.5ms.

Pratyush · 2020-03-29T21:42:05Z

If you'd like to implement this, please go ahead, but I'd prefer to merge this behind a feature if we're going to be using unsafe code.

jon-chuang · 2020-03-29T23:43:18Z

Sure, I can abide. By the way, there seems to be a discrepancy between the full pairing timings for BLS12-381 (~2ms) and BLS12-377 (~2.5ms). I have replicated it 3 times. This does not quite stand to reason, as the underlying fields are the same size. Do you know of possible reasons?

Pratyush · 2020-03-30T00:11:19Z

The curves are of different types (M-type vs D-type), and so have different final exponentiation algorithms.

Pratyush · 2020-03-30T00:18:17Z

btw, you don't necessarily need to drop down to assembly to utilize SIMD instructions; these are available as intrinsics here, for example here: https://doc.rust-lang.org/1.29.1/core/arch/x86_64/index.html

jon-chuang · 2020-03-30T00:29:01Z

I see. I will try to study this. I noticed the miller loop for 377 is slower too. Is there a reason why Zexe uses the 377 rather than 381 curve? Is this to have a power of 2 subgroup in both p and r? Perhaps I should read the Zexe paper more carefully.

I will test out using the SIMD instructions provided first. From what I gather, it's still unsafe rust.

jon-chuang · 2020-03-30T07:27:17Z

I've looked into things - SIMD isn't efficient for small numbers unless one uses an unsaturated representation that does not work for general moduli. Nonetheless, one could possibly try to rewrite point addition formulas to take advantage of SIMD, rearranging ops to avoid data dependencies that can cause clock speeds to throttle, by parallelising at the data level, rather than the representation (add 4 field elements at the same time etc), but this is a lot of manual work. There is some work on 32-bit SIMD which seems performant that I will try to review more carefully. In particular, I will try to take a look if strategies used to target NEON can be used for WASM SIMD.

The Rust std_arch intrinsics do not support Intel's ADX multiprecision add and carry (adc) instructions for now, so we will have to resort to writing assembly. Since it is difficult to maintain this assembly for all the given field sizes, I will try to do code gen with macros and focus on schoolbook Montgomery with goff's optimisation. I will try to get similar or better performance to goff.

Karatsuba might be reserved for curves over larger base field sizes like the SW ans BW curves and will be prioritised for later.

Pratyush · 2020-03-30T18:41:44Z

Thanks, this sounds like good plan!

jon-chuang · 2020-04-06T00:12:29Z

As some updates, I managed to integrate Fp256 assembly leading to about 25% speedup in Fp256 mults. But it is proving harder to write generic assembly code than expected, and the inline assembly module doesn't really play nice with catching errors and doing what it's supposed to.

I am going to try to simplify my code by writing smaller chunks of inline assembly that are easier to verify correct in a piecemeal fashion and also hopefully still allows the rust compiler to reason about memory fetches better than I can. Hopefully, I will be able to tweak both the rust and assembly sides of the code with careful benchmarking.

Also, I should use the following tool for benchmarking clock cycles: https://git.zx2c4.com/kbench9000/about/. But it may only work with C? In any case, it seems better practice to benchmark clock cycles than time because there really is a lot of clock speed volatility across time, which is undesirable. I will look into what Rust tooling exists.

Pratyush · 2020-04-06T00:16:46Z

Re: benchmarking in terms of cycles, one way could be to look into how perf.rust-lang.org counts number of instructions instead of wall-clock time.

Pratyush · 2020-11-20T00:46:31Z

@jon-chuang what's the status of this? Should this be closed?

Pratyush transferred this issue from arkworks-rs/snark Nov 20, 2020

Pratyush added T-feature Type: new features T-performance Type: performance improvements labels Nov 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement ASM for field ops/Arithmetic optimisations #78

Implement ASM for field ops/Arithmetic optimisations #78

jon-chuang commented Mar 29, 2020

Pratyush commented Mar 29, 2020

jon-chuang commented Mar 29, 2020 •

edited

Loading

Pratyush commented Mar 30, 2020

Pratyush commented Mar 30, 2020

jon-chuang commented Mar 30, 2020 •

edited

Loading

jon-chuang commented Mar 30, 2020 •

edited

Loading

Pratyush commented Mar 30, 2020

jon-chuang commented Apr 6, 2020

Pratyush commented Apr 6, 2020

Pratyush commented Nov 20, 2020

Implement ASM for field ops/Arithmetic optimisations #78

Implement ASM for field ops/Arithmetic optimisations #78

Comments

jon-chuang commented Mar 29, 2020

Pratyush commented Mar 29, 2020

jon-chuang commented Mar 29, 2020 • edited Loading

Pratyush commented Mar 30, 2020

Pratyush commented Mar 30, 2020

jon-chuang commented Mar 30, 2020 • edited Loading

jon-chuang commented Mar 30, 2020 • edited Loading

Pratyush commented Mar 30, 2020

jon-chuang commented Apr 6, 2020

Pratyush commented Apr 6, 2020

Pratyush commented Nov 20, 2020

jon-chuang commented Mar 29, 2020 •

edited

Loading

jon-chuang commented Mar 30, 2020 •

edited

Loading

jon-chuang commented Mar 30, 2020 •

edited

Loading