Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEON backend for aarch64 #457

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Conversation

rubdos
Copy link
Contributor

@rubdos rubdos commented Dec 8, 2022

As promised in #449! This is joint work with @Tarinn. In fact, @Tarinn did most of the work :-)

Fixes #147 for Aarch64, and you'll see an ARMv7 PR coming in after this.

I'm working on benchmarks across several devices now (including an unpublished very hacky armv7 version), will update here when they become available. We're seeing speedups of 20-30% in relevant benchmarks. This is the same code as in zkcrypto#19.

I'll be cleaning up the commits in the next few minutes (trailing white space fixes, mostly), and I'll run a comparison benchmark for 343be3a, because that was only introduced for future ARMv7 support, and has not been tested under load.

TODO:

  • Cleanup whitespace and commits
  • Test shuffle! call vs vqtbx1q_u8 performance
  • Add Aarch64 cross build to CI
  • Figure out how-the-💥 we get rid of the ARMv8 TBL/TBX issue
  • Add NEON documentation to README
    Explain that ARMv7 is not yet supported

@rubdos
Copy link
Contributor Author

rubdos commented Dec 8, 2022

The shuffle! macro use in 26c494f seems to win 1.5% to 3% over the use of the vqtbx1q_u8 intrinsic on a Raspberry Pi 4. Seems like the LLVM compiler people know what they're doing.

I'm testing another point where we can use the shuffle! macro instead of some manual lane swapping, will report back.

@rubdos
Copy link
Contributor Author

rubdos commented Dec 8, 2022

2d343e9 does not have an influence on performance, but the code is quite a bit cleaner, so let's leave it in.

Comment on lines 37 to 45
test-simd:
name: Test simd backend (nightly, aarch64)
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: dtolnay/rust-toolchain@nightly
with:
toolchain: aarch64-unknown-linux-gnu
- run: cargo test --no-default-features --target aarch64-unknown-linux-gnu --features "std simd_backend"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work without cross?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust has a built-in cross-compiler. I've never really used cross to cross-compile myself; always had to roll my own due to circumstances.

Probably, the underlaying Ubuntu machine needs to install a gcc toolchain to actually work, but I hoped that the CI would give me some feedback already.

Copy link
Contributor

@tarcieri tarcieri Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A cross-compiler works for builds, but how does it run the tests on a non-native architecture?

cross uses QEMU

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay, I did not spot that. The new version is only with cargo build, not with test, based on the new build-simd CI snippet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using cross for this as it should be fine for testing NEON in CI.

Here are some examples (and it's probably fine to reuse the @RustCrypto cross-install workflow too)

https://github.com/RustCrypto/block-ciphers/blob/master/.github/workflows/aes.yml#L201-L249

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like that's done in release/4.0 now, so I'll just drop my own commit.

I'd still like to point out that when I submit the armv7hl NEON work, it will work on nightly only for a while, so you might want to add armv7-nightly to the test matrix.

@tarcieri
Copy link
Contributor

tarcieri commented Dec 8, 2022

@rubdos we're working off the releases/4.0 branch for development, so you'll definitely want to rebase this

@rubdos rubdos changed the base branch from main to release/4.0 December 9, 2022 10:24
@rubdos rubdos marked this pull request as ready for review December 12, 2022 16:04
@jrose-signal
Copy link
Contributor

Don't forget to make sure it's still constant-time! I don't actually know what that entails; it is not something I've looked into before. But a good amount of effort was put into the initial version of this crate using constant-time operations, and new backends should adhere to that too.

@rubdos
Copy link
Contributor Author

rubdos commented Dec 13, 2022

@jrose-signal Indeed, that's pretty important. As far as @Tarinn and I know, this code should be constant time. It is a direct port of the AVX2 work, and there should be no time-dependent calls in there. The most sketchy call that I could see is the shuffle!, but that does not take secret-dependent indexes.

I'm not sure how this is generally done. Should we ask some independent reviewer to have a look at the code?

@pinkforest
Copy link
Contributor

pinkforest commented Dec 13, 2022

fyi - we're continuing all development via in main -

Could this be re-based to main which is basically from release/4.0 - Thanks!

Also would we have any benchmarks between u64, neon and fiat ?
e.g. a table like this would be great: https://github.com/pinkforest/ed25519-dalek/tree/docs-docsrs

For constant time, e.g. subtle CsOption etc.

@rubdos
Copy link
Contributor Author

rubdos commented Dec 13, 2022

fyi - we're continuing all development via in main -

Could this be re-based to main which is basically from release/4.0 - Thanks!

Of course - Just to make sure: is that going to remain? I did the rebase four days ago on release/4.0.

Do we have any benchmarks between u64, neon and fiat ?
e.g. a table like this would be great: https://github.com/pinkforest/ed25519-dalek/tree/docs-docsrs

I'm working on this across a bunch of devices (including armv7). I'd like these results to appear in a small publication, but I'll certainly dump a summary in the README when I have them. I should have something by the end of the week.

@pinkforest
Copy link
Contributor

Of course - Just to make sure: is that going to remain? I did the rebase four days ago on release/4.0.

Yeah this is going to stay in main

I'm working on this across a bunch of devices

Awesome! 🥳 would love to see a link to write-up in This-Week-in-Rust TWiR if you can share there too:
https://github.com/rust-lang/this-week-in-rust/

@rubdos
Copy link
Contributor Author

rubdos commented Dec 13, 2022

Awesome! partying_face would love to see a link to write-up in This-Week-in-Rust TWiR if you can share there too: https://github.com/rust-lang/this-week-in-rust/

After publication, I'll make sure to make a huge amount of noise about it. That should be pretty short notice, I have some very strict deadlines about that now. Just to clarify: the deadline is there because this work is complete; there was no pressure on completion of this work that would have affected the quality.

impl FieldElement2625x4 {

pub fn split(&self) -> [FieldElement51; 4] {
let mut out = [FieldElement51::zero(); 4];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut out = [FieldElement51::zero(); 4];
let mut out = [FieldElement51::ZERO; 4];

This was changed to ZERO


macro_rules! print_var {
($x:ident) => {
println!("{} = {:?}", stringify!($x), $x.to_bytes());
Copy link
Contributor

@pinkforest pinkforest Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
println!("{} = {:?}", stringify!($x), $x.to_bytes());
println!("{} = {:?}", stringify!($x), $x.as_bytes());

These were renamed to as_bytes()

use super::*;

fn serial_add(P: edwards::EdwardsPoint, Q: edwards::EdwardsPoint) -> edwards::EdwardsPoint {
use backend::serial::u64::field::FieldElement51;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use backend::serial::u64::field::FieldElement51;
use crate::backend::serial::u64::field::FieldElement51;

@pinkforest
Copy link
Contributor

Instead of flooding some changes here is some .patches:

src/backend/vector/neon/field.rs

diff --git a/src/backend/vector/neon/field.rs b/src/backend/vector/neon/field.rs
index 5cf57ee..85630cd 100644
--- a/src/backend/vector/neon/field.rs
+++ b/src/backend/vector/neon/field.rs
@@ -21,7 +21,7 @@
 //! arm instructions.
 
 use core::ops::{Add, Mul, Neg};
-use packed_simd::{u32x4, u32x2, i32x4, u8x16, u64x4, u64x2, IntoBits};
+use packed_simd::{u32x4, u32x2, i32x4, u64x4, u64x2, IntoBits};
 
 use crate::backend::vector::neon::constants::{P_TIMES_16_HI, P_TIMES_16_LO, P_TIMES_2_HI, P_TIMES_2_LO};
 use crate::backend::serial::u64::field::FieldElement51;
@@ -141,7 +141,7 @@ impl ConditionallySelectable for FieldElement2625x4 {
 impl FieldElement2625x4 {
 
     pub fn split(&self) -> [FieldElement51; 4] {
-        let mut out = [FieldElement51::zero(); 4];
+        let mut out = [FieldElement51::ZERO; 4];
         for i in 0..5 {
             let a_2i   = self.0[i].0.extract(0) as u64;
             let b_2i   = self.0[i].0.extract(1) as u64;
@@ -367,7 +367,9 @@ impl FieldElement2625x4 {
     #[inline]
     fn reduce64(mut z: [(u64x2, u64x2); 10]) -> FieldElement2625x4 {
 
+        #[allow(non_snake_case)]
         let LOW_25_BITS: u64x2 = u64x2::splat((1 << 25) - 1);
+        #[allow(non_snake_case)]
         let LOW_26_BITS: u64x2 = u64x2::splat((1 << 26) - 1);
 
         let carry = |z: &mut [(u64x2, u64x2); 10], i: usize| {
@@ -419,6 +421,7 @@ impl FieldElement2625x4 {
         ])
     }
 
+    #[allow(non_snake_case)]
     pub fn square_and_negate_D(&self) -> FieldElement2625x4 {
         #[inline(always)]
         fn m(x: (u32x2, u32x2), y: (u32x2, u32x2)) -> u64x4 {
@@ -463,16 +466,16 @@ impl FieldElement2625x4 {
         let x8_19  = m_lo(v19, x8);
         let x9_19  = m_lo(v19, x9);
 
-        let mut z0 = m(x0,  x0) + m(x2_2,x8_19) + m(x4_2,x6_19) + ((m(x1_2,x9_19) +  m(x3_2,x7_19) +    m(x5,x5_19)) << 1);
-        let mut z1 = m(x0_2,x1) + m(x3_2,x8_19) + m(x5_2,x6_19) +                  ((m(x2,x9_19)   +    m(x4,x7_19)) << 1);
-        let mut z2 = m(x0_2,x2) + m(x1_2,x1)    + m(x4_2,x8_19) + m(x6,x6_19)    + ((m(x3_2,x9_19) +  m(x5_2,x7_19)) << 1);
-        let mut z3 = m(x0_2,x3) + m(x1_2,x2)    + m(x5_2,x8_19) +                  ((m(x4,x9_19)   +    m(x6,x7_19)) << 1);
-        let mut z4 = m(x0_2,x4) + m(x1_2,x3_2)  + m(x2,  x2)    + m(x6_2,x8_19)  + ((m(x5_2,x9_19) +    m(x7,x7_19)) << 1);
-        let mut z5 = m(x0_2,x5) + m(x1_2,x4)    + m(x2_2,x3)    + m(x7_2,x8_19)                    +  ((m(x6,x9_19)) << 1);
-        let mut z6 = m(x0_2,x6) + m(x1_2,x5_2)  + m(x2_2,x4)    + m(x3_2,x3) + m(x8,x8_19)        + ((m(x7_2,x9_19)) << 1);
-        let mut z7 = m(x0_2,x7) + m(x1_2,x6)    + m(x2_2,x5)    + m(x3_2,x4)                      +   ((m(x8,x9_19)) << 1);
-        let mut z8 = m(x0_2,x8) + m(x1_2,x7_2)  + m(x2_2,x6)    + m(x3_2,x5_2) + m(x4,x4)         +   ((m(x9,x9_19)) << 1);
-        let mut z9 = m(x0_2,x9) + m(x1_2,x8)    + m(x2_2,x7)    + m(x3_2,x6) + m(x4_2,x5);
+        let z0 = m(x0,  x0) + m(x2_2,x8_19) + m(x4_2,x6_19) + ((m(x1_2,x9_19) +  m(x3_2,x7_19) +    m(x5,x5_19)) << 1);
+        let z1 = m(x0_2,x1) + m(x3_2,x8_19) + m(x5_2,x6_19) +                  ((m(x2,x9_19)   +    m(x4,x7_19)) << 1);
+        let z2 = m(x0_2,x2) + m(x1_2,x1)    + m(x4_2,x8_19) + m(x6,x6_19)    + ((m(x3_2,x9_19) +  m(x5_2,x7_19)) << 1);
+        let z3 = m(x0_2,x3) + m(x1_2,x2)    + m(x5_2,x8_19) +                  ((m(x4,x9_19)   +    m(x6,x7_19)) << 1);
+        let z4 = m(x0_2,x4) + m(x1_2,x3_2)  + m(x2,  x2)    + m(x6_2,x8_19)  + ((m(x5_2,x9_19) +    m(x7,x7_19)) << 1);
+        let z5 = m(x0_2,x5) + m(x1_2,x4)    + m(x2_2,x3)    + m(x7_2,x8_19)                    +  ((m(x6,x9_19)) << 1);
+        let z6 = m(x0_2,x6) + m(x1_2,x5_2)  + m(x2_2,x4)    + m(x3_2,x3) + m(x8,x8_19)        + ((m(x7_2,x9_19)) << 1);
+        let z7 = m(x0_2,x7) + m(x1_2,x6)    + m(x2_2,x5)    + m(x3_2,x4)                      +   ((m(x8,x9_19)) << 1);
+        let z8 = m(x0_2,x8) + m(x1_2,x7_2)  + m(x2_2,x6)    + m(x3_2,x5_2) + m(x4,x4)         +   ((m(x9,x9_19)) << 1);
+        let z9 = m(x0_2,x9) + m(x1_2,x8)    + m(x2_2,x7)    + m(x3_2,x6) + m(x4_2,x5);
 
 
         let low__p37 = u64x4::splat(0x3ffffed << 37);
@@ -675,7 +678,7 @@ mod test {
 
     #[test]
     fn scale_by_curve_constants() {
-        let mut x = FieldElement2625x4::splat(&FieldElement51::one());
+        let mut x = FieldElement2625x4::splat(&FieldElement51::ONE);
 
         x = x * (121666, 121666, 2*121666, 2*121665);
 

src/backend/vector/neon/edwards.rs

diff --git a/src/backend/vector/neon/edwards.rs b/src/backend/vector/neon/edwards.rs
index 8073c3a..357d4e3 100644
--- a/src/backend/vector/neon/edwards.rs
+++ b/src/backend/vector/neon/edwards.rs
@@ -321,14 +321,14 @@ mod test {
     use super::*;
 
     fn serial_add(P: edwards::EdwardsPoint, Q: edwards::EdwardsPoint) -> edwards::EdwardsPoint {
-        use backend::serial::u64::field::FieldElement51;
+        use crate::backend::serial::u64::field::FieldElement51;
 
         let (X1, Y1, Z1, T1) = (P.X, P.Y, P.Z, P.T);
         let (X2, Y2, Z2, T2) = (Q.X, Q.Y, Q.Z, Q.T);
 
         macro_rules! print_var {
             ($x:ident) => {
-                println!("{} = {:?}", stringify!($x), $x.to_bytes());
+                println!("{} = {:?}", stringify!($x), $x.as_bytes());
             };
         }
 
@@ -411,8 +411,8 @@ mod test {
 
     #[test]
     fn vector_addition_vs_serial_addition_vs_edwards_extendedpoint() {
-        use constants;
-        use scalar::Scalar;
+        use crate::constants;
+        use crate::scalar::Scalar;
 
         println!("Testing id +- id");
         let P = edwards::EdwardsPoint::identity();
@@ -440,7 +440,7 @@ mod test {
 
         macro_rules! print_var {
             ($x:ident) => {
-                println!("{} = {:?}", stringify!($x), $x.to_bytes());
+                println!("{} = {:?}", stringify!($x), $x.as_bytes());
             };
         }
 
@@ -498,8 +498,8 @@ mod test {
 
     #[test]
     fn vector_doubling_vs_serial_doubling_vs_edwards_extendedpoint() {
-        use constants;
-        use scalar::Scalar;
+        use crate::constants;
+        use crate::scalar::Scalar;
 
         println!("Testing [2]id");
         let P = edwards::EdwardsPoint::identity();
@@ -516,8 +516,8 @@ mod test {
 
     #[test]
     fn basepoint_odd_lookup_table_verify() {
-        use constants;
-        use backend::vector::neon::constants::{BASEPOINT_ODD_LOOKUP_TABLE};
+        use crate::constants;
+        use crate::backend::vector::neon::constants::{BASEPOINT_ODD_LOOKUP_TABLE};
 
         let basepoint_odd_table = NafLookupTable8::<CachedPoint>::from(&constants::ED25519_BASEPOINT_POINT);
         println!("Testing basepoint table");

clippy is still whining but I'll send a .patch later - we are using the 2021 edition

$ env RUSTFLAGS='-C target_feature=+neon --cfg curve25519_dalek_backend="simd"' cargo +nightly build --features rand_core

$ env RUSTFLAGS='-C target_feature=+neon --cfg curve25519_dalek_backend="simd"' cargo +nightly clippy --features rand_core

$ env RUSTFLAGS='-C target_feature=+neon --cfg curve25519_dalek_backend="simd"' cargo +nightly bench --features rand_core

@rubdos
Copy link
Contributor Author

rubdos commented Dec 14, 2022

I've done the rebase together with your patches, and squashed the NEON work into the first commit. I had not tested the rebase on 4.0, sorry about that. I'll do that now.

@rubdos
Copy link
Contributor Author

rubdos commented Dec 14, 2022

Working on the rustfmt attributes now.

@pinkforest
Copy link
Contributor

pinkforest commented Dec 14, 2022

started a benchmark repo here -

I ran some benchmarks earlier (will re-run now) on Mac M1 Max Aarch64 comparing between backends FWIW

I'm going to re-organise it to cover wider range benchmarks between various impl's not just dalek's

@rubdos
Copy link
Contributor Author

rubdos commented Dec 14, 2022

I'll see whether I'll have all the data in that format. I currently only retain the raw outputs from Criterion for further automatic processing for our paper.

@tarcieri tarcieri mentioned this pull request Feb 6, 2024
@rubdos
Copy link
Contributor Author

rubdos commented Feb 26, 2024

This took a bit longer than anticipated. We got a publication out, so academia-wise, we're all happy now. In between writing that paper and getting it out somewhere, there was quite some time, but I'm trying to sum up the current remaining issues with this branch in this post.

Results are especially spectacular on ARMv7/32-bit ARM CPUs (~50% speedups). Results are still great on ARMv8/aarch64 (~20%) until ARMv8.2-ish: when run in A64 mode, we get slow-downs relative to base. This is because ARM decided to nerf the TBL/TBX instructions relative to previous versions. The throughput remained the same, but the total latency of the instruction is now a function of the input size.

image

(shamelessly stolen from https://www.stonybrook.edu/commcms/ookami/support/_docs/A64FX_Microarchitecture_Manual_en_1.3.pdf)

We'll require a bit of in-depth looking into the assembly to figure out a way around this.
The TBL/TBX instruction comes from various vqtbx1q_u8 calls in FieldElement2625x4::blend. @Tarinn made a note that we can probably use a combination of vset/vget to get blend faster, but we don't currently have the hardware to reproduce this.

We sent some executables to @direc85 over in Finland to get the benchmark run on an Xperia 10 III, which is how we noted the issue. That's not really a tight debuggable loop. I should soon™ have an Xperia 10 IV to close the loop a bit.

Robrecht Blancquaert and others added 5 commits February 28, 2024 13:18
Co-authored-by: pinkforest <36498018+pinkforest@users.noreply.github.com>
Co-authored-by: Robrecht Blancquaert <Robrecht.Simon.Blancquaert@vub.be>
…5x4::shuffle

Co-authored-by: Robrecht Blacquaert <Robrecht.Simon.Blancquaert@vub.be>
@rubdos
Copy link
Contributor Author

rubdos commented Feb 28, 2024

(that's a disfunctional rebase, we still have to reintegrate in the new backend-selection)

@pinkforest
Copy link
Contributor

pinkforest commented Feb 28, 2024

Awesome -

For CI there is ARM already for cross - .github/workflows/cross.yml#L19

We also have simd/x86_64-AVX related tests in .github/worksflows/curve25519-dalek.yml#L118

But I think we could split backend tests into a separate file - qemu should support neon at this level - backends.yml for the sake of clarity and discoverability.

I'm re-working the build script so I could include a placeholder separately there so it doesn't cause conflict.

Also nightly CI is currently failing - I've fixed it here:

@jrose-signal
Copy link
Contributor

Today I idly tested this on my M1 Mac and the current state of the branch is a regression from the u64 backend. Changing the implementation of blend to use core::simd::Mask::select (with constant masks) got it back to about par with the u64 backend. I wasn't being super scientific with this (I was doing other things while the benchmark ran, for example, though nothing CPU-intensive), but it was a big enough gap to be indicative. There's probably something better than Mask, but just confirming that the TBL slowdown is serious. 😭

@rubdos
Copy link
Contributor Author

rubdos commented Mar 23, 2024

There's probably something better than Mask, but just confirming that the TBL slowdown is serious. 😭

Yep, it's quite devastating indeed... If we've got news on that front, we'll post our findings as soon as we have them. Thanks for testing it out! ❤️

@rubdos
Copy link
Contributor Author

rubdos commented May 22, 2024

@Tarinn made some progress by now on the aarch64 problem, but we're not there yet.

We thought we would have some speedup by working around llvm/llvm-project/issues/58323 (some vget instrinsics don't get correctly lowered to DUP/MOV instructions), but manually throwing in the assembly didn't give a measurable improvement on our Odroid N2+.

The manual injection of assembly for `vget_high_u32`/`vget_low_u32`
diff --git a/curve25519-dalek/src/backend/vector/neon/field.rs b/curve25519-dalek/src/backend/vector/neon/field.rs
index 2f8d42c..316963d 100644
--- a/curve25519-dalek/src/backend/vector/neon/field.rs
+++ b/curve25519-dalek/src/backend/vector/neon/field.rs
@@ -23,20 +23,57 @@
 use core::arch::aarch64::{self, vuzp1_u32};
 use core::ops::{Add, Mul, Neg};
 
-use super::packed_simd::{u32x2, u32x4, i32x4, u64x2, u64x4};
+use super::packed_simd::{i32x4, u32x2, u32x4, u64x2, u64x4};
 use crate::backend::serial::u64::field::FieldElement51;
 use crate::backend::vector::neon::constants::{
     P_TIMES_16_HI, P_TIMES_16_LO, P_TIMpatchdiff --git a/curve25519-dalek/src/backend/vector/neon/field.rs b/curve25519-dalek/src/backend/vector/neon/field.rs
index 2f8d42c..cccde8c 100644
--- a/curve25519-dalek/src/backend/vector/neon/field.rs
+++ b/curve25519-dalek/src/backend/vector/neon/field.rs
@@ -23,20 +23,57 @@
 use core::arch::aarch64::{self, vuzp1_u32};
 use core::ops::{Add, Mul, Neg};
 
-use super::packed_simd::{u32x2, u32x4, i32x4, u64x2, u64x4};
+use super::packed_simd::{i32x4, u32x2, u32x4, u64x2, u64x4};
 use crate::backend::serial::u64::field::FieldElement51;
 use crate::backend::vector::neon::constants::{
     P_TIMES_16_HI, P_TIMES_16_LO, P_TIMES_2_HI, P_TIMES_2_LO,
 };
 
+#[cfg(all(target_arch = "aarch64"))]
+#[inline(always)]
+fn vget_high_u32(v: core::arch::aarch64::uint32x4_t) -> core::arch::aarch64::uint32x2_t {
+    use core::arch::asm;
+    let o;
+    unsafe {
+        asm! (
+            "DUP {o:d}, {v}.D[1]",
+            v = in(vreg) v,
+            o = out(vreg) o,
+        )
+    }
+    o
+}
+
+#[cfg(all(target_arch = "aarch64"))]
+#[inline(always)]
+fn vget_low_u32(v: core::arch::aarch64::uint32x4_t) -> core::arch::aarch64::uint32x2_t {
+    use core::arch::asm;
+    let o;
+    unsafe {
+        asm! (
+            "DUP {o:d}, {v}.D[0]",
+            v = in(vreg) v,
+            o = out(vreg) o,
+        )
+    }
+    o
+}
+#[cfg(not(target_arch = "aarch64"))]
+use core::arch::aarch64::vget_high_u32;
+#[cfg(not(target_arch = "aarch64"))]
+use core::arch::aarch64::vget_low_u32;
+
 macro_rules! shuffle {
     ($vec0:expr, $vec1:expr, $index:expr) => {
         unsafe {
             core::mem::transmute::<[u32; 4], u32x4>(
                 *core::simd::simd_swizzle!(
-                    core::simd::Simd::from_array(core::mem::transmute::<u32x4, [u32; 4]>($vec0)), 
-                    core::simd::Simd::from_array(core::mem::transmute::<u32x4, [u32; 4]>($vec1)), 
-                    $index).as_array())
+                    core::simd::Simd::from_array(core::mem::transmute::<u32x4, [u32; 4]>($vec0)),
+                    core::simd::Simd::from_array(core::mem::transmute::<u32x4, [u32; 4]>($vec1)),
+                    $index
+                )
+                .as_array(),
+            )
         }
     };
 }
@@ -53,8 +90,6 @@ fn unpack_pair(src: (u32x4, u32x4)) -> ((u32x2, u32x2), (u32x2, u32x2)) {
     let b0: u32x2;
     let b1: u32x2;
     unsafe {
-        use core::arch::aarch64::vget_high_u32;
-        use core::arch::aarch64::vget_low_u32;
         a0 = vget_low_u32(src.0.into()).into();
         a1 = vget_low_u32(src.1.into()).into();
         b0 = vget_high_u32(src.0.into()).into();
@@ -72,7 +107,6 @@ fn unpack_pair(src: (u32x4, u32x4)) -> ((u32x2, u32x2), (u32x2, u32x2)) {
 fn repack_pair(x: (u32x4, u32x4), y: (u32x4, u32x4)) -> (u32x4, u32x4) {
     unsafe {
         use core::arch::aarch64::vcombine_u32;
-        use core::arch::aarch64::vget_low_u32;
         use core::arch::aarch64::vgetq_lane_u32;
         use core::arch::aarch64::vset_lane_u32;
 
@@ -223,7 +257,6 @@ impl FieldElement2625x4 {
         self.shuffle(Shuffle::BACD)
     }
 
-
     // Can probably be sped up using multiple vset/vget instead of table
     #[inline]
     pub fn blend(&self, other: FieldElement2625x4, control: Lanes) -> FieldElement2625x4 {
@@ -318,8 +351,6 @@ impl FieldElement2625x4 {
         let rotated_carryout = |v: (u32x4, u32x4)| -> (u32x4, u32x4) {
             unsafe {
                 use core::arch::aarch64::vcombine_u32;
-                use core::arch::aarch64::vget_high_u32;
-                use core::arch::aarch64::vget_low_u32;
                 use core::arch::aarch64::vqshlq_u32;
 
                 let c: (u32x4, u32x4) = (
@@ -327,16 +358,8 @@ impl FieldElement2625x4 {
                     vqshlq_u32(v.1.into(), shifts.1.into()).into(),
                 );
                 (
-                    vcombine_u32(
-                        vget_high_u32(c.0.into()),
-                        vget_low_u32(c.0.into()),
-                    )
-                    .into(),
-                    vcombine_u32(
-                        vget_high_u32(c.1.into()),
-                        vget_low_u32(c.1.into()),
-                    )
-                    .into(),
+                    vcombine_u32(vget_high_u32(c.0.into()), vget_low_u32(c.0.into())).into(),
+                    vcombine_u32(vget_high_u32(c.1.into()), vget_low_u32(c.1.into())).into(),
                 )
             }
         };
@@ -344,19 +367,9 @@ impl FieldElement2625x4 {
         let combine = |v_lo: (u32x4, u32x4), v_hi: (u32x4, u32x4)| -> (u32x4, u32x4) {
             unsafe {
                 use core::arch::aarch64::vcombine_u32;
-                use core::arch::aarch64::vget_high_u32;
-                use core::arch::aarch64::vget_low_u32;
                 (
-                    vcombine_u32(
-                        vget_low_u32(v_lo.0.into()),
-                        vget_high_u32(v_hi.0.into()),
-                    )
-                    .into(),
-                    vcombine_u32(
-                        vget_low_u32(v_lo.1.into()),
-                        vget_high_u32(v_hi.1.into()),
-                    )
-                    .into(),
+                    vcombine_u32(vget_low_u32(v_lo.0.into()), vget_high_u32(v_hi.0.into())).into(),
+                    vcombine_u32(vget_low_u32(v_lo.1.into()), vget_high_u32(v_hi.1.into())).into(),
                 )
             }
         };
@@ -386,7 +399,6 @@ impl FieldElement2625x4 {
         #[rustfmt::skip] // Retain formatting of return tuple
         let c9_19: (u32x4, u32x4) = unsafe {
             use core::arch::aarch64::vcombine_u32;
-            use core::arch::aarch64::vget_low_u32;
             use core::arch::aarch64::vmulq_n_u32;
 
             let c9_19_spread: (u32x4, u32x4) = (
@@ -475,8 +487,6 @@ impl FieldElement2625x4 {
         fn m_lo(x: (u32x2, u32x2), y: (u32x2, u32x2)) -> (u32x2, u32x2) {
             use core::arch::aarch64::vmull_u32;
             use core::arch::aarch64::vuzp1_u32;
-            use core::arch::aarch64::vget_low_u32;
-            use core::arch::aarch64::vget_high_u32;
             unsafe {
                 let x: (u32x4, u32x4) = (
                     vmull_u32(x.0.into(), y.0.into()).into(),
@@ -530,8 +540,6 @@ impl FieldElement2625x4 {
 
         let negate_D = |x_01: u64x4, p_01: u64x4| -> (u64x2, u64x2) {
             unsafe {
-                use core::arch::aarch64::vget_low_u32;
-                use core::arch::aarch64::vget_high_u32;
                 use core::arch::aarch64::vcombine_u32;
 
                 let x = x_01.0;
@@ -640,8 +648,6 @@ impl<'a, 'b> Mul<&'b FieldElement2625x4> for &'a FieldElement2625x4 {
         fn m_lo(x: (u32x2, u32x2), y: (u32x2, u32x2)) -> (u32x2, u32x2) {
             use core::arch::aarch64::vmull_u32;
             use core::arch::aarch64::vuzp1_u32;
-            use core::arch::aarch64::vget_low_u32;
-            use core::arch::aarch64::vget_high_u32;
             unsafe {
                 let x: (u32x4, u32x4) = (
                     vmull_u32(x.0.into(), y.0.into()).into(),
@@ -836,5 +842,3 @@ mod test {
         assert_eq!(x3, splits[3]);
     }
 }
-
-

@Tarinn
Copy link

Tarinn commented Aug 21, 2024

Current state of this pull request gives speed-up on the previously problematic CPUs. It is less than the speed-up on older architectures, but still a minimum of ~5% and more often at least 10 to 12% on benchmarks that use the SIMD NEON back-end, in comparison to the serial back-end.
The latest commits to this branch have mainly focused on refactoring to use internal arm types (such as uint64x2x2_t) and trying to avoid the problematic llvm instructions.
I believe this makes the current branch usable, but more testing on different devices would be welcome. Maybe a patch of the vget_high intrinsic in llvm will result in even more speed-up, so that is something I will look into next.

Results of the benchmarks: target_a55.tar.gz

(merge conflicts should be resolved shortly)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement NEON backend
5 participants