aarch64: Add specialized `shuffle` lowerings #5977

alexcrichton · 2023-03-10T03:28:20Z

This is the equivalent of #5930 but for AArch64. I went through various instructions I saw for AArch64 and added corresponding shuffle lowerings where appropriate. These lowerings cover all the lowerings I found in the meshoptimizer repository plus a few more based on various instructions I found while perusing ARM's documentation. Like with x86_64 I've tried to make sure there's a runtest and a precise-output test for each lowering, even if some of them probably overlap with the x86_64 runtests.

I'll note that many of these lowerings probably won't end up getting used by "portable" wasm binaries since some of the shifts here are pretty specific to AArch64 and don't have efficient 1/2 instruction lowerings on x86_64. That being said these are useful to any sort of hypothetical Cranelift-as-an-AArch64-backend-compiler such as rustc_cranelift_codegen since this broadens the spectrum of instructions supported by Cranelift's AArch64 backend.

This commit uses the same style of patterns in the x64 backend to start adding specific lowerings of the Cranelift `shuffle` instruction to particular AArch64 instructions.

These instructions match the `punpck*` family of instructions on x64 and should help provide more efficient lowerings than the current `shuffle` fallback.

Along the lines of prior commits adds specific patterns to lowering for individual AArch64 instructions available.

This instruction will more-or-less concatenate two 128-bit vector registers to create a 256-bit value, shift it right, and then take the lower 128-bits into the destination. This can be modeled with a `shuffle` of consecutive bytes so this adds a lowering rule to generate this instruction.

This commit adds special cases for Cranelift's `shuffle` on AArch64 when the lowering can be represented with a `dup` instruction which broadcasts one vector's lane into all lanes of the destination.

This commit adds shuffle mask specializations for the `rev{16,32,64}` family of instructions on AArch64 which can be used to reverse bytes, 16-bit values, or 32-bit values within larger values.

cfallin

This looks OK to me, I think; a few suggestions for clarity below.

I'm leaning on the runtest having validated that all the mappings are correct -- I didn't go to the AArch64 manual to check the semantics of the zip/... instructions.

Relatedly, it appears that the runtest (simd-shuffle.clif) does not execute on x86-64 or the interpreter. The latter appears blocked on #5915 but is there any reason we can't enable the former? That would give a little more confidence via cross-check, as well.

Thanks!

cfallin · 2023-03-10T19:06:48Z

cranelift/codegen/src/isa/aarch64/lower.isle

+(rule 3 (lower (shuffle a b (shuffle_dup64_from_imm n)))
+        (vec_dup_from_fpu a (VectorSize.Size64x2) n))
+
+(decl shuffle_dup8_from_imm (u8) Immediate)


Can we add doc comments here to describe what pattern in the Immediate each of these etors matches on? (Likewise below)

cfallin · 2023-03-10T19:10:05Z

cranelift/codegen/src/isa/aarch64/lower.isle

+
+;; Rules for the `uzp1` and `uzp2` instructions which gather even-numbered lanes
+;; or odd-numbered lanes
+(rule 1 (lower (shuffle a b (u128_from_immediate 0x1e1c_1a18_1614_1210_0e0c_0a08_0604_0200)))


I wonder if it would make these patterns clearer to have an extractor something like (shuffle_immediate 30 28 26 ...) (with external Rust impl that is Fn(&mut self, imm: Immediate) -> Option<(u8, u8, u8, u8, ...)>)?

I originally did this in #5905 but @jameysharp preferred the hex masks instead. I don't mind myself, but I do think it's worth being consistent across the backends so I'd want to update all the x64 things if these aarch64 rules change as wlel.

Funny, I suggested exactly the opposite in a previous PR 😆

Interesting!

I see the points in #5905 now about exposing more opportunity to islec by making the full mask visible as one value; that's a reasonable argument I think. My rationale was that I was having some friction converting hex values in my head to understand the permutation (but maybe the right answer to that is just to think in hex directly). I don't feel too strongly about it, so this is fine as-is.

alexcrichton · 2023-03-10T20:04:20Z

Oh I thought these lines were enough to run in x86_64?

cfallin · 2023-03-10T20:21:56Z

Oh I thought these lines were enough to run in x86_64?

Ah, yes, the simplest explanation ("@cfallin misses obvious details") is sometimes the correct one here. Not sure why I didn't see those; perhaps thrown off by the verbose flags or... who knows. Anyway, yes, nevermind this point, thanks!

alexcrichton added 6 commits March 9, 2023 19:19

aarch64: Add shuffle lowerings for the uzp{1,2} instructions

8d20e19

This commit uses the same style of patterns in the x64 backend to start adding specific lowerings of the Cranelift `shuffle` instruction to particular AArch64 instructions.

aarch64: Add shuffle lowerings to the zip{1,2} instructions

be249f5

These instructions match the `punpck*` family of instructions on x64 and should help provide more efficient lowerings than the current `shuffle` fallback.

aarch64: Add shuffle lowerings for trn{1,2}

7efa9e3

Along the lines of prior commits adds specific patterns to lowering for individual AArch64 instructions available.

aarch64: Add shuffle special case for dup

af890ca

This commit adds special cases for Cranelift's `shuffle` on AArch64 when the lowering can be represented with a `dup` instruction which broadcasts one vector's lane into all lanes of the destination.

aarch64: Add shuffle specializations for rev instructions

ab8b0a1

This commit adds shuffle mask specializations for the `rev{16,32,64}` family of instructions on AArch64 which can be used to reverse bytes, 16-bit values, or 32-bit values within larger values.

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:aarch64 Issues related to AArch64 backend. labels Mar 10, 2023

Fix tests

aca56e7

cfallin approved these changes Mar 10, 2023

View reviewed changes

Add doc-comments in ISLE

a2b376d

alexcrichton added this pull request to the merge queue Mar 10, 2023

Merged via the queue into bytecodealliance:main with commit 52896e0 Mar 10, 2023

alexcrichton deleted the aarch64-shuffles branch March 10, 2023 22:19

afonso360 mentioned this pull request Mar 11, 2023

Cranelift: AArch64 attempt to add with overflow panic on shuffle.i8x16 #5989

Closed

jameysharp mentioned this pull request Apr 5, 2023

Add release notes for 8.0.0 #6145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aarch64: Add specialized `shuffle` lowerings #5977

aarch64: Add specialized `shuffle` lowerings #5977

alexcrichton commented Mar 10, 2023

cfallin left a comment

cfallin Mar 10, 2023

cfallin Mar 10, 2023

alexcrichton Mar 10, 2023

jameysharp Mar 10, 2023

cfallin Mar 10, 2023

alexcrichton commented Mar 10, 2023

cfallin commented Mar 10, 2023

aarch64: Add specialized shuffle lowerings #5977

aarch64: Add specialized shuffle lowerings #5977

Conversation

alexcrichton commented Mar 10, 2023

cfallin left a comment

Choose a reason for hiding this comment

cfallin Mar 10, 2023

Choose a reason for hiding this comment

cfallin Mar 10, 2023

Choose a reason for hiding this comment

alexcrichton Mar 10, 2023

Choose a reason for hiding this comment

jameysharp Mar 10, 2023

Choose a reason for hiding this comment

cfallin Mar 10, 2023

Choose a reason for hiding this comment

alexcrichton commented Mar 10, 2023

cfallin commented Mar 10, 2023

aarch64: Add specialized `shuffle` lowerings #5977

aarch64: Add specialized `shuffle` lowerings #5977