Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aarch64: Add specialized shuffle lowerings #5977

Merged
merged 8 commits into from
Mar 10, 2023

Conversation

alexcrichton
Copy link
Member

This is the equivalent of #5930 but for AArch64. I went through various instructions I saw for AArch64 and added corresponding shuffle lowerings where appropriate. These lowerings cover all the lowerings I found in the meshoptimizer repository plus a few more based on various instructions I found while perusing ARM's documentation. Like with x86_64 I've tried to make sure there's a runtest and a precise-output test for each lowering, even if some of them probably overlap with the x86_64 runtests.

I'll note that many of these lowerings probably won't end up getting used by "portable" wasm binaries since some of the shifts here are pretty specific to AArch64 and don't have efficient 1/2 instruction lowerings on x86_64. That being said these are useful to any sort of hypothetical Cranelift-as-an-AArch64-backend-compiler such as rustc_cranelift_codegen since this broadens the spectrum of instructions supported by Cranelift's AArch64 backend.

This commit uses the same style of patterns in the x64 backend to start
adding specific lowerings of the Cranelift `shuffle` instruction to
particular AArch64 instructions.
These instructions match the `punpck*` family of instructions on x64 and
should help provide more efficient lowerings than the current `shuffle`
fallback.
Along the lines of prior commits adds specific patterns to lowering for
individual AArch64 instructions available.
This instruction will more-or-less concatenate two 128-bit vector
registers to create a 256-bit value, shift it right, and then take the
lower 128-bits into the destination. This can be modeled with a
`shuffle` of consecutive bytes so this adds a lowering rule to generate
this instruction.
This commit adds special cases for Cranelift's `shuffle` on AArch64 when
the lowering can be represented with a `dup` instruction which
broadcasts one vector's lane into all lanes of the destination.
This commit adds shuffle mask specializations for the `rev{16,32,64}`
family of instructions on AArch64 which can be used to reverse bytes,
16-bit values, or 32-bit values within larger values.
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:aarch64 Issues related to AArch64 backend. labels Mar 10, 2023
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks OK to me, I think; a few suggestions for clarity below.

I'm leaning on the runtest having validated that all the mappings are correct -- I didn't go to the AArch64 manual to check the semantics of the zip/... instructions.

Relatedly, it appears that the runtest (simd-shuffle.clif) does not execute on x86-64 or the interpreter. The latter appears blocked on #5915 but is there any reason we can't enable the former? That would give a little more confidence via cross-check, as well.

Thanks!

(rule 3 (lower (shuffle a b (shuffle_dup64_from_imm n)))
(vec_dup_from_fpu a (VectorSize.Size64x2) n))

(decl shuffle_dup8_from_imm (u8) Immediate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add doc comments here to describe what pattern in the Immediate each of these etors matches on? (Likewise below)


;; Rules for the `uzp1` and `uzp2` instructions which gather even-numbered lanes
;; or odd-numbered lanes
(rule 1 (lower (shuffle a b (u128_from_immediate 0x1e1c_1a18_1614_1210_0e0c_0a08_0604_0200)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would make these patterns clearer to have an extractor something like (shuffle_immediate 30 28 26 ...) (with external Rust impl that is Fn(&mut self, imm: Immediate) -> Option<(u8, u8, u8, u8, ...)>)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally did this in #5905 but @jameysharp preferred the hex masks instead. I don't mind myself, but I do think it's worth being consistent across the backends so I'd want to update all the x64 things if these aarch64 rules change as wlel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny, I suggested exactly the opposite in a previous PR 😆

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting!

I see the points in #5905 now about exposing more opportunity to islec by making the full mask visible as one value; that's a reasonable argument I think. My rationale was that I was having some friction converting hex values in my head to understand the permutation (but maybe the right answer to that is just to think in hex directly). I don't feel too strongly about it, so this is fine as-is.

@alexcrichton
Copy link
Member Author

Oh I thought these lines were enough to run in x86_64?

@cfallin
Copy link
Member

cfallin commented Mar 10, 2023

Oh I thought these lines were enough to run in x86_64?

Ah, yes, the simplest explanation ("@cfallin misses obvious details") is sometimes the correct one here. Not sure why I didn't see those; perhaps thrown off by the verbose flags or... who knows. Anyway, yes, nevermind this point, thanks!

@alexcrichton alexcrichton added this pull request to the merge queue Mar 10, 2023
Merged via the queue into bytecodealliance:main with commit 52896e0 Mar 10, 2023
@alexcrichton alexcrichton deleted the aarch64-shuffles branch March 10, 2023 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants