Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Dynamic shuffle #68

Open
penzn opened this Issue Mar 7, 2019 · 13 comments

Comments

Projects
None yet
7 participants
@penzn
Copy link

penzn commented Mar 7, 2019

Background in #30 by @lemaitre, also part of #24.

I would like to narrow this issue down to just i8x16.permute_dyn instruction proposed in #30, which would shuffle contents of one vector using lanes of the second vector as indices.

@zeux and @gnzlbg point out that dynamic permute is needed for bytewise processing on structures of dynamic width. Major hardware vendors support some form of dynamic shuffles.

From proposed changes in #30:

  • v8x16.permute_dyn(a: v128, s: v128) -> v128

Returns a new vector with lanes selected from the lanes of the first input vector
a and specified in the second input vector s. The indices from s are first
fit into the range [0, 15] via a modulo.

def S.permute_dyn(a, s):
    result = S.New()
    for i in range(S.Lanes):
      result[i] = a[s[i] % S.Lanes]
    return result
@gnzlbg

This comment has been minimized.

Copy link
Contributor

gnzlbg commented Mar 7, 2019

To the best of my knowledge, the following hardware natively support this operation as a single instruction: $arch: $ISA($instr,*),*

  • x86_64: SSSE3 (pshufb), AVX2 (vpshufb), AVX-512 (vpshufb)
  • arm: NEON ({v}tb{l,x})
  • ppc64: altivec (vperm), vsx (xxperm)
  • mips: msa (vshf)
  • riscv: V(vrgather)
@tlively

This comment has been minimized.

Copy link
Contributor

tlively commented Mar 7, 2019

From previous conversation it seems to me that this would be easy to implement both in engines and in tooling and can lead to clear performance wins for some applications. I personally see no problem with including this in MVP.

@zeux

This comment has been minimized.

Copy link

zeux commented Mar 8, 2019

Just in case I want to address the possible options for handling out-of-range indices.

Here's how each architecture handles out-of-range indices in the selector:

x86_64: If bit 7 is 1, set to 0, otherwise use lower 4 bits for lookup
ARM: if index is out of range (0-15), set to 0
PPC64: use lower 4 bits for lookup (no out of range handling)
MIPS: if bit 6 or bit 7 is 1, set to 0; otherwise use lower 4 bits for lookup (or rather use lower 5 bits for lookup into a table that has 32 elements constructed from 2 input vectors, but if both vectors are the same then it effectively means bits 4,5 are ignored)
RISCV: if index is out of range (0-15), set to 0.

The current proposed specification requires all platforms except for PPC64 to mask higher 4 bits using an and instruction, resulting in two instructions (if and with an immediate/memory argument is available).

Alternatively, if the specification says that all out-of-range indices return 0, then ARM and RISCV will use the native instruction, x86_64/PPC64/MIPS will have to generate additional checks using a 3-instruction sequence like "compare with 15; shuffle; andnot".

Finally - and I'm not sure if this is ok to do in WebAssembly, that's what C would do ;) - we could say that if index is 0x80 we return 0, if index is 0-15 we return the index result, otherwise the resulting byte is unspecified. This seems like the cleanest way to unify x86_64 behavior with the rest, resulting in zero-cost implementation for x86_64, ARM, MIPS, RISCV, and a 3-instruction sequence (compare to 0x80, shuffle, andnot) on PPC64.

I would be happy having the instruction with any behavior - it's not free to work around out-of-range behavior in some cases, and with the proposed spec we'd be paying a cost of an extra and (1 cycle assuming the load is free?) for each shuffle, but it's substantially better than status quo.

However I want to make sure we carefully consider all possibilities - if unspecified behavior is acceptable then the third solution is cleanest - it allows us to save 1 cycle in almost all cases, and allows for a cheap way to set elements to 0 which in some cases saves an extra instruction or two in the code that's using shuffles, at the expense of unspecified behavior for some values, and PPC64 that doesn't seem more valuable than other archs.

If unspecified behavior is unacceptable then the behavior as specced seems best, since option 2 is zero-cost for ARM but more expensive for x86_64, whereas option 1 is more balanced.

@Maratyszcza

This comment has been minimized.

Copy link

Maratyszcza commented Mar 8, 2019

@zeux Actually, if the specification says that all out-of-range indices return 0, then x86_64 can implement it with just two instructions: unsigned saturated add (PADDUBS) of byte 112 (=128-16) and PSHUFB. Similarly for MIPS MSA, albeit with a different constant.

The "out-of-range indices return 0" schema has an important advantage that it permits emulating larger than 16-byte lookup table. E.g. a 32 byte lookup can be simulated as permute_dyn(low16, idx) + permute_dyn(high16, idx - 16)

@zeux

This comment has been minimized.

Copy link

zeux commented Mar 8, 2019

Good point! Then it seems like the best option is to specify that out-of-range indices should return 0. This is zero cost on ARM and RISCV, and same cost as the currently proposed option on x86_64 and MIPS. PPC64 is the only platform where the currently proposed option is zero-cost but I don't think we should be optimizing for PPC64 over ARM...

@lemaitre

This comment has been minimized.

Copy link

lemaitre commented Mar 8, 2019

When I wrote the specification of v8x16.permute_dyn in my PR, I knew about ARM out-of-range policy, but I thought pshufbon x86 did not look at upper bits (like what I proposed).
In fact, it is the case for all other "permutevar" from x86, hence the confusion.

I'm happy to change the specification to something more useful and better supported.

So I think the best behavior is the following:

def S.permute_dyn(a, s):
    result = S.New()
    for i in range(S.Lanes):
        idx = s[i]
        if idx & (1 << 7) != 0:
            result[i] = 0
        else if idx < S.Lanes:
            result[i] = a[s[i]]
        else:
            result[i] = unspecified_value()
    return result

Therefore, we get a perfect translation to all major platforms but PPC (where a couple extra instructions woud be required).

If we cannot afford to specify unspecified_value() (or undefined_bahvior()), then we cannot have perfect translation to both ARM and x86.
I tend to prefer unspecified_value() over undefined_behavior() as it is more constrained (eg: it cannot format your filesystem), but still allows perfect codegen whenever possible.

@gnzlbg

This comment has been minimized.

Copy link
Contributor

gnzlbg commented Mar 8, 2019

I thought pshufbon x86 did not look at upper bits (like what I proposed).

You are not alone, I also forgot about this detail. Thank you @zeux for bringing out this detail to our attention.

In fact, it is the case for all other "permutevar" from x86, hence the confusion.

Note that permutevar is weird in its own way because it uses the second bit of each element to perform the selection, that is, i64x2.permute_dyn(mask: i64x2) with mask = [1, 0] implemented on top of permutevarpd requires a lane-wise mask << 1 to map [1, 0] to [0b10, 0b00] before calling the instruction.

@penzn

This comment has been minimized.

Copy link
Author

penzn commented Mar 8, 2019

Out of range indices have been brought up before, but did not get a whole lot of attention: #11

@zeux

This comment has been minimized.

Copy link

zeux commented Mar 8, 2019

@penzn Right now there are no instructions that take variable shuffle masks. This issue proposes one and some behavior needs to be standardized for out-of-range indices - the discussion is about what behavior we need I think, it's not clear that "modulo 16" is the optimum balance.

@penzn

This comment has been minimized.

Copy link
Author

penzn commented Mar 8, 2019

I understand that, there seems to be consensus in favor of dynamic shuffle. It may make sense to enable both static and dynamic shuffle to support out of range indices if supporting out of range indices is the route.

@tlively

This comment has been minimized.

Copy link
Contributor

tlively commented Mar 8, 2019

It may make sense to enable both static and dynamic shuffle to support out of range indices if supporting out of range indices is the route.

The compiler can easily ensure that all indices are in range in the static case, so allowing out of range indices for static shuffles gets us extra complexity with no additional utility. In contrast, the tools cannot reason about dynamic shuffles, so there is no way to statically prevent out of bounds indices for dynamic shuffles and they must be handled somehow.

@binji

This comment has been minimized.

Copy link
Member

binji commented Mar 8, 2019

If unspecified behavior is unacceptable then the behavior as specced seems best, since option 2 is zero-cost for ARM but more expensive for x86_64, whereas option 1 is more balanced.

Completely unspecified behavior is probably unacceptable. Non-deterministic behavior is perhaps OK. For example, we already have non-determinism w.r.t. NaN behavior. That said, we've already decided that it is OK to sacrifice some speed for determinism, e.g. requiring masks on scalar shifts.

@zeux

This comment has been minimized.

Copy link

zeux commented Mar 8, 2019

Right - with the comment from @Maratyszcza it seems to me that the best option is to specify that out-of-range indexing returns 0.

It's the same wrt performance as the currently proposed option on x64 (and presumably MIPS), faster - zero-cost - on ARM and RISCV and slower - non-zero-cost - on PPC64.

This seems like a better tradeoff since it also unlocks savings in the calling code (for example, on x64, in the benchmark I posted earlier on a different thread I'd have to work around the lack of out-of-bounds handling by spending one more instruction, whereas if wasm behavior guarantees this I won't need to do that, so the resulting code is faster).

Additionally, this is also cleaner and easier to explain compared to options that try to special-case current x64 behavior wrt high bit.

zeux added a commit to zeux/simd that referenced this issue Mar 12, 2019

Add v8x16.permute_dyn instruction
This change adds a dynamic permute instruction to SIMD proposal.

The name "permute", as opposed to shuffle, is used to distinguish it from
instructions that operate on two input vectors.

When indices are out of range, the result is specified as 0 for each
lane. This matches hardware behavior on ARM and RISCV architectures.

On x86_64 and MIPS, the hardware provides instructions that can select 0
when the high bit is set to 1 (x86_64) or any of the two high bits are
set to 1 (MIPS). On these architectures, the backend is expected to emit
a pair of instructions, saturating add (saturate(x + (128 - 16)) for
x86_64) and permute, to emulate the proposed behavior.

Fixes WebAssembly#68.
Fixes WebAssembly#24.
Contributes to WebAssembly#11.

zeux added a commit to zeux/simd that referenced this issue Mar 12, 2019

Add v8x16.permute instruction
This change adds a variable permute instruction to SIMD proposal.

The name "permute", as opposed to shuffle, is used to distinguish it from
instructions that operate on two input vectors.

When indices are out of range, the result is specified as 0 for each
lane. This matches hardware behavior on ARM and RISCV architectures.

On x86_64 and MIPS, the hardware provides instructions that can select 0
when the high bit is set to 1 (x86_64) or any of the two high bits are
set to 1 (MIPS). On these architectures, the backend is expected to emit
a pair of instructions, saturating add (saturate(x + (128 - 16)) for
x86_64) and permute, to emulate the proposed behavior.

Fixes WebAssembly#68.
Fixes WebAssembly#24.
Contributes to WebAssembly#11.

@gnzlbg gnzlbg referenced this issue Mar 12, 2019

Open

Update the shuffle1_dyn API #228

0 of 3 tasks complete

zeux added a commit to zeux/simd that referenced this issue Mar 12, 2019

Add v8x16.permute_dyn instruction
This change adds a variable permute instruction to SIMD proposal.

The name "permute", as opposed to shuffle, is used to distinguish it from
instructions that operate on two input vectors.

When indices are out of range, the result is specified as 0 for each
lane. This matches hardware behavior on ARM and RISCV architectures.

On x86_64 and MIPS, the hardware provides instructions that can select 0
when the high bit is set to 1 (x86_64) or any of the two high bits are
set to 1 (MIPS). On these architectures, the backend is expected to emit
a pair of instructions, saturating add (saturate(x + (128 - 16)) for
x86_64) and permute, to emulate the proposed behavior.

Fixes WebAssembly#68.
Fixes WebAssembly#24.
Contributes to WebAssembly#11.

zeux added a commit to zeux/simd that referenced this issue Mar 14, 2019

Add v8x16.shuffle1 instruction
This change adds a variable shuffle instruction to SIMD proposal.

When indices are out of range, the result is specified as 0 for each
lane. This matches hardware behavior on ARM and RISCV architectures.

On x86_64 and MIPS, the hardware provides instructions that can select 0
when the high bit is set to 1 (x86_64) or any of the two high bits are
set to 1 (MIPS). On these architectures, the backend is expected to emit
a pair of instructions, saturating add (saturate(x + (128 - 16)) for
x86_64) and permute, to emulate the proposed behavior.

To distinguish variable shuffles with immediate shuffles, existing
v8x16.shuffle instruction is renamed to v8x16.shuffle2_imm to be
explicit about the fact that it shuffles two vectors with an immediate
argument.

This naming scheme allows for adding variants like v8x16.shuffle2 and
v8x16.shuffle1_imm in the future.

Fixes WebAssembly#68.
Fixes WebAssembly#24.
Contributes to WebAssembly#11.

zeux added a commit to zeux/simd that referenced this issue Mar 14, 2019

Add v8x16.shuffle1 instruction
This change adds a variable shuffle instruction to SIMD proposal.

When indices are out of range, the result is specified as 0 for each
lane. This matches hardware behavior on ARM and RISCV architectures.

On x86_64 and MIPS, the hardware provides instructions that can select 0
when the high bit is set to 1 (x86_64) or any of the two high bits are
set to 1 (MIPS). On these architectures, the backend is expected to emit
a pair of instructions, saturating add (saturate(x + (128 - 16)) for
x86_64) and permute, to emulate the proposed behavior.

To distinguish variable shuffles with immediate shuffles, existing
v8x16.shuffle instruction is renamed to v8x16.shuffle2_imm to be
explicit about the fact that it shuffles two vectors with an immediate
argument.

This naming scheme allows for adding variants like v8x16.shuffle2 and
v8x16.shuffle1_imm in the future.

Fixes WebAssembly#68.
Contributes to WebAssembly#24.
Fixes WebAssembly#11.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.