Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

#372 Integer Sign/Zero Extension for {8,16}->{32,64} #395

Closed

Conversation

omnisip
Copy link

@omnisip omnisip commented Nov 8, 2020

Introduction

This proposal mirrors #290 to add new variants of existing widen instructions and extends the 32 and 64 widen instructions to include support from 16 and 8-bit integers. The practical use case for this is signal processing -- specifically audio and image processing, but the use cases for this are pretty large in general. For a non-image processing use case, these could be very helpful any time someone wants to convert an 8-bit value to a floating-point number. Currently, this requires multiple conversions steps between integers before converting to float, but modern architectures provide operations to convert from just about any integer size to another. Due to the non-binary relationship between 8 bits and 64 bits, this instruction will introduce new terminology that will replace the high/low terminology with a constant parameter immediate. This PR supersedes #372 to provide the implementation guidelines for this proposal.

Use Cases

  • Audio Digital Signal Processing (8/16-bit data) -> 32/64bits
  • Image/Video Digital Signal Processing (8-bit data) -> 32/64bits
  • Prefix Sums / Scan Algorithms

Notable Applications and Libraries

Proposed Instructions

  • i32x4.widen_i8x16_u(v128: a, ImmLaneIdx4: c) -> v128
  • i32x4.widen_i8x16_s(v128: a, ImmLaneIdx4: c) -> v128
Withdrawn instructions
  • i64x2.widen_i8x16_u(v128: a, ImmLaneIdx8: c) -> v128
  • i64x2.widen_i8x16_s(v128: a, ImmLaneIdx8: c) -> v128
  • i64x2.widen_i16x8_u(v128: a, ImmLaneIdx4: c) -> v128
  • i64x2.widen_i16x8_s(v128: a, ImmLaneIdx4: c) -> v128

Performance and Portability Considerations

The principal implementation is that of a shuffle/swizzle and shift for signed data and merely a shuffle/swizzle for unsigned data.
Analysis describing the efficacy of this proposal is described here and is demonstrated here for 8 to 32bit and here for 8 to 64 bit. There's a lot of room for compiler optimization depending on how the subsequent code operates. For instance, the primary advantage of the tbl approach (on ARM64) is when a mask already exists and doesn't require a load from memory. In other cases, it may make more sense to go the ushll or sshll routes. Whether or not a benefit is achieved depends on port utilization of the subsequent code and how much out of order and instruction-level parallelism that can be obtained. This does not appear to be the case with x64 chips which appear to gain a benefit so long as the number of shuffles is reduced. In such cases, if a compiler detects a load followed by a convert, it can immediately optimize it upstream with movzx**** or movsx**** directly to the target register. Such should provide the maximum instruction-level parallelism and minimal port usage. In any case where performance with this method does not exceed that of incremental conversions, the incremental conversion method may be used in its place. Similarly, any system or architecture that benefits from this conversion method over that of the incremental conversion method can use any of the masks described herein as if they were constants provided to the existing v128.swizzle operation.

Mapping To Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience. Compliant WebAssembly implementations do not have to follow the same code generation patterns.

Masks or Tables relevant to x64 and ARM Implementations



    mask_i32x4_i8x16_u0 = [0,255,255,255,
                           1,255,255,255,
                           2,255,255,255,
                           3,255,255,255]
    mask_i32x4_i8x16_u1 = [4,255,255,255,
                           5,255,255,255,
                           6,255,255,255,
                           7,255,255,255]
    mask_i32x4_i8x16_u2 = [8,255,255,255,
                           9,255,255,255,
                           10,255,255,255,
                           11,255,255,255]
    mask_i32x4_i8x16_u3 = [12,255,255,255,
                           13,255,255,255,
                           14,255,255,255,
                           15,255,255,255]
    mask_i32x4_i8x16_s0 = [255,255,255,0,
                           255,255,255,1,
                           255,255,255,2,
                           255,255,255,3]
    mask_i32x4_i8x16_s1 = [255,255,255,4,
                           255,255,255,5,
                           255,255,255,6,
                           255,255,255,7]
    mask_i32x4_i8x16_s2 = [255,255,255,8,
                           255,255,255,9,
                           255,255,255,10,
                           255,255,255,11]
    mask_i32x4_i8x16_s3 = [255,255,255,12
                           255,255,255,13
                           255,255,255,14
                           255,255,255,15]
Withdrawn lowerings
    mask_i64x2_i8x16_u0 = [0,255,255,255,255,255,255,255,
                           1,255,255,255,255,255,255,255]
    mask_i64x2_i8x16_u1 = [2,255,255,255,255,255,255,255,
                           3,255,255,255,255,255,255,255]
    mask_i64x2_i8x16_u2 = [4,255,255,255,255,255,255,255,
                           5,255,255,255,255,255,255,255]
    mask_i64x2_i8x16_u3 = [6,255,255,255,255,255,255,255,
                           7,255,255,255,255,255,255,255]
    mask_i64x2_i8x16_u4 = [8,255,255,255,255,255,255,255,
                           9,255,255,255,255,255,255,255]
    mask_i64x2_i8x16_u5 = [10,255,255,255,255,255,255,255,
                           11,255,255,255,255,255,255,255]
    mask_i64x2_i8x16_u6 = [12,255,255,255,255,255,255,255,
                           13,255,255,255,255,255,255,255]
    mask_i64x2_i8x16_u7 = [14,255,255,255,255,255,255,255,
                           15,255,255,255,255,255,255,255]
    mask_i64x2_i8x16_s0 = [255,255,255,255,255,255,255,0
                           255,255,255,255,255,255,255,1]
    mask_i64x2_i8x16_s1 = [255,255,255,255,255,255,255,2,
                           255,255,255,255,255,255,255,3]
    mask_i64x2_i8x16_s2 = [255,255,255,255,255,255,255,4,
                           255,255,255,255,255,255,255,5]
    mask_i64x2_i8x16_s3 = [255,255,255,255,255,255,255,6,
                           255,255,255,255,255,255,255,7]
    mask_i64x2_i8x16_s4 = [255,255,255,255,255,255,255,8,
                           255,255,255,255,255,255,255,9]
    mask_i64x2_i8x16_s5 = [255,255,255,255,255,255,255,10,
                           255,255,255,255,255,255,255,11]
    mask_i64x2_i8x16_s6 = [255,255,255,255,255,255,255,12,
                           255,255,255,255,255,255,255,13]
    mask_i64x2_i8x16_s7 = [255,255,255,255,255,255,255,14,
                           255,255,255,255,255,255,255,15]
    mask_i64x2_i16x8_u0 =  [0,1,255,255,255,255,255,255,
                           2,3,255,255,255,255,255,255]
    mask_i64x2_i16x8_u1 =  [4,5,255,255,255,255,255,255,
                           6,7,255,255,255,255,255,255]
    mask_i64x2_i16x8_u2 =  [8,9,255,255,255,255,255,255,
                           10,11,255,255,255,255,255,255]
    mask_i64x2_i16x8_u3 =  [12,13,255,255,255,255,255,255,
                           14,15,255,255,255,255,255,255]
    mask_i64x2_i16x8_s0 =  [255,255,255,255,255,255,0,1,
                           255,255,255,255,255,255,2,3]
    mask_i64x2_i16x8_s1 =  [255,255,255,255,255,255,4,5,
                           255,255,255,255,255,255,6,7]
    mask_i64x2_i16x8_s2 =  [255,255,255,255,255,255,8,9,
                           255,255,255,255,255,255,10,11]
    mask_i64x2_i16x8_s3 =  [255,255,255,255,255,255,12,13,
                           255,255,255,255,255,255,14,15]

    mask_i64x2_i8x16_condensed_s0 =  [255,255,255,0,255,255,255,1,255,255,255,2,255,255,255,3]
    mask_i64x2_i8x16_condensed_s1 = [255,255,255,4,255,255,255,5,255,255,255,6,255,255,255,7]
    mask_i64x2_i8x16_condensed_s2  = [255,255,255,8,255,255,255,9,255,255,255,10,255,255,255,11]
    mask_i64x2_i8x16_condensed_s3 =  [255,255,255,12,255,255,255,13,255,255,255,14,255,255,255,15]

    mask_i64x2_i16x8_condensed_s0 =  [255,255,0,1,255,255,2,3,255,255,4,5,255,255,6,7]
    mask_i64x2_i16x8_condensed_s1 = [255,255,8,9,255,255,10,11,255,255,12,13,255,255,14,15]

   *255 can be replaced with 128 where necessary or reasonable.*

x86/x86-64 processors with AVX instruction set

i32x4.widen_i8x16_u(v128: a, ImmLaneIdx4: c) -> v128

# When c=0
        vpmovzxbd xmm_out, xmm_a # for mask c=0
# When c=1
        vpshufb  xmm_out, xmm_a, mask_i32x4_i8x16_u1 
# When c=2
        vpshufb  xmm_out, xmm_a, mask_i32x4_i8x16_u2 
# When c=3
        vpshufb  xmm_out, xmm_a, mask_i32x4_i8x16_u3 

i32x4.widen_i8x16_s(v128: a, ImmLaneIdx4: c) -> v128

# When c=0
        vpmovsxbd xmm_out, xmm_a # for mask c=0
# When c=1
        vpshufb  xmm_out, xmm_a, mask_i32x4_i8x16_s1
        vpsrad   xmm_out, xmm_out, 24 
# When c=2
        vpshufb  xmm_out, xmm_a, mask_i32x4_i8x16_s2
        vpsrad   xmm_out, xmm_out, 24 
# When c=3
        vpshufb  xmm_out, xmm_a, mask_i32x4_i8x16_s3
        vpsrad   xmm_out, xmm_out, 24 
Withdrawn lowerings

i64x2.widen_i8x16_u(v128: a, ImmLaneIdx8: c) -> v128

# When c=0
        vpmovzxbq xmm_out, xmm_a 
# When c=1..7
        vpshufb  xmm_out, xmm_a, mask_i64x2_i8x16_u$c 

i64x2.widen_i8x16_s(v128: a, ImmLaneIdx8: c) -> v128

# When c=0
        vpmovsxbq xmm_out, xmm_a 
# When c=1
        vpsrad xmm_out, xmm_a, 16
        vpmovsxbq xmm_out, xmm_out
# When c=[1,3,5,7]
        vpshufb xmm_tmp, xmm_a, mask_i64x2_i8x16_condensed_s$(c-1)
        vpsrad  xmm_out, xmm_tmp, 24
        vpsrad  xmm_tmp, xmm_tmp, 31
        vpunpckhdq      xmm_out, xmm_out, xmm_tmp 
# When c=[2,4,6]      
        vpshufb xmm_tmp, xmm_a, mask_i64x2_i8x16_condensed_s$(c)
        vpsrad  xmm_out, xmm_tmp, 24
        vpsrad  xmm_tmp, xmm_tmp, 31
        vpunpckldq      xmm_out, xmm_out, xmm_tmp 

i64x2.widen_i16x8_u(v128: a, ImmLaneIdx4: c) -> v128

# When c=0
        vpmovzxwq xmm_out, xmm_a 
# When c=1..3
        vpshufb  xmm_out, xmm_a, mask_i64x2_i16x8_u$c 

i64x2.widen_i16x8_s(v128: a, ImmLaneIdx4: c) -> v128

# When c=0
        vpmovswbq xmm_out, xmm_a 
# When c=[1,3]
        vpshufb xmm_tmp, xmm_a, mask_i64x2_i8x16_condensed_s$(c-1)
        vpsrad  xmm_out, xmm_tmp, 16
        vpsrad  xmm_tmp, xmm_tmp, 31
        vpunpckhdq      xmm_out, xmm_out, xmm_tmp 
# When c=[2]      
        vpshufb xmm_tmp, xmm_a, mask_i64x2_i8x16_condensed_s$(c)
        vpsrad  xmm_out, xmm_tmp, 16
        vpsrad  xmm_tmp, xmm_tmp, 31
        vpunpckldq      xmm_out, xmm_out, xmm_tmp 

x86/x86-64 processors with SSE4 instruction set

i32x4.widen_i8x16_u(v128: a, ImmLaneIdx4: c) -> v128

        movdqa   xmm_out, xmm_a
# when c=0
       pmovzxbd  xmm_out, xmm_out
# when c=1..3
        pshufb  xmm_out, mask_i32x4_i8x16_u$c

i32x4.widen_i8x16_s(v128: a, ImmLaneIdx4: c) -> v128

        movdqa   xmm_out, xmm_a
# when c=0
       pmovsxbd  xmm_out, xmm_out
# when c=1..3
        pshufb  xmm_out, mask_i32x4_i8x16_s$c
        psrad   xmm_out, 24
Withdrawn lowerings

i64x2.widen_i8x16_u(v128: a, ImmLaneIdx8: c) -> v128

        movdqa  xmm_out, xmm_a
# when c=0
        pmovsxbq  xmm_out, xmm_out
# when c=1..7
        pshufb  xmm_out, mask_i64x2_i8x16_u$c 

i64x2.widen_i8x16_s(v128: a, ImmLaneIdx8: c) -> v128

# Option 1 (best performance in most cases)
       pmovsxbq xmm_out, mem_argument(base + 2*c)
# Option 2 
# when c=0
       pmovsxbq xmm_out, xmm_a
# when c=1
       movdqa xmm_out, xmm_a
       psrld xmm_out, 16
       pmovsxwq xmm_out, xmm_a
# When c=[1,3,5,7]
        movdqa xmm_tmp, xmm_a
        vpshufb xmm_tmp, mask_i64x2_i8x16_condensed_s$(c-1)
        movdqa xmm_out, xmm_tmp
        vpsrad  xmm_out, xmm_tmp, 24
        vpsrad  xmm_tmp, xmm_tmp, 31
        vpunpckhdq      xmm_out, xmm_out, xmm_tmp 
# When c=[2,4,6]      
        movdqa xmm_tmp, xmm_a
        vpshufb xmm_tmp, mask_i64x2_i8x16_condensed_s$(c-1)
        movdqa xmm_out, xmm_tmp
        vpsrad  xmm_out, xmm_tmp, 24
        vpsrad  xmm_tmp, xmm_tmp, 31
        vpunpckldq      xmm_out, xmm_out, xmm_tmp 
# Option 3 Spill and Load
# This may provide better performance than Option 2 if you're iterating through the whole register
# and you can't optimize for reuse of the original shuffle -- punpck{l,h}dq
      movdqa xmmword ptr[rsp+XXXX], xmm_a
      pmovsxwq xmm_out, dword ptr[rsp+XXXX+2*c]

i64x2.widen_i16x8_u(v128: a, ImmLaneIdx4: c) -> v128

        movdqa  xmm_out, xmm_a
        pshufb  xmm_out, mask_i64x2_i16x8_u$c 

i64x2.widen_i16x8_s(v128: a, ImmLaneIdx4: c) -> v128

# Option 1 (best performance in most cases)
       pmovsxwq xmm_out, mem_argument(base + 2*c)
# Option 2 
# When c=0
       pmovsxwq xmm_out, xmm_a
# When c=[1,3]
        movdqa xmm_tmp, xmm_a
        vpshufb xmm_tmp, mask_i64x2_i16x8_condensed_s$(c-1)
        movdqa xmm_out, xmm_tmp
        vpsrad  xmm_out, xmm_tmp, 16
        vpsrad  xmm_tmp, xmm_tmp, 31
        vpunpckhdq      xmm_out, xmm_out, xmm_tmp 
# When c=[2]      
        movdqa xmm_tmp, xmm_a
        vpshufb xmm_tmp, mask_i64x2_i16x8_condensed_s$(c)
        movdqa xmm_out, xmm_tmp
        vpsrad  xmm_out, xmm_tmp, 16
        vpsrad  xmm_tmp, xmm_tmp, 31
        vpunpckldq      xmm_out, xmm_out, xmm_tmp 
# when c=1..3 (with just pure random access and no other conversions needed)
       movdqa xmm_out, xmm_a
       psrldq xmm_out, c*4
       pmovsxwq xmm_out, xmm_a
# Option 3 Spill and Load
# This may provide better performance than Option 2 if you're iterating through the whole register
# and you can't optimize for reuse of the original shuffle -- punpck{l,h}dq
      movdqa xmmword ptr[rsp+XXXX], xmm_a
      pmovsxwq xmm_out, dword ptr[rsp+XXXX+2*c]

x86/x86-64 processors with SSE2 instruction set

i32x4.widen_i8x16_u(v128: a, ImmLaneIdx4: c) -> v128

# all cases
       pxor         xmm_tmp, xmm_tmp
       movdqa   xmm_out, xmm_a
# case c=0
        punpcklbw       xmm_out, xmm_tmp        
        punpcklwd       xmm_out, xmm_tmp              
 # case c=1
        punpcklbw       xmm_out, xmm_tmp        
        punpckhwd       xmm_out, xmm_tmp 
# case c=2
        punpckhbw       xmm_out, xmm_tmp        
        punpcklwd       xmm_out, xmm_tmp        
# case c=3
        punpckhbw       xmm_out, xmm_tmp        
        punpckhwd       xmm_out, xmm_tmp       

i32x4.widen_i8x16_s(v128: a, ImmLaneIdx4: c) -> v128

# all cases
# xmm_out can be uninitialized since we're discarding the values anyway
# case c=0
        punpcklbw       xmm_out, xmm_a         
        punpcklwd       xmm_out, xmm_out
        psrad               xmm_out, 24
 # case c=1
        punpcklbw       xmm_out, xmm_a         
        punpckhwd       xmm_out, xmm_out
        psrad               xmm_out, 24
# case c=2
        punpckhbw       xmm_out, xmm_a         
        punpcklwd       xmm_out, xmm_out
        psrad               xmm_out, 24
# case c=3
        punpckhbw       xmm_out, xmm_a         
        punpckhwd       xmm_out, xmm_out
        psrad                xmm_out, 24
Withdrawn lowerings

i64x2.widen_i8x16_u(v128: a, ImmLaneIdx8: c) -> v128

# all cases
        movdqa xmm_out, xmm_a
        pxor    xmm_tmp, xmm_tmp
# case c=0
        punpcklbw       xmm_out, xmm_tmp
        punpcklwd       xmm_out, xmm_tmp             
        punpckldq       xmm_out, xmm_tmp           
# case c=1
        punpcklbw       xmm_out, xmm_tmp
        punpcklwd       xmm_out, xmm_tmp             
        punpckhdq       xmm_out, xmm_tmp
# case c=2
        punpcklbw       xmm_out, xmm_tmp
        punpckhwd       xmm_out, xmm_tmp             
        punpckldq       xmm_out, xmm_tmp
# case c=3
        punpcklbw       xmm_out, xmm_tmp
        punpckhwd       xmm_out, xmm_tmp             
        punpckhdq       xmm_out, xmm_tmp
# case c=4..7
# repeat c=0..3 with punpckhbw instead of punpcklbw

i64x2.widen_i8x16_s(v128: a, ImmLaneIdx8: c) -> v128

# LLVM-MCA seems to suggest a spill for these cases if the chip really only supports SSE2
        movaps  xmmword ptr [safe_memory_location], xmm_a
        movsx   r8, byte ptr [safe_memory_location+c*2] # use which ever 64 bit register makes sense
        movsx   rcx, byte ptr [safe_memory_location+c*2+1] # use which ever 64 bit register makes sense
        movq    xmm_tmp, rcx
        movq    xmm_out, r8
        punpcklqdq      xmm_out, xmm_tmp             

i64x2.widen_i16x8_u(v128: a, ImmLaneIdx4: c) -> v128

# all cases
        movdqa xmm_out, xmm_a
        pxor    xmm_tmp, xmm_tmp
# case c=0
        punpcklwd       xmm_out, xmm_tmp             
        punpckldq       xmm_out, xmm_tmp           
# case c=1
        punpcklwd       xmm_out, xmm_tmp             
        punpckhdq       xmm_out, xmm_tmp
# case c=2
        punpckhwd       xmm_out, xmm_tmp             
        punpckldq       xmm_out, xmm_tmp
# case c=3
        punpckhwd       xmm_out, xmm_tmp             
        punpckhdq       xmm_out, xmm_tmp

i64x2.widen_i16x8_s(v128: a, ImmLaneIdx4: c) -> v128

# case c=0
        punpcklwd       xmm_out, xmm_a
        movdqa           xmm_high, xmm_out
        psrad               xmm_out, 16
        psrad               xmm_high, 31            
        punpckldq       xmm_out, xmm_high           
# case c=1
        punpcklwd       xmm_out, xmm_a
        movdqa           xmm_high, xmm_out
        psrad               xmm_out, 16
        psrad               xmm_high, 31            
        punpckhdq       xmm_out, xmm_high        
# case c=2
        punpckhwd       xmm_out, xmm_a
        movdqa           xmm_high, xmm_out
        psrad               xmm_out, 16
        psrad               xmm_high, 31            
        punpckldq       xmm_out, xmm_high       
# case c=3
        punpckhwd       xmm_out, xmm_a
        movdqa           xmm_high, xmm_out
        psrad               xmm_out, 16
        psrad               xmm_high, 31            
        punpckhdq       xmm_out, xmm_high       

on ARM64

i32x4.widen_i8x16_u(v128: a, ImmLaneIdx4: c) -> v128

### Option 1
        tbl vOut.16B, { vA.16B } mask_i32x4_i8x16_u$c.16B
### Option 2
        ushll{2}    vOut.4S,  vA.{4H,8H}, #0
        ushll{2}    vOut.2D,  vOut.{2S,4S}, #0

i32x4.widen_i8x16_s(v128: a, ImmLaneIdx4: c) -> v128

### Option 1
        tbl vOut.16B, { vA.16B } vMask_i64x2_i8x16_s$c.16B
        sshr    vOut.2S, #56
### Option 2
        sshll{2}    vOut.4H, vA.{8B,16B}, #0
        sshll{2}    vOut.4S,  vOut.4H, #0
Withdrawn lowerings

i64x2.widen_i8x16_u(v128: a, ImmLaneIdx8: c) -> v128

### Option 1
        tbl vOut.16B, { vA.16B } vMask_i64x2_i8x16_u$c.16B
### Option 2
        ushll{2}    vOut.4H, vA.{8B,16B}, #0
        ushll{2}    vOut.4S,  vOut.{4H,8H}, #0
        ushll{2}    vOut.2D,  vOut.{2S,4S}, #0

i64x2.widen_i8x16_s(v128: a, ImmLaneIdx8: c) -> v128

### Option 1
        tbl vOut.16B, { vA.16B }, vMask_i64x2_i8x16_s$c.16B
        sshr    vOut.2S, #56 
### Option 2
        sshll{2}    vOut.4H, vA.{8B,16B}, #0
        sshll{2}    vOut.4S,  vOut.{4H,8H}, #0
        sshll{2}    vOut.2D,  vOut.{2S,4S}, #0

i64x2.widen_i16x8_u(v128: a, ImmLaneIdx4: c) -> v128

### Option 1
        tbl vOut.16B, { vA.16B } vMask_i64x2_i16x8_u$c.16B
### Option 2
        ushll{2}    vOut.4S,  vA.{4H,8H}, #0
        ushll{2}    vOut.2D,  vOut.{2S,4S}, #0

i64x2.widen_i16x8_s(v128: a, ImmLaneIdx4: c) -> v128

### Option 1
        tbl vOut.16B, { vA.16B } vMask_i64x2_i8x16_s$c.16B
        sshr    vOut.2S, #48
### Option 2
        sshll{2}    vOut.4S,  vA.{4H,8H}, #0
        sshll{2}    vOut.2D,  vOut.{2S,4S}, #279 

on ARMv7 with NEON

i32x4.widen_i8x16_u(v128: a, ImmLaneIdx4: c) -> v128

# first lower 64 of mask and input vector
# assuming dLow/DHigh correspond to a Q
        tbl dOutLow, { dALow } (mask_i32x4_i8x16_u$c & 0xffffffffffffffff)
# second upper 64 of mask and input vector
        tbl dOutHigh, { dAHigh } (mask_i32x4_i8x16_u$c >> 64)

i32x4.widen_i8x16_s(v128: a, ImmLaneIdx4: c) -> v128

# assuming dLow/DHigh correspond to a Q
# lower 64
        tbl dOutLow, { dALow } (mask_i32x4_i8x16_s$c & 0xffffffffffffffff)
# upper 64
        tbl dOutHigh, { dAHigh } (mask_i32x4_i8x16_s$c >> 64)
        vshr.s64        qOut, qOut, #24
Withdrawn lowerings

i64x2.widen_i8x16_u(v128: a, ImmLaneIdx8: c) -> v128

# assuming dLow/DHigh correspond to a Q
# lower 64
        tbl dOutLow, { dALow } (mask_i64x2_i8x16_u$c & 0xffffffffffffffff)
# upper 64
        tbl dOutHigh, { dAHigh } (mask_i64x2_i8x16_u$c >> 64)

i64x2.widen_i8x16_s(v128: a, ImmLaneIdx8: c) -> v128

# assuming dLow/DHigh correspond to a Q
# lower 64
        tbl dOutLow, { dALow } (mask_i64x2_i8x16_s$c & 0xffffffffffffffff)
# upper 64
        tbl dOutHigh, { dAHigh } (mask_i64x2_i8x16_s$c >> 64)
        vshr.s64        qOut, qOut, #56

i64x2.widen_i16x8_u(v128: a, ImmLaneIdx4: c) -> v128

# assuming dLow/DHigh correspond to a Q
# lower 64
        tbl dOutLow, { dALow } (mask_i64x2_i16x8_u$c & 0xffffffffffffffff)
# upper 64
        tbl dOutHigh, { dAHigh } (mask_i64x2_i16x8_u$c >> 64)

i64x2.widen_i16x8_s(v128: a, ImmLaneIdx4: c) -> v128

# assuming dLow/DHigh correspond to a Q
# lower 64
        tbl dOutLow, { dALow } (mask_i64x2_i16x8_s$c & 0xffffffffffffffff)
# upper 64
        tbl dOutHigh, { dAHigh } (mask_i64x2_i16x8_s$c >> 64)
        vshr.s64 qOut, qOut, #48

@omnisip
Copy link
Author

omnisip commented Nov 10, 2020

Extra notes about performance and implementation (left for posterity)

ARM64 with int64

On ARM:

According to llvm-mca tbl/sshr should have identical performance to that of 2 sshlls since they use the exact same ports with the same latency and the same number of instructions. This suggests there's a potential benefit for 8 to 64bit case with signed integers.

Signed Data on x64 without SSE4 On architectures that don't support SSE4, it can make sense to spill the vector memory, load the values into individual registers, move it back to vectors, and unpack. Since machines lacking SSE4 seems to be such an edge case, this should provide reasonably good fallback behavior. Example: ```assembly movaps xmmword ptr [rsp - 128], xmm0 movsx r8, byte ptr [rsp - 128] movsx rcx, byte ptr [rsp - 127] movq xmm0, rcx movq xmm1, r8 punpcklqdq xmm1, xmm0 # xmm1 = xmm1[0],xmm0[0] ```

@omnisip
Copy link
Author

omnisip commented Nov 11, 2020

Updated the assembly above to provide comments on spill and load options as well for using pmovsxbq since x64 lacks psraq without AVX512.

@omnisip
Copy link
Author

omnisip commented Nov 11, 2020

And another option which has the potential to double the signed 64bit output depending on how it's used:

        vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI9_7] # xmm0 = zero,zero,zero,xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,zero,zero,zero,zero,zero
        vpsrad  xmm1, xmm0, 24
        vpsrad  xmm0, xmm0, 31
        vpunpckldq      xmm0, xmm0, xmm1 

@omnisip
Copy link
Author

omnisip commented Nov 11, 2020

The instruction set checks for ARMv7 with NEON is done. The method seems to port nicely.

Example from Godbolt:

        vtbl.8  d3, {d1}, d16
        vtbl.8  d2, {d0}, d17
        vshr.s64        q0, q1, #56

Will update the rest of the documentation for this PR later today. Should cover ARM64, ARMv7+Neon, x64/SSE4 (including SSSE3), AVX, and SSE2.

Updated: This is done.

@omnisip
Copy link
Author

omnisip commented Nov 13, 2020

hey @ngzhian,

On our call @Maratyszcza had a question about v8 preserving constants with respect to this proposal. This behavior appears to exist on x64 for any constant parameter, and v8 will pregenerate them at the beginning of the code block. Will this also apply for ARM7 and ARM64? The biggest benefit with respect to this proposal is making sure that the masks that are used are only loaded once.

@ngzhian
Copy link
Member

ngzhian commented Nov 17, 2020

These constant mask will remain in registers and be reused as long as they are not spilled. Same for ARM7 and ARM64.

@omnisip
Copy link
Author

omnisip commented Nov 18, 2020

These constant mask will remain in registers and be reused as long as they are not spilled. Same for ARM7 and ARM64.

That's awesome and should make this really efficient.

@omnisip
Copy link
Author

omnisip commented Dec 6, 2020

@ngzhian @Maratyszcza

It turns out that the TBL approach with SSHR can be more efficient than SSHLL when the algorithm is adjusted, such that SSHR is only called once. According to the ARM Cortex-A76 software optimization guide shift operations can only occur in one instruction per cycle, but TBL operations with two table vectors can occur twice per cycle. Whether Cortex-A76 is an accurate testbed is to be decided -- however -- it gave me an idea for a new implementation that uses fewer instructions for signed conversion and leverages the performances of TBL for these integer conversions. The biggest difference in the implementation is the mask that's required for signed conversion.

Here's a Godbolt example.

@ngzhian
Copy link
Member

ngzhian commented Dec 14, 2020

Thanks for your suggestion and the detailed implementation guide. Couple of notes:

  • A lot of the masks will not be easy to generate, it will likely end up like eor x, y + replace lanes as we don't have load constants from memory.
  • These instructions (especially the 8->32 unsigned ones), look a lot like swizzle/shuffles (the signed ones need another arithmetic shift right). V8 has support for shuffles and pattern matching shuffle immediates.
  • Which instruction do you think will see the greatest speed up from having a dedicated instruction, rather than composing existing ones?
  • Can you add more specific links to projects that will benefit from this instruction? Linking to specific snippet of code will make it more clear.

@omnisip
Copy link
Author

omnisip commented Dec 14, 2020

@ngzhian

First and foremost, thanks for looking at this. I know this is a doozy of a proposal. There are 24 variants masquerading as 6 instructions even if most of them are masks. I'm going to take your questions a bit out of order, so you can understand how this came to be, and what the benefits will be.

  • Which instruction do you think will see the greatest speed up from having a dedicated instruction, rather than composing existing ones?

As it stands today, every integer conversion requires stepwise conversion in the WASM SIMD instruction set. Thus the initial premise removes the minimum required instructions to go from 8 to 64 for 8 results from 14 WASM SIMD instructions to 8. For 8 to 32, it's 4 instead of 6. This proposal can neatly do that for unsigned values with PSHUFB/TBL equivalents assuming masks are present. For signed data types, the underlying implementation is equally efficient on x86/x64 even though there are more instructions by virtue of completely different port usage. And, if there's even a remote possibility that the ARM support can be implemented like this for signed data types, ARM will receive all of the same benefits as well. While all of those cases get clear direct benefits when in vectors, the largest benefits come from the operations that come directly from memory. V8 leverages this functionality for x64 and can do an in-flight LoadTransform (see here) for single step integer type conversion, but can't do it for multi-step. With these new instructions, load transformation could apply universally for x86/x64 without any interaction from the programmer and without the need for masks while still giving a very performant solution for ARM.

  • These instructions (especially the 8->32 unsigned ones), look a lot like swizzle/shuffles (the signed ones need another arithmetic shift right). V8 has support for shuffles and pattern matching shuffle immediates.

This is correct (mostly) with a couple of caveats. It doesn't take advantage of any of the underlying LoadTransform stuff listed above, and it has to deal with the less than efficient swizzle implementation that doesn't recognize that the input parameters themselves are constant. If we can come up with an optimization like proposed in #403, swizzle wouldn't be a bad option, but it'll never be as good as a load and shuffle or the loadtransform above.

  • A lot of the masks will not be easy to generate, it will likely end up like eor x, y + replace lanes as we don't have load constants from memory.

I have some ideas on how to make loading memory constants work nicely inside the current architecture of v8 with minimal changes to the code. I just need some time to flesh them out a bit. For runtime generation, there's a bunch of ways to do it that are better than individual inserts. If you need some samples, please let me know. Even the insert strategy isn't so bad as long as the masks are only generated once and reused by subsequent calls.

  • Can you add more specific links to projects that will benefit from this instruction? Linking to specific snippet of code will make it more clear.

Yes. I'll update this thread with some examples when I have a minute.

@penzn
Copy link
Contributor

penzn commented Dec 16, 2020

Are any of those compiling to wasm or on the way to compiling?

@omnisip
Copy link
Author

omnisip commented Dec 17, 2020

Are any of those compiling to wasm or on the way to compiling?

Yes sir. Simdpp (header only is up first). @tlively is there preprocessor macro to detect emscripten / wasm implementation?

@tlively
Copy link
Member

tlively commented Dec 17, 2020

Yep, you can check for the __wasm_simd128__ macro.

@ngzhian
Copy link
Member

ngzhian commented Dec 22, 2020

Yup external references (the link you sent) are arch-independent.

@ngzhian
Copy link
Member

ngzhian commented Dec 23, 2020

@omnisip you mentioned you will get some numbers if this is prototype. Which instruction are you planning to make use of? And for which architecture. This is a lot to prototype.

@ngzhian
Copy link
Member

ngzhian commented Dec 23, 2020

Also, simdpp is a simd header library, I wouldn't consider it a use case according to our inclusion criteria (since as a library it necessarily includes more instructions.)
AOM and Xiph uses 8->32 and 16->32 AFAICT, did not find any X->64 usages there.

@omnisip
Copy link
Author

omnisip commented Dec 23, 2020

@omnisip you mentioned you will get some numbers if this is prototype. Which instruction are you planning to make use of? And for which architecture. This is a lot to prototype.

The most interesting instructions to me are the 8 to 32s. I added the 64 bit variants for orthogonality. The 8 to 32 cases for unsigned stands out most since on x64 I have to use swizzle four times yielding at least 8 shuffles, 4 movs and 4 adds. With the shuffle method assuming I'm using a second vector with zeros, it doesn't look much better.

I have a prefix sum / scan calculation that leverages quite a bit of this with simdpp even if it's not posted yet. This will be a WASM first library for ssim calculation.

@ngzhian
Copy link
Member

ngzhian commented Jan 6, 2021

Before any further action, I would like to see more support for this set of instructions, e.g. community members saying that this is useful for them. It will also be better if existing use cases can immediately benefit if we have this set of instruction, rather than new developments.
For the reasons above, I suggest we mark this set of instructions as post-mvp, and focus on locking down our instruction set.

@omnisip
Copy link
Author

omnisip commented Jan 6, 2021

@ngzhian -- Please see the meeting notes from 11/13/2020 where this was discussed in detail. Specifically, this proposal is necessary because the conversions on x64 from 8 to 32bit are difficult and expensive to perform with our existing instruction set. No option exists without at least two shuffles ops for any conversion, and all of the widen high variants require at least 2 (alignr/psrlq + pmovsx...). For every conversion from 8 to 32 on x64, it takes a minimum of 3 shuffle ops to get from 1x8x16 to 2x16x8, then another 6 to go from 2x16x8 to 4x32x4 yielding 9 instructions and 9 shuffle ops.

The other options -- swizzle and shuffle are worse since no pattern matches will occur for these. If that wasn't problematic, it gets really messy with signed conversion cases -- where you end up with a shuffle like this: shuffle(0,16,16,16,1,17,17,17,17,2,18,18,18,3,19,19,19);
(the second vector would be the result of determining if the first vector was less than 0). This turns out to be okay on ARM where TBL can span 2 vectors and perform that in 1 op -- but it's lousy on x64.

All of that said -- ARM's performance improvement should be as good as the performance improvement for x64 on all of the unsigned cases today. It'll be even better when once the proposal for lifting reused constant intermediates is finished. This turns the signed cases into a net 5 instruction solution -- instead of 8.

Here are some extra use cases that show how these are used elsewhere:
https://github.com/dkfrankandersen/ITU_ResearcProject_Scann/blob/eaba125ccbaa78a6a21bcb7400c9a10321d5a6cf/scann/scann/distance_measures/one_to_one/dot_product_sse4.cc#L180

https://github.com/raspbian-packages/volk/blob/e3a8994b2fd0bb238ae6b460e1f3428a7f1f8f3a/kernels/volk/volk_8i_s32f_convert_32f.h#L81

@ngzhian
Copy link
Member

ngzhian commented Jan 7, 2021

I looked at the meetings notes, main takeaways:

  • X->64 ones are not useful
  • one comment from JW indicating that these are useful, without details on what for
  • XNNNPACK has 1 use case, "loading from memory signed 8->32", which is different from this
  • MD is skeptical of ARM lowering

It'll be even better when once the proposal for lifting reused constant intermediates is finished.

this is going to take a while, until then we have to live with the performance cliffs

For every conversion from 8 to 32 on x64, it takes a minimum of 3 shuffle ops to get from 1x8x16 to 2x16x8, then another 6 to go from 2x16x8 to 4x32x4 yielding 9 instructions and 9 shuffle ops.

I don't understand this part, as above:

i32x4.widen_i8x16_u(v128: a, ImmLaneIdx4: c) -> v128
        movdqa   xmm_out, xmm_a
# when c=0
       pmovzxbd  xmm_out, xmm_out
# when c=1..3
        pshufb  xmm_out, mask_i32x4_i8x16_u$c

Is a single shuffle. How are you getting 9 instructions?

If we were to ignore X->64 for a second, all the 8->32 instructions look like convenience wrappers or groupings around instructions we already have.

My main point is that our existing use cases don't benefit from this set of instruction (especially ->64). Pushing this to post-mvp will help reduce the surface area we need to work on to get to Phase 4, which makes SIMD more useful because we get it closer to the hands of all users.

@omnisip
Copy link
Author

omnisip commented Jan 7, 2021

For every conversion from 8 to 32 on x64, it takes a minimum of 3 shuffle ops to get from 1x8x16 to 2x16x8, then another 6 to go from 2x16x8 to 4x32x4 yielding 9 instructions and 9 shuffle ops.

The 9 instructions and 9 shuffles is what it takes without these proposed instructions.

@omnisip
Copy link
Author

omnisip commented Jan 7, 2021

If you prototype these for me, you can ditch the 64 bit ones. This was drafted to be fully complete by the submission deadline.

That leaves only two instructions. With the external reference support in v8 making it possible to do aligned loads, we can (and probably should) implement these with memory arguments. The performance should be excellent and it'll provide good support for 8 bit to float conversions which is often a subsequent step for these.

@Maratyszcza are there any outstanding proposals that would justify keeping the 16 to 64 variants? What would stand out would be something that allowed conversion of i64s to doubles.

@Maratyszcza
Copy link
Contributor

i64x2->f64x2 conversion is not supported on x86 until AVX512, so it is not in WAsm SIMD.

@ngzhian
Copy link
Member

ngzhian commented Jan 8, 2021

The 9 instructions and 9 shuffles is what it takes without these proposed instructions.

Instead of the stepwise conversion, you can emit the single pshufb you need to get from 8x16 -> 32x4. Does that not work?

@tlively
Copy link
Member

tlively commented Jan 29, 2021

Also, emcc: warning: LLVM version appears incorrect (seeing "13.0", expected "12.0") doesn't look good. Are you using the latest version of Emscripten from emsdk?

@omnisip
Copy link
Author

omnisip commented Jan 29, 2021

Yep. I just did emsdk install tot, an hour ago.

I'm assuming I had to compile llvm from source, so I pointed emscripten to point at my new llvm build. Is that wrong?

@penzn
Copy link
Contributor

penzn commented Jan 29, 2021

I don't know if that is what's going on, but LLVM just rolled the version from 12 to 13 very recently (maybe even yesterday), maybe emscripten's version detection hasn't gotten the memo.

@tlively
Copy link
Member

tlively commented Jan 30, 2021

Weird, it looks like that expectation was updated yesterday. @omnisip, did you do emsdk update-tags before installing tot?

@omnisip
Copy link
Author

omnisip commented Jan 30, 2021

Weird, it looks like that expectation was updated yesterday. @omnisip, did you do emsdk update-tags before installing tot?

Don't recall doing emsdk update-tags, but I think I did emsdk update. I'll redo it again if that helps.

@omnisip
Copy link
Author

omnisip commented Jan 30, 2021

Same issue, but the warning flags are gone now -- so that's a plus.

dan@dl360:~/applications/wrapper/xnnpack$ EMCC_DEBUG=1 /opt/emsdk/upstream/emscripten/emcc -o bazel-out/wasm-dbg/bin/elu_bench bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o bazel-out/wasm-dbg/bin/libXNNPACK.a bazel-out/wasm-dbg/bin/libmemory_planner.a bazel-out/wasm-dbg/bin/liboperator_run.a bazel-out/wasm-dbg/bin/liboperators.a bazel-out/wasm-dbg/bin/libindirection.a bazel-out/wasm-dbg/bin/liblogging_utils.a bazel-out/wasm-dbg/bin/libpacking.a bazel-out/wasm-dbg/bin/libscalar_ukernels.a bazel-out/wasm-dbg/bin/libwasm_ukernels.a bazel-out/wasm-dbg/bin/libtables.a bazel-out/wasm-dbg/bin/libasm_ukernels.a bazel-out/wasm-dbg/bin/libbench_utils.a bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a bazel-out/wasm-dbg/bin/external/clog/libclog.a bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a -s 'ASSERTIONS=1' -s 'ERROR_ON_UNDEFINED_SYMBOLS=1' -s 'EXIT_RUNTIME=1' -s 'ALLOW_MEMORY_GROWTH=1' -s 'TOTAL_MEMORY=268435456' --pre-js ./preamble.js.lds -pthread -msimd128 -g -sWASM_BIGINT -sERROR_ON_WASM_CHANGES_AFTER_LINK -msimd128 -s 'USE_PTHREADS=0' -s 'ERROR_ON_UNDEFINED_SYMBOLS=0' '-Wl,--export=__heap_base' '-Wl,--export=__data_end'
tools.filelock:DEBUG: Attempting to acquire lock 139907621271968 on /tmp/emscripten_temp/emscripten.lock
tools.filelock:DEBUG: Lock 139907621271968 acquired on /tmp/emscripten_temp/emscripten.lock
emcc:WARNING: invocation: /opt/emsdk/upstream/emscripten/emcc.py -o bazel-out/wasm-dbg/bin/elu_bench bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o bazel-out/wasm-dbg/bin/libXNNPACK.a bazel-out/wasm-dbg/bin/libmemory_planner.a bazel-out/wasm-dbg/bin/liboperator_run.a bazel-out/wasm-dbg/bin/liboperators.a bazel-out/wasm-dbg/bin/libindirection.a bazel-out/wasm-dbg/bin/liblogging_utils.a bazel-out/wasm-dbg/bin/libpacking.a bazel-out/wasm-dbg/bin/libscalar_ukernels.a bazel-out/wasm-dbg/bin/libwasm_ukernels.a bazel-out/wasm-dbg/bin/libtables.a bazel-out/wasm-dbg/bin/libasm_ukernels.a bazel-out/wasm-dbg/bin/libbench_utils.a bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a bazel-out/wasm-dbg/bin/external/clog/libclog.a bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a -s ASSERTIONS=1 -s ERROR_ON_UNDEFINED_SYMBOLS=1 -s EXIT_RUNTIME=1 -s ALLOW_MEMORY_GROWTH=1 -s TOTAL_MEMORY=268435456 --pre-js ./preamble.js.lds -pthread -msimd128 -g -sWASM_BIGINT -sERROR_ON_WASM_CHANGES_AFTER_LINK -msimd128 -s USE_PTHREADS=0 -s ERROR_ON_UNDEFINED_SYMBOLS=0 -Wl,--export=__heap_base -Wl,--export=__data_end  (in /home/dan/applications/wrapper/xnnpack)
shared:DEBUG: successfully executed /opt/emsdk/upstream/bin/clang --version
cache:DEBUG: PID 208108 acquiring multiprocess file lock to Emscripten cache at /opt/emsdk/upstream/emscripten/cache
tools.filelock:DEBUG: Attempting to acquire lock 139907621272112 on /opt/emsdk/upstream/emscripten/cache/cache.lock
tools.filelock:DEBUG: Lock 139907621272112 acquired on /opt/emsdk/upstream/emscripten/cache/cache.lock
cache:DEBUG: done
shared:DEBUG: sanity file up-to-date but check forced: /opt/emsdk/upstream/emscripten/cache/sanity.txt
shared:DEBUG: successfully executed /opt/emsdk/node/12.18.1_64bit/bin/node --version
shared:DEBUG: successfully executed /opt/emsdk/upstream/bin/llc --version
shared:INFO: (Emscripten: Running sanity checks)
shared:DEBUG: successfully executed /opt/emsdk/node/12.18.1_64bit/bin/node -e console.log("hello")
tools.filelock:DEBUG: Attempting to release lock 139907621272112 on /opt/emsdk/upstream/emscripten/cache/cache.lock
tools.filelock:DEBUG: Lock 139907621272112 released on /opt/emsdk/upstream/emscripten/cache/cache.lock
cache:DEBUG: PID 208108 released multiprocess file lock to Emscripten cache at /opt/emsdk/upstream/emscripten/cache
diagnostics:DEBUG: disabled warning: use of legacy setting: TOTAL_MEMORY (setting renamed to INITIAL_MEMORY) [-Wlegacy-settings]
emcc:DEBUG: compiling to bitcode
emcc:DEBUG: emcc step "parse arguments and setup" took 0.12 seconds
emcc:DEBUG: using object file: bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libXNNPACK.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libmemory_planner.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/liboperator_run.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/liboperators.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libindirection.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/liblogging_utils.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libpacking.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libscalar_ukernels.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libwasm_ukernels.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libtables.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libasm_ukernels.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libbench_utils.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/external/clog/libclog.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a
emcc:DEBUG: emcc step "compile inputs" took 0.00 seconds
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libXNNPACK.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libmemory_planner.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/liboperator_run.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/liboperators.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libindirection.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/liblogging_utils.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libpacking.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libscalar_ukernels.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libwasm_ukernels.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libtables.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libasm_ukernels.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libbench_utils.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/external/clog/libclog.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a
system_libs:DEBUG: adding dependency on malloc due to deps-info on realloc
system_libs:DEBUG: adding dependency on free due to deps-info on realloc
system_libs:DEBUG: adding dependency on malloc due to deps-info on getenv
system_libs:DEBUG: adding dependency on free due to deps-info on getenv
system_libs:DEBUG: adding dependency on malloc due to deps-info on gmtime_r
system_libs:DEBUG: adding dependency on free due to deps-info on gmtime_r
system_libs:DEBUG: adding dependency on _get_tzname due to deps-info on localtime_r
system_libs:DEBUG: adding dependency on _get_daylight due to deps-info on localtime_r
system_libs:DEBUG: adding dependency on _get_timezone due to deps-info on localtime_r
system_libs:DEBUG: adding dependency on malloc due to deps-info on localtime_r
system_libs:DEBUG: adding dependency on free due to deps-info on localtime_r
system_libs:DEBUG: adding dependency on malloc due to deps-info on pthread_create
system_libs:DEBUG: adding dependency on free due to deps-info on pthread_create
system_libs:DEBUG: adding dependency on emscripten_main_thread_process_queued_calls due to deps-info on pthread_create
system_libs:DEBUG: adding dependency on malloc due to deps-info on calloc
system_libs:DEBUG: adding dependency on free due to deps-info on calloc
system_libs:DEBUG: including libgl (libgl.a)
system_libs:DEBUG: including libal (libal.a)
system_libs:DEBUG: including libhtml5 (libhtml5.a)
system_libs:DEBUG: including libc (libc.a)
system_libs:DEBUG: including libcompiler_rt (libcompiler_rt.a)
system_libs:DEBUG: including libc++ (libc++-noexcept.a)
system_libs:DEBUG: including libc++abi (libc++abi-noexcept.a)
system_libs:DEBUG: including libmalloc (libdlmalloc.a)
system_libs:DEBUG: including libc_rt_wasm (libc_rt_wasm.a)
system_libs:DEBUG: including libsockets (libsockets.a)
emcc:DEBUG: emcc step "calculate system libraries" took 0.46 seconds
emcc:DEBUG: linking: ['bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o', 'bazel-out/wasm-dbg/bin/libXNNPACK.a', 'bazel-out/wasm-dbg/bin/libmemory_planner.a', 'bazel-out/wasm-dbg/bin/liboperator_run.a', 'bazel-out/wasm-dbg/bin/liboperators.a', 'bazel-out/wasm-dbg/bin/libindirection.a', 'bazel-out/wasm-dbg/bin/liblogging_utils.a', 'bazel-out/wasm-dbg/bin/libpacking.a', 'bazel-out/wasm-dbg/bin/libscalar_ukernels.a', 'bazel-out/wasm-dbg/bin/libwasm_ukernels.a', 'bazel-out/wasm-dbg/bin/libtables.a', 'bazel-out/wasm-dbg/bin/libasm_ukernels.a', 'bazel-out/wasm-dbg/bin/libbench_utils.a', 'bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a', 'bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a', 'bazel-out/wasm-dbg/bin/external/clog/libclog.a', 'bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a', '--export=__heap_base', '--export=__data_end', '-L/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libgl.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libal.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libhtml5.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libcompiler_rt.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc++-noexcept.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc++abi-noexcept.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libdlmalloc.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc_rt_wasm.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libsockets.a']
shared:DEBUG: successfully executed /opt/emsdk/upstream/bin/wasm-ld -o bazel-out/wasm-dbg/bin/elu_bench.wasm bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o bazel-out/wasm-dbg/bin/libXNNPACK.a bazel-out/wasm-dbg/bin/libmemory_planner.a bazel-out/wasm-dbg/bin/liboperator_run.a bazel-out/wasm-dbg/bin/liboperators.a bazel-out/wasm-dbg/bin/libindirection.a bazel-out/wasm-dbg/bin/liblogging_utils.a bazel-out/wasm-dbg/bin/libpacking.a bazel-out/wasm-dbg/bin/libscalar_ukernels.a bazel-out/wasm-dbg/bin/libwasm_ukernels.a bazel-out/wasm-dbg/bin/libtables.a bazel-out/wasm-dbg/bin/libasm_ukernels.a bazel-out/wasm-dbg/bin/libbench_utils.a bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a bazel-out/wasm-dbg/bin/external/clog/libclog.a bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a --export=__heap_base --export=__data_end -L/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libgl.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libal.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libhtml5.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libcompiler_rt.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc++-noexcept.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc++abi-noexcept.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libdlmalloc.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc_rt_wasm.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libsockets.a -mllvm -combiner-global-alias-analysis=false -mllvm -enable-emscripten-sjlj -mllvm -disable-lsr --allow-undefined --export main --export emscripten_stack_get_end --export emscripten_stack_get_free --export emscripten_stack_init --export stackSave --export stackRestore --export stackAlloc --export __wasm_call_ctors --export fflush --export __errno_location --export malloc --export free --export _get_tzname --export _get_daylight --export _get_timezone --export emscripten_main_thread_process_queued_calls --export-table -z stack-size=5242880 --initial-memory=268435456 --no-entry --max-memory=2147483648 --global-base=1024
emcc:DEBUG: emcc step "link" took 0.12 seconds
emcc:DEBUG: emscript
building:DEBUG: saving debug copy /tmp/emscripten_temp/emcc-0-base.wasm
shared:DEBUG: successfully executed /opt/emsdk/upstream/bin/wasm-opt --version
[parse exception: invalid code after SIMD prefix: 103 (at 0:222205)]
Fatal: error in parsing input
emcc: error: '/opt/emsdk/upstream/bin/wasm-emscripten-finalize --detect-features --minimize-wasm-changes -g --bigint --no-dyncalls --no-legalize-javascript-ffi --dwarf bazel-out/wasm-dbg/bin/elu_bench.wasm' failed (1)
tools.filelock:DEBUG: Attempting to release lock 139907621271968 on /tmp/emscripten_temp/emscripten.lock
tools.filelock:DEBUG: Lock 139907621271968 released on /tmp/emscripten_temp/emscripten.lock

@tlively
Copy link
Member

tlively commented Jan 30, 2021

Looks like I'll have to do a Binaryen implementation after all. Sorry about that. I should have a PR up soon.

tlively added a commit to tlively/binaryen that referenced this pull request Feb 1, 2021
As proposed in WebAssembly/simd#395. Note that the other
instructions in the proposal have not been implemented in LLVM or in V8, so
there is no need to implement them in Binaryen right now either. This PR
introduces a new expression class for the new instructions because they uniquely
take an immediate argument identifying which portion of the input vector to
widen.
tlively added a commit to tlively/binaryen that referenced this pull request Feb 1, 2021
As proposed in WebAssembly/simd#395. Note that the other
instructions in the proposal have not been implemented in LLVM or in V8, so
there is no need to implement them in Binaryen right now either. This PR
introduces a new expression class for the new instructions because they uniquely
take an immediate argument identifying which portion of the input vector to
widen.
tlively added a commit to WebAssembly/binaryen that referenced this pull request Feb 2, 2021
As proposed in WebAssembly/simd#395. Note that the other
instructions in the proposal have not been implemented in LLVM or in V8, so
there is no need to implement them in Binaryen right now either. This PR
introduces a new expression class for the new instructions because they uniquely
take an immediate argument identifying which portion of the input vector to
widen.
@dtig dtig added the needs discussion Proposal with an unclear resolution label Feb 2, 2021
@dtig
Copy link
Member

dtig commented Feb 2, 2021

I wasn't clear with my comments on the meetings issue, following up here so we can discuss in more detail. The number of operations introduced here, and the number of constant masks required, combined with the fact that I can't seem to narrow down whether there are production use cases for these operations make me lean towards these not being a good candidate for MVP. A couple of things that I think will make a more compelling case for the inclusion of these operations -

  • Narrowing down to only the i32x4 operations (there's some discussion earlier in the issue about this, but it's not clear whether the intent is to include the i64x2 variants in the future)
  • Pointers to production use cases, or the benchmarks in which this would be tested.

@omnisip
Copy link
Author

omnisip commented Feb 2, 2021

I have little objection to dropping the 64 bit variants. I don't use them. I originally proposed them for completeness to be discussed before the November cut-off, but didn't know if they served any value.

With respect to the i8->i32 use cases, many are used for conversion from i8->i32->f32. It depends on the algorithm and level of precision in how this is done. For instance, my Structural Similarity (SSIM) calculation depends on 2D prefix sums (scan) to be calculated efficiently. In such case, I convert from i8->i32 before going to f32 for the final calculations. However, there are plenty of libraries that go from i8->i32->f32 in one go.

For benchmark purposes, I plan on presenting the performance time improvement or loss based on the modifications in prefix sum calculations and select XNNPack sections that use load_8x8 today.

@dtig
Copy link
Member

dtig commented Feb 4, 2021

@omnisip Sounds good, then let's limit this issue and the subsequent vote only to the non-64x2 variants. I'm not adding a vote here as it would make sense to vote after looking at the data. Adding this to this week's agenda.

@omnisip
Copy link
Author

omnisip commented Feb 4, 2021

Test Case

2D Prefix Sum with 10K images each at 1920x1080 resolution. Each test was repeated 5 times to mitigate any errors or jitter in the results set.

Preliminary Results Table

UPDATED: 2021-02-04 23:34Z

CPU Test Best of 5 Avg Performance Time
Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (Skylake) Prefix Sum - Swizzle Widening 32239 ms 32881ms
Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (Skylake) Prefix Sum - Stepwise 28507 ms 29177 ms
Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (Skylake) Prefix Sum - widen_i8x16 27942 ms 29220 ms
Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (Skylake) Prefix Sum - Single-Op Swizzle 27429 ms 27777 ms
Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (Sandybridge) Prefix Sum - Swizzle Widening 24342 ms 25662 ms
Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (Sandybridge) Prefix Sum - Stepwise 23168 ms 23640 ms
Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (Sandybridge) Prefix Sum - widen_i8x16 23148 ms 23293 ms
Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (Sandybridge) Prefix Sum - Single Op Swizzle 23275 ms 23545 ms

Analysis

While widen_i8x16 is significantly better than swizzle, it's not significantly better than stepwise expansion (i8x16->i16x8->i32x4). This is likely due to the fact that the memory constants have to be reloaded each time. When compilers implement reuse of intermediate constants, performance is likely to improve on the swizzle variants such that it will be better than both the stepwise and widen implementations. A simple test to verify this can be performed by testing swizzle with the proposed masks on ARM64 against the stepwise instructions or by patching i8x16 swizzle for x64 (the method used to show the results above).

@ngzhian
Copy link
Member

ngzhian commented Feb 4, 2021

With https://crrev.com/c/2664994 this is prototyped on arm64 as well, using the double shifts.

Just saw the latest comment sharing the benchmarks, would you prefer the arm64 code sequence to use tbl instead?

@omnisip
Copy link
Author

omnisip commented Feb 4, 2021

@ngzhian -- I appreciate the offer, but I don't think that's necessary.

After going through this and benchmarking it a few different ways, there's a better way to address this specific issue and provide a solution/optimization that makes swizzle on x64 competitive with the ARM64 implementation. In essence, not only would it make for a faster swizzle for i8x16 -> i32x4 it would also work make a whole host of other applications using swizzle more performant.

To do that, we should consider detecting v128 Consts as arguments and perform optimizations in the Instruction Selector. If we can detect that there's S128Constant through the graph using a NodeMatcher, we can then predetermine before the code-generation step if the input to swizzle is constant, and if so, whether or not the parameter needs modification. As such, if we can determine a top bit is already set if any of the values are out of range, we can emit a pshufb without any additional movdqu, pshufd, and paddusb instructions -- effectively making swizzle a single-op instruction and eliminating the need for this instruction set.

@ngzhian
Copy link
Member

ngzhian commented Feb 4, 2021

To do that, we should consider detecting v128 Consts as arguments and perform optimizations

Good suggestion, we have a tracking bug for such optimizations at https://crbug.com/v8/10992. This is an slow case that has shown up multiple times and an optimization we would like to get to.

@omnisip
Copy link
Author

omnisip commented Feb 4, 2021

I updated the table above to show what the performance would be if the optimization was in-place.

@lars-t-hansen
Copy link
Contributor

Indeed, SpiderMonkey translates swizzle-with-constant-mask into shuffle-with-zero, which is then subject to the usual pattern matching optimizations: https://searchfox.org/mozilla-central/source/js/src/jit/MIR.cpp#4313.

@omnisip
Copy link
Author

omnisip commented Feb 5, 2021

This question may sound a bit silly, but what indices should I be using with the zero vector for this test or does it not matter?

@omnisip
Copy link
Author

omnisip commented Feb 10, 2021

@abrown, @dtig, @ngzhian -- Here are the benchmarks for shuffle:

CPU Test Best of 5 Avg Performance Time % Worse Than Unadjusted Swizzle
Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (Skylake) Prefix Sum - Shuffle Zero Widening 44861 ms 46128 ms 40.2%
Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (Sandybridge) Prefix Sum - Shuffle Zero Widening 29613 ms 29924 ms 16.6%

This week (hopefully today or tomorrow), I'm going to put together a self-contained sample for everyone to test and file the ticket.

@dtig
Copy link
Member

dtig commented Mar 5, 2021

Closing as per #436.

@dtig dtig closed this Mar 5, 2021
nikic pushed a commit to rust-lang/llvm-project that referenced this pull request Mar 18, 2021
cuviper pushed a commit to rust-lang/llvm-project that referenced this pull request Apr 15, 2021
nikic pushed a commit to rust-lang/llvm-project that referenced this pull request Jul 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs discussion Proposal with an unclear resolution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants