-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Hi,
I have stumbled upon conversions between bool-arrays and SIMD-masks that are generated by Julia/LLVM in the amd64 assembly and I am wondering if there is any technique or idiom to reduce the occuring runtime conversions.
This is related to SIMD programming in Julia and might not be related to the package SIMD.jl although it could be that some representation for masks is missing in SIMD.jl so I hope this is the right place to discuss this.
My concerns regarding the topic are:
- Would it make sense to expose "quadword masks" (and similar) as returned by vector-comparisons in amd64 in Julia?
- ... and how could that be done?
- ... or does such an exposition belong to the LLVM-domain?
- If not in Julia, how are vector masks meant to be used in just LLVM then?
- LLVM calls these masks "condition vectors", but condition vectors and boolean SIMD-vectors seem to differ conceptually
- And generally, how architecture dependent is SIMD-mask programming relating to Julia's supported x86 and ARM targets in 64 and 32 bit?
- i.e. "how does this work under ARM?"
Background
The conversions occur when using vifelse from the SIMD.jl package, which produces a select statement in LLVM assembly and a vblendvpd instruction in amd64 assembly. Now I found out that
- Julia's
Vec{N, Bool}seems to be represented as<N x i8>, - the generated LLVM statement
selectseems to take<N x i1>masks, - for the amd64 instruction vblendvpd "The mask bits are the most significant bit in each quadword element of the mask register",
- the LLVM statement for comparison
fcmpseems to produce<N x i1>results, - and the amd64 instruction for comparison vcmpltpd produces "a quadword mask of all 1s (comparison true) or all 0s (comparison false)".
I understand, that Julia's and LLVM's representation-conversions are "opaque"/architecture-independent and meant to be optimized away in the most cases, but for tiny-tiny SIMD-"kernels" the architecture and the mask representation affects the way one would write a program.
E.g. I found a technique where they "load from the mask once per 16 elements, and reusing the vector by bit shifting" which has an amd64-specific smell to it. Such techniques only work when we can treat the SIMD-mask as a "quadword mask" which may or may not be in the scope of Julia, because of its architecture dependence (?)
Example
select_min(a,b) = vifelse( (a < b) , a, b )
select_min(a,b,mask_a,mask_b) = vifelse( vifelse((mask_a & mask_b),(a < b),a), a, b )
select_min(a,b,mask_a,mask_b) = vifelse( mask_a & ((~mask_b) | (a < b)) , a, b )To illustate the issue, think of a merge of two vectors by a comparison where each input comes with a mask.
I.e. from vectors a and b we want to component-wise select the smaller element but only when mask_a or mask_b is true. This could be a vector-reduction step in a masked minimum operation:
| a < b | mask_a | mask_b | masked_cond | masked minimum |
|---|---|---|---|---|
| 0 | 0 | 0 | * | * |
| 0 | 0 | 1 | 0 | b |
| 0 | 1 | 0 | 1 | a |
| 0 | 1 | 1 | 0 | b |
| 1 | 0 | 0 | * | * |
| 1 | 0 | 1 | 0 | b |
| 1 | 1 | 0 | 1 | a |
| 1 | 1 | 1 | 1 | a |
A boolean expression to realize this behaviour would be
ma & mb .? a < b .: ma
= (ma & mb) & (a < b) | ~(ma & mb) & ma
= ma & (~mb | (a < b))
When extending this masked minimum into a masked findmin one strategy would be to re-use the mask on a separate index-vector. I stick with the first variant although this specific function could be implemented differently:
using SIMD
function select_min(
a :: Vec{4,Float64} # 4×64 float
, b :: Vec{4,Float64} # 4×64 float
, mask_a :: Vec{4,Bool} # 4×8 least-significant-bit bool
, mask_b :: Vec{4,Bool} # 4×8 least-significant-bit bool
)
cond_A = a < b #✓# `cond_A` is a 4×64 most-significant-bit mask
#↯# `cond_A` gets converted to a 4×8 least-significant-bit bool
cond_B = mask_a & mask_b #✓# `cond_B` is a 4×8 least-significant-bit bool
#↯# `cond_B` gets converted to a 4×8 most-significant-bit mask
masked_cond =
vifelse(cond_B, cond_A, mask_a) #✓# `masked_cond` is a 4×8 least-significant-bit bool
#↯# `masked_cond` gets converted to a 4×64 most-significant-bit mask
return vifelse(masked_cond, a, b)
# think here of multiple other "companion-components" that should fit into registers such as
# return ( vifelse(masked_cond, a , b )
# , vifelse(masked_cond, a_index, b_index)
# , vifelse(masked_cond, a_data , b_data )
# )
endHere, I've marked the implicit conversions with a small ↯ symbol. The conversions become "visible" when looking at the generated assembly with
N = 4
T = Float64
code_native(select_min,(Vec{N,T},Vec{N,T},Vec{N,Bool},Vec{N,Bool}); debuginfo=:none)then we can see, that the mask-conversions make up for a lot of instructions and some additional registers.
I have annotated the assembly so far as I understood:
movq %rdi, %rax
vmovupd (%rsi), %ymm0 ;✓; load a (to ymm0)
vmovupd (%rdx), %ymm1 ;✓; load b (to ymm1)
vcmpltpd %ymm1, %ymm0, %ymm2 ;✓; ymm2 = a < b
; ; ⇒ ymm2 == cond_A
; ; packing of cond_A:
vextractf128 $1, %ymm2, %xmm3 ;↯; xmm3 = upper part of ymm2
vpackssdw %xmm3, %xmm2, %xmm2 ;↯; xmm2 = ... packing ... of xmm3
vpackssdw %xmm2, %xmm2, %xmm2 ;↯; xmm2 = ... packing ... of xmm2
vpacksswb %xmm2, %xmm2, %xmm2 ;↯; xmm2 = ... packing ... of xmm2
movabsq $.rodata.cst16, %rdx ;↯; ...
vpand (%rdx), %xmm2, %xmm2 ;↯; xmm2 &= some bitmask
; ; ⇒ xmm2 == packed cond_A
movl (%rcx), %ecx ;✓; ecx = mask_a
vmovd %ecx, %xmm3 ;✓; xmm3 = mask_a
andl (%r8), %ecx ;✓; ecx = mask_a & mask_b
vmovd %ecx, %xmm4 ;✓; xmm4 = mask_a & mask_b
; ; ⇒ xmm4 = cond_B
; ; unpacking of cond_B:
movabsq $140542381220720, %rcx ;↯; ...
vpand (%rcx), %xmm4, %xmm4 ;↯; xmm4 &= some bitmask (?)
vpmovzxbd %xmm4, %xmm4 ;↯; xmm4 = xmm4[0...1...2...3...]
movabsq $140542381220736, %rcx ;↯; ...
vpshufb (%rcx), %xmm4, %xmm4 ;↯; xmm4 = "Packed Shuffle Bytes"
vpsllw $7, %xmm4, %xmm4 ;↯; xmm4 <<= 7
; ; ⇒ xmm4 == unpacked cond_B
vpblendvb %xmm4, %xmm2, %xmm3, %xmm2 ;✓; xmm2 = xmm4 .? xmm2 .: xmm3
; ; ⇒ xmm2 = masked_cond
; ; unpacking of masked_cond:
vpmovzxbd %xmm2, %xmm2 ;↯; xmm2 = xmm2[0...1...2...3...]
vpslld $31, %xmm2, %xmm2 ;↯; xmm2 <<= 31
vpmovsxdq %xmm2, %ymm2 ;↯; ymm2 = "Packed Move with Sign Extend" xmm2
; ; ⇒ ymm2 == unpacked masked_cond
vblendvpd %ymm2, %ymm0, %ymm1, %ymm0 ;✓; ymm0 = ymm2 .? a .: b
vmovapd %ymm0, (%rdi) ;✓; store result
vzeroupper
retqThe reason for these conversions occurance might be already present in the LLVM code:
define void @julia_mselect_65501(
[1 x <4 x double>]* noalias nocapture sret([1 x <4 x double>]) %0
, [1 x <4 x double>]* nocapture nonnull readonly align 16 dereferenceable(32) %1
, [1 x <4 x double>]* nocapture nonnull readonly align 16 dereferenceable(32) %2
, [1 x <4 x i8>]* nocapture nonnull readonly align 4 dereferenceable( 4) %3
, [1 x <4 x i8>]* nocapture nonnull readonly align 4 dereferenceable( 4) %4
) #0 {
top:
%5 = getelementptr inbounds [1 x <4 x double>] .... ; ...
%6 = getelementptr inbounds [1 x <4 x double>] .... ; ...
%7 = load <4 x double>, <4 x double>* %5, align 16 ; %7 = a
%8 = load <4 x double>, <4 x double>* %6, align 16 ; %8 = b
%res.i = fcmp olt <4 x double> %7, %8 ; %res.i = a < b (4 x i1)
%resb.i = zext <4 x i1> %res.i to <4 x i8> ; %resb.i = a < b (4 x i8)
; ⇒ %resb.i = cond_A
%9 = getelementptr inbounds [1 x <4 x i8>] .... ; ...
%10 = getelementptr inbounds [1 x <4 x i8>] .... ; ...
%11 = load <4 x i8>, <4 x i8>* %9, align 4 ; %11 = mask_a (4 x i8)
%12 = load <4 x i8>, <4 x i8>* %10, align 4 ; %12 = mask_b (4 x i8)
%13 = and <4 x i8> %12, %11 ; %13 = mask_a & mask_b (4 x i8)
; ⇒ %13 == cond_B (4 x i8)
%cond.i2 = trunc <4 x i8> %13 to <4 x i1> ; %cond.i2 = cond_B (4 x i1)
%res.i3 = select <4 x i1> %cond.i2, <4 x i8> %resb.i, <4 x i8> %11 ; ⇒ %res.i3 == masked_cond (4 x i8)
%cond.i = trunc <4 x i8> %res.i3 to <4 x i1> ; %cond.i = masked_cond (4 x i1)
%res.i1 = select <4 x i1> %cond.i, <4 x double> %7, <4 x double> %8 ; %res.i1 = masked_cond .? .a : .b
%14 = getelementptr inbounds [1 x <4 x double>] .... ;
store <4 x double> %res.i1, <4 x double>* %14, align 32 ; store result
ret void
}Although I am not fluent in amd64, I hope that something like this snippet would be possible:
vmovupd (%rsi), %ymm0 ;✓; load a (to ymm0)
vmovupd (%rdx), %ymm1 ;✓; load b (to ymm1)
vmovupd (%r??), %ymm3 ;✓; load mask_a (to ymm3)
vmovupd (%r??), %ymm4 ;✓; load mask_b (to ymm4)
; ; ⇒ these registers might already be loaded anyways
vcmpltpd %ymm1, %ymm0, %ymm2 ;✓; ymm2 = a < b (cond_A)
vandps %ymm3, %ymm4, %ymm4 ;✓; ymm4 = mask_a & mask_b (cond_B)
vpblendvb %ymm4, %ymm2, %ymm3, %ymm2 ;✓; ymm2 = ymm4 .? ymm2 .: ymm3 (masked_cond)
vblendvpd %ymm2, %ymm0, %ymm1, %ymm0 ;✓; ymm0 = ymm2 .? a .: bwhich makes use of 5 ymm registers and therefore could be unrolled three times without register spilling and whatnot.
Would it be possible to achieve this in Julia somehow?
edit/PS: It seems that the whole issue generalizes for SIMD vectors of element size greater than a single byte