Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible suboptimal code generation for SIMD any function #72413

Open
karwa opened this issue Mar 19, 2024 · 1 comment
Open

Possible suboptimal code generation for SIMD any function #72413

karwa opened this issue Mar 19, 2024 · 1 comment
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. SILOptimizer Area → compiler: SIL optimization passes simd

Comments

@karwa
Copy link
Collaborator

karwa commented Mar 19, 2024

Description

Test code:

func test_stdlib_8(_ input: SIMD8<UInt8>) -> Bool {
    any(input .== SIMD8(repeating: 0x42))
}

Building this with -O produces:

.LCPI1_0:
        .byte   66
        .byte   66
        .byte   66
        .byte   66
        .byte   66
        .byte   66
        .byte   66
        .byte   66
        .zero   1
        .zero   1
        .zero   1
        .zero   1
        .zero   1
        .zero   1
        .zero   1
        .zero   1
output.test_stdlib_8(Swift.SIMD8<Swift.UInt8>) -> Swift.Bool:
        movq    xmm0, rdi
        pcmpeqb xmm0, xmmword ptr [rip + .LCPI1_0]
        pmovmskb        eax, xmm0
        test    al, al
        setne   al
        ret

Which is nice 👍

Unfortunately, when I widen the vector to 16+ elements, the any function becomes a massive, outlined, glob of code:

func test_stdlib_16(_ input: SIMD16<UInt8>) -> Bool {
    any(input .== SIMD16(repeating: 0x42))
}
.LCPI3_0:
        .zero   16,66
output.test_stdlib_16(Swift.SIMD16<Swift.UInt8>) -> Swift.Bool:
        push    rax
        pcmpeqb xmm0, xmmword ptr [rip + .LCPI3_0]
        call    (generic specialization <Swift.SIMD16<Swift.Int8>> of (extension in Swift):Swift.SIMD< where A.Scalar: Swift.Comparable>.min() -> A.Scalar)
        shr     al, 7
        pop     rcx
        ret

generic specialization <Swift.SIMD16<Swift.Int8>> of (extension in Swift):Swift.SIMD< where A.Scalar: Swift.Comparable>.min() -> A.Scalar:
        pshufd  xmm1, xmm0, 238
        movdqa  xmm2, xmm1
        pcmpgtb xmm2, xmm0
        pand    xmm0, xmm2
        pandn   xmm2, xmm1
        por     xmm2, xmm0
        pshufd  xmm0, xmm2, 85
        movdqa  xmm1, xmm0
        pcmpgtb xmm1, xmm2
        pand    xmm2, xmm1
        pandn   xmm1, xmm0
        por     xmm1, xmm2
        movdqa  xmm0, xmm1
        psrld   xmm0, 16
        movdqa  xmm2, xmm0
        pcmpgtb xmm2, xmm1
        pand    xmm1, xmm2
        pandn   xmm2, xmm0
        por     xmm2, xmm1
        movdqa  xmm0, xmm2
        psrlw   xmm0, 8
        movdqa  xmm1, xmm0
        pcmpgtb xmm1, xmm2
        pand    xmm2, xmm1
        pandn   xmm1, xmm0
        por     xmm1, xmm2
        movd    eax, xmm1
        ret

The SIMD mask is 16 bytes, and the any function basically amounts to mask != 0, so... even though I'm not an expert at SIMD instruction sets, it feels like this is probably not optimal.

Even if I enable all the advanced modern instruction sets I can think of (-O -Xcc -msse -Xcc -msse2 -Xcc -mavx -Xcc -mavx2), the code generated for the any function still feels suboptimal:

.LCPI5_0:
        .zero   16,128
generic specialization <Swift.SIMD16<Swift.Int8>> of (extension in Swift):Swift.SIMD< where A.Scalar: Swift.Comparable>.min() -> A.Scalar:
        vpxor   xmm0, xmm0, xmmword ptr [rip + .LCPI5_0]
        vpsrlw  xmm1, xmm0, 8
        vpminub xmm0, xmm0, xmm1
        vphminposuw     xmm0, xmm0
        vmovd   eax, xmm0
        add     al, -128
        ret

Reproduction

See above.

Also Godbolt

Expected behavior

Intuitively, I would expect any(SIMDMask<SIMD16<Int16>>) to compile down to far fewer instructions than it does. At the very least, it seems it could be implemented using two 64-bit comparisons to zero, which I have to believe it more efficient than the code we're generating today.

Environment

Swift version 6.0-dev (LLVM d1625da873daa4c, Swift bae6450)
Target: x86_64-unknown-linux-gnu

Additional information

No response

@karwa karwa added bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. triage needed This issue needs more specific labels labels Mar 19, 2024
@stephentyrone
Copy link
Member

These optimizations all happen at the LLVM level; there's some work that we can maybe do in Swift so that they're not needed, however.

@hborla hborla added SILOptimizer Area → compiler: SIL optimization passes simd and removed triage needed This issue needs more specific labels labels Apr 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. SILOptimizer Area → compiler: SIL optimization passes simd
Projects
None yet
Development

No branches or pull requests

3 participants