Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jit64: Avoid unnecessary MOVAPS instructions #8992

Merged
merged 4 commits into from Aug 8, 2020

Conversation

Sintendo
Copy link
Member

@Sintendo Sintendo commented Jul 29, 2020

We can avoid MOVAPS in a few places by taking advantage of AVX or by realizing when we can write the result directly into the target register.

I've only been able to find one game that uses ps_sel at all (Mario Kart Wii). If anyone knows of any others, please let me know.


fsel with AVX

Before:

66 0F 57 C0          xorpd       xmm0,xmm0
F2 41 0F C2 C6 06    cmpnlesd    xmm0,xmm14
41 0F 28 CE          movaps      xmm1,xmm14
66 41 0F 38 15 CA    blendvpd    xmm1,xmm10,xmm0
F2 44 0F 10 F1       movsd       xmm14,xmm1

After:

66 0F 57 C0          xorpd       xmm0,xmm0
F2 41 0F C2 C6 06    cmpnlesd    xmm0,xmm14
C4 C3 09 4B CA 00    vblendvpd   xmm1,xmm14,xmm10,xmm0
F2 44 0F 10 F1       movsd       xmm14,xmm1

ps_sel with AVX
Before:

66 0F 57 C0          xorpd       xmm0,xmm0
66 41 0F C2 C1 06    cmpnlepd    xmm0,xmm9
C4 C3 09 4B CC 00    vblendvpd   xmm1,xmm14,xmm12,xmm0
44 0F 28 F1          movaps      xmm14,xmm1

After:

66 0F 57 C0          xorpd       xmm0,xmm0
66 41 0F C2 C1 06    cmpnlepd    xmm0,xmm9
C4 43 09 4B F4 00    vblendvpd   xmm14,xmm14,xmm12,xmm0

ps_sel with SSE4.1
Before:

66 0F 57 C0          xorpd       xmm0,xmm0
66 41 0F C2 C1 06    cmpnlepd    xmm0,xmm9
41 0F 28 CE          movaps      xmm1,xmm14
66 41 0F 38 15 CC    blendvpd    xmm1,xmm12,xmm0
44 0F 28 F1          movaps      xmm14,xmm1

After:

66 0F 57 C0          xorpd       xmm0,xmm0
66 41 0F C2 C1 06    cmpnlepd    xmm0,xmm9
66 45 0F 38 15 F4    blendvpd    xmm14,xmm12,xmm0

ConvertDoubleToSingle with AVX
Before:

0F 28 C8                movaps      xmm1,xmm0
66 0F DB 0D CF 2C 00 00 pand        xmm1,xmmword ptr [1F8D283B220h]

After:

C5 F9 DB 0D D2 2C 00 00 vpand       xmm1,xmm0,xmmword ptr [271835FB220h]

AVX has a four-operand VBLENDVPD instruction, which allows for the first
input and the destination to be different. By taking advantage of this,
we no longer need to copy one of the inputs around and we can just
reference it directly, provided it's already in a register (I have yet
to see this not be the case).

Before:
66 0F 57 C0          xorpd       xmm0,xmm0
F2 41 0F C2 C6 06    cmpnlesd    xmm0,xmm14
41 0F 28 CE          movaps      xmm1,xmm14
66 41 0F 38 15 CA    blendvpd    xmm1,xmm10,xmm0
F2 44 0F 10 F1       movsd       xmm14,xmm1

After:
66 0F 57 C0          xorpd       xmm0,xmm0
F2 41 0F C2 C6 06    cmpnlesd    xmm0,xmm14
C4 C3 09 4B CA 00    vblendvpd   xmm1,xmm14,xmm10,xmm0
F2 44 0F 10 F1       movsd       xmm14,xmm1
For the packed variant, we can skip the final MOVAPS and write the
result directly into the destination register.

Before:
66 0F 57 C0          xorpd       xmm0,xmm0
66 41 0F C2 C1 06    cmpnlepd    xmm0,xmm9
C4 C3 09 4B CC 00    vblendvpd   xmm1,xmm14,xmm12,xmm0
44 0F 28 F1          movaps      xmm14,xmm1

After:
66 0F 57 C0          xorpd       xmm0,xmm0
66 41 0F C2 C1 06    cmpnlepd    xmm0,xmm9
C4 43 09 4B F4 00    vblendvpd   xmm14,xmm14,xmm12,xmm0
Pretty much the same optimization we did for AVX, although slightly more
constrained because we're stuck with the two-operand instruction where
destination and source have to match.

We could also specialize the case where registers b, c, and d are all
distinct, but I decided against it since I couldn't find any game that
does this.

Before:
66 0F 57 C0          xorpd       xmm0,xmm0
66 41 0F C2 C1 06    cmpnlepd    xmm0,xmm9
41 0F 28 CE          movaps      xmm1,xmm14
66 41 0F 38 15 CC    blendvpd    xmm1,xmm12,xmm0
44 0F 28 F1          movaps      xmm14,xmm1

After:
66 0F 57 C0          xorpd       xmm0,xmm0
66 41 0F C2 C1 06    cmpnlepd    xmm0,xmm9
66 45 0F 38 15 F4    blendvpd    xmm14,xmm12,xmm0
@Tilka
Copy link
Member

Tilka commented Aug 1, 2020

Have you tried some GameCube games?

https://github.com/devkitPro/libogc/blob/8d525d412737b20e70ca931120d7cb4d0b15b529/libogc/gu.c#L951

Edit: Oh, I guess GEKKO is also defined on Wii.

Using AVX we can eliminate another MOVAPS instruction here.

Before:
0F 28 C8                movaps      xmm1,xmm0
66 0F DB 0D CF 2C 00 00 pand        xmm1,xmmword ptr [1F8D283B220h]

After:
C5 F9 DB 0D D2 2C 00 00 vpand       xmm1,xmm0,xmmword ptr [271835FB220h]
@Sintendo Sintendo changed the title Jit64: fselx optimizations Jit64: Avoid unnecessary MOVAPS instructions Aug 2, 2020
@Tilka Tilka merged commit 6d0bc03 into dolphin-emu:master Aug 8, 2020
10 checks passed
@Sintendo Sintendo deleted the fselx-avx branch September 28, 2020 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants