Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jit64: More addx and subfx optimizations #8755

Merged
merged 7 commits into from Apr 24, 2020

Conversation

Sintendo
Copy link
Member

Similar to #8551. We start by deduplicating some of the existing logic so the optimizations don't have to be repeated. We then add special handling for constant 0 in several places, carefully swap operands around if it can get us shorter instructions, and use LEA for subfx too.


addx - Emit nothing when possible

Before:

83 C7 00             add         edi,0

After:


addx - Emit MOV when possible

Before:

41 8D 7D 00          lea         edi,[r13]

After:

41 8B FD             mov         edi,r13d

addx - Prefer smaller MOV+ADD sequence

Before:

41 BE 40 00 00 00    mov         r14d,40h
44 03 75 A8          add         r14d,dword ptr [rbp-58h]

After:

44 8B 75 A8          mov         r14d,dword ptr [rbp-58h]
41 83 C6 40          add         r14d,40h

Before:

44 8B 7D F8          mov         r15d,dword ptr [rbp-8]
41 81 C7 00 68 00 CC add         r15d,0CC006800h

After:

41 BF 00 68 00 CC    mov         r15d,0CC006800h
44 03 7D F8          add         r15d,dword ptr [rbp-8]

addx - Skip ADD after MOV when possible

Before:

8B 7D F8             mov         edi,dword ptr [rbp-8]
83 C7 00             add         edi,0

After:

8B 7D F8             mov         edi,dword ptr [rbp-8]

subfx - Use LEA when possible

Before:

45 8B EE             mov         r13d,r14d
41 83 ED 08          sub         r13d,8

After:

45 8D 6E F8          lea         r13d,[r14-8]

No functional change, just simplify some repeated logic for the cases
where the destination register matches one of the sources.
No functional change, just simplify some repeated logic in the case
where we're dealing with exactly one immediate and one simple register
when overflow isn't needed.
When the destination register matches a source register, the other
source register contains zero, and overflow isn't needed, the
instruction becomes a nop and we don't need to emit anything.

We could add specialized handling for the case where overflow is needed,
but none of the titles I tried would hit this path.

Before:
83 C7 00             add         edi,0

After:
When the source registers are a simple register and a constant zero and
overflow isn't needed, emitting LEA is kinda silly.

This will occasionally save a single byte for certain registers due to
how x86 encoding works. More importantly, LEA takes up execution
resources while MOV does not.

Before:
41 8D 7D 00          lea         edi,[r13]

After:
41 8B FD             mov         edi,r13d
ADD has a smaller encoding for immediates that can be expressed as an
8-bit signed integer (in other words, between -128 and 127). MOV lacks
this compact representation.

Since addition allows us to swap the source registers, we can always get
the shortest sequence here by carefully checking if we're dealing with a
small immediate first. If we are, move the other source into the
destination and add the small immediate onto that. For large immediates
the reverse is preferrable.

Before:
41 BE 40 00 00 00    mov         r14d,40h
44 03 75 A8          add         r14d,dword ptr [rbp-58h]

After:
44 8B 75 A8          mov         r14d,dword ptr [rbp-58h]
41 83 C6 40          add         r14d,40h

Before:
44 8B 7D F8          mov         r15d,dword ptr [rbp-8]
41 81 C7 00 68 00 CC add         r15d,0CC006800h

After:
41 BF 00 68 00 CC    mov         r15d,0CC006800h
44 03 7D F8          add         r15d,dword ptr [rbp-8]
We can get away with skipping the addition when we know we're dealing
with a constant zero. Just a MOV will suffice in this case.

Once again, we don't bother to add separate handling for when overflow
is needed, because no titles would ever hit that path during my testing.

Before:
8B 7D F8             mov         edi,dword ptr [rbp-8]
83 C7 00             add         edi,0

After:
8B 7D F8             mov         edi,dword ptr [rbp-8]
Similar to what we do for addx. Since we're calculating b - a and
because subtraction is not communitative, we can only apply this when
source register a holds the constant.

Before:
45 8B EE             mov         r13d,r14d
41 83 ED 08          sub         r13d,8

After:
45 8D 6E F8          lea         r13d,[r14-8]
Copy link
Member

@degasus degasus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good. However I wonder how often is there an immediate value of zero. Why should a compiler emit this kind of code?

@BhaaLseN
Copy link
Member

Const-propagation perhaps? The code seems fine, but I can't really think of any reasons why they would use an immediate valuee of zero either.

@degasus degasus merged commit 703f7d4 into dolphin-emu:master Apr 24, 2020
@Sintendo Sintendo deleted the jit64intopts branch July 18, 2020 05:59
Sintendo added a commit to Sintendo/dolphin that referenced this pull request Oct 25, 2020
On x86 many common instructions, including OR, have a smaller encoding
for immediates that can be expressed as an 8-bit signed integer. MOV
lacks this compact representation and always encodes the full 32-bit
immediate.

Because bitwise OR is commutative, we can always get the shortest
sequence here by carefully checking if we're dealing with a small
immediate first and reordering the operands if necessary.

This optimization was previously applied to addx in dolphin-emu#8755.

What follows is an example from Doshin the Giant.

Before:
41 BF 02 00 00 00    mov         r15d,2
45 0B FD             or          r15d,r13d

After:
45 8B FD             mov         r15d,r13d
41 83 CF 02          or          r15d,2
Sintendo added a commit to Sintendo/dolphin that referenced this pull request Jan 22, 2021
This doesn't really add any new optimizations, but fixes an issue that
prevented the optimizations introduced in dolphin-emu#8551 and dolphin-emu#8755 from being
applied in specific cases. A similar issue was solved for subfx as part
of dolphin-emu#9425.

Consider the case where the destination register is also an input
register and happens to hold an immediate value. This results in a set
of constraints that forces the RegCache to allocate a register and move
the immediate value into it for us. By the time we check for immediate
values in the JIT, we're too late.

We solve this by refactoring the code in such a way that we can check
for immediates before involving the RegCache.

- Example 1
Before:
41 BF 00 68 00 CC    mov         r15d,0CC006800h
44 03 FF             add         r15d,edi

After:
44 8D BF 00 68 00 CC lea         r15d,[rdi-33FF9800h]

- Example 2
Before:
41 BE 00 00 00 00    mov         r14d,0
44 03 F7             add         r14d,edi

After:
44 8B F7             mov         r14d,edi

- Example 3
Before:
41 BD 03 00 00 00    mov         r13d,3
44 03 6D 8C          add         r13d,dword ptr [rbp-74h]

After:
44 8B 6D 8C          mov         r13d,dword ptr [rbp-74h]
41 83 C5 03          add         r13d,3
@Sintendo Sintendo mentioned this pull request Jan 22, 2021
Sintendo added a commit to Sintendo/dolphin that referenced this pull request Jan 22, 2021
This doesn't really add any new optimizations, but fixes an issue that
prevented the optimizations introduced in dolphin-emu#8551 and dolphin-emu#8755 from being
applied in specific cases. A similar issue was solved for subfx as part
of dolphin-emu#9425.

Consider the case where the destination register is also an input
register and happens to hold an immediate value. This results in a set
of constraints that forces the RegCache to allocate a register and move
the immediate value into it for us. By the time we check for immediate
values in the JIT, we're too late.

We solve this by refactoring the code in such a way that we can check
for immediates before involving the RegCache.

- Example 1
Before:
41 BF 00 68 00 CC    mov         r15d,0CC006800h
44 03 FF             add         r15d,edi

After:
44 8D BF 00 68 00 CC lea         r15d,[rdi-33FF9800h]

- Example 2
Before:
41 BE 00 00 00 00    mov         r14d,0
44 03 F7             add         r14d,edi

After:
44 8B F7             mov         r14d,edi

- Example 3
Before:
41 BD 03 00 00 00    mov         r13d,3
44 03 6D 8C          add         r13d,dword ptr [rbp-74h]

After:
44 8B 6D 8C          mov         r13d,dword ptr [rbp-74h]
41 83 C5 03          add         r13d,3
Sintendo added a commit to Sintendo/dolphin that referenced this pull request Jan 26, 2021
This doesn't really add any new optimizations, but fixes an issue that
prevented the optimizations introduced in dolphin-emu#8551 and dolphin-emu#8755 from being
applied in specific cases. A similar issue was solved for subfx as part
of dolphin-emu#9425.

Consider the case where the destination register is also an input
register and happens to hold an immediate value. This results in a set
of constraints that forces the RegCache to allocate a register and move
the immediate value into it for us. By the time we check for immediate
values in the JIT, we're too late.

We solve this by refactoring the code in such a way that we can check
for immediates before involving the RegCache.

- Example 1
Before:
41 BF 00 68 00 CC    mov         r15d,0CC006800h
44 03 FF             add         r15d,edi

After:
44 8D BF 00 68 00 CC lea         r15d,[rdi-33FF9800h]

- Example 2
Before:
41 BE 00 00 00 00    mov         r14d,0
44 03 F7             add         r14d,edi

After:
44 8B F7             mov         r14d,edi

- Example 3
Before:
41 BD 03 00 00 00    mov         r13d,3
44 03 6D 8C          add         r13d,dword ptr [rbp-74h]

After:
44 8B 6D 8C          mov         r13d,dword ptr [rbp-74h]
41 83 C5 03          add         r13d,3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants