New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jit64: More addx and subfx optimizations #8755
Merged
Merged
+40
−10
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
No functional change, just simplify some repeated logic for the cases where the destination register matches one of the sources.
No functional change, just simplify some repeated logic in the case where we're dealing with exactly one immediate and one simple register when overflow isn't needed.
When the destination register matches a source register, the other source register contains zero, and overflow isn't needed, the instruction becomes a nop and we don't need to emit anything. We could add specialized handling for the case where overflow is needed, but none of the titles I tried would hit this path. Before: 83 C7 00 add edi,0 After:
When the source registers are a simple register and a constant zero and overflow isn't needed, emitting LEA is kinda silly. This will occasionally save a single byte for certain registers due to how x86 encoding works. More importantly, LEA takes up execution resources while MOV does not. Before: 41 8D 7D 00 lea edi,[r13] After: 41 8B FD mov edi,r13d
ADD has a smaller encoding for immediates that can be expressed as an 8-bit signed integer (in other words, between -128 and 127). MOV lacks this compact representation. Since addition allows us to swap the source registers, we can always get the shortest sequence here by carefully checking if we're dealing with a small immediate first. If we are, move the other source into the destination and add the small immediate onto that. For large immediates the reverse is preferrable. Before: 41 BE 40 00 00 00 mov r14d,40h 44 03 75 A8 add r14d,dword ptr [rbp-58h] After: 44 8B 75 A8 mov r14d,dword ptr [rbp-58h] 41 83 C6 40 add r14d,40h Before: 44 8B 7D F8 mov r15d,dword ptr [rbp-8] 41 81 C7 00 68 00 CC add r15d,0CC006800h After: 41 BF 00 68 00 CC mov r15d,0CC006800h 44 03 7D F8 add r15d,dword ptr [rbp-8]
We can get away with skipping the addition when we know we're dealing with a constant zero. Just a MOV will suffice in this case. Once again, we don't bother to add separate handling for when overflow is needed, because no titles would ever hit that path during my testing. Before: 8B 7D F8 mov edi,dword ptr [rbp-8] 83 C7 00 add edi,0 After: 8B 7D F8 mov edi,dword ptr [rbp-8]
Similar to what we do for addx. Since we're calculating b - a and because subtraction is not communitative, we can only apply this when source register a holds the constant. Before: 45 8B EE mov r13d,r14d 41 83 ED 08 sub r13d,8 After: 45 8D 6E F8 lea r13d,[r14-8]
degasus
approved these changes
Apr 22, 2020
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good. However I wonder how often is there an immediate value of zero. Why should a compiler emit this kind of code?
|
Const-propagation perhaps? The code seems fine, but I can't really think of any reasons why they would use an immediate valuee of zero either. |
Sintendo
added a commit
to Sintendo/dolphin
that referenced
this pull request
Oct 25, 2020
On x86 many common instructions, including OR, have a smaller encoding for immediates that can be expressed as an 8-bit signed integer. MOV lacks this compact representation and always encodes the full 32-bit immediate. Because bitwise OR is commutative, we can always get the shortest sequence here by carefully checking if we're dealing with a small immediate first and reordering the operands if necessary. This optimization was previously applied to addx in dolphin-emu#8755. What follows is an example from Doshin the Giant. Before: 41 BF 02 00 00 00 mov r15d,2 45 0B FD or r15d,r13d After: 45 8B FD mov r15d,r13d 41 83 CF 02 or r15d,2
Sintendo
added a commit
to Sintendo/dolphin
that referenced
this pull request
Jan 22, 2021
This doesn't really add any new optimizations, but fixes an issue that prevented the optimizations introduced in dolphin-emu#8551 and dolphin-emu#8755 from being applied in specific cases. A similar issue was solved for subfx as part of dolphin-emu#9425. Consider the case where the destination register is also an input register and happens to hold an immediate value. This results in a set of constraints that forces the RegCache to allocate a register and move the immediate value into it for us. By the time we check for immediate values in the JIT, we're too late. We solve this by refactoring the code in such a way that we can check for immediates before involving the RegCache. - Example 1 Before: 41 BF 00 68 00 CC mov r15d,0CC006800h 44 03 FF add r15d,edi After: 44 8D BF 00 68 00 CC lea r15d,[rdi-33FF9800h] - Example 2 Before: 41 BE 00 00 00 00 mov r14d,0 44 03 F7 add r14d,edi After: 44 8B F7 mov r14d,edi - Example 3 Before: 41 BD 03 00 00 00 mov r13d,3 44 03 6D 8C add r13d,dword ptr [rbp-74h] After: 44 8B 6D 8C mov r13d,dword ptr [rbp-74h] 41 83 C5 03 add r13d,3
Merged
Sintendo
added a commit
to Sintendo/dolphin
that referenced
this pull request
Jan 22, 2021
This doesn't really add any new optimizations, but fixes an issue that prevented the optimizations introduced in dolphin-emu#8551 and dolphin-emu#8755 from being applied in specific cases. A similar issue was solved for subfx as part of dolphin-emu#9425. Consider the case where the destination register is also an input register and happens to hold an immediate value. This results in a set of constraints that forces the RegCache to allocate a register and move the immediate value into it for us. By the time we check for immediate values in the JIT, we're too late. We solve this by refactoring the code in such a way that we can check for immediates before involving the RegCache. - Example 1 Before: 41 BF 00 68 00 CC mov r15d,0CC006800h 44 03 FF add r15d,edi After: 44 8D BF 00 68 00 CC lea r15d,[rdi-33FF9800h] - Example 2 Before: 41 BE 00 00 00 00 mov r14d,0 44 03 F7 add r14d,edi After: 44 8B F7 mov r14d,edi - Example 3 Before: 41 BD 03 00 00 00 mov r13d,3 44 03 6D 8C add r13d,dword ptr [rbp-74h] After: 44 8B 6D 8C mov r13d,dword ptr [rbp-74h] 41 83 C5 03 add r13d,3
Sintendo
added a commit
to Sintendo/dolphin
that referenced
this pull request
Jan 26, 2021
This doesn't really add any new optimizations, but fixes an issue that prevented the optimizations introduced in dolphin-emu#8551 and dolphin-emu#8755 from being applied in specific cases. A similar issue was solved for subfx as part of dolphin-emu#9425. Consider the case where the destination register is also an input register and happens to hold an immediate value. This results in a set of constraints that forces the RegCache to allocate a register and move the immediate value into it for us. By the time we check for immediate values in the JIT, we're too late. We solve this by refactoring the code in such a way that we can check for immediates before involving the RegCache. - Example 1 Before: 41 BF 00 68 00 CC mov r15d,0CC006800h 44 03 FF add r15d,edi After: 44 8D BF 00 68 00 CC lea r15d,[rdi-33FF9800h] - Example 2 Before: 41 BE 00 00 00 00 mov r14d,0 44 03 F7 add r14d,edi After: 44 8B F7 mov r14d,edi - Example 3 Before: 41 BD 03 00 00 00 mov r13d,3 44 03 6D 8C add r13d,dword ptr [rbp-74h] After: 44 8B 6D 8C mov r13d,dword ptr [rbp-74h] 41 83 C5 03 add r13d,3
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Similar to #8551. We start by deduplicating some of the existing logic so the optimizations don't have to be repeated. We then add special handling for constant 0 in several places, carefully swap operands around if it can get us shorter instructions, and use LEA for subfx too.
addx - Emit nothing when possible
Before:
After:
addx - Emit MOV when possible
Before:
After:
addx - Prefer smaller MOV+ADD sequence
Before:
After:
Before:
After:
addx - Skip ADD after MOV when possible
Before:
After:
subfx - Use LEA when possible
Before:
After: