New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jit64: More constant propagation optimizations #9262
Conversation
|
Added two more optimizations; for slwx and srawx. Similar optimizations are possible for srwx and rlwnmx, but didn't occur during my tests. slwx ExampleBefore: After: srawx Example 1Before: After: Example 2Before: After: |
|
Well spotted! The logic for these specific optimizations was actually taken from JitArm64, which has the same problem. I raised a separate PR (#9346) to fix it there as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. If you want to get rid of the FifoCI diff, you can do so by rebasing on a newer version of master, but this shouldn't affect the mergeability of your PR.
Occurs surprisingly often. Prevents generating silly code like this: BE 03 00 00 00 mov esi,3 83 EE 08 sub esi,8 0F 93 45 58 setae byte ptr [rbp+58h]
More efficient code can be generated if the shift amount is known at compile time. Similar optimizations were present in JitArm64 already, but were missing in Jit64. - By using an 8-bit immediate we can eliminate the need for ECX as a scratch register, thereby reducing register pressure and occasionally eliminating a spill. Before: B9 18 00 00 00 mov ecx,18h 45 8B C1 mov r8d,r9d 49 D3 E8 shr r8,cl After: 45 8B C1 mov r8d,r9d 41 C1 E8 18 shr r8d,18h - PowerPC has strange shift amount masking behavior which is emulated using 64-bit shifts, even though we only care about a 32-bit result. If the shift amount is known, we can handle this special case separately, and use 32-bit shift instructions otherwise. Before: B9 F8 FF FF FF mov ecx,0FFFFFFF8h 45 8B C1 mov r8d,r9d 49 D3 E8 shr r8,cl After: Nothing, register is set to constant zero. - A shift by zero becomes a simple MOV. Before: B9 00 00 00 00 mov ecx,0 45 8B C1 mov r8d,r9d 49 D3 E8 shr r8,cl After: 45 8B C1 mov r8d,r9d
More efficient code can be generated if the shift amount is known at compile time. Similar optimizations were present in JitArm64 already, but were missing in Jit64. - By using an 8-bit immediate we can eliminate the need for ECX as a scratch register, thereby reducing register pressure and occasionally eliminating a spill. Before: B9 18 00 00 00 mov ecx,18h 41 8B F7 mov esi,r15d 48 D3 E6 shl rsi,cl 8B F6 mov esi,esi After: 41 8B CF mov ecx,r15d C1 E1 18 shl ecx,18h - PowerPC has strange shift amount masking behavior which is emulated using 64-bit shifts, even though we only care about a 32-bit result. If the shift amount is known, we can handle this special case separately, and use 32-bit shift instructions otherwise. We also no longer need to clear the upper 32 bits of the register. Before: BE F8 FF FF FF mov esi,0FFFFFFF8h 8B CE mov ecx,esi 41 8B F4 mov esi,r12d 48 D3 E6 shl rsi,cl 8B F6 mov esi,esi After: Nothing, register is set to constant zero. - A shift by zero becomes a simple MOV. Before: BE 00 00 00 00 mov esi,0 8B CE mov ecx,esi 41 8B F3 mov esi,r11d 48 D3 E6 shl rsi,cl 8B F6 mov esi,esi After: 41 8B FB mov edi,r11d
More efficient code can be generated if the shift amount is known at compile time. We can once again take advantage of shifts with the shift amount in an 8-bit immediate to eliminate ECX as a scratch register, reducing register pressure and removing the occasional spill. We can also do 32-bit shifts instead of 64-bit operations. We recognize four distinct cases: - The special case where we're dealing with the PowerPC's quirky shift amount masking. If the shift amount is a number from 32 to 63, all bits are shifted out and the result it either all zeroes or all ones. Before: B9 F0 FF FF FF mov ecx,0FFFFFFF0h 8B F7 mov esi,edi 48 C1 E6 20 shl rsi,20h 48 D3 FE sar rsi,cl 8B C6 mov eax,esi 48 C1 EE 20 shr rsi,20h 85 F0 test eax,esi 0F 95 45 58 setne byte ptr [rbp+58h] After: 8B F7 mov esi,edi C1 FE 1F sar esi,1Fh 0F 95 45 58 setne byte ptr [rbp+58h] - The shift amount is zero. Not calculation needs to be done, just clear the carry flag. Before: B9 00 00 00 00 mov ecx,0 49 C1 E5 20 shl r13,20h 49 D3 FD sar r13,cl 41 8B C5 mov eax,r13d 49 C1 ED 20 shr r13,20h 44 85 E8 test eax,r13d 0F 95 45 58 setne byte ptr [rbp+58h] After: C6 45 58 00 mov byte ptr [rbp+58h],0 - The carry flag doesn't need to be computed. Just do the arithmetic shift. Before: B9 02 00 00 00 mov ecx,2 48 C1 E7 20 shl rdi,20h 48 D3 FF sar rdi,cl 48 C1 EF 20 shr rdi,20h After: C1 FF 02 sar edi,2 - The carry flag must be computed. In addition to the arithmetic shift, we do a shift to the left and and them together to know if any ones were shifted out. It's still better than before, because we can do 32-bit shifts. Before: B9 02 00 00 00 mov ecx,2 49 C1 E5 20 shl r13,20h 49 D3 FD sar r13,cl 41 8B C5 mov eax,r13d 49 C1 ED 20 shr r13,20h 44 85 E8 test eax,r13d 0F 95 45 58 setne byte ptr [rbp+58h] After: 41 8B C5 mov eax,r13d 41 C1 FD 02 sar r13d,2 C1 E0 1E shl eax,1Eh 44 85 E8 test eax,r13d 0F 95 45 58 setne byte ptr [rbp+58h]
If both input registers hold known values at compile time, we can just calculate the result on the spot. Code has mostly been copied from JitArm64 where it had already been implemented. Before: BF FF FF FF FF mov edi,0FFFFFFFFh 8B C7 mov eax,edi C1 FF 10 sar edi,10h C1 E0 10 shl eax,10h 85 F8 test eax,edi 0F 95 45 58 setne byte ptr [rbp+58h] After: C6 45 58 01 mov byte ptr [rbp+58h],1
Much like we did for srawx. This was already implemented on JitArm64. Before: B8 00 00 00 00 mov eax,0 8B F0 mov esi,eax C1 E8 1F shr eax,1Fh 23 C6 and eax,esi D1 FE sar esi,1 88 45 58 mov byte ptr [rbp+58h],al After: C6 45 58 00 mov byte ptr [rbp+58h],0
Only removes the scratch register and a MOV, but hey. Before: B9 02 00 00 00 mov ecx,2 41 8B F5 mov esi,r13d D3 C6 rol esi,cl 83 E6 01 and esi,1 After: 41 8B F5 mov esi,r13d C1 C6 02 rol esi,2 83 E6 01 and esi,1
Shifting zero by any amount always gives zero. Before: 41 BF 00 00 00 00 mov r15d,0 8B CF mov ecx,edi 49 D3 E7 shl r15,cl 45 8B FF mov r15d,r15d After: Nothing, register is set to constant zero. All games I've tried hit this optimization on launch. In Soul Calibur II it occurs very frequently during gameplay.
Shifting zero by any amount always gives zero. Before: 41 B9 00 00 00 00 mov r9d,0 41 8B CF mov ecx,r15d 49 C1 E1 20 shl r9,20h 49 D3 F9 sar r9,cl 49 C1 E9 20 shr r9,20h After: Nothing, register is set to constant zero. Before: 41 B8 00 00 00 00 mov r8d,0 41 8B CF mov ecx,r15d 49 C1 E0 20 shl r8,20h 49 D3 F8 sar r8,cl 41 8B C0 mov eax,r8d 49 C1 E8 20 shr r8,20h 44 85 C0 test eax,r8d 0F 95 45 58 setne byte ptr [rbp+58h] After: C6 45 58 00 mov byte ptr [rbp+58h],0 Occurs a bunch of times in Super Mario Sunshine. Since this is an arithmetic shift a similar optimization can be done for constant -1 (0xFFFFFFFF), but I couldn't find any game where this happens.
It is possible to generate more optimal instruction sequences for various instructions if some or all of the input registers are known to hold a specific value.
subfic
Example
Before:
After:
srwx
Example 1
Before:
After:
Example 2
Before:
After:
Example 3
Before:
After:
Nothing, register is set to constant zero.
slwx
Example 1
Before:
After:
Example 2
Before:
After:
Example 3
Before:
After:
Nothing, register is set to constant zero.
srawx
Example 1
Before:
After:
Example 2
Before:
After:
Example 3
Before:
After:
Example 4
Before:
After:
srawix
Example
Before:
After:
rlwnmx
Example
Before:
After: