Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not use the INC or DEC instructions in x86 or x64 CodeGen #7697

Closed
tannergooding opened this issue Mar 22, 2017 · 30 comments
Closed

Do not use the INC or DEC instructions in x86 or x64 CodeGen #7697

tannergooding opened this issue Mar 22, 2017 · 30 comments
Labels
arch-x64 arch-x86 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@tannergooding
Copy link
Member

We are currently emitting the INC and DEC instructions for x86 and x64 CodeGen. For example, the following loop:

for (var index = 0; index < 5; index++)
{
    Console.WriteLine(index);
}

generates the following assembly:

00007FFCC8110B20  push        rbp  
00007FFCC8110B21  sub         rsp,30h  
00007FFCC8110B25  lea         rbp,[rsp+30h]  
00007FFCC8110B2A  xor         eax,eax  
00007FFCC8110B2C  mov         dword ptr [rbp-4],eax  
00007FFCC8110B2F  mov         qword ptr [rbp+10h],rcx  
00007FFCC8110B33  cmp         dword ptr [7FFCC7FB4CF0h],0  
00007FFCC8110B3A  je          00007FFCC8110B41  
00007FFCC8110B3C  call        00007FFD27D1B930  
00007FFCC8110B41  xor         ecx,ecx  
00007FFCC8110B43  mov         dword ptr [rbp-4],ecx  
00007FFCC8110B46  nop  
00007FFCC8110B47  jmp         00007FFCC8110B59  
00007FFCC8110B49  mov         ecx,dword ptr [rbp-4]  
00007FFCC8110B4C  call        00007FFCC81106E0  
00007FFCC8110B51  mov         eax,dword ptr [rbp-4]  
00007FFCC8110B54  inc         eax  
00007FFCC8110B56  mov         dword ptr [rbp-4],eax  
00007FFCC8110B59  mov         ecx,dword ptr [rbp-4]  
00007FFCC8110B5C  cmp         ecx,5  
00007FFCC8110B5F  jl          00007FFCC8110B49  
00007FFCC8110B61  nop  
00007FFCC8110B62  lea         rsp,[rbp]  
00007FFCC8110B66  pop         rbp  
00007FFCC8110B67  ret

However, the Intel 64 and IA-32 Architectures Optimization Reference Manual recommends the following:
image
category:cq
theme:basic-cq
skill-level:expert
cost:small

@mikedn
Copy link
Contributor

mikedn commented Mar 22, 2017

However, the Intel 64 and IA-32 Architectures Optimization Reference Manual recommends the following:

So? Last time I checked Intel's own compiler didn't follow this particular rule.

In general, INC/DEC are more likely to cause problems if the flags are used after such an instruction. This isn't usually the case in JIT generated code.

@tannergooding
Copy link
Member Author

@mikedn, regardless of whether their own compiler does it. This is overall a fairly trivial fix that is listed as having a medium impact (essentially it is removing a check in the code gen that says use inc if the addend is a constant of 1).

@tannergooding
Copy link
Member Author

Given that the fix is trivial to make, it should be fairly simple to make it and see if it has an visible perf impact.

@mikedn
Copy link
Contributor

mikedn commented Mar 22, 2017

Given that the fix is trivial to make, it should be fairly simple to make it and see if it has an visible perf impact.

Yes, it is trivial, I looked at this many months ago. The problem is that ADD reg, 1 is one byte larger than INC reg so it's not so obvious that this change is beneficial.

@tannergooding
Copy link
Member Author

Well, we have a plethora of perf tests now that would make this much easier to validate 😄

@mikedn
Copy link
Contributor

mikedn commented Mar 22, 2017

Well, we have a plethora of perf tests now that would make this much easier to validate

Well, have fun 😄. Note that this might be more effective after/together with #6794

@RussKeldorph
Copy link
Contributor

@russellhadley @fiigii @litian2025 Do you want to weigh in on this? My impression is that "avoid inc/dec" is outdated advice, but I'm not sure.

@dotnet/jit-contrib

@tannergooding
Copy link
Member Author

@RussKeldorph, this is coming from the June 2016 copy of the SOG.

It might be important to note that this case was improved with the Sandy Bridge architecture and newer, but they still recommend avoiding the instructions where possible.

image

@BruceForstall BruceForstall changed the title Do not use the INC or DEC intrusctions in x86 or x64 CodeGen Do not use the INC or DEC instructions in x86 or x64 CodeGen Mar 29, 2017
@jarodrig
Copy link

jarodrig commented Apr 2, 2017

We should not avoid emitting inc/dec anymore: check out VC++ default codegen and you will see we use it liberally there.

-Juan (Intel)

@jarodrig
Copy link

jarodrig commented Apr 2, 2017

..and you should see that in the case where VC++ can detect the consumer of a flag generated by a partial flag writer, it should be emitting add/sub instructions -- the point is that you should not see blanket substitution of inc/dec by add/sub everywhere.

-Juan (Intel)

@litian2025
Copy link

Yes, I agree with what Juan's points above, beside VC++, in Chakra, here is what we do it, first check if there's partial flag dependency, then make the changes only when necessary: https://github.com/Microsoft/ChakraCore/blob/master/lib/Backend/Peeps.cpp#L160_#L172

@tannergooding
Copy link
Member Author

@RussKeldorph, do you know if CoreCLR has the appropriate logic to detect the partial flag dependency (and uses add/sub in those cases)? If so, I'm fine with closing the bug; otherwise, we should probably keep it open to track adding such logic.

@litian2025
Copy link

@tannergooding @RussKeldorph @fiigii @jarodrig we will take a look at it and report back.

@mikedn
Copy link
Contributor

mikedn commented May 2, 2017

@litian2025 It's not clear what Chakra does and why but there's nothing like this in JIT.

@fiigii
Copy link
Contributor

fiigii commented May 2, 2017

ChakraCore does this in peephole optimization with the rules:

// Check for any of the following patterns which can cause partial flag dependency
//
//                                                        Jcc or SHL or SHR or SAR or SHLD(in case of x64)
// Jcc or SHL or SHR or SAR or SHLD(in case of x64)       Any Instruction
// INC or DEC                                             INC or DEC
// -------------------------------------------------- OR -----------------------
// INC or DEC                                             INC or DEC
// Jcc or SHL or SHR or SAR or SHLD(in case of x64)       Any Instruction
//                                                        Jcc or SHL or SHR or SAR or SHLD(in case of x64)
//
// With this optimization if any of the above pattern found, substitute INC/DEC with ADD/SUB respectively.

When the peephole optimizer encounters an inc or dec, it checks one or two of the previous/later instructions to determine if substitute inc/dec with add/sub.

RyuJIT does not have peephole optimization, but I think that we can implement this in code generator by checking previous/later instructions through gtPrev/gtNext pointers.

@mikedn
Copy link
Contributor

mikedn commented May 2, 2017

ChakraCore does this in peephole optimization with the rules

Yes, a saw that comment but it's not clear where did those rules come from. For example, one variant is JCC followed by INC/DEC, what's that for? And the Chakra peep seems to be targeted solely at Atoms...

RyuJIT does not have peephole optimization, but I think that we can implement this in code generator by checking previous/later instructions through gtPrev/gtNext pointers.

Yes, that should work for INC/DEC followed by JCC. It won't work the other way around because JCC followed by INC/DEC implies different basic blocks.

@fiigii
Copy link
Contributor

fiigii commented May 2, 2017

@mikedn Yes, this ChakraCore peep only works for Atom. According to Intel® 64 and IA-32 Architectures
Optimization Reference Manual 14.2.2.9

INC and DEC instructions require an additional uop to merge the flags as they are partial flag writers. As a result, a branch instruction depending on an INC or a DEC instruction incurs a 1 cycle penalty. Note that this penalty only applies to branches that are directly dependent on the INC or DEC instruction.
Assembly/Compiler Coding Rule 3. (M impact, M generality) Use CMP/ADD/SUB instructions to
compute branch conditions instead of INC/DEC instructions whenever possible.

The partial flags stall only happens on older CPUs (P4) and Atom, for new Intel CPUs are able to rename each flag bit separately.

@briansull
Copy link
Contributor

briansull commented May 2, 2017

So we can close this issue as optimizing for the ancient P4 architectural is not currently necessary

Pentium 4 was a line of single-core central processing units (CPUs) for desktops, laptops and entry-level servers introduced by Intel on November 20, 2000

And the Atom versions were introduced in 2008-2011 for budget PC buyers.

@fiigii
Copy link
Contributor

fiigii commented May 2, 2017

[Correction]
@briansull Sorry, I checked the manual again. Nehalem also has the partial register access stall on flag registers. It was totally resolved by Sandy Bridge.

@briansull
Copy link
Contributor

In any case the impacted CPU's are all fairly old processors:

Nehalem was replaced with the Sandy Bridge microarchitecture, released in January 2011.

@tannergooding
Copy link
Member Author

@briansull, that is barely older (less than a month) than Windows 7 SP1 which still has some 48% of the market share (and which still has extended supported until Jan 14, 2020).

@briansull
Copy link
Contributor

And Yes, it is totally supported, with an occasional 2 cycle stall when an inc or dec instruction is encountered.

@litian2025
Copy link

@mikedn for chakra code peep optimization JCC followed by INC/DEC is in case of backward branch operation like a loop

@tannergooding
Copy link
Member Author

@briansull, I would think we would want to optimize this scenario as the actual optimization seems fairly trivial and it hits a fair bit of code (most for loops are an inc or a dec followed by a conditional jump)

@mikedn
Copy link
Contributor

mikedn commented May 3, 2017

@litian2025 I'm not familiar with Chakra's IR but it looks very much like linear IR so it's not clear how being next in linear order has anything to do with loops. It looks more like it is detecting an INC that is outside of the loop.

@litian2025
Copy link

litian2025 commented May 4, 2017

@mikedn maybe an example can help here:

Some Compare Instruction
LabelA:
JCC
Any instructions
INC/DEC
JMP LabelA

the purpose of the code is to catch pattern like this between JCC and INC/DEC

@mikedn
Copy link
Contributor

mikedn commented May 4, 2017

maybe an example can help here

Thanks for the example. That doesn't look like typical codegen for a loop, most of the time compilers turn while loops into do while loops so the JCC is at the bottom. In any case, the peep doesn't actually detect that a backward JMP follows so it will transform more INC/DEC instructions than necessary.

Well, it's a peep, it does what it can.

@mikedn
Copy link
Contributor

mikedn commented May 4, 2017

I would think we would want to optimize this scenario as the actual optimization seems fairly trivial and it hits a fair bit of code

When running on old processors.

most for loops are an inc or a dec followed by a conditional jump

Not really. Most for loops count up and the JIT can't transform them into count down loops that may produce a dec followed by a jcc. Your very own example is a count up loop. Let's see it again, this time with optimizations enabled:

G_M55886_IG03:
       8BCE                 mov      ecx, esi
       E81AFCFFFF           call     System.Console:WriteLine(int)
       FFC6                 inc      esi
       83FE05               cmp      esi, 5
       7CF2                 jl       SHORT G_M55886_IG03

The flags used by jl are produced by cmp, not by inc. All native compilers produce similar code (unless they decide to unroll but that can be disabled).

@JustDre
Copy link

JustDre commented Jun 3, 2019

Not really. Most for loops count up and the JIT can't transform them into count down loops that may produce a dec followed by a jcc. Your very own example is a count up loop. Let's see it again, this time with optimizations enabled:

@mikedn Just out of curiosity, why can't the JIT transform count-up loops into count-down loops?

FWIW, I tried manually transforming a count-up loop to a count-down loop and got
for (var x = count; x != 0; x--)
00007FFA9C3B74CF dec eax
00007FFA9C3B74D1 test eax,eax
00007FFA9C3B74D3 jne 00007FFA9C3B74AF

Here I'd say the 'test eax,eax' was unnecessary.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall
Copy link
Member

It doesn't seem likely we will address this, and it's even more out-of-date now, so closing.

@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Dec 5, 2020
@dotnet dotnet locked as resolved and limited conversation to collaborators Jan 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 arch-x86 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

10 participants