Do not use the INC or DEC instructions in x86 or x64 CodeGen #7697

tannergooding · 2017-03-22T15:23:06Z

We are currently emitting the INC and DEC instructions for x86 and x64 CodeGen. For example, the following loop:

for (var index = 0; index < 5; index++)
{
    Console.WriteLine(index);
}

generates the following assembly:

00007FFCC8110B20  push        rbp  
00007FFCC8110B21  sub         rsp,30h  
00007FFCC8110B25  lea         rbp,[rsp+30h]  
00007FFCC8110B2A  xor         eax,eax  
00007FFCC8110B2C  mov         dword ptr [rbp-4],eax  
00007FFCC8110B2F  mov         qword ptr [rbp+10h],rcx  
00007FFCC8110B33  cmp         dword ptr [7FFCC7FB4CF0h],0  
00007FFCC8110B3A  je          00007FFCC8110B41  
00007FFCC8110B3C  call        00007FFD27D1B930  
00007FFCC8110B41  xor         ecx,ecx  
00007FFCC8110B43  mov         dword ptr [rbp-4],ecx  
00007FFCC8110B46  nop  
00007FFCC8110B47  jmp         00007FFCC8110B59  
00007FFCC8110B49  mov         ecx,dword ptr [rbp-4]  
00007FFCC8110B4C  call        00007FFCC81106E0  
00007FFCC8110B51  mov         eax,dword ptr [rbp-4]  
00007FFCC8110B54  inc         eax  
00007FFCC8110B56  mov         dword ptr [rbp-4],eax  
00007FFCC8110B59  mov         ecx,dword ptr [rbp-4]  
00007FFCC8110B5C  cmp         ecx,5  
00007FFCC8110B5F  jl          00007FFCC8110B49  
00007FFCC8110B61  nop  
00007FFCC8110B62  lea         rsp,[rbp]  
00007FFCC8110B66  pop         rbp  
00007FFCC8110B67  ret

However, the Intel 64 and IA-32 Architectures Optimization Reference Manual recommends the following:

category:cq
theme:basic-cq
skill-level:expert
cost:small

The text was updated successfully, but these errors were encountered:

mikedn · 2017-03-22T15:58:52Z

However, the Intel 64 and IA-32 Architectures Optimization Reference Manual recommends the following:

So? Last time I checked Intel's own compiler didn't follow this particular rule.

In general, INC/DEC are more likely to cause problems if the flags are used after such an instruction. This isn't usually the case in JIT generated code.

tannergooding · 2017-03-22T16:08:14Z

@mikedn, regardless of whether their own compiler does it. This is overall a fairly trivial fix that is listed as having a medium impact (essentially it is removing a check in the code gen that says use inc if the addend is a constant of 1).

tannergooding · 2017-03-22T16:08:55Z

Given that the fix is trivial to make, it should be fairly simple to make it and see if it has an visible perf impact.

mikedn · 2017-03-22T16:12:41Z

Given that the fix is trivial to make, it should be fairly simple to make it and see if it has an visible perf impact.

Yes, it is trivial, I looked at this many months ago. The problem is that ADD reg, 1 is one byte larger than INC reg so it's not so obvious that this change is beneficial.

tannergooding · 2017-03-22T16:36:40Z

Well, we have a plethora of perf tests now that would make this much easier to validate 😄

mikedn · 2017-03-22T17:04:03Z

Well, we have a plethora of perf tests now that would make this much easier to validate

Well, have fun 😄. Note that this might be more effective after/together with #6794

RussKeldorph · 2017-03-22T21:24:26Z

@russellhadley @fiigii @litian2025 Do you want to weigh in on this? My impression is that "avoid inc/dec" is outdated advice, but I'm not sure.

@dotnet/jit-contrib

tannergooding · 2017-03-22T21:32:39Z

@RussKeldorph, this is coming from the June 2016 copy of the SOG.

It might be important to note that this case was improved with the Sandy Bridge architecture and newer, but they still recommend avoiding the instructions where possible.

jarodrig · 2017-04-02T22:59:55Z

We should not avoid emitting inc/dec anymore: check out VC++ default codegen and you will see we use it liberally there.

-Juan (Intel)

jarodrig · 2017-04-02T23:28:47Z

..and you should see that in the case where VC++ can detect the consumer of a flag generated by a partial flag writer, it should be emitting add/sub instructions -- the point is that you should not see blanket substitution of inc/dec by add/sub everywhere.

-Juan (Intel)

litian2025 · 2017-04-04T23:20:00Z

Yes, I agree with what Juan's points above, beside VC++, in Chakra, here is what we do it, first check if there's partial flag dependency, then make the changes only when necessary: https://github.com/Microsoft/ChakraCore/blob/master/lib/Backend/Peeps.cpp#L160_#L172

tannergooding · 2017-04-07T19:25:52Z

@RussKeldorph, do you know if CoreCLR has the appropriate logic to detect the partial flag dependency (and uses add/sub in those cases)? If so, I'm fine with closing the bug; otherwise, we should probably keep it open to track adding such logic.

litian2025 · 2017-05-02T15:55:11Z

@tannergooding @RussKeldorph @fiigii @jarodrig we will take a look at it and report back.

mikedn · 2017-05-02T16:38:59Z

@litian2025 It's not clear what Chakra does and why but there's nothing like this in JIT.

fiigii · 2017-05-02T18:10:48Z

ChakraCore does this in peephole optimization with the rules:

// Check for any of the following patterns which can cause partial flag dependency
//
//                                                        Jcc or SHL or SHR or SAR or SHLD(in case of x64)
// Jcc or SHL or SHR or SAR or SHLD(in case of x64)       Any Instruction
// INC or DEC                                             INC or DEC
// -------------------------------------------------- OR -----------------------
// INC or DEC                                             INC or DEC
// Jcc or SHL or SHR or SAR or SHLD(in case of x64)       Any Instruction
//                                                        Jcc or SHL or SHR or SAR or SHLD(in case of x64)
//
// With this optimization if any of the above pattern found, substitute INC/DEC with ADD/SUB respectively.

When the peephole optimizer encounters an inc or dec, it checks one or two of the previous/later instructions to determine if substitute inc/dec with add/sub.

RyuJIT does not have peephole optimization, but I think that we can implement this in code generator by checking previous/later instructions through gtPrev/gtNext pointers.

mikedn · 2017-05-02T18:28:38Z

ChakraCore does this in peephole optimization with the rules

Yes, a saw that comment but it's not clear where did those rules come from. For example, one variant is JCC followed by INC/DEC, what's that for? And the Chakra peep seems to be targeted solely at Atoms...

RyuJIT does not have peephole optimization, but I think that we can implement this in code generator by checking previous/later instructions through gtPrev/gtNext pointers.

Yes, that should work for INC/DEC followed by JCC. It won't work the other way around because JCC followed by INC/DEC implies different basic blocks.

fiigii · 2017-05-02T22:52:49Z

@mikedn Yes, this ChakraCore peep only works for Atom. According to Intel® 64 and IA-32 Architectures
Optimization Reference Manual 14.2.2.9

INC and DEC instructions require an additional uop to merge the flags as they are partial flag writers. As a result, a branch instruction depending on an INC or a DEC instruction incurs a 1 cycle penalty. Note that this penalty only applies to branches that are directly dependent on the INC or DEC instruction.
Assembly/Compiler Coding Rule 3. (M impact, M generality) Use CMP/ADD/SUB instructions to
compute branch conditions instead of INC/DEC instructions whenever possible.

The partial flags stall only happens on older CPUs (P4) and Atom, for new Intel CPUs are able to rename each flag bit separately.

briansull · 2017-05-02T23:19:19Z

So we can close this issue as optimizing for the ancient P4 architectural is not currently necessary

Pentium 4 was a line of single-core central processing units (CPUs) for desktops, laptops and entry-level servers introduced by Intel on November 20, 2000

And the Atom versions were introduced in 2008-2011 for budget PC buyers.

fiigii · 2017-05-02T23:38:51Z

[Correction]
@briansull Sorry, I checked the manual again. Nehalem also has the partial register access stall on flag registers. It was totally resolved by Sandy Bridge.

briansull · 2017-05-02T23:51:31Z

In any case the impacted CPU's are all fairly old processors:

Nehalem was replaced with the Sandy Bridge microarchitecture, released in January 2011.

tannergooding · 2017-05-03T00:27:45Z

@briansull, that is barely older (less than a month) than Windows 7 SP1 which still has some 48% of the market share (and which still has extended supported until Jan 14, 2020).

briansull · 2017-05-03T01:11:05Z

And Yes, it is totally supported, with an occasional 2 cycle stall when an inc or dec instruction is encountered.

litian2025 · 2017-05-03T21:29:40Z

@mikedn for chakra code peep optimization JCC followed by INC/DEC is in case of backward branch operation like a loop

tannergooding · 2017-05-03T21:50:39Z

@briansull, I would think we would want to optimize this scenario as the actual optimization seems fairly trivial and it hits a fair bit of code (most for loops are an inc or a dec followed by a conditional jump)

mikedn · 2017-05-03T22:11:26Z

@litian2025 I'm not familiar with Chakra's IR but it looks very much like linear IR so it's not clear how being next in linear order has anything to do with loops. It looks more like it is detecting an INC that is outside of the loop.

litian2025 · 2017-05-04T00:07:48Z

@mikedn maybe an example can help here:

Some Compare Instruction
LabelA:
JCC
Any instructions
INC/DEC
JMP LabelA

the purpose of the code is to catch pattern like this between JCC and INC/DEC

mikedn · 2017-05-04T05:30:37Z

maybe an example can help here

Thanks for the example. That doesn't look like typical codegen for a loop, most of the time compilers turn while loops into do while loops so the JCC is at the bottom. In any case, the peep doesn't actually detect that a backward JMP follows so it will transform more INC/DEC instructions than necessary.

Well, it's a peep, it does what it can.

mikedn · 2017-05-04T05:48:06Z

I would think we would want to optimize this scenario as the actual optimization seems fairly trivial and it hits a fair bit of code

When running on old processors.

most for loops are an inc or a dec followed by a conditional jump

Not really. Most for loops count up and the JIT can't transform them into count down loops that may produce a dec followed by a jcc. Your very own example is a count up loop. Let's see it again, this time with optimizations enabled:

G_M55886_IG03:
       8BCE                 mov      ecx, esi
       E81AFCFFFF           call     System.Console:WriteLine(int)
       FFC6                 inc      esi
       83FE05               cmp      esi, 5
       7CF2                 jl       SHORT G_M55886_IG03

The flags used by jl are produced by cmp, not by inc. All native compilers produce similar code (unless they decide to unroll but that can be disabled).

JustDre · 2019-06-03T21:41:43Z

Not really. Most for loops count up and the JIT can't transform them into count down loops that may produce a dec followed by a jcc. Your very own example is a count up loop. Let's see it again, this time with optimizations enabled:

@mikedn Just out of curiosity, why can't the JIT transform count-up loops into count-down loops?

FWIW, I tried manually transforming a count-up loop to a count-down loop and got
for (var x = count; x != 0; x--)
00007FFA9C3B74CF dec eax
00007FFA9C3B74D1 test eax,eax
00007FFA9C3B74D3 jne 00007FFA9C3B74AF

Here I'd say the 'test eax,eax' was unnecessary.

BruceForstall · 2020-12-05T03:24:51Z

It doesn't seem likely we will address this, and it's even more out-of-date now, so closing.

BruceForstall changed the title ~~Do not use the INC or DEC intrusctions in x86 or x64 CodeGen~~ Do not use the INC or DEC instructions in x86 or x64 CodeGen Mar 29, 2017

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

BruceForstall closed this as completed Dec 5, 2020

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Dec 5, 2020

dotnet locked as resolved and limited conversation to collaborators Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not use the INC or DEC instructions in x86 or x64 CodeGen #7697

Do not use the INC or DEC instructions in x86 or x64 CodeGen #7697

tannergooding commented Mar 22, 2017

mikedn commented Mar 22, 2017

tannergooding commented Mar 22, 2017

tannergooding commented Mar 22, 2017

mikedn commented Mar 22, 2017

tannergooding commented Mar 22, 2017

mikedn commented Mar 22, 2017

RussKeldorph commented Mar 22, 2017

tannergooding commented Mar 22, 2017

jarodrig commented Apr 2, 2017

jarodrig commented Apr 2, 2017

litian2025 commented Apr 4, 2017

tannergooding commented Apr 7, 2017

litian2025 commented May 2, 2017

mikedn commented May 2, 2017

fiigii commented May 2, 2017

mikedn commented May 2, 2017

fiigii commented May 2, 2017 •

edited

briansull commented May 2, 2017 •

edited

fiigii commented May 2, 2017 •

edited

briansull commented May 2, 2017

tannergooding commented May 3, 2017

briansull commented May 3, 2017

litian2025 commented May 3, 2017

tannergooding commented May 3, 2017

mikedn commented May 3, 2017

litian2025 commented May 4, 2017 •

edited

mikedn commented May 4, 2017

mikedn commented May 4, 2017 •

edited

JustDre commented Jun 3, 2019 •

edited

BruceForstall commented Dec 5, 2020

Do not use the INC or DEC instructions in x86 or x64 CodeGen #7697

Do not use the INC or DEC instructions in x86 or x64 CodeGen #7697

Comments

tannergooding commented Mar 22, 2017

mikedn commented Mar 22, 2017

tannergooding commented Mar 22, 2017

tannergooding commented Mar 22, 2017

mikedn commented Mar 22, 2017

tannergooding commented Mar 22, 2017

mikedn commented Mar 22, 2017

RussKeldorph commented Mar 22, 2017

tannergooding commented Mar 22, 2017

jarodrig commented Apr 2, 2017

jarodrig commented Apr 2, 2017

litian2025 commented Apr 4, 2017

tannergooding commented Apr 7, 2017

litian2025 commented May 2, 2017

mikedn commented May 2, 2017

fiigii commented May 2, 2017

mikedn commented May 2, 2017

fiigii commented May 2, 2017 • edited

briansull commented May 2, 2017 • edited

fiigii commented May 2, 2017 • edited

briansull commented May 2, 2017

tannergooding commented May 3, 2017

briansull commented May 3, 2017

litian2025 commented May 3, 2017

tannergooding commented May 3, 2017

mikedn commented May 3, 2017

litian2025 commented May 4, 2017 • edited

mikedn commented May 4, 2017

mikedn commented May 4, 2017 • edited

JustDre commented Jun 3, 2019 • edited

BruceForstall commented Dec 5, 2020

fiigii commented May 2, 2017 •

edited

briansull commented May 2, 2017 •

edited

fiigii commented May 2, 2017 •

edited

litian2025 commented May 4, 2017 •

edited

mikedn commented May 4, 2017 •

edited

JustDre commented Jun 3, 2019 •

edited