New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lzcnt, tzcnt, popcnt false dependency not handled in CoreCLR #19555

Closed
damageboy opened this Issue Aug 19, 2018 · 7 comments

Comments

Projects
None yet
4 participants
@damageboy
Copy link

damageboy commented Aug 19, 2018

Hi,
Following my successful foray into the brave world of CoreCLR architecture specific intrinsics I think I stumbled upon a relatively known bug/deficiency in most intel CPUs (I have no knowledge on how this affects AMD CPUs if at all) that is now affecting perf on CoreCLR.

It appears that when I write a tight / unrolled loop with any of the intrinsics mentioned in the issue name:

  • lzcnt (LeadingZeroCount())
  • tzcnt (TrailingZeroCount())
  • last but certainly not least popcnt (popcnt (PopCount())

CoreCLR generates seemingly very good code when inspecting the JIT output, however that code hits a rather known Intel false dependency bug, that has been covered quite extensively:

The gist of the matter is that most in-production Intel CPUs, from Sandy-Bridge and onward up to, and including Skylake (for popcnt) and not-including Skylake (for lzcnt, tzcnt) introduce a dependency on the DESTINATION register for those instructions.

So a loop, written in c# as:

for (var i = 0; i < n; i+=4, bits += 4)
    index += PopCount(bits[0]) + PopCount(bits[1]) + PopCount(bits[2]) + PopCount(bits[3]);}

Get's compiled/jitted down to (copy-pasted from CoreCLR JITDump):

G_M2736_IG05:        ; func=00, offs=000069H, size=0030H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
IN001e: 000069 F34C0FB816           popcnt   r10, qword ptr [rsi]
IN001f: 00006E 4903D2               add      rdx, r10
IN0020: 000071 F34C0FB85608         popcnt   r10, qword ptr [rsi+8]
IN0021: 000077 4903D2               add      rdx, r10
IN0022: 00007A F34C0FB85610         popcnt   r10, qword ptr [rsi+16]
IN0023: 000080 4903D2               add      rdx, r10
IN0024: 000083 F34C0FB85618         popcnt   r10, qword ptr [rsi+24]
IN0025: 000089 4903D2               add      rdx, r10
IN0026: 00008C 4183C104             add      r9d, 4
IN0027: 000090 4883C620             add      rsi, 32
IN0028: 000094 453BC8               cmp      r9d, r8d
IN0029: 000097 7CD0                 jl       SHORT G_M2736_IG05

As you can easily read/see, each popcnt instruction in this unrolled loop (and also between loop iterations) is writing its output into r10, thus, according to the described errata/cpu bug, is causing the next popcnt to stall before it can proceed with actual execution.

Given that the CPU can, in theory, execute two such popcnt instruction in parallel, this false dependency seems to have been messing around with various people's code perf for a long time.

I think I even found an inadvertent "complaint" about this inside this CoreCLR issue where @Tornhoof and @fiigii are trying to make a popcnt loop / unrolled loop perform faster but aren't achieving substantial speedup.

It seems like modern compilers, such as GCC, can and do alleviate the impact of this issue by clearing the bottom 32 bits of each destination preemptively before emitting the popcntinstruction, copy pasted to here for completeness...:

.L3:
        xor     edx, edx
        xor     ecx, ecx
        popcnt  rcx, QWORD PTR [rdi]
        popcnt  rdx, QWORD PTR [rdi+8]
        add     rdx, rcx
        xor     ecx, ecx
        popcnt  rcx, QWORD PTR [rdi+16]
        add     rdx, rcx
        xor     ecx, ecx
        popcnt  rcx, QWORD PTR [rdi+24]
        add     rdi, 32
        add     rdx, rcx
        add     eax, edx
        cmp     rdi, rsi
        jne     .L3

Would this be something that CoreCLR JIT could dynamically detect (according to CPU model/family) and insert into the instruction stream?

@damageboy damageboy changed the title lzcnt, tzcnt, popcnt false depndency not handled in CoreCLR lzcnt, tzcnt, popcnt false dependency not handled in CoreCLR Aug 19, 2018

@damageboy damageboy referenced this issue Aug 19, 2018

Closed

Implement BMI1 and BMI2 intrinsics #18712

7 of 10 tasks complete

@AndyAyersMS AndyAyersMS added this to the 3.0 milestone Aug 20, 2018

@AndyAyersMS

This comment has been minimized.

Copy link
Member

AndyAyersMS commented Aug 20, 2018

@fiigii -- is this something you'd recommend fixing?
cc @dotnet/jit-contrib

@damageboy

This comment has been minimized.

Copy link
Author

damageboy commented Aug 20, 2018

For what it's worth, from my C++ benchmarking, I can testify that unrolling a loop of 4 POPCNTs without this xor dest,test hack (I used inline assembly to force the compiler generate the suboptimal code) led to a 80% regression for 4096 bit arrays, for example...

In other words the gcc generate POPCNT ran at X ns for 4096 bits, while the unoptimized POPCNT ran at 1.8 X...

I'm writing a blog post about what I'm doing, and I can incorporate those tests if anyone is interested...?

@fiigii

This comment has been minimized.

Copy link
Collaborator

fiigii commented Aug 20, 2018

@damageboy Thanks for opening this issue. This issue is also detected by Intel and logged in errata (e.g., SKL029).

However, the fix will be non-trivial that may need to be done in the instruction scheduling.
For you code example, there is anti-dependency from ADD and the following POPCNT.

IN001e: 000069 F34C0FB816           popcnt   r10, qword ptr [rsi]
IN001f: 00006E 4903D2               add      rdx, r10    ;;; read r10
IN0020: 000071 F34C0FB85608         popcnt   r10, qword ptr [rsi+8] ;;; write r10 after read, can be broken by renaming
IN0021: 000077 4903D2               add      rdx, r10

We can see GCC inserting xor dst, dst before popcnt dst, src to break the anti-dependency because xor dst, dst can trigger the register renaming.

is this something you'd recommend fixing?

@AndyAyersMS I think we need a non-trivial fix like GCC does rather than unconditionally insert XOR before every POPCNT, but RyuJIT seems to not have a phase to do this (right?). cc @CarolEidt

@GSPP

This comment has been minimized.

Copy link

GSPP commented Aug 20, 2018

Isn't xor dst, dst done as part of register renaming on modern CPUs? It might cost exactly zero latency. Maybe it is the best strategy to just always insert it.

@fiigii

This comment has been minimized.

Copy link
Collaborator

fiigii commented Aug 20, 2018

Isn't xor dst, dst done as part of register renaming on modern CPUs?

Yes, but CPU front-end resource is also limited. Too much XOR will impact the front-end performance by renaming and code size.

Maybe it is the best strategy to just always insert it.

We may need more scenarios and data to decide the implementation approach.

@damageboy

This comment has been minimized.

Copy link
Author

damageboy commented Aug 20, 2018

@fiigii I think that what @GSPP meant was that it might simply be a good strategy to always insert it for those 3 affected instructions on the relevant CPU models..

A simple experiment with gcc seems to indicate that @GSPP is right:
gcc seems to just blindly insert a xor dest,dest whenever it sees a POPCNT on skylake (in that example), even if there isn't a loop involved (see example).

I completely understand what you are trying to say about the CPU front-end being a limited resource in the general sense, and that over-utililzing it also needs to be taken into consideration, it's just that for these specific instructions the benefit out-weighs the cost...

I don't think I've ever seen a POPCNT done outside of a loop in my life :), and the gcc behavior around a single POPCNT seems to imply that they haven't either :)

In addition, given that many devs deliberately take advantage of Intel CPUs ability to handle 2 independent POPCNT instructions in parallel on two different execution ports (If memory serves me well), missing the opportunity to free the CPU from stalling carries a heavy penalty given the context.

As such, blindly clearing the destination in order to trigger the register renaming is probably the sane thing to do for this bunch of orders, while definitely NOT the right thing to do in other cases...

Do you have any suggestions on how else we can collect data / scenarios to be able to come to a decision...?

@fiigii

This comment has been minimized.

Copy link
Collaborator

fiigii commented Aug 20, 2018

@damageboy Thank you so much for the investigation and clarification. Looks like we can give a try with unconditionally inserting XOR.

However, before fixing the issue in codegen, we need more infrastructure work in the runtime for the micro-architecture specific optimization. Currently, CoreCLR is only aware of the underlying hardware difference at ISA levels, so we need to make some change to let RyuJIT know the micro-architecture. After that, we can only insert XOR for lzcnt/tzcnt on SNB/IVB and HSW, and remove XOR for popcnt in the future.

@AndyAyersMS @CarolEidt @jkotas Do you agree to add the micro-architecture checks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment