Skip to content
This repository has been archived by the owner on Mar 15, 2022. It is now read-only.

Reader: implement tail calls #191

Closed
erozenfeld opened this issue Feb 24, 2015 · 25 comments
Closed

Reader: implement tail calls #191

erozenfeld opened this issue Feb 24, 2015 · 25 comments
Assignees

Comments

@erozenfeld
Copy link
Member

No description provided.

@AndyAyersMS
Copy link
Member

@richardlford would be nice to at least get the first layer of this in place and drop back to normal calls in most cases. This should unblock a lot of methods in our testing.

@AndyAyersMS
Copy link
Member

Started looking into this. First step is to just pass on tail call opportunities without the ‘tail’ prefix, since the jit is not obligated to make these into tail calls. Internally this means trying to handle cases that are isTailCalI() && !isUnmarkedTailCall() as normal calls. This covers (as far as I can tell) all of the cases from C#, which does not seem to use the tail prefix.

This unblocks a lot of methods, but it appears we generate incorrect code for some of them. Every test fails with a runtime error, and at the point of failure there are no managed frames on the stack. So it seems like we are corrupting memory or something similar.

I’ve isolated System.Buffer.Memmove as the first method that causes issues, so will dig into that next....

@AndyAyersMS
Copy link
Member

Memmove does a lot of case analysis internally. It also gets called a repeatedly. I verified in the debugger that the first 20 calls or so seem to work as expected, but there are many more. A lot of these calls are from code that manipulates file paths during CLR initialization so the failure mode may be sensitive to the length of the path to CoreCLR.exe and similar things.

It seems plausible that for some particular length, things go bad and memmove ends up corrupting memory somewhere. I'll set the debugger to capture the different length values that appear and perhaps by focusing on "unusual" values we can pin down which particular calls are worth looking at more closely. I also will try verifying heap integrity after each call to see if that can spot corruption.

@AndyAyersMS
Copy link
Member

There are 739 calls to Memmove and 73 distinct length values for copies. I've verified that the heap is still in good shape before each call and at the end when the process is failing. So if the problem is memory corruption the corruption is not happening on the GC heap anywhere.

The memmove code special cases short (16 bytes or less) moves and very long ones (512 bytes or more). The short cases we see are 2, 10, and 14 bytes, so I'll look at those and verify they work ok. The long cases fall back to pinvokes and those seem to be ok but I'll double check that again too.

@AndyAyersMS
Copy link
Member

Because of the way memmove is structured, you can force it (via debugger breakin altering RIP) to always follow the pinvoke path. That works. So I made this conditional and have been binary searching to see which call to memmove causes problems. For my repro case I have it narrowed down to the range 701, 712 -- that is, if the first 700 calls just run the normal code, everything is ok; if the first 712 use the normal code, the app fails.

The windbg breakpoint command I'm using is something like this:

"r$t0 = @$t0 + 1; .if (@$t0 > 0n706) {r @rip=0x7ffaabac1aad} .else {}; g"`

Here $t0 pseudo-var counts the number of hits; once it gets past some value, the debugger modifies RIP to force the code down the pinvoke path. I do this at the first conditional branch in memmove, where it is checking for overlap src and dst.

@AndyAyersMS
Copy link
Member

In my repro call #708 goes bad. This is copying c2 (194) bytes to an address that ends in 4. So the code first stores 4 bytes to get the destination aligned, then does 11 rounds of storing two qwords (176 bytes in all, 180 including the initial 4 byte write). This leaves 14 bytes, which it stores as 8, 4, and 2. That last write of 2 however, actually writes 4 bytes. So this corrupts the memory just beyond the buffer. The LLVM IR clearly shows this:

; <label>:365                                     ; preds = %361
  %366 = load i8** %arg0
  %367 = load i8** %arg1
  %368 = bitcast i8* %367 to i16*
  %369 = load i16* %368, align 8       < loadi16 here
  %370 = sext i16 %369 to i32
  %371 = bitcast i8* %366 to i32*
  store i32 %370, i32* %371, align 8    < store i32 here

@kyulee1
Copy link
Contributor

kyulee1 commented Mar 6, 2015

%370 = sext i16 %369 to i16 < not needed
%371 = bitcast i8* %366 to i16*
store i16 %370, i16* %371, align 8
So we expect something like this?
Is it a bug in LLVM IR expansion that inadvertently sign-extends it or our bug in reader?

@AndyAyersMS
Copy link
Member

Bug was that storePrimitiveType was using the stack type, not the CorInfoType. Have a fix.

@kyulee1
Copy link
Contributor

kyulee1 commented Mar 6, 2015

so we need this->getType(CorInfoType, NULL) instead of Value->getType()? Probably this is my bad.

@AndyAyersMS
Copy link
Member

Basically, yeah. When values are pushed on the stack they widen to i32 or i64, so you can't use the stack type to figure out how many bytes to store.

@AndyAyersMS
Copy link
Member

Commit e3efa38 is in -- we now should only fail when there are explicit tall calls. Down the road we can implement the optimization to opportunistically handle the unmarked tail calls more efficiently.

@AndyAyersMS
Copy link
Member

Note LLVM has tail call annotations on call instructions, so once we get more of the legality checking implemented in the reader we can use this to try and convince downstream phases to actually do the transformation. There are both "may" and "must" forms.

@erozenfeld erozenfeld assigned erozenfeld and unassigned AndyAyersMS Sep 11, 2015
@erozenfeld
Copy link
Member Author

We now hit these in lots of CoreCLR tests. Most of them in JIT\Methodical\tailcall

@erozenfeld
Copy link
Member Author

Looks like the existing LLVM tail annotations aren't enough for implementing tail.call.
musttail has a constraint that the caller and callee signatures must match. That worked for implementing jmp but tail.call only requires return types to be compatible.
tail guarantees the optimization to occur only if the caller and the callee have the calling convention fastcc (and some platform-specific constraints are met).

@pgavlin
Copy link
Contributor

pgavlin commented Sep 16, 2015

@erozenfeld it might be worth looking into what ghc does here.

@erozenfeld
Copy link
Member Author

It turns out that tail call optimization is guaranteed for their calling convention as well:
http://llvm.org/docs/CodeGenerator.html#tail-call-optimization
Tail call optimization, callee reusing the stack of the caller, is currently supported on x86/x86-64 and PowerPC. It is performed if:

  • Caller and callee have the calling convention fastcc, cc 10 (GHC calling convention) or cc 11 (HiPE calling convention).
  • The call is a tail call - in tail position (ret immediately follows call and ret uses value of call or is void).
  • Option -tailcallopt is enabled.
  • Platform-specific constraints are met.

@erozenfeld
Copy link
Member Author

So called sibling call optimization may be good enough for us.
http://llvm.org/docs/CodeGenerator.html#sibling-call-optimization

Sibling call optimization is a restricted form of tail call optimization.
Unlike tail call optimization described in the previous section, it can be
performed automatically on any tail calls when -tailcallopt option is not
specified.

Sibling call optimization is currently performed on x86/x86-64 when the
following constraints are met:

  • Caller and callee have the same calling convention. It can be either c or
    fastcc.
  • The call is a tail call - in tail position (ret immediately follows call and
    ret uses value of call or is void).
  • Caller and callee have matching return type or the callee result is not used.
  • If any of the callee arguments are being passed in stack, they must be
    available in caller's own incoming argument stack and the frame offsets must
    be the same.

@erozenfeld
Copy link
Member Author

@russellhadley please reassign as appropriate

@russellhadley
Copy link
Contributor

@AndyAyersMS Assigning since this overlaps with the tailcall work you're doing now.

@AndyAyersMS
Copy link
Member

Have returned to this and have it mostly coded up -- debugging now.

We classify calls as tail/notail in the reader, let LLVM do the sibling call opt where it can. I can see it kicking in, which is cool. Must still be missing a legality check though.

@AndyAyersMS
Copy link
Member

Found a bug in LLVM -- when it goes to readjust the stack in the "epilog" it calls findDeadCallerSavedReg which does not have the right reg sets for our calling convention. So it makes a bad choice...

push        rax  
mov         qword ptr [rsp],0  
mov         qword ptr [rsp],rcx  
mov         rcx,qword ptr [rsp]  
mov         rax,7FD32BFF638h  
xor         edx,edx  
xor         r8d,r8d  
pop         rsi                   // oops
jmp         rax  

Worked around this but there are still more issues to sort through....

@AndyAyersMS
Copy link
Member

Issues fixed, things looking good. Will have to push the LLVM change out first though.

@AndyAyersMS
Copy link
Member

LLVM fix is out for review

@AndyAyersMS
Copy link
Member

With that fix and my changes, I can pass the local windows tests. Still need to see how this interacts with EH and GC, and try it on Linux...

@AndyAyersMS
Copy link
Member

LLVM fix is now in the main LLVM tree. Now it needs to show up in our MS branch...

AndyAyersMS added a commit to AndyAyersMS/llilc that referenced this issue Nov 24, 2015
Closes dotnet#191.

Add the `tail` modifier to calls that satisfy correctness checks for tail calls (either explicit or implicit). This will enable LLVM to perform "sibling" call optimizations during lowering.

Do a very simplistic tracking of address-taken locals and args to screen out implicit tail call candidates. Do likewise with localloc. Both these need to be detected during the reader's first pass, so add suitable asserts.

Also start tracking unsafe locals, since they (and localloc) will inspire stack security protection checking (aka /GS) and will inhibit tail calls. We still don't actually do the checks (see dotnet#353).

Add logic to the ABI classifier so we can detect if we're introducing on-stack references for call lowerings and avoid tail calling in those cases too. This can also be made smarter (eg we might be able to copy-prop and use the values passed by the caller).

Have the jit options determine if the jit is optimizing and use the option setting in the code rather than checking the flags.

Remove existing NYI for explicit tail calls.

Verified by hand that the excluded tail call tests all now pass and seem to get most of the tail call cases the tests intended, so unexcluded them. Most "normal" bring up tests will create around 150 tail call sites so I think the new codegen gets pretty well tested.

Also verified all this works with EH enabled and GC enabled.
AndyAyersMS added a commit to AndyAyersMS/llilc that referenced this issue Nov 25, 2015
Closes dotnet#191.

Add the `tail` modifier to calls that satisfy correctness checks for tail calls (either explicit or implicit). This will enable LLVM to perform "sibling" call optimizations during lowering.

Do a very simplistic tracking of address-taken locals and args to screen out implicit tail call candidates. Do likewise with localloc. Both these need to be detected during the reader's first pass, so add suitable asserts.

Also start tracking unsafe locals, since they (and localloc) will inspire stack security protection checking (aka /GS) and will inhibit tail calls. We still don't actually do the checks (see dotnet#353).

Add logic to the ABI classifier so we can detect if we're introducing on-stack references for call lowerings and avoid tail calling in those cases too. This can also be made smarter (eg we might be able to copy-prop and use the values passed by the caller).

Have the jit options determine if the jit is optimizing and use the option setting in the code rather than checking the flags.

Remove existing NYI for explicit tail calls.

Verified by hand that the excluded tail call tests all now pass and seem to get most of the tail call cases the tests intended, so unexcluded them. Most "normal" bring up tests will create around 150 tail call sites so I think the new codegen gets pretty well tested.

Also verified all this works with EH enabled and GC enabled.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants