Reader: implement tail calls #191

erozenfeld · 2015-02-24T23:19:40Z

No description provided.

AndyAyersMS · 2015-03-04T00:39:33Z

@richardlford would be nice to at least get the first layer of this in place and drop back to normal calls in most cases. This should unblock a lot of methods in our testing.

AndyAyersMS · 2015-03-05T16:56:06Z

Started looking into this. First step is to just pass on tail call opportunities without the ‘tail’ prefix, since the jit is not obligated to make these into tail calls. Internally this means trying to handle cases that are isTailCalI() && !isUnmarkedTailCall() as normal calls. This covers (as far as I can tell) all of the cases from C#, which does not seem to use the tail prefix.

This unblocks a lot of methods, but it appears we generate incorrect code for some of them. Every test fails with a runtime error, and at the point of failure there are no managed frames on the stack. So it seems like we are corrupting memory or something similar.

I’ve isolated System.Buffer.Memmove as the first method that causes issues, so will dig into that next....

AndyAyersMS · 2015-03-05T18:03:37Z

Memmove does a lot of case analysis internally. It also gets called a repeatedly. I verified in the debugger that the first 20 calls or so seem to work as expected, but there are many more. A lot of these calls are from code that manipulates file paths during CLR initialization so the failure mode may be sensitive to the length of the path to CoreCLR.exe and similar things.

It seems plausible that for some particular length, things go bad and memmove ends up corrupting memory somewhere. I'll set the debugger to capture the different length values that appear and perhaps by focusing on "unusual" values we can pin down which particular calls are worth looking at more closely. I also will try verifying heap integrity after each call to see if that can spot corruption.

AndyAyersMS · 2015-03-05T18:19:13Z

There are 739 calls to Memmove and 73 distinct length values for copies. I've verified that the heap is still in good shape before each call and at the end when the process is failing. So if the problem is memory corruption the corruption is not happening on the GC heap anywhere.

The memmove code special cases short (16 bytes or less) moves and very long ones (512 bytes or more). The short cases we see are 2, 10, and 14 bytes, so I'll look at those and verify they work ok. The long cases fall back to pinvokes and those seem to be ok but I'll double check that again too.

AndyAyersMS · 2015-03-05T20:05:33Z

Because of the way memmove is structured, you can force it (via debugger breakin altering RIP) to always follow the pinvoke path. That works. So I made this conditional and have been binary searching to see which call to memmove causes problems. For my repro case I have it narrowed down to the range 701, 712 -- that is, if the first 700 calls just run the normal code, everything is ok; if the first 712 use the normal code, the app fails.

The windbg breakpoint command I'm using is something like this:

"r$t0 = @$t0 + 1; .if (@$t0 > 0n706) {r @rip=0x7ffaabac1aad} .else {}; g"`

Here $t0 pseudo-var counts the number of hits; once it gets past some value, the debugger modifies RIP to force the code down the pinvoke path. I do this at the first conditional branch in memmove, where it is checking for overlap src and dst.

AndyAyersMS · 2015-03-06T00:07:40Z

In my repro call #708 goes bad. This is copying c2 (194) bytes to an address that ends in 4. So the code first stores 4 bytes to get the destination aligned, then does 11 rounds of storing two qwords (176 bytes in all, 180 including the initial 4 byte write). This leaves 14 bytes, which it stores as 8, 4, and 2. That last write of 2 however, actually writes 4 bytes. So this corrupts the memory just beyond the buffer. The LLVM IR clearly shows this:

; <label>:365                                     ; preds = %361
  %366 = load i8** %arg0
  %367 = load i8** %arg1
  %368 = bitcast i8* %367 to i16*
  %369 = load i16* %368, align 8       < loadi16 here
  %370 = sext i16 %369 to i32
  %371 = bitcast i8* %366 to i32*
  store i32 %370, i32* %371, align 8    < store i32 here

kyulee1 · 2015-03-06T00:18:22Z

%370 = sext i16 %369 to i16 < not needed
%371 = bitcast i8* %366 to i16*
store i16 %370, i16* %371, align 8
So we expect something like this?
Is it a bug in LLVM IR expansion that inadvertently sign-extends it or our bug in reader?

AndyAyersMS · 2015-03-06T00:33:22Z

Bug was that storePrimitiveType was using the stack type, not the CorInfoType. Have a fix.

kyulee1 · 2015-03-06T00:41:28Z

so we need this->getType(CorInfoType, NULL) instead of Value->getType()? Probably this is my bad.

AndyAyersMS · 2015-03-06T00:52:24Z

Basically, yeah. When values are pushed on the stack they widen to i32 or i64, so you can't use the stack type to figure out how many bytes to store.

AndyAyersMS · 2015-03-10T00:43:58Z

Commit e3efa38 is in -- we now should only fail when there are explicit tall calls. Down the road we can implement the optimization to opportunistically handle the unmarked tail calls more efficiently.

AndyAyersMS · 2015-03-19T16:38:18Z

Note LLVM has tail call annotations on call instructions, so once we get more of the legality checking implemented in the reader we can use this to try and convince downstream phases to actually do the transformation. There are both "may" and "must" forms.

erozenfeld · 2015-09-15T22:43:22Z

We now hit these in lots of CoreCLR tests. Most of them in JIT\Methodical\tailcall

erozenfeld · 2015-09-16T22:12:49Z

Looks like the existing LLVM tail annotations aren't enough for implementing tail.call.
musttail has a constraint that the caller and callee signatures must match. That worked for implementing jmp but tail.call only requires return types to be compatible.
tail guarantees the optimization to occur only if the caller and the callee have the calling convention fastcc (and some platform-specific constraints are met).

pgavlin · 2015-09-16T22:17:06Z

@erozenfeld it might be worth looking into what ghc does here.

erozenfeld · 2015-09-16T22:36:31Z

It turns out that tail call optimization is guaranteed for their calling convention as well:
http://llvm.org/docs/CodeGenerator.html#tail-call-optimization
Tail call optimization, callee reusing the stack of the caller, is currently supported on x86/x86-64 and PowerPC. It is performed if:

Caller and callee have the calling convention fastcc, cc 10 (GHC calling convention) or cc 11 (HiPE calling convention).
The call is a tail call - in tail position (ret immediately follows call and ret uses value of call or is void).
Option -tailcallopt is enabled.
Platform-specific constraints are met.

erozenfeld · 2015-09-18T00:56:49Z

So called sibling call optimization may be good enough for us.
http://llvm.org/docs/CodeGenerator.html#sibling-call-optimization

Sibling call optimization is a restricted form of tail call optimization.
Unlike tail call optimization described in the previous section, it can be
performed automatically on any tail calls when -tailcallopt option is not
specified.

Sibling call optimization is currently performed on x86/x86-64 when the
following constraints are met:

Caller and callee have the same calling convention. It can be either c or
fastcc.
The call is a tail call - in tail position (ret immediately follows call and
ret uses value of call or is void).
Caller and callee have matching return type or the callee result is not used.
If any of the callee arguments are being passed in stack, they must be
available in caller's own incoming argument stack and the frame offsets must
be the same.

erozenfeld · 2015-11-02T23:51:07Z

@russellhadley please reassign as appropriate

russellhadley · 2015-11-12T17:24:49Z

@AndyAyersMS Assigning since this overlaps with the tailcall work you're doing now.

AndyAyersMS · 2015-11-19T23:55:42Z

Have returned to this and have it mostly coded up -- debugging now.

We classify calls as tail/notail in the reader, let LLVM do the sibling call opt where it can. I can see it kicking in, which is cool. Must still be missing a legality check though.

AndyAyersMS · 2015-11-20T04:14:11Z

Found a bug in LLVM -- when it goes to readjust the stack in the "epilog" it calls findDeadCallerSavedReg which does not have the right reg sets for our calling convention. So it makes a bad choice...

push        rax  
mov         qword ptr [rsp],0  
mov         qword ptr [rsp],rcx  
mov         rcx,qword ptr [rsp]  
mov         rax,7FD32BFF638h  
xor         edx,edx  
xor         r8d,r8d  
pop         rsi                   // oops
jmp         rax

Worked around this but there are still more issues to sort through....

AndyAyersMS · 2015-11-20T09:26:29Z

Issues fixed, things looking good. Will have to push the LLVM change out first though.

AndyAyersMS · 2015-11-20T20:18:07Z

LLVM fix is out for review

AndyAyersMS · 2015-11-20T20:19:15Z

With that fix and my changes, I can pass the local windows tests. Still need to see how this interacts with EH and GC, and try it on Linux...

AndyAyersMS · 2015-11-23T22:24:12Z

LLVM fix is now in the main LLVM tree. Now it needs to show up in our MS branch...

Closes dotnet#191. Add the `tail` modifier to calls that satisfy correctness checks for tail calls (either explicit or implicit). This will enable LLVM to perform "sibling" call optimizations during lowering. Do a very simplistic tracking of address-taken locals and args to screen out implicit tail call candidates. Do likewise with localloc. Both these need to be detected during the reader's first pass, so add suitable asserts. Also start tracking unsafe locals, since they (and localloc) will inspire stack security protection checking (aka /GS) and will inhibit tail calls. We still don't actually do the checks (see dotnet#353). Add logic to the ABI classifier so we can detect if we're introducing on-stack references for call lowerings and avoid tail calling in those cases too. This can also be made smarter (eg we might be able to copy-prop and use the values passed by the caller). Have the jit options determine if the jit is optimizing and use the option setting in the code rather than checking the flags. Remove existing NYI for explicit tail calls. Verified by hand that the excluded tail call tests all now pass and seem to get most of the tail call cases the tests intended, so unexcluded them. Most "normal" bring up tests will create around 150 tail call sites so I think the new codegen gets pretty well tested. Also verified all this works with EH enabled and GC enabled.

erozenfeld assigned richardlford Feb 24, 2015

AndyAyersMS assigned AndyAyersMS and unassigned richardlford Mar 5, 2015

erozenfeld assigned erozenfeld and unassigned AndyAyersMS Sep 11, 2015

erozenfeld mentioned this issue Sep 15, 2015

Verify we're not skipping methods in the enabled tests #821

Closed

erozenfeld assigned russellhadley and unassigned erozenfeld Nov 2, 2015

russellhadley assigned AndyAyersMS and unassigned russellhadley Nov 12, 2015

AndyAyersMS mentioned this issue Nov 24, 2015

Implement support for tail calls #956

Merged

AndyAyersMS closed this as completed in #956 Nov 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reader: implement tail calls #191

Reader: implement tail calls #191

erozenfeld commented Feb 24, 2015

AndyAyersMS commented Mar 4, 2015

AndyAyersMS commented Mar 5, 2015

AndyAyersMS commented Mar 5, 2015

AndyAyersMS commented Mar 5, 2015

AndyAyersMS commented Mar 5, 2015

AndyAyersMS commented Mar 6, 2015

kyulee1 commented Mar 6, 2015

AndyAyersMS commented Mar 6, 2015

kyulee1 commented Mar 6, 2015

AndyAyersMS commented Mar 6, 2015

AndyAyersMS commented Mar 10, 2015

AndyAyersMS commented Mar 19, 2015

erozenfeld commented Sep 15, 2015

erozenfeld commented Sep 16, 2015

pgavlin commented Sep 16, 2015

erozenfeld commented Sep 16, 2015

erozenfeld commented Sep 18, 2015

erozenfeld commented Nov 2, 2015

russellhadley commented Nov 12, 2015

AndyAyersMS commented Nov 19, 2015

AndyAyersMS commented Nov 20, 2015

AndyAyersMS commented Nov 20, 2015

AndyAyersMS commented Nov 20, 2015

AndyAyersMS commented Nov 20, 2015

AndyAyersMS commented Nov 23, 2015

Reader: implement tail calls #191

Reader: implement tail calls #191

Comments

erozenfeld commented Feb 24, 2015

AndyAyersMS commented Mar 4, 2015

AndyAyersMS commented Mar 5, 2015

AndyAyersMS commented Mar 5, 2015

AndyAyersMS commented Mar 5, 2015

AndyAyersMS commented Mar 5, 2015

AndyAyersMS commented Mar 6, 2015

kyulee1 commented Mar 6, 2015

AndyAyersMS commented Mar 6, 2015

kyulee1 commented Mar 6, 2015

AndyAyersMS commented Mar 6, 2015

AndyAyersMS commented Mar 10, 2015

AndyAyersMS commented Mar 19, 2015

erozenfeld commented Sep 15, 2015

erozenfeld commented Sep 16, 2015

pgavlin commented Sep 16, 2015

erozenfeld commented Sep 16, 2015

erozenfeld commented Sep 18, 2015

erozenfeld commented Nov 2, 2015

russellhadley commented Nov 12, 2015

AndyAyersMS commented Nov 19, 2015

AndyAyersMS commented Nov 20, 2015

AndyAyersMS commented Nov 20, 2015

AndyAyersMS commented Nov 20, 2015

AndyAyersMS commented Nov 20, 2015

AndyAyersMS commented Nov 23, 2015