Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General inlining of tail calls #23370

Open
AndyAyersMS opened this Issue Mar 20, 2019 · 3 comments

Comments

Projects
None yet
1 participant
@AndyAyersMS
Copy link
Member

AndyAyersMS commented Mar 20, 2019

This note explores expanding jit inlining to handle cases of explicit tail calls.

cc @dotnet/jit-contrib @dsyme

Related issues: #18406, Microsoft/visualfsharp#6329

Current behavior of the jit

  • freely allows inlining of implicit tail calls
  • propagates implicit tail to qualified inlinee call sites
  • demotes implicit tail call to normal call, if it requires a helper
  • blocks inlining of methods that make explicit tail calls
  • blocks inlining methods at explicit tail call sites
  • will use the slow helper for explicit tail calls if necessary

We will use the following notation to describe a chain of calls and call sites:

A -(cs1:?t)-> B -(cs2:et)-> C

Here A, B, and C are methods, and cs1 and cs2 are call sites in A and
B respectively. We will also use A, B, and C in formulas where they
indicate the stack space used by the respective methods, and S for the total
stack needed by some logical chain of calls.

A call site can be:

  • ?t a tail call or a normal call
  • xt a normal call
  • it an implicit tail call
  • et an explicit tail call (can be fast or slow)

Case 0: Inlining at explicit tail call sites

A -(cs1:et)-> B

Before stack size is

S = max(A, B)

After stack size is

S' = A + B

worst case, when jit can't optimize some away any stack usage after inlining,
the stack delta is:

S' - S = A + B - max(A, B) = min(A, B)

For performance, the before picture may have required slow tail call, so we can
expect good perf boost from inlining, perhaps better than a typical inline. If
we could anticipate this we could even give inlining an extra profitability boost.

[Case 0 Heuristic]: allow if A or B is small stack, boost if cs1 was slow tail call

Heuristics that depend on estimating the stack size of A or the nature of the
tail call to B may not be workable in the current jit; see commentary below.
So we may decide to just rely on size of B:

[Simple Case 0 Heuristic]: allow if B is small stack

Inlining methods that make explicit tail calls

Case 1: method invoked from non-tail call site

If B is inlined then none of its internal sites can be tail call sites. So
they will all become normal calls:

A -(cs1:xt)-> B -(cs2:xt)-> C   ;; cs2 goes from et to xt

Before stack usage is:

S = A + max(B, C)

Stack consumption depends on inlining. Generally the jit won't be able to
estimate the stack size of C at the time it has to decide on whether or not
to inline B.

  • if B is inlined and C is not, S' = (A + B) + C, possibly less
  • if both B and C are inlined, S' = ((A + B) + C), possibly less

Assuming worst case where inlining does not allow the jit to optimize away any
stack usage, stack increase is

S' - S = A + B + C - (A + max(B, C)) = min(B, C)

As noted before the jit will likely be unable to estimate C.

For perf, if the call to C was a slow tail call, perf will improve beyond just
the normal inline boost. But jit may not be able to anticipate this easily.

[Case 1 Heuristic]: allow if B is small stack

Case2: method invoked from implicit tail call site

If cs1 is an implicit tail call, the jit can inline B and retain cs2 as
an explicit tail call:

   A -(cs1:it)-> B -(cs2:et)-> C

Before stack usage depends on whether B was tail called or called normally:

   S' = max(A, B, C) if B was tail called 
        max(A + B, C) if B was called

We will use the former as it is smaller.

After stack usage (assuming C not inlined)

S' = max(A + B, C)

So worst-case stack impact is

S' - S = max(A + B, C) - max(A, B, C) = min(A, B)

So for stack usage, inlining seems ok if either A or B is small stack.

For perf, jit should only do this if call to C does not change from a fast
tail call to a slow call. Generally slow tail calls happen because caller stack
is insufficiently large. A rough heuristic for this is the number of args for
A and B respectively (some ABIs will need more complex logic).

[Case2 Heuristic]: allow if B is small stack, and #args(A) >= #args(B)

Case3: method invoked from explicit call site

This dovetails with the Case0 and Case2 above and the reasoning is similar.

[Case3 Heuristic]: allow if B is small stack, and #args(A) >= #args(B)

Use of the tail call helper

In cases where the jit determines a tail call from A will need a helper,
it might be feasible to modify Aso that its frame has sufficient arg space to
make a fast tail call by giving A a special prolog. This would be viable
provided the tail call sites(s) were executed frequently enough that we'd be
willing to pay a bit more on every call, and non-tail call paths through A
were rare and perhaps did not involve other calls.

Notes

  • The jit lacks a way to obtain a reliable stack size estimate for an inline
    candidate, but presumably we could develop something. Using number
    and type of args and locals might be sufficient, though there are bad
    cases where jit temps predominate (we could presumably also screen based
    on IL size to help rule these out).
  • The jit also is unable to make early determination for when an inline
    candidate's tail calls will be slow tail calls. Here the information must be
    based on an IL prescan. The jit does not resolve inline candidate tokens in so
    may not know how many args are consumed at a call site. But again we might be
    able to guess well enough.
  • In cases above, if none of A, B, C are part of recursive cycles we
    are somewhat less worried about increase in stack size, as it can't be
    magnified. The jit probably can't reason about this, but some up front tools
    perhaps could.
  • Stack size estimates for A must take already accepted inlines into account,
    so "optimal" results given a number of possible candidates depend on order in
    which inline candidates are considered. Currently the jit is not smart about
    this, which is why the simplified heuristic for Case 0 is suggested.
  • If the jit was able to do "stack packing" it would be able to be more aggressive
    since tail call method locals are lifetime disjoint from root method locals. So
    stack size impact of inlining a tail call would be more like max(A,B) than A + B.
    This would remove most restrictions and might end up being a better option long
    term (it could also potentially help address other problems like excessive prolog
    zeroing).

Proposal

  • develop code to estimate stack size increase from an inline
  • develop code to estimate when an explicit tail call will end up using a helper
  • update inliner to unmark explicit tall call sites in inlinees called from
    non tail-call sites
  • update inliner with new heuristic based on above estimates and unblock
    inlines of methods at tail call sites and methods with explicit tail calls
  • validate new behavior on suitable test suite
  • explore possibility of using modified prologs to avoid or limit the use of tail call helpers

@AndyAyersMS AndyAyersMS added this to the Future milestone Mar 20, 2019

@AndyAyersMS

This comment has been minimized.

Copy link
Member Author

AndyAyersMS commented Mar 21, 2019

I should probably point out that the math above is a bit handwavy, as the size of a frame is dependent on a lot of factors. The post inline size S(A+B) should be close to A+B but may be more or less. One generally hopes it will be less, as frame size is often a good indicator of prolog cost.

Because the jit inliner is scriptable, the idea I have in mind for modelling the stack impact of inlining B into A is simply to conduct a huge number of actual measurements of the impact, similar to what I did a while back when I was building a model of the code size impact of inlines. We can then see how easy it is to estimate the stack frame impact given the facts available to the inliner (more likely, some biased estimator, so we're more likely to overestimate than underestimate).

Using actual measurements also has the benefit that it captures various jit quirks and adapt to different target ISAs and ABIs. Ideally the heuristic is largely independent of these so we get similar behavior everywhere, but we'll have to see what the numbers say.

@AndyAyersMS

This comment has been minimized.

Copy link
Member Author

AndyAyersMS commented Mar 21, 2019

A couple more notes

  • we don't want to make recursive explicit tail calls into inline candidates, as these will usually turn into loops.
  • if B has other call sites that are inline candidates, the stack size of B we are interested in is really the size we end up with after all those candidates (and any transitive candidates) are inlined. But there's no way for us to know this at the time we have to decide about inlining B (the jit inlines "top-down"). So one thought is to restrict the inline policy for ancestors of B to ensure that the initial size estimate for B remains plausible. That being said, if B is not marked as aggressive inline it is unlikely to make a lot of calls, and if it is, the user is implicitly telling us stack frame size increases are ok, so perhaps we don't need to worry about this in practice.
@AndyAyersMS

This comment has been minimized.

Copy link
Member Author

AndyAyersMS commented Mar 22, 2019

Preview of the jit changes needed, without any kind of heuristic: InlineExplicitTailCalls.

Sample diff on one of the methods from Microsoft/visualfsharp#6329:

;;; before
; Assembly listing for method Logger:Log(ubyte,ref):this

G_M33878_IG01:
       nop      

G_M33878_IG02:
       mov      rcx, gword ptr [rcx+8]
       mov      rdx, r8
       mov      rax, 0xD1FFAB1E
       cmp      dword ptr [rcx], ecx

G_M33878_IG03:
       rex.jmp  rax     // tail call List.Add

;;; after
; Assembly listing for method Logger:Log(ubyte,ref):this

G_M33879_IG01:
       sub      rsp, 40
       nop      

G_M33879_IG02:
       mov      rcx, gword ptr [rcx+8]
       inc      dword ptr [rcx+20]        // inline List.Add
       mov      rdx, gword ptr [rcx+8]
       mov      eax, dword ptr [rcx+16]
       cmp      dword ptr [rdx+8], eax
       jbe      SHORT G_M33879_IG04       // capacity check
       lea      r9d, [rax+1]              // "fast path" add
       mov      dword ptr [rcx+16], r9d
       mov      rcx, rdx
       mov      edx, eax
       call     CORINFO_HELP_ARRADDR_ST
       nop      

G_M33879_IG03:
       add      rsp, 40
       ret      

G_M33879_IG04:
       mov      rdx, r8
       mov      rax, 0xD1FFAB1E

G_M33879_IG05:
       add      rsp, 40
       rex.jmp  rax                       // tail call List.AddWithResize
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.