General inlining of tail calls #12304

AndyAyersMS · 2019-03-20T21:23:27Z

This note explores expanding jit inlining to handle cases of explicit tail calls.

cc @dotnet/jit-contrib @dsyme

Related issues: #10487, dotnet/fsharp#6329

Current behavior of the jit

freely allows inlining of implicit tail calls
propagates implicit tail to qualified inlinee call sites
demotes implicit tail call to normal call, if it requires a helper
blocks inlining of methods that make explicit tail calls
blocks inlining methods at explicit tail call sites
will use the slow helper for explicit tail calls if necessary

We will use the following notation to describe a chain of calls and call sites:

A -(cs1:?t)-> B -(cs2:et)-> C

Here A, B, and C are methods, and cs1 and cs2 are call sites in A and
B respectively. We will also use A, B, and C in formulas where they
indicate the stack space used by the respective methods, and S for the total
stack needed by some logical chain of calls.

A call site can be:

?t a tail call or a normal call
xt a normal call
it an implicit tail call
et an explicit tail call (can be fast or slow)

Inlining methods that do not have explicit tail call sites

Case 0: Inlining when invoked at an explicit tail call site

A -(cs1:et)-> B

Before stack size is

S = max(A, B)

After stack size is

S' = A + B

worst case, when jit can't optimize some away any stack usage after inlining,
the stack delta is:

S' - S = A + B - max(A, B) = min(A, B)

For performance, the before picture may have required slow tail call, so we can
expect good perf boost from inlining, perhaps better than a typical inline. If
we could anticipate this we could even give inlining an extra profitability boost.

[Case 0 Heuristic]: allow inlining if A or B is small stack, boost if cs1 was slow tail call

Heuristics that depend on estimating the stack size of A or the nature of the
tail call to B may not be workable in the current jit; see commentary below.
So we may decide to just rely on size of B:

[Simple Case 0 Heuristic]: allow inlining if B is small stack

Inlining methods that have explicit tail call sites

Case 1: Inlining when invoked from non-tail call site

If B is inlined at a non-tail call site, then none of B's internal sites can be tail call sites. So
they will all become normal calls:

A -(cs1:xt)-> B -(cs2:xt)-> C   ;; cs2 goes from et to xt

Before stack usage is:

S = A + max(B, C)

Stack consumption depends on inlining. Generally the jit won't be able to
estimate the stack size of C at the time it has to decide on whether or not
to inline B.

if B is inlined and C is not, S' = (A + B) + C, possibly less
if both B and C are inlined, S' = ((A + B) + C), possibly less

Assuming worst case where inlining does not allow the jit to optimize away any
stack usage, stack increase is

S' - S = A + B + C - (A + max(B, C)) = min(B, C)

As noted before the jit will likely be unable to estimate C.

For perf, if the call to C was a slow tail call, perf will improve beyond just
the normal inline boost. But jit may not be able to anticipate this easily.

[Case 1 Heuristic]: allow inlining if B is small stack

Case2: Inlining when invoked from implicit tail call site

If cs1 is an implicit tail call, the jit can inline B and retain cs2 as
an explicit tail call:

   A -(cs1:it)-> B -(cs2:et)-> C

Before stack usage depends on whether B was tail called or called normally:

   S' = max(A, B, C) if B was tail called 
        max(A + B, C) if B was called

We will use the former as it is smaller.

After stack usage (assuming C not inlined)

S' = max(A + B, C)

So worst-case stack impact is

S' - S = max(A + B, C) - max(A, B, C) = min(A, B)

So for stack usage, inlining seems ok if either A or B is small stack.

For perf, jit should only do this if call to C does not change from a fast
tail call to a slow call. Generally slow tail calls happen because caller stack
is insufficiently large. A rough heuristic for this is the number of args for
A and B respectively (some ABIs will need more complex logic).

[Case2 Heuristic]: allow inlining if B is small stack, and #args(A) >= #args(B)

Case3: Inlining when invoked from explicit call site

This dovetails with the Case0 and Case2 above and the reasoning is similar.

[Case3 Heuristic]: allow inlining if B is small stack, and #args(A) >= #args(B)

Use of the tail call helper

In cases where the jit determines a tail call from A will need a helper,
it might be feasible to modify Aso that its frame has sufficient arg space to
make a fast tail call by giving A a special prolog. This would be viable
provided the tail call sites(s) were executed frequently enough that we'd be
willing to pay a bit more on every call, and non-tail call paths through A
were rare and perhaps did not involve other calls.

Notes

The jit lacks a way to obtain a reliable stack size estimate for an inline
candidate, but presumably we could develop something. Using number
and type of args and locals might be sufficient, though there are bad
cases where jit temps predominate (we could presumably also screen based
on IL size to help rule these out).
The jit also is unable to make early determination for when an inline
candidate's tail calls will be slow tail calls. Here the information must be
based on an IL prescan. The jit does not resolve inline candidate tokens in so
may not know how many args are consumed at a call site. But again we might be
able to guess well enough.
In cases above, if none of A, B, C are part of recursive cycles we
are somewhat less worried about increase in stack size, as it can't be
magnified. The jit probably can't reason about this, but some up front tools
perhaps could.
Stack size estimates for A must take already accepted inlines into account,
so "optimal" results given a number of possible candidates depend on order in
which inline candidates are considered. Currently the jit is not smart about
this, which is why the simplified heuristic for Case 0 is suggested.
If the jit was able to do "stack packing" it would be able to be more aggressive
since tail call method locals are lifetime disjoint from root method locals. So
stack size impact of inlining a tail call would be more like max(A,B) than A + B.
This would remove most restrictions and might end up being a better option long
term (it could also potentially help address other problems like excessive prolog
zeroing).

Proposal

develop code to estimate stack size increase from an inline
develop code to estimate when an explicit tail call will end up using a helper
update inliner to unmark explicit tall call sites in inlinees called from
non tail-call sites
update inliner with new heuristic based on above estimates and unblock
inlines of methods at tail call sites and methods with explicit tail calls
validate new behavior on suitable test suite
explore possibility of using modified prologs to avoid or limit the use of tail call helpers

category:cq
theme:tail-call
skill-level:expert
cost:medium
impact:medium

The text was updated successfully, but these errors were encountered:

AndyAyersMS · 2019-03-21T00:18:12Z

I should probably point out that the math above is a bit handwavy, as the size of a frame is dependent on a lot of factors. The post inline size S(A+B) should be close to A+B but may be more or less. One generally hopes it will be less, as frame size is often a good indicator of prolog cost.

Because the jit inliner is scriptable, the idea I have in mind for modelling the stack impact of inlining B into A is simply to conduct a huge number of actual measurements of the impact, similar to what I did a while back when I was building a model of the code size impact of inlines. We can then see how easy it is to estimate the stack frame impact given the facts available to the inliner (more likely, some biased estimator, so we're more likely to overestimate than underestimate).

Using actual measurements also has the benefit that it captures various jit quirks and adapt to different target ISAs and ABIs. Ideally the heuristic is largely independent of these so we get similar behavior everywhere, but we'll have to see what the numbers say.

AndyAyersMS · 2019-03-21T18:56:15Z

A couple more notes

we don't want to make recursive explicit tail calls into inline candidates, as these will usually turn into loops.
if B has other call sites that are inline candidates, the stack size of B we are interested in is really the size we end up with after all those candidates (and any transitive candidates) are inlined. But there's no way for us to know this at the time we have to decide about inlining B (the jit inlines "top-down"). So one thought is to restrict the inline policy for ancestors of B to ensure that the initial size estimate for B remains plausible. That being said, if B is not marked as aggressive inline it is unlikely to make a lot of calls, and if it is, the user is implicitly telling us stack frame size increases are ok, so perhaps we don't need to worry about this in practice.

AndyAyersMS · 2019-03-22T23:05:08Z

Preview of the jit changes needed, without any kind of heuristic: InlineExplicitTailCalls.

Sample diff on one of the methods from dotnet/fsharp#6329:

;;; before
; Assembly listing for method Logger:Log(ubyte,ref):this

G_M33878_IG01:
       nop      

G_M33878_IG02:
       mov      rcx, gword ptr [rcx+8]
       mov      rdx, r8
       mov      rax, 0xD1FFAB1E
       cmp      dword ptr [rcx], ecx

G_M33878_IG03:
       rex.jmp  rax     // tail call List.Add

;;; after
; Assembly listing for method Logger:Log(ubyte,ref):this

G_M33879_IG01:
       sub      rsp, 40
       nop      

G_M33879_IG02:
       mov      rcx, gword ptr [rcx+8]
       inc      dword ptr [rcx+20]        // inline List.Add
       mov      rdx, gword ptr [rcx+8]
       mov      eax, dword ptr [rcx+16]
       cmp      dword ptr [rdx+8], eax
       jbe      SHORT G_M33879_IG04       // capacity check
       lea      r9d, [rax+1]              // "fast path" add
       mov      dword ptr [rcx+16], r9d
       mov      rcx, rdx
       mov      edx, eax
       call     CORINFO_HELP_ARRADDR_ST
       nop      

G_M33879_IG03:
       add      rsp, 40
       ret      

G_M33879_IG04:
       mov      rdx, r8
       mov      rax, 0xD1FFAB1E

G_M33879_IG05:
       add      rsp, 40
       rex.jmp  rax                       // tail call List.AddWithResize

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

AndyAyersMS mentioned this issue Feb 18, 2020

The frustrating state of tails calls in .NET #2191

Closed

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

AndyAyersMS mentioned this issue Jul 9, 2021

Attribute to prevent fsc inlining but allow JIT inlining fsharp/fslang-suggestions#838

Closed

5 tasks

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General inlining of tail calls #12304

General inlining of tail calls #12304

AndyAyersMS commented Mar 20, 2019 •

edited by BruceForstall

Loading

AndyAyersMS commented Mar 21, 2019

AndyAyersMS commented Mar 21, 2019

AndyAyersMS commented Mar 22, 2019

General inlining of tail calls #12304

General inlining of tail calls #12304

Comments

AndyAyersMS commented Mar 20, 2019 • edited by BruceForstall Loading

Current behavior of the jit

Inlining methods that do not have explicit tail call sites

Case 0: Inlining when invoked at an explicit tail call site

Inlining methods that have explicit tail call sites

Case 1: Inlining when invoked from non-tail call site

Case2: Inlining when invoked from implicit tail call site

Case3: Inlining when invoked from explicit call site

Use of the tail call helper

Notes

Proposal

AndyAyersMS commented Mar 21, 2019

AndyAyersMS commented Mar 21, 2019

AndyAyersMS commented Mar 22, 2019

AndyAyersMS commented Mar 20, 2019 •

edited by BruceForstall

Loading