Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
General inlining of tail calls #23370
This note explores expanding jit inlining to handle cases of explicit tail calls.
Current behavior of the jit
We will use the following notation to describe a chain of calls and call sites:
A call site can be:
Case 0: Inlining at explicit tail call sites
Before stack size is
After stack size is
worst case, when jit can't optimize some away any stack usage after inlining,
For performance, the before picture may have required slow tail call, so we can
Heuristics that depend on estimating the stack size of
Inlining methods that make explicit tail calls
Case 1: method invoked from non-tail call site
Before stack usage is:
Stack consumption depends on inlining. Generally the jit won't be able to
Assuming worst case where inlining does not allow the jit to optimize away any
As noted before the jit will likely be unable to estimate
For perf, if the call to
Case2: method invoked from implicit tail call site
Before stack usage depends on whether
We will use the former as it is smaller.
After stack usage (assuming
So worst-case stack impact is
So for stack usage, inlining seems ok if either
For perf, jit should only do this if call to
Case3: method invoked from explicit call site
This dovetails with the Case0 and Case2 above and the reasoning is similar.
Use of the tail call helper
In cases where the jit determines a tail call from
I should probably point out that the math above is a bit handwavy, as the size of a frame is dependent on a lot of factors. The post inline size S(A+B) should be close to A+B but may be more or less. One generally hopes it will be less, as frame size is often a good indicator of prolog cost.
Because the jit inliner is scriptable, the idea I have in mind for modelling the stack impact of inlining B into A is simply to conduct a huge number of actual measurements of the impact, similar to what I did a while back when I was building a model of the code size impact of inlines. We can then see how easy it is to estimate the stack frame impact given the facts available to the inliner (more likely, some biased estimator, so we're more likely to overestimate than underestimate).
Using actual measurements also has the benefit that it captures various jit quirks and adapt to different target ISAs and ABIs. Ideally the heuristic is largely independent of these so we get similar behavior everywhere, but we'll have to see what the numbers say.
referenced this issue
Mar 21, 2019
A couple more notes
Preview of the jit changes needed, without any kind of heuristic: InlineExplicitTailCalls.
Sample diff on one of the methods from Microsoft/visualfsharp#6329:
;;; before ; Assembly listing for method Logger:Log(ubyte,ref):this G_M33878_IG01: nop G_M33878_IG02: mov rcx, gword ptr [rcx+8] mov rdx, r8 mov rax, 0xD1FFAB1E cmp dword ptr [rcx], ecx G_M33878_IG03: rex.jmp rax // tail call List.Add ;;; after ; Assembly listing for method Logger:Log(ubyte,ref):this G_M33879_IG01: sub rsp, 40 nop G_M33879_IG02: mov rcx, gword ptr [rcx+8] inc dword ptr [rcx+20] // inline List.Add mov rdx, gword ptr [rcx+8] mov eax, dword ptr [rcx+16] cmp dword ptr [rdx+8], eax jbe SHORT G_M33879_IG04 // capacity check lea r9d, [rax+1] // "fast path" add mov dword ptr [rcx+16], r9d mov rcx, rdx mov edx, eax call CORINFO_HELP_ARRADDR_ST nop G_M33879_IG03: add rsp, 40 ret G_M33879_IG04: mov rdx, r8 mov rax, 0xD1FFAB1E G_M33879_IG05: add rsp, 40 rex.jmp rax // tail call List.AddWithResize