Optimize jump stubs on arm64 #62302

EgorBo · 2021-12-02T22:23:22Z

On x64 we emit the following code for jump stubs:

mov rax, 123456789abcdef0h
jmp rax

as I understand from

runtime/src/coreclr/vm/amd64/cgenamd64.cpp

Lines 505 to 507 in 70d20f1

    
           // mov rax, 123456789abcdef0h       48 b8 xx xx xx xx xx xx xx xx 
        
           // jmp rax                          ff e0

while on arm64 we make a memory load (from data section via pc):

ldr x16, [pc, #8]
br  x16
[target address]

runtime/src/coreclr/vm/arm64/cgencpu.h

Lines 294 to 296 in eeb79b3

    
           // +0:   ldr x16, [pc, #8] 
        
           // +4:   br  x16 
        
           // +8:   [target address]

I'm just wondering if it's not faster to do what x64 does and emit the const directly even if it takes 4 instructions to populate it...

mov     x8, #9044
movk    x8, #9268, lsl #16
movk    x8, #61203, lsl #32
movk    x8, #43981, lsl #48
br      x8

I'm asking because I have a feeling that it could be a bottleneck if I read it correctly from the TE traces (Plaintext benchmark):

cc @dotnet/jit-contrib @jkotas

EgorBo · 2021-12-02T22:28:51Z

~~Also, doesn't arm64 allow us to do just br [pc, #8]?~~ it doesn't.

echesakov · 2021-12-02T22:40:06Z

If distance between pCode and target smaller than 128MB I guess we could do

b <pcRelDistance>

BruceForstall · 2021-12-02T22:49:32Z

Wouldn't it take 4 instructions to populate an address constant in the worst case?

kunalspathak · 2021-12-02T22:57:40Z

We are not good in hoisting address constant population, so if this is part of the loop, we might regress.

jkotas · 2021-12-02T23:13:12Z

Note that we require that the target is atomically patchable without suspending the execution. It makes it close to impossible to split this into multiple instructions.

I agree that the whole scheme for how we deal with the precodes and back patching is likely very suboptimal on non-x86 architectures (and maybe even on current x86). I think the more optimal path may look like this:

Loading of the indirection is inlined into the JITed code. The direct call to indirect jump that we have there today is replaced with indirect call - it should be an improvement. The JITed code will look like this:

...
    ldr    x16, [pc + 0x1230] // Indirection cell that lives in the local data section
    blx    x16
...

Keep track of the indirection cells that live in the local data sections. If tiering comes up with a new copy of method, we need to patch them all to point to the new method.
We can also consider emitting direct call to the actual method for cases when we know that the target method is not going to change because of it reached the final tier. The direct calls would have to be treated as if the method was inlined by profiler ReJIT.

jkotas · 2021-12-02T23:22:54Z

I'm just wondering if it's not faster to do what x64 does and emit the const directly even if it takes 3 instructions

My bet would be that the bottleneck is more caused more by the call + indirect jump combination than by the memory load. Patterns like that used to cause pipeline stalls on x86 in the past, and I think it is likely that they are problem for arm64 too.

EgorBo · 2021-12-03T08:41:51Z

@jkotas thanks for a detailed explanation! 👍

Wouldn't it take 4 instructions to populate an address constant in the worst case?

@BruceForstall Right, I was wondering if it's still faster, because otherwise I'd expect native compilers to always prefer doing a memory load from data section rather than doing 4 movs (e.g. https://godbolt.org/z/cWYsTq6P6). I played locally with llvm-mca tool targeting -mcpu=apple-a13:

From what I read it takes 3x less cycles to do 4 movs.

jkotas · 2021-12-03T14:10:38Z

I think we should look into optimizing the jump stubs and friends for arm64. I agree with your initial observation that there is likely bottleneck.

EgorBo · 2021-12-06T18:12:25Z

I guess we also more likely to hit a jump stub on ARM64, quoting jump-stubs.md:

The need for jump stubs only arises when jumps of greater than 2GB range (on x64; 128MB on arm64) are required

so even pretty simple TE benchmarks hit that.

EgorBo · 2022-01-16T10:37:53Z

Just noticed that a completely empty void Main() {} program (in TieredCompilation=0 mode) emits just one jump-stub on x64 (for ProcessCLRException) and 35 on arm64.

EgorBo · 2022-01-16T11:09:23Z

The following methods request a jump-stub on arm64 during compilation of a completely empty program in TC=0:

getNextJumpStub from System.Runtime.CompilerServices.CastHelpers::StelemRef, sig=void *(class System.Array,int32,object)
getNextJumpStub from System.Runtime.CompilerServices.CastHelpers::StelemRef, sig=void *(class System.Array,int32,object)
getNextJumpStub from System.AppContext::Setup, sig=void *(char**,char**,int32)
getNextJumpStub from System.Collections.Generic.Dictionary`2[__Canon,__Canon]::.ctor, sig=instance void *(int32,class System.Collections.Generic.IEqualityComparer`1<!0>)
getNextJumpStub from System.Collections.Generic.Dictionary`2[__Canon,__Canon]::Initialize, sig=instance int32 *(int32)
getNextJumpStub from System.Collections.Generic.Dictionary`2[__Canon,__Canon]::Initialize, sig=instance int32 *(int32)
getNextJumpStub from System.Collections.HashHelpers::GetPrime, sig=int32 *(int32)
getNextJumpStub from System.Collections.HashHelpers::GetPrime, sig=int32 *(int32)
getNextJumpStub from System.Collections.HashHelpers::GetPrime, sig=int32 *(int32)
getNextJumpStub from System.Collections.HashHelpers::GetPrime, sig=int32 *(int32)
getNextJumpStub from System.Collections.HashHelpers::.cctor, sig=void *()
getNextJumpStub from System.Collections.HashHelpers::.cctor, sig=void *()
getNextJumpStub from System.Collections.Generic.EqualityComparer`1[__Canon]::get_Default, sig=class System.Collections.Generic.EqualityComparer`1<!0> *()
getNextJumpStub from System.Collections.Generic.EqualityComparer`1[__Canon]::.cctor, sig=void *()
getNextJumpStub from System.Collections.Generic.EqualityComparer`1[__Canon]::.cctor, sig=void *()
getNextJumpStub from System.Collections.Generic.ComparerHelpers::CreateDefaultEqualityComparer, sig=object *(class System.Type)
getNextJumpStub from System.Runtime.CompilerServices.CastHelpers::ChkCastAny, sig=object *(void*,object)
getNextJumpStub from System.Environment::.cctor, sig=void *()
getNextJumpStub from System.Environment::.cctor, sig=void *()
getNextJumpStub from System.Threading.AutoreleasePool::CreateAutoreleasePool, sig=void *()
getNextJumpStub from System.StartupHookProvider::ProcessStartupHooks, sig=void *()
getNextJumpStub from System.StartupHookProvider::ProcessStartupHooks, sig=void *()
getNextJumpStub from System.AppContext::TryGetSwitch, sig=bool *(string,bool&)
getNextJumpStub from System.AppContext::TryGetSwitch, sig=bool *(string,bool&)
getNextJumpStub from System.Diagnostics.Tracing.EventListener::get_EventListenersLock, sig=object *()
getNextJumpStub from System.Runtime.InteropServices.Marshal::GetFunctionPointerForDelegate, sig=native int *(class System.Delegate)
getNextJumpStub from System.Diagnostics.Tracing.EventListener::AddEventSource, sig=void *(class System.Diagnostics.Tracing.EventSource)
getNextJumpStub from System.Diagnostics.Tracing.EventListener::AddEventSource, sig=void *(class System.Diagnostics.Tracing.EventSource)
getNextJumpStub from System.Diagnostics.Tracing.EventListener::AddEventSource, sig=void *(class System.Diagnostics.Tracing.EventSource)
getNextJumpStub from Program::Main, sig=int32 *(string[])
getNextJumpStub from System.Runtime.Loader.AssemblyLoadContext::OnProcessExit, sig=void *()
getNextJumpStub from System.Diagnostics.Tracing.EventListener::DisposeOnShutdown, sig=void *()
getNextJumpStub from System.Runtime.CompilerServices.CastHelpers::StelemRef_Helper_NoCacheLookup, sig=void *(object&,void*,object)
getNextJumpStub from System.Threading.Monitor::IsEntered, sig=bool *(object)
getNextJumpStub from System.GC::SuppressFinalize, sig=void *(object)

none of them do that on x64

EgorBo · 2022-01-16T11:26:11Z

Apparently all FCalls use jump-stubs, e.g.:

using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System;

public class Program
{
    static void Main() => CallCos(3.14);

    [MethodImpl(MethodImplOptions.NoInlining)]
    static double CallCos(double d) => Math.Cos(d);
}

        00000000          stp     fp, lr, [sp,#-16]!
        00000000          mov     fp, sp
        00000000          bl      System.Math:Cos(double):double  ;; <--- jump stub
        00000000          ldp     fp, lr, [sp],#16
        00000000          ret     lr

It explains why some microbenchmarks are slow - almost all Math.* functions go via double calls basically

EgorBo · 2022-01-16T12:27:21Z

@jakobbotsch suggested to change these constants

runtime/src/coreclr/pal/src/include/pal/virtual.h

Lines 188 to 193 in 23de817

    
           static const int32_t CoreClrLibrarySize = 100 * 1024 * 1024; 
        
           // This constant represent the max size of the virtual memory that this allocator 
        
           // will try to reserve during initialization. We want all JIT-ed code and the 
        
           // entire libcoreclr to be located in a 2GB range. 
        
           static const int32_t MaxExecutableMemorySize = 0x7FFF0000;

to 10mb and 128mb and it helped managed code to reach fcalls via reloc 😮

EgorBo · 2022-01-16T13:16:23Z

using BenchmarkDotNet.Attributes; 
using BenchmarkDotNet.Running; 
 
public class Program 
{ 
    static void Main(string[] args) => 
        BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args); 
 
    [Benchmark] 
    [Arguments(3.14)] 
    public double Test(double d) => Math.Cos(d) * Math.Sin(d) * Math.Tan(d);  // 3 InternalCalls
}

|  Method |        Job |               Toolchain |    d |      Mean |     Error |    StdDev | Ratio |
|-------- |----------- |------------------------ |----- |----------:|----------:|----------:|------:|
|    Test | Job-UWEEFQ |      /Core_Root/corerun | 3.14 |  9.884 ns | 0.0076 ns | 0.0071 ns |  1.00 |
|    Test | Job-HATVTO | /Core_Root_base/corerun | 3.14 | 28.235 ns | 0.1235 ns | 0.1155 ns |  2.86 |

😮

EgorBo · 2022-07-10T16:38:27Z

#70707 improved perf here, mainly, because we used to use 128Mb as a step when we were probing memory around coreclr to reserve. That step didn't make sense for arm and was decreased to 4Mb resulting in more chances to successfully reserve memory near coreclr.

Although, there are still ways to improve it - moving to Future.

EgorBo · 2022-10-29T22:15:14Z

Moving here, so apparently it's also an issue for x64 for large apps:

I noticed that BingSNR (when I run it locally on Windows-x64) emits 44k jump stubs (44k calls to allocJumpStubBlock) - it happens because the app itself is quite big and its working set is 7-10Gb (thus, we likely have multiple loaderheaps) running locally for a benchmark. Also I noticed that the process of emitting jump stubs is quite hot, e.g. here is a flamegraph for a randomly selected time frame after start: 50s - 60s:

Can we do anything with this? E.g. just like in #64148 to emit 64bit addresses to precode slots directly in methods

jkotas · 2022-10-30T06:26:58Z

I noticed that the process of emitting jump stubs is quite hot, e.g. here is a flamegraph for a randomly selected time frame after start: 50s - 60s:

Notice that the expensive path goes into HostCodeHeap. HostCodeHeap is used for DynamicMethods. Each dynamic method gets its own set of jump stubs that are all freed when the dynamic method is collected. It is how we ensure that the dynamic stubs are not leaking when the dynamic methods are collected. It means the cost of the jump stubs is not amortized for dynamic methods. I think it is why they are expensive.

emit 64bit addresses to precode slots directly in methods

Yes, I think it would make sense for dynamic methods at least. (Alternatively, we may be able to come up with some sort of ref-counting scheme for jumps stubs in dynamic methods so that their cost gets amortized.)

EgorBo · 2022-10-30T14:16:50Z

HostCodeHeap is used for DynamicMethods

Ah, so for this specific project it's the same problem with redundant dynamic methods (at least they look so) that they might fix

jkotas · 2022-10-30T15:46:34Z

Right, there are two different concerns. (1) Is given usage of dynamic methods warranted? (2) Does runtime behave efficiently for large projects with a lot of dynamic methods? It is still worth fixing (2) even if the answer for (1) is negative for BingSNR.

EgorBo added the tenet-performance Performance related issue label Dec 2, 2021

dotnet-issue-labeler bot added area-VM-coreclr untriaged New issue has not been triaged by the area owner labels Dec 2, 2021

EgorBo closed this as completed Dec 3, 2021

jkotas reopened this Dec 3, 2021

jkotas added the arch-arm64 label Dec 3, 2021

This was referenced Jan 16, 2022

Reduce number of jump-stubs on ARM64 via smaller preserved space #63842

Closed

ARM64: Avoid jump stubs where possible #64148

Closed

JulieLeeMSFT assigned EgorBo Feb 23, 2022

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Feb 23, 2022

JulieLeeMSFT added this to the 7.0.0 milestone Feb 23, 2022

EgorBo mentioned this issue Feb 25, 2022

Initial executable memory allocation on ARM64 Linux is not right #65900

Closed

EgorBo mentioned this issue Mar 30, 2022

x64 vs ARM64 Microbenchmarks Performance Study Report #67339

Open

19 tasks

adamsitnik mentioned this issue Apr 1, 2022

Improving ARM64 Performance in .NET 7.0 #64820

Closed

32 tasks

EgorBo removed this from the 7.0.0 milestone Jul 10, 2022

EgorBo added this to the Future milestone Jul 10, 2022

kunalspathak mentioned this issue Oct 13, 2022

Improving Arm64 Performance in .NET 8.0 #77010

Closed

28 tasks

EgorBo mentioned this issue Oct 29, 2022

Too many jump stubs #77638

Closed

EgorBo mentioned this issue Oct 31, 2022

Emit less jump-stubs on x64 #77639

Closed

kunalspathak mentioned this issue Jul 14, 2023

[JIT] ARM64 - NonVirtual method call slower inside of a loop than a Virtual method call #88807

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize jump stubs on arm64 #62302

Optimize jump stubs on arm64 #62302

EgorBo commented Dec 2, 2021 •

edited

Loading

EgorBo commented Dec 2, 2021 •

edited

Loading

echesakov commented Dec 2, 2021

BruceForstall commented Dec 2, 2021

kunalspathak commented Dec 2, 2021

jkotas commented Dec 2, 2021 •

edited

Loading

jkotas commented Dec 2, 2021

EgorBo commented Dec 3, 2021 •

edited

Loading

jkotas commented Dec 3, 2021

EgorBo commented Dec 6, 2021 •

edited

Loading

EgorBo commented Jan 16, 2022 •

edited

Loading

EgorBo commented Jan 16, 2022 •

edited

Loading

EgorBo commented Jan 16, 2022 •

edited

Loading

EgorBo commented Jan 16, 2022 •

edited

Loading

EgorBo commented Jan 16, 2022 •

edited

Loading

EgorBo commented Jul 10, 2022 •

edited

Loading

EgorBo commented Oct 29, 2022 •

edited

Loading

jkotas commented Oct 30, 2022 •

edited

Loading

EgorBo commented Oct 30, 2022

jkotas commented Oct 30, 2022

Optimize jump stubs on arm64 #62302

Optimize jump stubs on arm64 #62302

Comments

EgorBo commented Dec 2, 2021 • edited Loading

EgorBo commented Dec 2, 2021 • edited Loading

echesakov commented Dec 2, 2021

BruceForstall commented Dec 2, 2021

kunalspathak commented Dec 2, 2021

jkotas commented Dec 2, 2021 • edited Loading

jkotas commented Dec 2, 2021

EgorBo commented Dec 3, 2021 • edited Loading

jkotas commented Dec 3, 2021

EgorBo commented Dec 6, 2021 • edited Loading

EgorBo commented Jan 16, 2022 • edited Loading

EgorBo commented Jan 16, 2022 • edited Loading

EgorBo commented Jan 16, 2022 • edited Loading

EgorBo commented Jan 16, 2022 • edited Loading

EgorBo commented Jan 16, 2022 • edited Loading

EgorBo commented Jul 10, 2022 • edited Loading

EgorBo commented Oct 29, 2022 • edited Loading

jkotas commented Oct 30, 2022 • edited Loading

EgorBo commented Oct 30, 2022

jkotas commented Oct 30, 2022

EgorBo commented Dec 2, 2021 •

edited

Loading

EgorBo commented Dec 2, 2021 •

edited

Loading

jkotas commented Dec 2, 2021 •

edited

Loading

EgorBo commented Dec 3, 2021 •

edited

Loading

EgorBo commented Dec 6, 2021 •

edited

Loading

EgorBo commented Jan 16, 2022 •

edited

Loading

EgorBo commented Jan 16, 2022 •

edited

Loading

EgorBo commented Jan 16, 2022 •

edited

Loading

EgorBo commented Jan 16, 2022 •

edited

Loading

EgorBo commented Jan 16, 2022 •

edited

Loading

EgorBo commented Jul 10, 2022 •

edited

Loading

EgorBo commented Oct 29, 2022 •

edited

Loading

jkotas commented Oct 30, 2022 •

edited

Loading