Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize jump stubs on arm64 #62302

Open
Tracked by #77010
EgorBo opened this issue Dec 2, 2021 · 19 comments
Open
Tracked by #77010

Optimize jump stubs on arm64 #62302

EgorBo opened this issue Dec 2, 2021 · 19 comments
Assignees
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Dec 2, 2021

On x64 we emit the following code for jump stubs:

mov rax, 123456789abcdef0h
jmp rax

as I understand from

// mov rax, 123456789abcdef0h 48 b8 xx xx xx xx xx xx xx xx
// jmp rax ff e0

while on arm64 we make a memory load (from data section via pc):

ldr x16, [pc, #8]
br  x16
[target address]

// +0: ldr x16, [pc, #8]
// +4: br x16
// +8: [target address]

I'm just wondering if it's not faster to do what x64 does and emit the const directly even if it takes 4 instructions to populate it...

mov     x8, #9044
movk    x8, #9268, lsl #16
movk    x8, #61203, lsl #32
movk    x8, #43981, lsl #48
br      x8

I'm asking because I have a feeling that it could be a bottleneck if I read it correctly from the TE traces (Plaintext benchmark):
image

cc @dotnet/jit-contrib @jkotas

@EgorBo EgorBo added the tenet-performance Performance related issue label Dec 2, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-VM-coreclr untriaged New issue has not been triaged by the area owner labels Dec 2, 2021
@EgorBo
Copy link
Member Author

EgorBo commented Dec 2, 2021

Also, doesn't arm64 allow us to do just br [pc, #8]? it doesn't.

@echesakov
Copy link
Contributor

If distance between pCode and target smaller than 128MB I guess we could do

b <pcRelDistance>

@BruceForstall
Copy link
Member

Wouldn't it take 4 instructions to populate an address constant in the worst case?

@kunalspathak
Copy link
Member

We are not good in hoisting address constant population, so if this is part of the loop, we might regress.

@jkotas
Copy link
Member

jkotas commented Dec 2, 2021

Note that we require that the target is atomically patchable without suspending the execution. It makes it close to impossible to split this into multiple instructions.

I agree that the whole scheme for how we deal with the precodes and back patching is likely very suboptimal on non-x86 architectures (and maybe even on current x86). I think the more optimal path may look like this:

  • Loading of the indirection is inlined into the JITed code. The direct call to indirect jump that we have there today is replaced with indirect call - it should be an improvement. The JITed code will look like this:
...
    ldr    x16, [pc + 0x1230] // Indirection cell that lives in the local data section
    blx    x16
...
  • Keep track of the indirection cells that live in the local data sections. If tiering comes up with a new copy of method, we need to patch them all to point to the new method.

  • We can also consider emitting direct call to the actual method for cases when we know that the target method is not going to change because of it reached the final tier. The direct calls would have to be treated as if the method was inlined by profiler ReJIT.

@jkotas
Copy link
Member

jkotas commented Dec 2, 2021

I'm just wondering if it's not faster to do what x64 does and emit the const directly even if it takes 3 instructions

My bet would be that the bottleneck is more caused more by the call + indirect jump combination than by the memory load. Patterns like that used to cause pipeline stalls on x86 in the past, and I think it is likely that they are problem for arm64 too.

@EgorBo
Copy link
Member Author

EgorBo commented Dec 3, 2021

@jkotas thanks for a detailed explanation! 👍

Wouldn't it take 4 instructions to populate an address constant in the worst case?

@BruceForstall Right, I was wondering if it's still faster, because otherwise I'd expect native compilers to always prefer doing a memory load from data section rather than doing 4 movs (e.g. https://godbolt.org/z/cWYsTq6P6). I played locally with llvm-mca tool targeting -mcpu=apple-a13:

image

From what I read it takes 3x less cycles to do 4 movs.

@EgorBo EgorBo closed this as completed Dec 3, 2021
@jkotas
Copy link
Member

jkotas commented Dec 3, 2021

I think we should look into optimizing the jump stubs and friends for arm64. I agree with your initial observation that there is likely bottleneck.

@jkotas jkotas reopened this Dec 3, 2021
@EgorBo
Copy link
Member Author

EgorBo commented Dec 6, 2021

I guess we also more likely to hit a jump stub on ARM64, quoting jump-stubs.md:

The need for jump stubs only arises when jumps of greater than 2GB range (on x64; 128MB on arm64) are required

so even pretty simple TE benchmarks hit that.

@EgorBo
Copy link
Member Author

EgorBo commented Jan 16, 2022

Just noticed that a completely empty void Main() {} program (in TieredCompilation=0 mode) emits just one jump-stub on x64 (for ProcessCLRException) and 35 on arm64.

@EgorBo
Copy link
Member Author

EgorBo commented Jan 16, 2022

The following methods request a jump-stub on arm64 during compilation of a completely empty program in TC=0:

getNextJumpStub from System.Runtime.CompilerServices.CastHelpers::StelemRef, sig=void *(class System.Array,int32,object)
getNextJumpStub from System.Runtime.CompilerServices.CastHelpers::StelemRef, sig=void *(class System.Array,int32,object)
getNextJumpStub from System.AppContext::Setup, sig=void *(char**,char**,int32)
getNextJumpStub from System.Collections.Generic.Dictionary`2[__Canon,__Canon]::.ctor, sig=instance void *(int32,class System.Collections.Generic.IEqualityComparer`1<!0>)
getNextJumpStub from System.Collections.Generic.Dictionary`2[__Canon,__Canon]::Initialize, sig=instance int32 *(int32)
getNextJumpStub from System.Collections.Generic.Dictionary`2[__Canon,__Canon]::Initialize, sig=instance int32 *(int32)
getNextJumpStub from System.Collections.HashHelpers::GetPrime, sig=int32 *(int32)
getNextJumpStub from System.Collections.HashHelpers::GetPrime, sig=int32 *(int32)
getNextJumpStub from System.Collections.HashHelpers::GetPrime, sig=int32 *(int32)
getNextJumpStub from System.Collections.HashHelpers::GetPrime, sig=int32 *(int32)
getNextJumpStub from System.Collections.HashHelpers::.cctor, sig=void *()
getNextJumpStub from System.Collections.HashHelpers::.cctor, sig=void *()
getNextJumpStub from System.Collections.Generic.EqualityComparer`1[__Canon]::get_Default, sig=class System.Collections.Generic.EqualityComparer`1<!0> *()
getNextJumpStub from System.Collections.Generic.EqualityComparer`1[__Canon]::.cctor, sig=void *()
getNextJumpStub from System.Collections.Generic.EqualityComparer`1[__Canon]::.cctor, sig=void *()
getNextJumpStub from System.Collections.Generic.ComparerHelpers::CreateDefaultEqualityComparer, sig=object *(class System.Type)
getNextJumpStub from System.Runtime.CompilerServices.CastHelpers::ChkCastAny, sig=object *(void*,object)
getNextJumpStub from System.Environment::.cctor, sig=void *()
getNextJumpStub from System.Environment::.cctor, sig=void *()
getNextJumpStub from System.Threading.AutoreleasePool::CreateAutoreleasePool, sig=void *()
getNextJumpStub from System.StartupHookProvider::ProcessStartupHooks, sig=void *()
getNextJumpStub from System.StartupHookProvider::ProcessStartupHooks, sig=void *()
getNextJumpStub from System.AppContext::TryGetSwitch, sig=bool *(string,bool&)
getNextJumpStub from System.AppContext::TryGetSwitch, sig=bool *(string,bool&)
getNextJumpStub from System.Diagnostics.Tracing.EventListener::get_EventListenersLock, sig=object *()
getNextJumpStub from System.Runtime.InteropServices.Marshal::GetFunctionPointerForDelegate, sig=native int *(class System.Delegate)
getNextJumpStub from System.Diagnostics.Tracing.EventListener::AddEventSource, sig=void *(class System.Diagnostics.Tracing.EventSource)
getNextJumpStub from System.Diagnostics.Tracing.EventListener::AddEventSource, sig=void *(class System.Diagnostics.Tracing.EventSource)
getNextJumpStub from System.Diagnostics.Tracing.EventListener::AddEventSource, sig=void *(class System.Diagnostics.Tracing.EventSource)
getNextJumpStub from Program::Main, sig=int32 *(string[])
getNextJumpStub from System.Runtime.Loader.AssemblyLoadContext::OnProcessExit, sig=void *()
getNextJumpStub from System.Diagnostics.Tracing.EventListener::DisposeOnShutdown, sig=void *()
getNextJumpStub from System.Runtime.CompilerServices.CastHelpers::StelemRef_Helper_NoCacheLookup, sig=void *(object&,void*,object)
getNextJumpStub from System.Threading.Monitor::IsEntered, sig=bool *(object)
getNextJumpStub from System.GC::SuppressFinalize, sig=void *(object)

none of them do that on x64

@EgorBo
Copy link
Member Author

EgorBo commented Jan 16, 2022

Apparently all FCalls use jump-stubs, e.g.:

using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System;

public class Program
{
    static void Main() => CallCos(3.14);

    [MethodImpl(MethodImplOptions.NoInlining)]
    static double CallCos(double d) => Math.Cos(d);
}
        00000000          stp     fp, lr, [sp,#-16]!
        00000000          mov     fp, sp
        00000000          bl      System.Math:Cos(double):double  ;; <--- jump stub
        00000000          ldp     fp, lr, [sp],#16
        00000000          ret     lr

It explains why some microbenchmarks are slow - almost all Math.* functions go via double calls basically

@EgorBo
Copy link
Member Author

EgorBo commented Jan 16, 2022

@jakobbotsch suggested to change these constants

static const int32_t CoreClrLibrarySize = 100 * 1024 * 1024;
// This constant represent the max size of the virtual memory that this allocator
// will try to reserve during initialization. We want all JIT-ed code and the
// entire libcoreclr to be located in a 2GB range.
static const int32_t MaxExecutableMemorySize = 0x7FFF0000;
to 10mb and 128mb and it helped managed code to reach fcalls via reloc 😮

@EgorBo
Copy link
Member Author

EgorBo commented Jan 16, 2022

using BenchmarkDotNet.Attributes; 
using BenchmarkDotNet.Running; 
 
public class Program 
{ 
    static void Main(string[] args) => 
        BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args); 
 
    [Benchmark] 
    [Arguments(3.14)] 
    public double Test(double d) => Math.Cos(d) * Math.Sin(d) * Math.Tan(d);  // 3 InternalCalls
}
|  Method |        Job |               Toolchain |    d |      Mean |     Error |    StdDev | Ratio |
|-------- |----------- |------------------------ |----- |----------:|----------:|----------:|------:|
|    Test | Job-UWEEFQ |      /Core_Root/corerun | 3.14 |  9.884 ns | 0.0076 ns | 0.0071 ns |  1.00 |
|    Test | Job-HATVTO | /Core_Root_base/corerun | 3.14 | 28.235 ns | 0.1235 ns | 0.1155 ns |  2.86 |

😮

@EgorBo
Copy link
Member Author

EgorBo commented Jul 10, 2022

#70707 improved perf here, mainly, because we used to use 128Mb as a step when we were probing memory around coreclr to reserve. That step didn't make sense for arm and was decreased to 4Mb resulting in more chances to successfully reserve memory near coreclr.

Although, there are still ways to improve it - moving to Future.

@EgorBo EgorBo removed this from the 7.0.0 milestone Jul 10, 2022
@EgorBo EgorBo added this to the Future milestone Jul 10, 2022
@EgorBo
Copy link
Member Author

EgorBo commented Oct 29, 2022

Moving here, so apparently it's also an issue for x64 for large apps:

I noticed that BingSNR (when I run it locally on Windows-x64) emits 44k jump stubs (44k calls to allocJumpStubBlock) - it happens because the app itself is quite big and its working set is 7-10Gb (thus, we likely have multiple loaderheaps) running locally for a benchmark. Also I noticed that the process of emitting jump stubs is quite hot, e.g. here is a flamegraph for a randomly selected time frame after start: 50s - 60s:

image

Can we do anything with this? E.g. just like in #64148 to emit 64bit addresses to precode slots directly in methods

@jkotas
Copy link
Member

jkotas commented Oct 30, 2022

I noticed that the process of emitting jump stubs is quite hot, e.g. here is a flamegraph for a randomly selected time frame after start: 50s - 60s:

Notice that the expensive path goes into HostCodeHeap. HostCodeHeap is used for DynamicMethods. Each dynamic method gets its own set of jump stubs that are all freed when the dynamic method is collected. It is how we ensure that the dynamic stubs are not leaking when the dynamic methods are collected. It means the cost of the jump stubs is not amortized for dynamic methods. I think it is why they are expensive.

emit 64bit addresses to precode slots directly in methods

Yes, I think it would make sense for dynamic methods at least. (Alternatively, we may be able to come up with some sort of ref-counting scheme for jumps stubs in dynamic methods so that their cost gets amortized.)

@EgorBo
Copy link
Member Author

EgorBo commented Oct 30, 2022

HostCodeHeap is used for DynamicMethods

Ah, so for this specific project it's the same problem with redundant dynamic methods (at least they look so) that they might fix

@jkotas
Copy link
Member

jkotas commented Oct 30, 2022

Right, there are two different concerns. (1) Is given usage of dynamic methods warranted? (2) Does runtime behave efficiently for large projects with a lot of dynamic methods? It is still worth fixing (2) even if the answer for (1) is negative for BingSNR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants