Skip to content

Runtime<->JIT reloc hinting mechanism needs work  #60712

@jakobbotsch

Description

@jakobbotsch

Description

In #60228 I made the JIT generate lea instructions for any handle that the runtime returns an IMAGE_REL_BASED_REL32 hint for. We saw some regressions due to this, e.g. #60626.

After investigating this further I have determined that the issue is similar to one we saw in #49549. We create a constant pointing to kernel32!GetStdHandle, and we try to use rip-relative addressing for this constant. The constant does not end up being reachable, which hits this code path in the runtime that permanently turns off rip-relative addressing for the remaining duration of the process.

The issue is not completely the same as #49549, however. For calls we always assume they are reachable with rip-relative addressing; for constant handles, we are using the getRelocTypeHint function to figure out if this is the case.

For this particular case getRelocTypeHint is returning IMAGE_REL_BASED_REL32 for the address that ends up not being reachable. The reason is that the runtime allocates a 4GB range around coreclr.dll that is the preferred range: it returns IMAGE_REL_BASED_REL32 for any address within this range. However, if jitted code is placed in the beginning of the range, and a handle is at the end, then this is not reachable with rip-relative addressing, and we hit the path above that turns off rip-relative addressing.

After speaking to folks it seems there are conflicting assumptions about what getRelocTypeHint can be used for on both the JIT side and the runtime side. For the runtime side, it is assumed that this function is called only for addresses in the current loader heap and for addresses in coreclr.dll. The runtime tries to allocate memory from loader heaps in a circular fashion, which means that under the above assumption we only end up turning off rip-relative addressing once we have pretty much run out of memory in the preferred range.

However the JIT uses the check more generally to mean that it is assumed the address will be within +-2GB. My change exacerbated this, but it is already the case even before the change. In particular, any indir is checked for rip-relative addressing using this function. Due to this it is quite simple to write a program that reliably turns off rip-relative addressing without actually allocating much of the preferred range, see next.

Reproduction Steps

The following program reliably turns off rip-relative addressing permanently for the remaining duration of the process. It is a preexisting issue to #60228 - even with that PR reverted this example hits the case. Note that the address could come from anywhere, e.g. it could be a pointer into static memory from a native image that would likely be in the preferred range as well. However the VirtualAlloc with explicit address makes the repro simple and reliable.

using System;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Threading;

public unsafe class Program
{
    public static void Main()
    {
        for (int i = 0; i < 100; i++)
        {
            Foo();
            if (i >= 35)
                Thread.Sleep(30);
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void Foo()
    {
        Volatile.Read(ref NearbyMemory[15]);
    }

    private static readonly byte* NearbyMemory = AllocUnreachableMemoryInPreferredRange();

    private static byte* AllocUnreachableMemoryInPreferredRange()
    {
        delegate*<byte*> codeAddr = &AllocUnreachableMemoryInPreferredRange;
        byte* start = (byte*)codeAddr + 0x90000000;

        for (byte* addr = start; ; addr += 0x1000)
        {
            IntPtr alloced = VirtualAlloc((IntPtr)addr, 0x1000, 0x1000 | 0x2000, 0x04);
            if (alloced != IntPtr.Zero)
                return (byte*)alloced;
        }
    }

    [DllImport("kernel32.dll", SetLastError = true, ExactSpelling = true)]
    static extern IntPtr VirtualAlloc(IntPtr lpAddress, nint dwSize, uint flAllocationType, uint flProtect);
}

Expected behavior

We should not be turning off rip-relative addressing globally unless we are actually running out of memory in the preferred range.

Actual behavior

We do turn it off. We see three JITs of Foo in the above:

; Assembly listing for method Program:Foo()
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-0 compilation
; MinOpts code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 OutArgs      [V00    ] (  1,  1   )  lclBlk (32) [rsp+00H]   do-not-enreg[] "OutgoingArgSpace"
;
; Lcl frame size = 32

G_M24659_IG01:              ;; offset=0000H
       55                   push     rbp
       4883EC20             sub      rsp, 32
       488D6C2420           lea      rbp, [rsp+20H]
                                                ;; bbWeight=1    PerfScore 1.75
G_M24659_IG02:              ;; offset=000AH
       48B908B28E2AFD7F0000 mov      rcx, 0x7FFD2A8EB208
       BA03000000           mov      edx, 3
       E8320BA35F           call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
       B90F000000           mov      ecx, 15
       4863C9               movsxd   rcx, ecx
       48030D53F32400       add      rcx, qword ptr [reloc classVar[0x2a8ecf18]]
       E8FEBBFEFF           call     System.Threading.Volatile:Read(byref):ubyte
       90                   nop
                                                ;; bbWeight=1    PerfScore 5.25
G_M24659_IG03:              ;; offset=0033H
       4883C420             add      rsp, 32
       5D                   pop      rbp
       C3                   ret
                                                ;; bbWeight=1    PerfScore 1.75

; Total bytes of code 57, prolog size 10, PerfScore 14.45, instruction count 14, allocated bytes for code 57 (MethodHash=999a9fac) for method Program:Foo()
; ============================================================

; Assembly listing for method Program:Foo()
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;# V00 OutArgs      [V00    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;* V01 tmp1         [V01    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;
; Lcl frame size = 0

G_M24659_IG01:              ;; offset=0000H
                                                ;; bbWeight=1    PerfScore 0.00
G_M24659_IG02:              ;; offset=0000H
       8B0500000000         mov      eax, dword ptr [(reloc 0x7ffdba69000f)]
                                                ;; bbWeight=1    PerfScore 2.00
G_M24659_IG03:              ;; offset=0006H
       C3                   ret
                                                ;; bbWeight=1    PerfScore 1.00

; Total bytes of code 7, prolog size 0, PerfScore 3.70, instruction count 2, allocated bytes for code 7 (MethodHash=999a9fac) for method Program:Foo()
; ============================================================

Hit jump stub overflow
; Assembly listing for method Program:Foo()
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;# V00 OutArgs      [V00    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;* V01 tmp1         [V01    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;
; Lcl frame size = 0

G_M24659_IG01:              ;; offset=0000H
                                                ;; bbWeight=1    PerfScore 0.00
G_M24659_IG02:              ;; offset=0000H
       48B80F0069BAFD7F0000 mov      rax, 0x7FFDBA69000F
       3900                 cmp      dword ptr [rax], eax
                                                ;; bbWeight=1    PerfScore 3.25
G_M24659_IG03:              ;; offset=000CH
       C3                   ret
                                                ;; bbWeight=1    PerfScore 1.00

; Total bytes of code 13, prolog size 0, PerfScore 5.55, instruction count 3, allocated bytes for code 13 (MethodHash=999a9fac) for method Program:Foo()
; ============================================================

"Hit jump stub overflow" is a simple printf I added to the runtime code that turns off rip-relative addressing.

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

IMO, it would be ideal if the runtime could guess roughly where the jitted code will be located before the JIT has given it a size back and it has done the allocation. Then the reloc hint address range could be based on that.

Alternatively, as suggested by @jkotas, we could eagerly reserve memory before jit and back out of the unused memory.

Another short-term alternative might be to halve the preferred address range so that we get +-1 GB around coreclr.dll; this
should mean that the entire region is reachable regardless of where the code ends up.

In any case, as long as the final scheme works in vast majority of practical cases that should be ok -- for pathological cases we can always fall back to turning off rip-relative addressing.

Fixing this on the JIT side is also a possibility. In that case the JIT should change to ensure that it only uses the hint function for very particular data addresses (e.g. static field addrs, managed method entry points). In this case I'm not really sure what the point is of checking the preferred range at all on the runtime side, compared to just checking if we are allowing rip-relative addressing.

cc @dotnet/jit-contrib, @jkotas, @janvorli

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    No status

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions