Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 20 additions & 5 deletions src/coreclr/runtime/arm64/StubDispatch.S
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,30 @@
// Macro that generates code to check a single cache entry.
.macro CHECK_CACHE_ENTRY entry
// Check a single entry in the cache.
// x9 : Cache data structure. Also used for target address jump.
// x9 : Cache data structure
// x10 : Instance MethodTable*
// x11 : Indirection cell address, preserved
// x12 : Trashed
ldr x12, [x9, #(OFFSETOF__InterfaceDispatchCache__m_rgEntries + (\entry * 16))]
// x12, x13 : Trashed
//
// Use ldp to load both m_pInstanceType and m_pTargetCode in a single instruction.
// On ARM64 two separate ldr instructions can be reordered across a control dependency,
// which means a concurrent atomic cache entry update (via stlxp) could be observed as a
// torn read (new type, old target). ldp is single-copy atomic for the pair on FEAT_LSE2
// hardware (ARMv8.4+). The cbz guard ensures correctness on pre-LSE2 hardware too:
// a torn read can only produce a zero target (entries go from 0,0 to type,target),
// so we treat it as a cache miss.
.if (OFFSETOF__InterfaceDispatchCache__m_rgEntries + (\entry * 16)) > 504
// ldp's signed immediate offset must be in [-512,504] for 64-bit registers.
// Use add to reach far entries in the 32/64 slot stubs.
add x12, x9, #(OFFSETOF__InterfaceDispatchCache__m_rgEntries + (\entry * 16))
ldp x12, x13, [x12]
Comment on lines +31 to +35
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For stubs with 32/64 entries, OFFSETOF__InterfaceDispatchCache__m_rgEntries is 0x20 (see src/coreclr/vm/arm64/asmconstants.h:289), so entries >= 30 fall into the add+ldp path. That adds an extra instruction on the (common) mismatch path for over half the probes, which could regress interface-dispatch hot-path throughput. Consider restructuring to avoid per-entry add (e.g., split the probe sequence into two ranges using an adjusted base once, so ldp can keep using immediate offsets in-range).

Copilot uses AI. Check for mistakes.
.else
ldp x12, x13, [x9, #(OFFSETOF__InterfaceDispatchCache__m_rgEntries + (\entry * 16))]
.endif
cmp x10, x12
bne 0f
ldr x9, [x9, #(OFFSETOF__InterfaceDispatchCache__m_rgEntries + (\entry * 16) + 8)]
br x9
cbz x13, 0f
br x13
0:
.endm

Expand Down
25 changes: 20 additions & 5 deletions src/coreclr/runtime/arm64/StubDispatch.asm
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,30 @@
MACRO
CHECK_CACHE_ENTRY $entry
;; Check a single entry in the cache.
;; x9 : Cache data structure. Also used for target address jump.
;; x9 : Cache data structure
;; x10 : Instance MethodTable*
;; x11 : Indirection cell address, preserved
;; x12 : Trashed
ldr x12, [x9, #(OFFSETOF__InterfaceDispatchCache__m_rgEntries + ($entry * 16))]
;; x12, x13 : Trashed
;;
;; Use ldp to load both m_pInstanceType and m_pTargetCode in a single instruction.
;; On ARM64 two separate ldr instructions can be reordered across a control dependency,
;; which means a concurrent atomic cache entry update (via stlxp) could be observed as a
;; torn read (new type, old target). ldp is single-copy atomic for the pair on FEAT_LSE2
;; hardware (ARMv8.4+). The cbz guard ensures correctness on pre-LSE2 hardware too:
;; a torn read can only produce a zero target (entries go from 0,0 to type,target),
;; so we treat it as a cache miss.
IF (OFFSETOF__InterfaceDispatchCache__m_rgEntries + ($entry * 16)) > 504
;; ldp's signed immediate offset must be in [-512,504] for 64-bit registers.
;; Use add to reach far entries in the 32/64 slot stubs.
add x12, x9, #(OFFSETOF__InterfaceDispatchCache__m_rgEntries + ($entry * 16))
ldp x12, x13, [x12]
Comment on lines +28 to +32
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 32/64-entry stubs, OFFSETOF__InterfaceDispatchCache__m_rgEntries is 0x20 (src/coreclr/vm/arm64/asmconstants.h:289), so entries >= 30 will always take the add+ldp sequence. This adds an extra instruction on each mismatch for a large fraction of probes. Consider splitting the probe loop into two ranges with a second base computed once so later ldp uses an in-range immediate offset, avoiding repeated add in the hot path.

Copilot uses AI. Check for mistakes.
ELSE
ldp x12, x13, [x9, #(OFFSETOF__InterfaceDispatchCache__m_rgEntries + ($entry * 16))]
ENDIF
cmp x10, x12
bne %ft0
ldr x9, [x9, #(OFFSETOF__InterfaceDispatchCache__m_rgEntries + ($entry * 16) + 8)]
br x9
cbz x13, %ft0
br x13
0
MEND

Expand Down
Loading