Skip to content

Fix ARM64 interface dispatch cache torn read#126346

Open
MichalStrehovsky wants to merge 1 commit intodotnet:mainfrom
MichalStrehovsky:fix/arm64-interface-dispatch-torn-read
Open

Fix ARM64 interface dispatch cache torn read#126346
MichalStrehovsky wants to merge 1 commit intodotnet:mainfrom
MichalStrehovsky:fix/arm64-interface-dispatch-torn-read

Conversation

@MichalStrehovsky
Copy link
Copy Markdown
Member

On ARM64, the CHECK_CACHE_ENTRY macro read m_pInstanceType and m_pTargetCode from a cache entry using two separate ldr instructions separated by a control dependency (cmp/bne). ARM64's weak memory model does not order loads across control dependencies, so the hardware can speculatively satisfy the second load (target) before the first (type) commits. When a concurrent thread atomically populates the entry via stlxp/casp (UpdateCacheEntryAtomically), the reader can observe the new m_pInstanceType but the old m_pTargetCode (0), then br to address 0.

Fix by using ldp to load both fields in a single instruction (single-copy atomic on FEAT_LSE2 / ARMv8.4+ hardware), plus a cbz guard to catch torn reads on pre-LSE2 hardware where ldp pair atomicity is not architecturally guaranteed.

Fixes #126345

On ARM64, the CHECK_CACHE_ENTRY macro read m_pInstanceType and m_pTargetCode
from a cache entry using two separate ldr instructions separated by a control
dependency (cmp/bne). ARM64's weak memory model does not order loads across
control dependencies, so the hardware can speculatively satisfy the second
load (target) before the first (type) commits. When a concurrent thread
atomically populates the entry via stlxp/casp (UpdateCacheEntryAtomically),
the reader can observe the new m_pInstanceType but the old m_pTargetCode (0),
then br to address 0.

Fix by using ldp to load both fields in a single instruction (single-copy
atomic on FEAT_LSE2 / ARMv8.4+ hardware), plus a cbz guard to catch torn
reads on pre-LSE2 hardware where ldp pair atomicity is not architecturally
guaranteed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @agocke, @dotnet/ilc-contrib
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a potential torn-read race in the ARM64 cached interface dispatch fast-path that could lead to branching to address 0 when reading a concurrently populated cache entry.

Changes:

  • Replace two independent loads of cache-entry fields with a single ldp pair load to avoid reordering across control dependencies on ARM64.
  • Add a cbz guard on the loaded target to treat observed torn reads (type updated, target still 0) as a cache miss on pre-LSE2 hardware.
  • Mirror the changes in both the GAS (.S) and ARMASM (.asm) implementations.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/coreclr/runtime/arm64/StubDispatch.S Updates ARM64 stub macro to use ldp + cbz to avoid torn cache-entry reads.
src/coreclr/runtime/arm64/StubDispatch.asm Same logic as above for the ARMASM variant to keep implementations consistent.

Comment on lines +31 to +35
.if (OFFSETOF__InterfaceDispatchCache__m_rgEntries + (\entry * 16)) > 504
// ldp's signed immediate offset must be in [-512,504] for 64-bit registers.
// Use add to reach far entries in the 32/64 slot stubs.
add x12, x9, #(OFFSETOF__InterfaceDispatchCache__m_rgEntries + (\entry * 16))
ldp x12, x13, [x12]
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For stubs with 32/64 entries, OFFSETOF__InterfaceDispatchCache__m_rgEntries is 0x20 (see src/coreclr/vm/arm64/asmconstants.h:289), so entries >= 30 fall into the add+ldp path. That adds an extra instruction on the (common) mismatch path for over half the probes, which could regress interface-dispatch hot-path throughput. Consider restructuring to avoid per-entry add (e.g., split the probe sequence into two ranges using an adjusted base once, so ldp can keep using immediate offsets in-range).

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +32
IF (OFFSETOF__InterfaceDispatchCache__m_rgEntries + ($entry * 16)) > 504
;; ldp's signed immediate offset must be in [-512,504] for 64-bit registers.
;; Use add to reach far entries in the 32/64 slot stubs.
add x12, x9, #(OFFSETOF__InterfaceDispatchCache__m_rgEntries + ($entry * 16))
ldp x12, x13, [x12]
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 32/64-entry stubs, OFFSETOF__InterfaceDispatchCache__m_rgEntries is 0x20 (src/coreclr/vm/arm64/asmconstants.h:289), so entries >= 30 will always take the add+ldp sequence. This adds an extra instruction on each mismatch for a large fraction of probes. Consider splitting the probe loop into two ranges with a second base computed once so later ldp uses an in-range immediate offset, avoiding repeated add in the hot path.

Copilot uses AI. Check for mistakes.
@MichalStrehovsky
Copy link
Copy Markdown
Member Author

/azp run runtime-nativeaot-outerloop

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash in interface dispatch on ARM64

3 participants