Fix ARM64 interface dispatch cache torn read#126346
Fix ARM64 interface dispatch cache torn read#126346MichalStrehovsky wants to merge 1 commit intodotnet:mainfrom
Conversation
On ARM64, the CHECK_CACHE_ENTRY macro read m_pInstanceType and m_pTargetCode from a cache entry using two separate ldr instructions separated by a control dependency (cmp/bne). ARM64's weak memory model does not order loads across control dependencies, so the hardware can speculatively satisfy the second load (target) before the first (type) commits. When a concurrent thread atomically populates the entry via stlxp/casp (UpdateCacheEntryAtomically), the reader can observe the new m_pInstanceType but the old m_pTargetCode (0), then br to address 0. Fix by using ldp to load both fields in a single instruction (single-copy atomic on FEAT_LSE2 / ARMv8.4+ hardware), plus a cbz guard to catch torn reads on pre-LSE2 hardware where ldp pair atomicity is not architecturally guaranteed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @agocke, @dotnet/ilc-contrib |
There was a problem hiding this comment.
Pull request overview
Fixes a potential torn-read race in the ARM64 cached interface dispatch fast-path that could lead to branching to address 0 when reading a concurrently populated cache entry.
Changes:
- Replace two independent loads of cache-entry fields with a single
ldppair load to avoid reordering across control dependencies on ARM64. - Add a
cbzguard on the loaded target to treat observed torn reads (type updated, target still 0) as a cache miss on pre-LSE2 hardware. - Mirror the changes in both the GAS (
.S) and ARMASM (.asm) implementations.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/coreclr/runtime/arm64/StubDispatch.S | Updates ARM64 stub macro to use ldp + cbz to avoid torn cache-entry reads. |
| src/coreclr/runtime/arm64/StubDispatch.asm | Same logic as above for the ARMASM variant to keep implementations consistent. |
| .if (OFFSETOF__InterfaceDispatchCache__m_rgEntries + (\entry * 16)) > 504 | ||
| // ldp's signed immediate offset must be in [-512,504] for 64-bit registers. | ||
| // Use add to reach far entries in the 32/64 slot stubs. | ||
| add x12, x9, #(OFFSETOF__InterfaceDispatchCache__m_rgEntries + (\entry * 16)) | ||
| ldp x12, x13, [x12] |
There was a problem hiding this comment.
For stubs with 32/64 entries, OFFSETOF__InterfaceDispatchCache__m_rgEntries is 0x20 (see src/coreclr/vm/arm64/asmconstants.h:289), so entries >= 30 fall into the add+ldp path. That adds an extra instruction on the (common) mismatch path for over half the probes, which could regress interface-dispatch hot-path throughput. Consider restructuring to avoid per-entry add (e.g., split the probe sequence into two ranges using an adjusted base once, so ldp can keep using immediate offsets in-range).
| IF (OFFSETOF__InterfaceDispatchCache__m_rgEntries + ($entry * 16)) > 504 | ||
| ;; ldp's signed immediate offset must be in [-512,504] for 64-bit registers. | ||
| ;; Use add to reach far entries in the 32/64 slot stubs. | ||
| add x12, x9, #(OFFSETOF__InterfaceDispatchCache__m_rgEntries + ($entry * 16)) | ||
| ldp x12, x13, [x12] |
There was a problem hiding this comment.
For 32/64-entry stubs, OFFSETOF__InterfaceDispatchCache__m_rgEntries is 0x20 (src/coreclr/vm/arm64/asmconstants.h:289), so entries >= 30 will always take the add+ldp sequence. This adds an extra instruction on each mismatch for a large fraction of probes. Consider splitting the probe loop into two ranges with a second base computed once so later ldp uses an in-range immediate offset, avoiding repeated add in the hot path.
|
/azp run runtime-nativeaot-outerloop |
|
Azure Pipelines successfully started running 1 pipeline(s). |
On ARM64, the CHECK_CACHE_ENTRY macro read m_pInstanceType and m_pTargetCode from a cache entry using two separate ldr instructions separated by a control dependency (cmp/bne). ARM64's weak memory model does not order loads across control dependencies, so the hardware can speculatively satisfy the second load (target) before the first (type) commits. When a concurrent thread atomically populates the entry via stlxp/casp (UpdateCacheEntryAtomically), the reader can observe the new m_pInstanceType but the old m_pTargetCode (0), then br to address 0.
Fix by using ldp to load both fields in a single instruction (single-copy atomic on FEAT_LSE2 / ARMv8.4+ hardware), plus a cbz guard to catch torn reads on pre-LSE2 hardware where ldp pair atomicity is not architecturally guaranteed.
Fixes #126345