Remove CORINFO_HELP_ASSIGN_BYREF#128542
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
Pull request overview
This PR removes the CORINFO_HELP_ASSIGN_BYREF / JIT_ByRefWriteBarrier helper from the JIT/EE boundary and ReadyToRun format, and replaces the JIT’s “cpobj with interleaved GC pointers” lowering/codegen path with either (a) decomposition into per-slot ASSIGN_REF write barriers (for small GC pointer counts) or (b) the existing bulk write-barrier helper (CORINFO_HELP_BULK_WRITEBARRIER). It also bumps the ReadyToRun major version to reflect the serialized format change.
Changes:
- Delete the
ASSIGN_BYREFhelper surface area (helper IDs, write-barrier stubs/AV markers, and per-arch asm implementations) across VM, runtime, NativeAOT EH helper marking, and AOT tooling. - Remove the
CpObj*block op kinds and all backendgenCodeForCpObjimplementations; introduceTryDecomposeBlockStoreAsIndirs+LowerBlockStoreAsGcBulkCopyCallas the new lowering strategy for GC-containing struct copies. - Bump ReadyToRun major version to 19 and reserve helper ID
0x32(formerlyByRefWriteBarrier) to avoid reuse.
Reviewed changes
Copilot reviewed 70 out of 70 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/vm/wasm/helpers.cpp | Removes wasm helper end marker and AV location globals for byref write barrier. |
| src/coreclr/vm/threads.cpp | Stops registering CORINFO_HELP_ASSIGN_BYREF helper in InitThreadManager; updates AMD64 comment. |
| src/coreclr/vm/riscv64/asmhelpers.S | Removes JIT_ByRefWriteBarrier implementation; adjusts comments about incremented regs. |
| src/coreclr/vm/loongarch64/asmhelpers.S | Removes JIT_ByRefWriteBarrier implementation; adjusts comments about incremented regs. |
| src/coreclr/vm/jitinterface.h | Removes RhpByRefAssignRef* declarations/mapping and JIT_ByRefWriteBarrier decl. |
| src/coreclr/vm/i386/jithelp.S | Deletes x86 JIT_ByRefWriteBarrier helper macro/entry and its inclusion in the group. |
| src/coreclr/vm/i386/jithelp.asm | Deletes x86 JIT_ByRefWriteBarrier helper macro/entry and its inclusion in the group. |
| src/coreclr/vm/excep.cpp | Removes byref write barrier end marker, AV locations, and helper-range checks. |
| src/coreclr/vm/arm64/patchedcode.S | Removes ARM64 JIT_ByRefWriteBarrier; updates comments for remaining helpers/return stubs. |
| src/coreclr/vm/arm64/patchedcode.asm | Removes ARM64 JIT_ByRefWriteBarrier; updates comments for remaining helpers/return stubs. |
| src/coreclr/vm/arm/stubs.cpp | Removes ARM byref barrier variants from stub declarations/mapping/validation. |
| src/coreclr/vm/arm/patchedcode.S | Removes ARM JIT_ByRefWriteBarrier region from patched code blob. |
| src/coreclr/vm/arm/asmhelpers.S | Removes ARM byref write barrier helper macro and descriptor entries. |
| src/coreclr/vm/amd64/jithelpers_fast.S | Removes AMD64 Unix JIT_ByRefWriteBarrier implementation. |
| src/coreclr/vm/amd64/JitHelpers_Fast.asm | Removes AMD64 Windows JIT_ByRefWriteBarrier implementation. |
| src/coreclr/tools/Common/JitInterface/CorInfoHelpFunc.cs | Removes CORINFO_HELP_ASSIGN_BYREF from managed mirror enum. |
| src/coreclr/tools/Common/Internal/Runtime/ReadyToRunConstants.cs | Reserves 0x32 with a “do not reuse” comment in tool-side helper enum. |
| src/coreclr/tools/aot/ILCompiler.RyuJit/JitInterface/CorInfoImpl.RyuJit.cs | Drops mapping from ASSIGN_BYREF to ByRefWriteBarrier R2R helper. |
| src/coreclr/tools/aot/ILCompiler.Reflection.ReadyToRun/ReadyToRunSignature.cs | Removes signature pretty-printing for ByRefWriteBarrier. |
| src/coreclr/tools/aot/ILCompiler.ReadyToRun/JitInterface/CorInfoImpl.ReadyToRun.cs | Drops mapping from ASSIGN_BYREF to ByRefWriteBarrier R2R helper. |
| src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/JitHelper.cs | Removes ReadyToRunHelper.ByRefWriteBarrier entry point logic. |
| src/coreclr/runtime/riscv64/WriteBarriers.S | Removes RhpByRefAssignRef implementation/AV alternate entry. |
| src/coreclr/runtime/portable/WriteBarriers.cpp | Removes portable stub for RhpByRefAssignRef. |
| src/coreclr/runtime/loongarch64/WriteBarriers.S | Removes RhpByRefAssignRef; updates comment about incremented regs in remaining helper. |
| src/coreclr/runtime/i386/WriteBarriers.S | Removes RhpByRefAssignRef implementation and AV alternate entries. |
| src/coreclr/runtime/i386/WriteBarriers.asm | Removes RhpByRefAssignRef implementation and AV alternate entries. |
| src/coreclr/runtime/arm64/WriteBarriers.S | Removes RhpByRefAssignRefArm64 implementation/AV alternate entry. |
| src/coreclr/runtime/arm64/WriteBarriers.asm | Removes RhpByRefAssignRefArm64 implementation/AV alternate entry. |
| src/coreclr/runtime/arm/WriteBarriers.S | Removes RhpByRefAssignRef implementation and AV labels. |
| src/coreclr/runtime/amd64/WriteBarriers.S | Removes RhpByRefAssignRef implementation and AV alternate entries (incl. Apple unwind workaround). |
| src/coreclr/runtime/amd64/WriteBarriers.asm | Removes RhpByRefAssignRef implementation and AV alternate entries. |
| src/coreclr/nativeaot/Runtime/EHHelpers.cpp | Removes byref write barrier AV locations from NativeAOT helper-range detection. |
| src/coreclr/jit/utils.cpp | Removes CORINFO_HELP_ASSIGN_BYREF from helper call property initialization. |
| src/coreclr/jit/targetx86.h | Removes x86 ASSIGN_BYREF register/killset documentation and masks. |
| src/coreclr/jit/targetwasm.h | Removes wasm byref write barrier reg/killset defines. |
| src/coreclr/jit/targetriscv64.h | Removes RISCV64 byref write barrier reg/killset defines. |
| src/coreclr/jit/targetloongarch64.h | Removes LoongArch64 byref write barrier reg/killset defines. |
| src/coreclr/jit/targetarm64.h | Removes ARM64 byref write barrier reg/killset defines. |
| src/coreclr/jit/targetarm.h | Removes ARM byref write barrier reg/killset defines. |
| src/coreclr/jit/targetamd64.h | Removes AMD64 byref write barrier killset defines and related commentary. |
| src/coreclr/jit/lsraxarch.cpp | Removes CpObj-specific LSRA register assignment logic for xarch. |
| src/coreclr/jit/lsrariscv64.cpp | Removes CpObj-specific LSRA register assignment logic for riscv64. |
| src/coreclr/jit/lsraloongarch64.cpp | Removes CpObj-specific LSRA register assignment logic for loongarch64. |
| src/coreclr/jit/lsrabuild.cpp | Removes CpObj killset handling that depended on ASSIGN_BYREF. |
| src/coreclr/jit/lsraarmarch.cpp | Removes CpObj-specific LSRA register assignment logic for arm/arm64. |
| src/coreclr/jit/lowerxarch.cpp | Changes GC-containing struct copy lowering to decompose-or-bulk-helper; adjusts stack/off-heap handling. |
| src/coreclr/jit/lowerwasm.cpp | Uses decompose-or-bulk-helper for GC-containing struct copies on wasm. |
| src/coreclr/jit/lowerriscv64.cpp | Uses decompose-or-bulk-helper for GC-containing struct copies on riscv64. |
| src/coreclr/jit/lowerloongarch64.cpp | Uses decompose-or-bulk-helper for GC-containing struct copies on loongarch64. |
| src/coreclr/jit/lowerarmarch.cpp | Uses decompose-or-bulk-helper for GC-containing struct copies on arm/arm64. |
| src/coreclr/jit/lower.h | Replaces TryLowerBlockStoreAsGcBulkCopyCall with LowerBlockStoreAsGcBulkCopyCall and adds TryDecomposeBlockStoreAsIndirs. |
| src/coreclr/jit/lower.cpp | Implements decomposition into per-slot indirections and the bulk write-barrier fallback lowering. |
| src/coreclr/jit/gentree.h | Removes CpObj-specific BlkOpKind enum values. |
| src/coreclr/jit/gentree.cpp | Removes debug printing for CpObj BlkOpKind values. |
| src/coreclr/jit/emit.cpp | Removes special-case GC-reg-kill handling for ASSIGN_BYREF. |
| src/coreclr/jit/codegenxarch.cpp | Removes CpObj codegen path and related block-kind dispatch. |
| src/coreclr/jit/codegenwasm.cpp | Removes byref write barrier signature comment. |
| src/coreclr/jit/codegenriscv64.cpp | Removes CpObj codegen path and related block-kind dispatch. |
| src/coreclr/jit/codegenloongarch64.cpp | Removes CpObj codegen path and related block-kind dispatch. |
| src/coreclr/jit/codegencommon.cpp | Removes helper killset case for ASSIGN_BYREF. |
| src/coreclr/jit/codegenarmarch.cpp | Removes CpObj block-kind dispatch. |
| src/coreclr/jit/codegenarm64.cpp | Removes CpObj codegen path implementation. |
| src/coreclr/jit/codegenarm.cpp | Removes CpObj codegen path implementation. |
| src/coreclr/jit/codegen.h | Removes genCodeForCpObj declaration. |
| src/coreclr/inc/readytorunhelpers.h | Removes READYTORUN_HELPER_ByRefWriteBarrier mapping macro entry. |
| src/coreclr/inc/readytorun.h | Bumps R2R major/minimum major to 19; reserves helper slot 0x32; documents the format break. |
| src/coreclr/inc/jithelpers.h | Removes CORINFO_HELP_ASSIGN_BYREF helper entry. |
| src/coreclr/inc/jiteeversionguid.h | Updates JITEEVersionIdentifier to reflect the interface change. |
| src/coreclr/inc/corinfo.h | Removes CORINFO_HELP_ASSIGN_BYREF from CorInfoHelpFunc. |
| docs/design/coreclr/botr/readytorun-format.md | Updates R2R helper enum docs to reserve 0x32 and note removal. |
|
@EgorBot -amd -arm using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);
public class Benchmarks
{
private MyStruct a1;
private MyStruct a2;
[Benchmark]
public void Copy()
{
a1 = a2;
}
}
record struct MyStruct(long N1, long N2, long N3, long N4, long N5, string A, string B); |
9beff5e to
c845f6c
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 32 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
src/coreclr/jit/lower.cpp:10167
- LowerBlockStoreAsGcBulkCopyCall now unconditionally drops a GT_IND source without preserving/handling volatile semantics (previously this path explicitly refused volatile sources/stores). If
datais a volatile indirection (e.g. reading from a volatile struct field), removing the GT_IND loses the volatile flag, and the resulting CORINFO_HELP_BULK_WRITEBARRIER call is not wrapped with the usual memory barrier nodes (compare with LowerBlockStoreAsHelperCall). This risks incorrect volatile ordering/semantics on non-xarch targets. Consider either (1) keeping the earlier bail-out for volatile source/store and forcing decomposition in those cases, or (2) explicitly inserting the appropriate BARRIER_* nodes around the helper call and preserving volatile-source behavior.
// or layouts with too many GC pointers to decompose efficiently). Volatile
// semantics are preserved by the helper call itself — the call acts as a
// compiler memory barrier and the helper uses atomic per-slot stores.
//
// Arguments:
// blk - The block store node to lower
//
void Lowering::LowerBlockStoreAsGcBulkCopyCall(GenTreeBlk* blk)
{
assert(blk->OperIs(GT_STORE_BLK));
assert(!blk->OperIsInitBlkOp());
assert(blk->GetLayout()->HasGCPtr());
assert(!blk->IsAddressNotOnHeap(m_compiler));
GenTree* dest = blk->Addr();
GenTree* data = blk->Data();
if (data->OperIs(GT_IND))
{
// Drop GT_IND nodes
BlockRange().Remove(data);
data = data->AsIndir()->Addr();
}
c845f6c to
ee79d4d
Compare
ee79d4d to
d805a92
Compare
d805a92 to
28e1813
Compare
28e1813 to
1b597b2
Compare
1b597b2 to
0177e27
Compare
|
@EgorBot -amd -arm using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);
public class Benchmarks
{
// Heap-resident sources/destinations (class fields).
private S_2GC_NoGap _2g_nogap_src, _2g_nogap_dst;
private S_2GC_Gap8 _2g_g8_src, _2g_g8_dst;
private S_2GC_Gap32 _2g_g32_src, _2g_g32_dst;
private S_2GC_Gap64 _2g_g64_src, _2g_g64_dst;
private S_2GC_Gap128 _2g_g128_src, _2g_g128_dst;
private S_2GC_Gap1K _2g_g1k_src, _2g_g1k_dst;
private S_2GC_Gap2K _2g_g2k_src, _2g_g2k_dst;
private S_1GC_Gap64 _1g_g64_src, _1g_g64_dst;
private S_1GC_Gap1K _1g_g1k_src, _1g_g1k_dst;
private S_3GC_Gap64 _3g_g64_src, _3g_g64_dst;
private S_3GC_Gap1K _3g_g1k_src, _3g_g1k_dst;
[MethodImpl(MethodImplOptions.NoInlining)]
private static void Sink<T>(in T _) { }
// ===== Heap target: dst is class field (write barriers needed) =====
[Benchmark] public void Heap_2GC_NoGap() => _2g_nogap_dst = _2g_nogap_src;
[Benchmark] public void Heap_2GC_Gap8() => _2g_g8_dst = _2g_g8_src;
[Benchmark] public void Heap_2GC_Gap32() => _2g_g32_dst = _2g_g32_src;
[Benchmark] public void Heap_2GC_Gap64() => _2g_g64_dst = _2g_g64_src;
[Benchmark] public void Heap_2GC_Gap128() => _2g_g128_dst = _2g_g128_src;
[Benchmark] public void Heap_2GC_Gap1K() => _2g_g1k_dst = _2g_g1k_src;
[Benchmark] public void Heap_2GC_Gap2K() => _2g_g2k_dst = _2g_g2k_src;
[Benchmark] public void Heap_1GC_Gap64() => _1g_g64_dst = _1g_g64_src;
[Benchmark] public void Heap_1GC_Gap1K() => _1g_g1k_dst = _1g_g1k_src;
[Benchmark] public void Heap_3GC_Gap64() => _3g_g64_dst = _3g_g64_src;
[Benchmark] public void Heap_3GC_Gap1K() => _3g_g1k_dst = _3g_g1k_src;
// ===== Stack target: dst is a local (no barriers; SIMD-unroll path) =====
[Benchmark] public void Stack_2GC_NoGap() { var l = _2g_nogap_src; Sink(in l); }
[Benchmark] public void Stack_2GC_Gap8() { var l = _2g_g8_src; Sink(in l); }
[Benchmark] public void Stack_2GC_Gap32() { var l = _2g_g32_src; Sink(in l); }
[Benchmark] public void Stack_2GC_Gap64() { var l = _2g_g64_src; Sink(in l); }
[Benchmark] public void Stack_2GC_Gap128() { var l = _2g_g128_src; Sink(in l); }
[Benchmark] public void Stack_2GC_Gap1K() { var l = _2g_g1k_src; Sink(in l); }
[Benchmark] public void Stack_2GC_Gap2K() { var l = _2g_g2k_src; Sink(in l); }
[Benchmark] public void Stack_1GC_Gap64() { var l = _1g_g64_src; Sink(in l); }
[Benchmark] public void Stack_1GC_Gap1K() { var l = _1g_g1k_src; Sink(in l); }
[Benchmark] public void Stack_3GC_Gap64() { var l = _3g_g64_src; Sink(in l); }
[Benchmark] public void Stack_3GC_Gap1K() { var l = _3g_g1k_src; Sink(in l); }
}
// ---- Layouts (Sequential + Pack=8 so the GC slots stay where declared) ----
// 2 GC slots, no gap (16B total)
[StructLayout(LayoutKind.Sequential, Pack = 8)]
public struct S_2GC_NoGap { public string A; public string B; }
// 2 GC slots, 8B gap (24B)
[StructLayout(LayoutKind.Sequential, Pack = 8)]
public struct S_2GC_Gap8 { public string A; public long G; public string B; }
// 2 GC slots, 32B gap (48B)
[StructLayout(LayoutKind.Sequential, Pack = 8)]
public struct S_2GC_Gap32 { public string A; public InlineArray4<long> G; public string B; }
// 2 GC slots, 64B gap (80B)
[StructLayout(LayoutKind.Sequential, Pack = 8)]
public struct S_2GC_Gap64 { public string A; public InlineArray8<long> G; public string B; }
// 2 GC slots, 128B gap (144B)
[StructLayout(LayoutKind.Sequential, Pack = 8)]
public struct S_2GC_Gap128 { public string A; public InlineArray16<long> G; public string B; }
// 2 GC slots, 1024B gap (~1040B)
[StructLayout(LayoutKind.Sequential, Pack = 8)]
public struct S_2GC_Gap1K { public string A; public InlineArray16<InlineArray8<long>> G; public string B; }
// 2 GC slots, 2048B gap (~2064B)
[StructLayout(LayoutKind.Sequential, Pack = 8)]
public struct S_2GC_Gap2K { public string A; public InlineArray16<InlineArray16<long>> G; public string B; }
// 1 GC slot + 64B padding (72B)
[StructLayout(LayoutKind.Sequential, Pack = 8)]
public struct S_1GC_Gap64 { public string A; public InlineArray8<long> Pad; }
// 1 GC slot + 1024B padding (~1032B)
[StructLayout(LayoutKind.Sequential, Pack = 8)]
public struct S_1GC_Gap1K { public string A; public InlineArray16<InlineArray8<long>> Pad; }
// 3 GC slots with 64B gaps between (152B)
[StructLayout(LayoutKind.Sequential, Pack = 8)]
public struct S_3GC_Gap64
{
public string A; public InlineArray8<long> G1;
public string B; public InlineArray8<long> G2;
public string C;
}
// 3 GC slots with 1024B gaps between (~2072B)
[StructLayout(LayoutKind.Sequential, Pack = 8)]
public struct S_3GC_Gap1K
{
public string A; public InlineArray16<InlineArray8<long>> G1;
public string B; public InlineArray16<InlineArray8<long>> G2;
public string C;
} |
…ollLimit When a struct copy with GC pointers has an off-heap destination but is too large to unroll inline, the isNotHeap check was previously gated on 'size <= copyBlockUnrollLimit', so doCpObj stayed true and the code fell into LowerBlockStoreAsGcBulkCopyCall - which asserts !isNotHeap. Restructure to match the armarch pattern: always clear doCpObj for isNotHeap, and only set gtBlkOpGcUnsafe when small enough to unroll. Larger off-heap copies now fall through to LowerBlockStoreAsHelperCall (memcpy), which is correct since memcpy is a leaf helper (no GC during the call) and both src/dst slots are GC-tracked via local var info. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@EgorBot -arm -aws_arm -amd -windows_intel using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);
public class Benchmarks
{
MyStruct src, dst;
[Benchmark]
public void Heap() => dst = src;
}
public struct MyStruct
{
public string A;
public InlineArray4<long> G;
public string B;
} |
|
@dotnet/jit-contrib @jkotas is this an acceptable trade-off? Clearly this is a size increase given how comfortable the previous call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
mov ecx, 5
rep movsqbut rep movsq can be up to 10x slower, see benchmark in the description. Also, see #128542 (comment) SPMI diffs, jit-diff (MihuBot). Quite a bit a hit for Tier0 codegen, mainly because it spills-reloads everything. I tried to just unconditionally use the bulk-barrier for tier0, but it gets a penalty from these explicit null checks. There are cases where I can improve codegen by tunning LSRA and modelling registers used by write barrier more precisely (presumably). Pros:
Cons:
So I think we can improve certain patterns in follow ups here (to reduce redundant stack spill-reloads). Also, we can tune the bulk barrier a bit. |
|
LGTM |
This PR attempts to remove
CORINFO_HELP_ASSIGN_BYREF(which is basicallyCORINFO_HELP_CHECKED_ASSIGN_REFwith a custom calling convention), it is replaced by eitherCORINFO_HELP_ASSIGN_REForCORINFO_HELP_CHECKED_ASSIGN_REF. As a side effect we no longer use slowrep movsqon x64 for gaps between GC slots and may use SIMD. The motivation thatCORINFO_HELP_ASSIGN_BYREFwas very useful for large struct copies, but we now have a bulk write barrier for those.NOTE: A lot of code (>1000-2000 LOC) can be removed from CoreCLR/NAOT side as dead as well since
CORINFO_HELP_ASSIGN_BYREFis unused now, but that's for another PR and might require R2R bump.Benchmark
INTEL XEON PLATINUM 8573C
AMD EPYC 9V45 2.60GHz
Apple M4
Sligtly less difference on arm64 where we don't have slow
movsqfor nongc gaps.