Reduce contention in UnwindInfoTable seen during high volume LCG method creation#128619
Reduce contention in UnwindInfoTable seen during high volume LCG method creation#128619eduardo-vp wants to merge 7 commits into
Conversation
|
Tagging subscribers to this area: @agocke |
There was a problem hiding this comment.
Pull request overview
This PR targets performance/scalability under high-volume dynamic method (LCG) creation by reducing lock contention and avoiding avoidable allocations in CoreCLR’s dynamic method and unwind-info publishing paths.
Changes:
- Pre-seeds
LCGMethodResolver::m_DynamicCodePointerswith an inline “first” node to avoid chunk allocation in the common single-pointer case. - Reworks
UnwindInfoTablesynchronization to use per-table locks plus a per-table “flush gate” to reduce contention when flushing pending entries. - Increases pending buffer capacity and switches pending sorting and removal operations to more efficient algorithms.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/coreclr/vm/dynamicmethod.h | Adds an inline initial DynamicCodePointer node and initializes m_DynamicCodePointers to it. |
| src/coreclr/vm/dynamicmethod.cpp | Resets/allocates code pointer records using the inline initial node to avoid allocations under hot locks. |
| src/coreclr/vm/codeman.h | Expands pending buffer and introduces per-table locks, flush gate, and registration-failed state. |
| src/coreclr/vm/codeman.cpp | Implements per-table locking/flush gating, pending sort changes, binary-search removal, and lock-free table installation via CAS. |
Comments suppressed due to low confidence (1)
src/coreclr/vm/codeman.cpp:500
- RemoveFromUnwindInfoTable can incorrectly return after matching a soft-deleted (UnwindData==0) entry in the published table. This is problematic now that we also search the pending buffer: if code addresses are reused, a deleted published entry may still cover the new method’s Begin/End range, causing this path to return early and skip removing the real pending entry. Consider requiring UnwindData != 0 for the published-table match (or otherwise treating deleted matches as not-found so the pending buffer check can run).
if (lo > 0)
{
ULONG i = lo - 1;
if (relativeEntryPoint < RUNTIME_FUNCTION__EndAddress(&unwindInfo->pTable[i], unwindInfo->iRangeStart))
{
if (unwindInfo->pTable[i].UnwindData != 0)
unwindInfo->cDeletedEntries++;
unwindInfo->pTable[i].UnwindData = 0; // Mark the entry for deletion
STRESS_LOG1(LF_JIT, LL_INFO100, "RemoveFromUnwindInfoTable Removed entry 0x%x\n", i);
|
|
||
| // This thread attempts to become the sole flusher for this table by taking | ||
| // the flush gate. If it wins, publish the pending entries, then release the gate, | ||
| // re-check if more entries arrived and loop if so. |
There was a problem hiding this comment.
Do the other threads need to wait for their entries to be flushed?
I do not see any code that does that.
There was a problem hiding this comment.
After AddToUnwindInfoTable finishes, they may need to wait a minimum amount of time. A thread that finds m_flushInProgress set to 1 (there's a flushing thread) just returns and waits for the flushing thread to just loop again and pick up its entries. Some threads essentially defer the work to another thread in this version as opposed to every thread taking the lock and publishing (which generated a lot of contention).
There was a problem hiding this comment.
A thread that finds m_flushInProgress set to 1 (there's a flushing thread) just returns and waits for the flushing thread to just loop again
What's the line of code that makes it wait?
There was a problem hiding this comment.
Ah it doesn't actually wait/block - I meant there's some time until it gets published. The thread just continues.
There was a problem hiding this comment.
It means that the tracing stacktraces on the thread that continues are going to be broken until the flushing thread catches up. This change is regressing tracing reliability.
|
@eduardo-vp you still need minimal repro from us on Linux? based on your comment |
Contributes to #123124.
Changes:
m_DynamicCodePointersinLCGMethodResolver. In most cases the list will contain exactly one pointer but if it starts empty, it does a chunk allocation while taking a lock. The initial pointer avoids that unnecessary chunk allocation for most cases.AddToUnwindInfoTablefinishes (it may stay in the pending buffer and get picked up by another thread).AddToUnwindInfoTableto avoid taking locks unnecessarily.I used both the original xUnit test with 30k methods mentioned in the issue and a separate benchmark that stresses LCG method creation to test these changes. I mostly used the LCG benchmark to investigate this since .NET 10 runs ~150% slower than .NET 9 so it was easier to spot if changes were helpful or not.
For the measurements, I built from release/9.0, release/10.0 and release/10.0 + the changes in this PR since the goal is to backport to net 10.
LCG Stress Benchmark (300K methods)
LCG stress benchmark code
Original regression was ~150%. Still slower than .NET 9 but at the same time I'm dubious if it's a realistic scenario.
Original xUnit Benchmark (30K methods)
Original regression was ~33%. The original benchmark now runs essentially as fast as .NET 9.