Replace HashMap COOP transitions with Epoch-Based Reclamation (EBR)#124307
Replace HashMap COOP transitions with Epoch-Based Reclamation (EBR)#124307AaronRobinsonMSFT wants to merge 9 commits intodotnet:mainfrom
Conversation
HashMap's async mode used GCX_MAYBE_COOP_NO_THREAD_BROKEN to transition into cooperative GC mode on every operation, preventing the GC from freeing obsolete bucket arrays mid-read. Old bucket arrays were queued via SyncClean::AddHashMap and freed during GC pauses. This caused a deadlock: when HashMap::LookupValue() was called while holding the DebuggerController lock, the COOP transition (which is level-equivalent to taking the ThreadStore lock) violated lock ordering constraints, since ThreadStore must be acquired before DebuggerController. Replace both mechanisms with Epoch-Based Reclamation (EBR), based on Fraser's algorithm from 'Practical Lock-Freedom' (UCAM-CL-TR-579): - EnterCriticalRegion/ExitCriticalRegion are simple atomic flag stores with memory barriers -- they never block or trigger GC transitions - Obsolete bucket arrays are queued for deferred deletion and freed once all threads have passed through a quiescent state - An RAII holder (EbrCriticalRegionHolder) replaces GCX_MAYBE_COOP at all 6 call sites in hash.cpp Changes: - New: src/coreclr/vm/ebr.h, ebr.cpp (EbrCollector, ~340 lines) - hash.cpp: Replace 6 GCX_MAYBE_COOP_NO_THREAD_BROKEN with EBR holders, replace SyncClean::AddHashMap with QueueForDeletion - syncclean.hpp/cpp: Remove HashMap-related members and cleanup code - ceemain.cpp: Init g_HashMapEbr at startup, shutdown at EE shutdown - CrstTypes.def: Add CrstEbrThreadList, CrstEbrPending - crsttypes_generated.h: Regenerated with new Crst types - CMakeLists.txt: Add ebr.cpp, ebr.h to build
- Rename memoryBudget/m_pendingSize to memoryBudgetInBytes/m_pendingSizeInBytes - Mark EbrCollector and EbrCriticalRegionHolder as final - Delete move constructors/assignment operators - Move NextObsolete from hash.h (public) to hash.cpp (file-static) - Reuse DeleteObsoleteBuckets for sync-mode path in Rehash - Trim redundant backstory comments at EBR call sites - Remove unused forward decls from syncclean.hpp
|
Tagging subscribers to this area: @agocke |
There was a problem hiding this comment.
Pull request overview
This PR replaces HashMap async-mode protection that relied on per-operation COOP GC transitions and GC-time cleanup with an Epoch-Based Reclamation (EBR) mechanism to avoid lock-ordering deadlocks (notably involving DebuggerController vs ThreadStore/GC transitions).
Changes:
- Introduces a new EBR implementation (
EbrCollector+EbrCriticalRegionHolder) and a global collector for HashMap async mode (g_HashMapEbr). - Updates HashMap async call sites to use EBR critical regions and queues obsolete bucket arrays for deferred deletion via EBR.
- Removes the HashMap-specific deferred cleanup path from
SyncCleanand adds new Crst types for EBR internal locks.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/vm/syncclean.hpp | Removes HashMap cleanup surface from SyncClean. |
| src/coreclr/vm/syncclean.cpp | Removes HashMap obsolete-bucket list tracking and GC-time deletion. |
| src/coreclr/vm/hash.h | Removes NextObsolete helper from the header. |
| src/coreclr/vm/hash.cpp | Adds EBR critical region usage and EBR-based deferred deletion for obsolete buckets. |
| src/coreclr/vm/ebr.h | Adds public EBR APIs (EbrCollector, EbrCriticalRegionHolder) and global collector declaration. |
| src/coreclr/vm/ebr.cpp | Implements the EBR collector, per-thread tracking, and deferred deletion queues. |
| src/coreclr/vm/ceemain.cpp | Initializes/shuts down the global HashMap EBR collector during runtime startup/shutdown. |
| src/coreclr/vm/CMakeLists.txt | Adds EBR sources/headers to the VM build. |
| src/coreclr/inc/crsttypes_generated.h | Adds new CrstEbrPending / CrstEbrThreadList types and metadata. |
| src/coreclr/inc/CrstTypes.def | Declares new EBR Crst types. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| } | ||
|
|
||
| void | ||
| EbrCollector::Shutdown() |
There was a problem hiding this comment.
Do we need this? Our shutdown strategy in many parts of the runtime is to let everything leak and let the OS clean it up at process shutdown.
- QueueForDeletion: leak object on OOM instead of immediate deletion, which could cause use-after-free for concurrent EBR readers. Track leaked count via InterlockedIncrement counter. - Rehash: read obsolete bucket size directly from allocation base instead of calling GetSize with wrong pointer (undefined behavior).
- Shutdown: early-return if !m_initialized instead of asserting - Buckets()/Rehash(): simplify assert to !m_fAsyncMode || InCriticalRegion() - LookupValue: remove GC thread exclusion from EBR critical region - Comment fixes in InsertValue and Rehash deferred deletion
Add EbrCollector::ThreadDetach() to unlink and free per-thread EBR data. Call it from ThreadDetaching() in corhost.cpp, following the existing StressLog::ThreadDetach() pattern. This prevents unbounded growth of the EBR thread list in processes with short-lived threads.
Replace thread_local EbrThreadData* with thread_local EbrThreadData
value, eliminating the OOM failure path in GetOrCreateThreadData().
This removes the risk of null dereference in ExitCriticalRegion()
when the RAII holder unwinds after a failed EnterCriticalRegion().
Shutdown and ThreadDetach now clear the data with = {} instead of
deleting heap memory.
|
@jkotas and @noahfalk The following comment looks suspicious in this new world. This is the only COOP transition we have on this thread and the comment, to me, seems now out of date. runtime/src/coreclr/vm/readytoruninfo.cpp Lines 385 to 405 in c1cb909 |
I don't think we need the transitions to COOP mode any longer. The code can stay as-is, but the critical section can be be updated and the COOP transition removed. I think any ways. |
|
I then have a follow-up question on why |
Your analysis sounds right to me. (Although not sure what you meant about updating the critical section?)
The lock appears designed to prevent concurrent writes. The HashMap supports concurrent reads or concurrent read with write, but not multiple concurrent writes. |
| #ifdef LOGGING | ||
| ShutdownLogging(); | ||
| #endif | ||
| // Shutdown EBR before GC heap to ensure all deferred deletions are drained. |
There was a problem hiding this comment.
Why is it not ok to leak everything just like we leak most other things?
| EbrPendingEntry* pEntry = new (nothrow) EbrPendingEntry(); | ||
| if (pEntry == nullptr) | ||
| { | ||
| // If we can't allocate, we must not delete pObject immediately, because |
There was a problem hiding this comment.
QueueForDeletion is called during Rehash. Can we handle this gracefully and fail the rehash with OOM exception when we are not able to queue the old memory block here?
HashMap's async mode used GCX_MAYBE_COOP_NO_THREAD_BROKEN to transition
into cooperative GC mode on every operation, preventing the GC from
freeing obsolete bucket arrays mid-read. Old bucket arrays were queued
via SyncClean::AddHashMap and freed during GC pauses.
This caused a deadlock: when HashMap::LookupValue() was called while
holding the DebuggerController lock, the COOP transition (which is
level-equivalent to taking the ThreadStore lock) violated lock ordering
constraints, since ThreadStore must be acquired before DebuggerController.
Replace both mechanisms with Epoch-Based Reclamation (EBR), based on
Fraser's algorithm from 'Practical Lock-Freedom' (UCAM-CL-TR-579):
with memory barriers -- they never block or trigger GC transitions
once all threads have passed through a quiescent state
at all 6 call sites in hash.cpp