Replace HashMap COOP transitions with Epoch-Based Reclamation (EBR) by AaronRobinsonMSFT · Pull Request #124307 · dotnet/runtime

AaronRobinsonMSFT · 2026-02-12T00:56:45Z

HashMap's async mode used GCX_MAYBE_COOP_NO_THREAD_BROKEN to transition
into cooperative GC mode on every operation, preventing the GC from
freeing obsolete bucket arrays mid-read. Old bucket arrays were queued
via SyncClean::AddHashMap and freed during GC pauses.

This caused a deadlock: when HashMap::LookupValue() was called while
holding the DebuggerController lock, the COOP transition (which is
level-equivalent to taking the ThreadStore lock) violated lock ordering
constraints, since ThreadStore must be acquired before DebuggerController.

Replace both mechanisms with Epoch-Based Reclamation (EBR), based on
Fraser's algorithm from 'Practical Lock-Freedom' (UCAM-CL-TR-579):

EnterCriticalRegion/ExitCriticalRegion are simple atomic flag stores
with memory barriers -- they never block or trigger GC transitions
Obsolete bucket arrays are queued for deferred deletion and freed
once all threads have passed through a quiescent state
An RAII holder (EbrCriticalRegionHolder) replaces GCX_MAYBE_COOP
at all 6 call sites in hash.cpp

HashMap's async mode used GCX_MAYBE_COOP_NO_THREAD_BROKEN to transition into cooperative GC mode on every operation, preventing the GC from freeing obsolete bucket arrays mid-read. Old bucket arrays were queued via SyncClean::AddHashMap and freed during GC pauses. This caused a deadlock: when HashMap::LookupValue() was called while holding the DebuggerController lock, the COOP transition (which is level-equivalent to taking the ThreadStore lock) violated lock ordering constraints, since ThreadStore must be acquired before DebuggerController. Replace both mechanisms with Epoch-Based Reclamation (EBR), based on Fraser's algorithm from 'Practical Lock-Freedom' (UCAM-CL-TR-579): - EnterCriticalRegion/ExitCriticalRegion are simple atomic flag stores with memory barriers -- they never block or trigger GC transitions - Obsolete bucket arrays are queued for deferred deletion and freed once all threads have passed through a quiescent state - An RAII holder (EbrCriticalRegionHolder) replaces GCX_MAYBE_COOP at all 6 call sites in hash.cpp Changes: - New: src/coreclr/vm/ebr.h, ebr.cpp (EbrCollector, ~340 lines) - hash.cpp: Replace 6 GCX_MAYBE_COOP_NO_THREAD_BROKEN with EBR holders, replace SyncClean::AddHashMap with QueueForDeletion - syncclean.hpp/cpp: Remove HashMap-related members and cleanup code - ceemain.cpp: Init g_HashMapEbr at startup, shutdown at EE shutdown - CrstTypes.def: Add CrstEbrThreadList, CrstEbrPending - crsttypes_generated.h: Regenerated with new Crst types - CMakeLists.txt: Add ebr.cpp, ebr.h to build

- Rename memoryBudget/m_pendingSize to memoryBudgetInBytes/m_pendingSizeInBytes - Mark EbrCollector and EbrCriticalRegionHolder as final - Delete move constructors/assignment operators - Move NextObsolete from hash.h (public) to hash.cpp (file-static) - Reuse DeleteObsoleteBuckets for sync-mode path in Rehash - Trim redundant backstory comments at EBR call sites - Remove unused forward decls from syncclean.hpp

….cpp

dotnet-policy-service · 2026-02-12T00:57:26Z

Tagging subscribers to this area: @agocke
See info in area-owners.md if you want to be subscribed.

Copilot

Pull request overview

This PR replaces HashMap async-mode protection that relied on per-operation COOP GC transitions and GC-time cleanup with an Epoch-Based Reclamation (EBR) mechanism to avoid lock-ordering deadlocks (notably involving DebuggerController vs ThreadStore/GC transitions).

Changes:

Introduces a new EBR implementation (EbrCollector + EbrCriticalRegionHolder) and a global collector for HashMap async mode (g_HashMapEbr).
Updates HashMap async call sites to use EBR critical regions and queues obsolete bucket arrays for deferred deletion via EBR.
Removes the HashMap-specific deferred cleanup path from SyncClean and adds new Crst types for EBR internal locks.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/coreclr/vm/syncclean.hpp	Removes HashMap cleanup surface from `SyncClean`.
src/coreclr/vm/syncclean.cpp	Removes HashMap obsolete-bucket list tracking and GC-time deletion.
src/coreclr/vm/hash.h	Removes `NextObsolete` helper from the header.
src/coreclr/vm/hash.cpp	Adds EBR critical region usage and EBR-based deferred deletion for obsolete buckets.
src/coreclr/vm/ebr.h	Adds public EBR APIs (`EbrCollector`, `EbrCriticalRegionHolder`) and global collector declaration.
src/coreclr/vm/ebr.cpp	Implements the EBR collector, per-thread tracking, and deferred deletion queues.
src/coreclr/vm/ceemain.cpp	Initializes/shuts down the global HashMap EBR collector during runtime startup/shutdown.
src/coreclr/vm/CMakeLists.txt	Adds EBR sources/headers to the VM build.
src/coreclr/inc/crsttypes_generated.h	Adds new `CrstEbrPending` / `CrstEbrThreadList` types and metadata.
src/coreclr/inc/CrstTypes.def	Declares new EBR Crst types.

src/coreclr/vm/ebr.cpp

src/coreclr/vm/hash.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

src/coreclr/vm/ebr.cpp

src/coreclr/vm/hash.cpp

src/coreclr/vm/ebr.cpp

noahfalk · 2026-02-12T05:39:25Z

src/coreclr/vm/ebr.cpp

+}
+
+void
+EbrCollector::Shutdown()


Do we need this? Our shutdown strategy in many parts of the runtime is to let everything leak and let the OS clean it up at process shutdown.

src/coreclr/vm/ebr.cpp

src/coreclr/vm/hash.cpp

- QueueForDeletion: leak object on OOM instead of immediate deletion, which could cause use-after-free for concurrent EBR readers. Track leaked count via InterlockedIncrement counter. - Rehash: read obsolete bucket size directly from allocation base instead of calling GetSize with wrong pointer (undefined behavior).

- Shutdown: early-return if !m_initialized instead of asserting - Buckets()/Rehash(): simplify assert to !m_fAsyncMode || InCriticalRegion() - LookupValue: remove GC thread exclusion from EBR critical region - Comment fixes in InsertValue and Rehash deferred deletion

Add EbrCollector::ThreadDetach() to unlink and free per-thread EBR data. Call it from ThreadDetaching() in corhost.cpp, following the existing StressLog::ThreadDetach() pattern. This prevents unbounded growth of the EBR thread list in processes with short-lived threads.

Replace thread_local EbrThreadData* with thread_local EbrThreadData value, eliminating the OOM failure path in GetOrCreateThreadData(). This removes the risk of null dereference in ExitCriticalRegion() when the RAII holder unwinds after a failed EnterCriticalRegion(). Shutdown and ThreadDetach now clear the data with = {} instead of deleting heap memory.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

src/coreclr/vm/hash.cpp

src/coreclr/vm/ceemain.cpp

src/coreclr/vm/ebr.cpp

AaronRobinsonMSFT · 2026-02-14T01:11:53Z

@jkotas and @noahfalk The following comment looks suspicious in this new world. This is the only COOP transition we have on this thread and the comment, to me, seems now out of date.

runtime/src/coreclr/vm/readytoruninfo.cpp

Lines 385 to 405 in c1cb909

    
           void ReadyToRunInfo::SetMethodDescForEntryPointInNativeImage(PCODE entryPoint, MethodDesc *methodDesc) 
        
           { 
        
               CONTRACTL 
        
               { 
        
                   PRECONDITION(!m_isComponentAssembly); 
        
               } 
        
               CONTRACTL_END; 
        
               // We are entering coop mode here so that we don't do it later inside LookupMap while we are already holding the Crst. 
        
               // Doing it in the other order can block the debugger from running func-evals. For example thread A would acquire the Crst, 
        
               // then block at the coop transition inside LookupMap waiting for the debugger to resume from a break state. The debugger then 
        
               // requests thread B to run a funceval, the funceval tries to load some R2R method calling in here, then it blocks because 
        
               // thread A is holding the Crst. 
        
               GCX_COOP(); 
        
               CrstHolder ch(&m_Crst); 
        
               if ((TADDR)m_entryPointToMethodDescMap.LookupValue(PCODEToPINSTR(entryPoint), (LPVOID)PCODEToPINSTR(entryPoint)) == (TADDR)INVALIDENTRY) 
        
               { 
        
                   m_entryPointToMethodDescMap.InsertValue(PCODEToPINSTR(entryPoint), methodDesc); 
        
               } 
        
           }

AaronRobinsonMSFT · 2026-02-14T01:17:36Z

@jkotas and @noahfalk The following comment looks suspicious in this new world. This is the only COOP transition we have on this thread and the comment, to me, seems now out of date.

I don't think we need the transitions to COOP mode any longer. The code can stay as-is, but the critical section can be be updated and the COOP transition removed. I think any ways.

AaronRobinsonMSFT · 2026-02-14T01:19:34Z

I then have a follow-up question on why ReadyToRunInfo::GetMethodDescForEntryPointInNativeImage doesn't have the critical section for lookup?

noahfalk · 2026-02-14T01:31:20Z

I don't think we need the transitions to COOP mode any longer. The code can stay as-is, but the critical section can be be updated and the COOP transition removed. I think any ways.

Your analysis sounds right to me. (Although not sure what you meant about updating the critical section?)

I then have a follow-up question on why ReadyToRunInfo::GetMethodDescForEntryPointInNativeImage doesn't have the critical section for lookup?

The lock appears designed to prevent concurrent writes. The HashMap supports concurrent reads or concurrent read with write, but not multiple concurrent writes.

jkotas · 2026-02-14T05:57:05Z

src/coreclr/vm/ceemain.cpp

 #ifdef LOGGING
                ShutdownLogging();
 #endif
+                // Shutdown EBR before GC heap to ensure all deferred deletions are drained.


Why is it not ok to leak everything just like we leak most other things?

jkotas · 2026-02-14T06:02:33Z

src/coreclr/vm/ebr.cpp

+    EbrPendingEntry* pEntry = new (nothrow) EbrPendingEntry();
+    if (pEntry == nullptr)
+    {
+        // If we can't allocate, we must not delete pObject immediately, because


QueueForDeletion is called during Rehash. Can we handle this gracefully and fail the rehash with OOM exception when we are not able to queue the old memory block here?

AaronRobinsonMSFT added 3 commits February 11, 2026 15:19

Refactor EbrPendingEntry structure: move definition from ebr.h to ebr…

73168b4

….cpp

AaronRobinsonMSFT added this to the 11.0.0 milestone Feb 12, 2026

AaronRobinsonMSFT requested review from jkotas and noahfalk February 12, 2026 00:56

AaronRobinsonMSFT added the area-VM-coreclr label Feb 12, 2026

Copilot AI review requested due to automatic review settings February 12, 2026 00:56

dotnet-policy-service bot assigned AaronRobinsonMSFT Feb 12, 2026

github-project-automation bot added this to AppModel Feb 12, 2026

Copilot started reviewing on behalf of AaronRobinsonMSFT February 12, 2026 00:57 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

src/coreclr/vm/ebr.cpp Outdated Show resolved Hide resolved

src/coreclr/vm/ebr.cpp Outdated Show resolved Hide resolved

src/coreclr/vm/ebr.cpp Show resolved Hide resolved

src/coreclr/vm/hash.cpp Outdated Show resolved Hide resolved

Update src/coreclr/vm/hash.cpp

d6acf4e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings February 12, 2026 01:37

Copilot started reviewing on behalf of AaronRobinsonMSFT February 12, 2026 01:38 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

src/coreclr/vm/ebr.cpp Outdated Show resolved Hide resolved

src/coreclr/vm/hash.cpp Outdated Show resolved Hide resolved

build-analysis bot mentioned this pull request Feb 12, 2026

Cannot find 'arm64-v8a' device dotnet/dnceng#2284

Open

3 tasks

noahfalk reviewed Feb 12, 2026

View reviewed changes

AaronRobinsonMSFT added 4 commits February 12, 2026 16:35

build-analysis bot mentioned this pull request Feb 13, 2026

"We stopped hearing from agent Azure Pipelines 32. Verify the agent machine is running and has a healthy network connection" dotnet/dnceng#1886

Open

3 tasks

Fix contracts

6aa1f2c

Copilot AI review requested due to automatic review settings February 13, 2026 22:58

Copilot started reviewing on behalf of AaronRobinsonMSFT February 13, 2026 22:58 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

src/coreclr/vm/hash.cpp Show resolved Hide resolved

src/coreclr/vm/ceemain.cpp Show resolved Hide resolved

src/coreclr/vm/ebr.cpp Show resolved Hide resolved

jkotas reviewed Feb 13, 2026

View reviewed changes

src/coreclr/vm/ebr.cpp Show resolved Hide resolved

jkotas reviewed Feb 14, 2026

View reviewed changes

Conversation

AaronRobinsonMSFT commented Feb 12, 2026

Uh oh!

dotnet-policy-service bot commented Feb 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

noahfalk Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AaronRobinsonMSFT commented Feb 14, 2026

Uh oh!

AaronRobinsonMSFT commented Feb 14, 2026

Uh oh!

AaronRobinsonMSFT commented Feb 14, 2026

Uh oh!

noahfalk commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotas Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

jkotas Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

noahfalk commented Feb 14, 2026 •

edited

Loading