Skip to content

Replace HashMap COOP transitions with Epoch-Based Reclamation (EBR)#124307

Draft
AaronRobinsonMSFT wants to merge 9 commits intodotnet:mainfrom
AaronRobinsonMSFT:ebr-hashmap
Draft

Replace HashMap COOP transitions with Epoch-Based Reclamation (EBR)#124307
AaronRobinsonMSFT wants to merge 9 commits intodotnet:mainfrom
AaronRobinsonMSFT:ebr-hashmap

Conversation

@AaronRobinsonMSFT
Copy link
Member

HashMap's async mode used GCX_MAYBE_COOP_NO_THREAD_BROKEN to transition
into cooperative GC mode on every operation, preventing the GC from
freeing obsolete bucket arrays mid-read. Old bucket arrays were queued
via SyncClean::AddHashMap and freed during GC pauses.

This caused a deadlock: when HashMap::LookupValue() was called while
holding the DebuggerController lock, the COOP transition (which is
level-equivalent to taking the ThreadStore lock) violated lock ordering
constraints, since ThreadStore must be acquired before DebuggerController.

Replace both mechanisms with Epoch-Based Reclamation (EBR), based on
Fraser's algorithm from 'Practical Lock-Freedom' (UCAM-CL-TR-579):

  • EnterCriticalRegion/ExitCriticalRegion are simple atomic flag stores
    with memory barriers -- they never block or trigger GC transitions
  • Obsolete bucket arrays are queued for deferred deletion and freed
    once all threads have passed through a quiescent state
  • An RAII holder (EbrCriticalRegionHolder) replaces GCX_MAYBE_COOP
    at all 6 call sites in hash.cpp

HashMap's async mode used GCX_MAYBE_COOP_NO_THREAD_BROKEN to transition
into cooperative GC mode on every operation, preventing the GC from
freeing obsolete bucket arrays mid-read. Old bucket arrays were queued
via SyncClean::AddHashMap and freed during GC pauses.

This caused a deadlock: when HashMap::LookupValue() was called while
holding the DebuggerController lock, the COOP transition (which is
level-equivalent to taking the ThreadStore lock) violated lock ordering
constraints, since ThreadStore must be acquired before DebuggerController.

Replace both mechanisms with Epoch-Based Reclamation (EBR), based on
Fraser's algorithm from 'Practical Lock-Freedom' (UCAM-CL-TR-579):

- EnterCriticalRegion/ExitCriticalRegion are simple atomic flag stores
  with memory barriers -- they never block or trigger GC transitions
- Obsolete bucket arrays are queued for deferred deletion and freed
  once all threads have passed through a quiescent state
- An RAII holder (EbrCriticalRegionHolder) replaces GCX_MAYBE_COOP
  at all 6 call sites in hash.cpp

Changes:
- New: src/coreclr/vm/ebr.h, ebr.cpp (EbrCollector, ~340 lines)
- hash.cpp: Replace 6 GCX_MAYBE_COOP_NO_THREAD_BROKEN with EBR holders,
  replace SyncClean::AddHashMap with QueueForDeletion
- syncclean.hpp/cpp: Remove HashMap-related members and cleanup code
- ceemain.cpp: Init g_HashMapEbr at startup, shutdown at EE shutdown
- CrstTypes.def: Add CrstEbrThreadList, CrstEbrPending
- crsttypes_generated.h: Regenerated with new Crst types
- CMakeLists.txt: Add ebr.cpp, ebr.h to build
- Rename memoryBudget/m_pendingSize to memoryBudgetInBytes/m_pendingSizeInBytes
- Mark EbrCollector and EbrCriticalRegionHolder as final
- Delete move constructors/assignment operators
- Move NextObsolete from hash.h (public) to hash.cpp (file-static)
- Reuse DeleteObsoleteBuckets for sync-mode path in Rehash
- Trim redundant backstory comments at EBR call sites
- Remove unused forward decls from syncclean.hpp
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @agocke
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces HashMap async-mode protection that relied on per-operation COOP GC transitions and GC-time cleanup with an Epoch-Based Reclamation (EBR) mechanism to avoid lock-ordering deadlocks (notably involving DebuggerController vs ThreadStore/GC transitions).

Changes:

  • Introduces a new EBR implementation (EbrCollector + EbrCriticalRegionHolder) and a global collector for HashMap async mode (g_HashMapEbr).
  • Updates HashMap async call sites to use EBR critical regions and queues obsolete bucket arrays for deferred deletion via EBR.
  • Removes the HashMap-specific deferred cleanup path from SyncClean and adds new Crst types for EBR internal locks.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/coreclr/vm/syncclean.hpp Removes HashMap cleanup surface from SyncClean.
src/coreclr/vm/syncclean.cpp Removes HashMap obsolete-bucket list tracking and GC-time deletion.
src/coreclr/vm/hash.h Removes NextObsolete helper from the header.
src/coreclr/vm/hash.cpp Adds EBR critical region usage and EBR-based deferred deletion for obsolete buckets.
src/coreclr/vm/ebr.h Adds public EBR APIs (EbrCollector, EbrCriticalRegionHolder) and global collector declaration.
src/coreclr/vm/ebr.cpp Implements the EBR collector, per-thread tracking, and deferred deletion queues.
src/coreclr/vm/ceemain.cpp Initializes/shuts down the global HashMap EBR collector during runtime startup/shutdown.
src/coreclr/vm/CMakeLists.txt Adds EBR sources/headers to the VM build.
src/coreclr/inc/crsttypes_generated.h Adds new CrstEbrPending / CrstEbrThreadList types and metadata.
src/coreclr/inc/CrstTypes.def Declares new EBR Crst types.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 12, 2026 01:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

}

void
EbrCollector::Shutdown()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Our shutdown strategy in many parts of the runtime is to let everything leak and let the OS clean it up at process shutdown.

- QueueForDeletion: leak object on OOM instead of immediate deletion,
  which could cause use-after-free for concurrent EBR readers. Track
  leaked count via InterlockedIncrement counter.
- Rehash: read obsolete bucket size directly from allocation base
  instead of calling GetSize with wrong pointer (undefined behavior).
- Shutdown: early-return if !m_initialized instead of asserting
- Buckets()/Rehash(): simplify assert to !m_fAsyncMode || InCriticalRegion()
- LookupValue: remove GC thread exclusion from EBR critical region
- Comment fixes in InsertValue and Rehash deferred deletion
Add EbrCollector::ThreadDetach() to unlink and free per-thread EBR
data. Call it from ThreadDetaching() in corhost.cpp, following the
existing StressLog::ThreadDetach() pattern. This prevents unbounded
growth of the EBR thread list in processes with short-lived threads.
Replace thread_local EbrThreadData* with thread_local EbrThreadData
value, eliminating the OOM failure path in GetOrCreateThreadData().
This removes the risk of null dereference in ExitCriticalRegion()
when the RAII holder unwinds after a failed EnterCriticalRegion().
Shutdown and ThreadDetach now clear the data with = {} instead of
deleting heap memory.
Copilot AI review requested due to automatic review settings February 13, 2026 22:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

@AaronRobinsonMSFT
Copy link
Member Author

@jkotas and @noahfalk The following comment looks suspicious in this new world. This is the only COOP transition we have on this thread and the comment, to me, seems now out of date.

void ReadyToRunInfo::SetMethodDescForEntryPointInNativeImage(PCODE entryPoint, MethodDesc *methodDesc)
{
CONTRACTL
{
PRECONDITION(!m_isComponentAssembly);
}
CONTRACTL_END;
// We are entering coop mode here so that we don't do it later inside LookupMap while we are already holding the Crst.
// Doing it in the other order can block the debugger from running func-evals. For example thread A would acquire the Crst,
// then block at the coop transition inside LookupMap waiting for the debugger to resume from a break state. The debugger then
// requests thread B to run a funceval, the funceval tries to load some R2R method calling in here, then it blocks because
// thread A is holding the Crst.
GCX_COOP();
CrstHolder ch(&m_Crst);
if ((TADDR)m_entryPointToMethodDescMap.LookupValue(PCODEToPINSTR(entryPoint), (LPVOID)PCODEToPINSTR(entryPoint)) == (TADDR)INVALIDENTRY)
{
m_entryPointToMethodDescMap.InsertValue(PCODEToPINSTR(entryPoint), methodDesc);
}
}

@AaronRobinsonMSFT
Copy link
Member Author

@jkotas and @noahfalk The following comment looks suspicious in this new world. This is the only COOP transition we have on this thread and the comment, to me, seems now out of date.

I don't think we need the transitions to COOP mode any longer. The code can stay as-is, but the critical section can be be updated and the COOP transition removed. I think any ways.

@AaronRobinsonMSFT
Copy link
Member Author

I then have a follow-up question on why ReadyToRunInfo::GetMethodDescForEntryPointInNativeImage doesn't have the critical section for lookup?

@noahfalk
Copy link
Member

noahfalk commented Feb 14, 2026

I don't think we need the transitions to COOP mode any longer. The code can stay as-is, but the critical section can be be updated and the COOP transition removed. I think any ways.

Your analysis sounds right to me. (Although not sure what you meant about updating the critical section?)

I then have a follow-up question on why ReadyToRunInfo::GetMethodDescForEntryPointInNativeImage doesn't have the critical section for lookup?

The lock appears designed to prevent concurrent writes. The HashMap supports concurrent reads or concurrent read with write, but not multiple concurrent writes.

#ifdef LOGGING
ShutdownLogging();
#endif
// Shutdown EBR before GC heap to ensure all deferred deletions are drained.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it not ok to leak everything just like we leak most other things?

EbrPendingEntry* pEntry = new (nothrow) EbrPendingEntry();
if (pEntry == nullptr)
{
// If we can't allocate, we must not delete pObject immediately, because
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QueueForDeletion is called during Rehash. Can we handle this gracefully and fail the rehash with OOM exception when we are not able to queue the old memory block here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants