Skip to content

VSD DispatchCache ResolveCacheElem self-loop (pNext == self) hangs the process on osx-x64 — #128868 present but insufficient #128955

@wtgodbe

Description

@wtgodbe

The bug

A ResolveCacheElem in the Virtual Stub Dispatch DispatchCache ends up with pNext pointing
at itself. The next DispatchCache::Insert walks that collision chain forever while
holding the cache Crst
, so every other thread that needs a VSD resolve blocks on that lock →
whole-process hang. The runtime under test already contains #128868, so #128868 does not fix it.

Evidence

Confirmed at instruction level in two independent, symbolized x86_64 cores from separate CI
builds and two different managed workloads.

Spinning thread (the only thread running CLR code):

DispatchCache::Insert + 208           ; dedup pre-scan, looping on rax = rax->pNext (offset 0x18)
VirtualCallStubManager::ResolveWorker + 2526
VSD_ResolveWorker + 752
ResolveWorkerAsmStub + 113

Corrupt node 0x10d71a2e0: pMT=0x1131be660, token=0x21a00000006, target=0x4227a0e1c8,
pNext=0x10d71a2e0 (== self). The end sentinel is never reached and the inserting thread's
key never matches, so the walk loops forever. Mapping (pMT, token -> target) is valid
only pNext is corrupt. No crash.

Lock cascade: the spinning thread holds the cache Crst; many other threads are stuck at

__psynch_mutexwait -> CrstBase::Enter -> DispatchCache::Insert + 32
  <- VirtualCallStubManager::ResolveWorker            (dispatch miss)
  <- MethodTable::TryResolveConstraintMethodApprox <- CEEInfo::getCallInfo   (JIT)

A full per-bucket census of the other core showed exactly one cycle among 4096 buckets — a
single isolated 1-node self-loop at a bucket head — with 147/164 threads blocked on the lock.

The only writer of pNext is Insert's prepend (elem->pNext = entries[hash]; entries[hash] = elem), so a self-loop means entries[hash] already equaled the elem being prepended — i.e.
ResolveCacheElem address reuse / double-publish at the bucket head. The producing store is not
captured in-the-act (the dumps show the aftermath only).

Suggestion

Bounded walk + cycle guard in DispatchCache::Insert/Lookup and the asm chain-walk: detect
a self-reference / over-length chain during the pNext traversal and treat it as a miss
(re-resolve) or reset the bucket. This eliminates the hang regardless of the root producer.

To pinpoint the producer, add a checked-build assert in Insert that entries[hash] != elem
before the prepend; a checked CI run would then trap the producing store.

Caveats

  • Both dumps are x86_64 (the arm64 leg's dotnet ran under Rosetta 2). No native-arm64 hang
    observed
    — no cross-arch claim.
  • Source: vm/virtualcallstub.cpp (DispatchCache::Insert/Lookup), VirtualCallStubAMD64.asm.
  • Cores, full per-bucket census, symbolized stacks, and node decode available on request.

Known Issue Error Message

DO NOT USE JSON BELOW IF THIS IS A BUILD BREAK otherwise build analysis will allow pull requests to merge that break the build worse. For a build break, do not use this issue form. Make a regular new issue.

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0

Metadata

Metadata

Assignees

Labels

area-VM-coreclrblocking-clean-ciBlocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'os-mac-os-xmacOS aka OSXuntriagedNew issue has not been triaged by the area owner

Type

No type
No fields configured for issues without a type.

Projects

Status
No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions