The bug
A ResolveCacheElem in the Virtual Stub Dispatch DispatchCache ends up with pNext pointing
at itself. The next DispatchCache::Insert walks that collision chain forever while
holding the cache Crst, so every other thread that needs a VSD resolve blocks on that lock →
whole-process hang. The runtime under test already contains #128868, so #128868 does not fix it.
Evidence
Confirmed at instruction level in two independent, symbolized x86_64 cores from separate CI
builds and two different managed workloads.
Spinning thread (the only thread running CLR code):
DispatchCache::Insert + 208 ; dedup pre-scan, looping on rax = rax->pNext (offset 0x18)
VirtualCallStubManager::ResolveWorker + 2526
VSD_ResolveWorker + 752
ResolveWorkerAsmStub + 113
Corrupt node 0x10d71a2e0: pMT=0x1131be660, token=0x21a00000006, target=0x4227a0e1c8,
pNext=0x10d71a2e0 (== self). The end sentinel is never reached and the inserting thread's
key never matches, so the walk loops forever. Mapping (pMT, token -> target) is valid —
only pNext is corrupt. No crash.
Lock cascade: the spinning thread holds the cache Crst; many other threads are stuck at
__psynch_mutexwait -> CrstBase::Enter -> DispatchCache::Insert + 32
<- VirtualCallStubManager::ResolveWorker (dispatch miss)
<- MethodTable::TryResolveConstraintMethodApprox <- CEEInfo::getCallInfo (JIT)
A full per-bucket census of the other core showed exactly one cycle among 4096 buckets — a
single isolated 1-node self-loop at a bucket head — with 147/164 threads blocked on the lock.
The only writer of pNext is Insert's prepend (elem->pNext = entries[hash]; entries[hash] = elem), so a self-loop means entries[hash] already equaled the elem being prepended — i.e.
ResolveCacheElem address reuse / double-publish at the bucket head. The producing store is not
captured in-the-act (the dumps show the aftermath only).
Suggestion
Bounded walk + cycle guard in DispatchCache::Insert/Lookup and the asm chain-walk: detect
a self-reference / over-length chain during the pNext traversal and treat it as a miss
(re-resolve) or reset the bucket. This eliminates the hang regardless of the root producer.
To pinpoint the producer, add a checked-build assert in Insert that entries[hash] != elem
before the prepend; a checked CI run would then trap the producing store.
Caveats
- Both dumps are x86_64 (the arm64 leg's
dotnet ran under Rosetta 2). No native-arm64 hang
observed — no cross-arch claim.
- Source:
vm/virtualcallstub.cpp (DispatchCache::Insert/Lookup), VirtualCallStubAMD64.asm.
- Cores, full per-bucket census, symbolized stacks, and node decode available on request.
Known Issue Error Message
DO NOT USE JSON BELOW IF THIS IS A BUILD BREAK otherwise build analysis will allow pull requests to merge that break the build worse. For a build break, do not use this issue form. Make a regular new issue.
Fill the error message using step by step known issues guidance.
{
"ErrorMessage": "",
"ErrorPattern": "",
"BuildRetry": false,
"ExcludeConsoleLog": false
}
Report
Summary
| 24-Hour Hit Count |
7-Day Hit Count |
1-Month Count |
| 0 |
0 |
0 |
The bug
A
ResolveCacheElemin the Virtual Stub DispatchDispatchCacheends up withpNextpointingat itself. The next
DispatchCache::Insertwalks that collision chain forever whileholding the cache
Crst, so every other thread that needs a VSD resolve blocks on that lock →whole-process hang. The runtime under test already contains #128868, so #128868 does not fix it.
Evidence
Confirmed at instruction level in two independent, symbolized x86_64 cores from separate CI
builds and two different managed workloads.
Spinning thread (the only thread running CLR code):
Corrupt node
0x10d71a2e0:pMT=0x1131be660,token=0x21a00000006,target=0x4227a0e1c8,pNext=0x10d71a2e0(== self). The end sentinel is never reached and the inserting thread'skey never matches, so the walk loops forever. Mapping
(pMT, token -> target)is valid —only
pNextis corrupt. No crash.Lock cascade: the spinning thread holds the cache
Crst; many other threads are stuck atA full per-bucket census of the other core showed exactly one cycle among 4096 buckets — a
single isolated 1-node self-loop at a bucket head — with 147/164 threads blocked on the lock.
The only writer of
pNextisInsert's prepend (elem->pNext = entries[hash]; entries[hash] = elem), so a self-loop meansentries[hash]already equaled the elem being prepended — i.e.ResolveCacheElemaddress reuse / double-publish at the bucket head. The producing store is notcaptured in-the-act (the dumps show the aftermath only).
Suggestion
Bounded walk + cycle guard in
DispatchCache::Insert/Lookupand the asm chain-walk: detecta self-reference / over-length chain during the
pNexttraversal and treat it as a miss(re-resolve) or reset the bucket. This eliminates the hang regardless of the root producer.
To pinpoint the producer, add a checked-build assert in
Insertthatentries[hash] != elembefore the prepend; a checked CI run would then trap the producing store.
Caveats
dotnetran under Rosetta 2). No native-arm64 hangobserved — no cross-arch claim.
vm/virtualcallstub.cpp(DispatchCache::Insert/Lookup),VirtualCallStubAMD64.asm.Known Issue Error Message
DO NOT USE JSON BELOW IF THIS IS A BUILD BREAK otherwise build analysis will allow pull requests to merge that break the build worse. For a build break, do not use this issue form. Make a regular new issue.
Fill the error message using step by step known issues guidance.
{ "ErrorMessage": "", "ErrorPattern": "", "BuildRetry": false, "ExcludeConsoleLog": false }Report
Summary