Skip to content

Fix NativeAOT GC roots after universal transition#127640

Merged
MichalStrehovsky merged 4 commits intodotnet:mainfrom
MichalStrehovsky:fix-nativeaot-universal-transition-gc-root
May 2, 2026
Merged

Fix NativeAOT GC roots after universal transition#127640
MichalStrehovsky merged 4 commits intodotnet:mainfrom
MichalStrehovsky:fix-nativeaot-universal-transition-gc-root

Conversation

@MichalStrehovsky
Copy link
Copy Markdown
Member

When a GC stack walk starts from a hijacked universal-transition frame, the iterator unwinds through the thunk and then yields the managed caller at the post-call IP. That caller is not actually the active frame yet, so reporting scratch registers from its post-call GC info can expose stale thunk state.

In the failing System.Linq.Tests NativeAOT case, the precise GC root came from REGDISPLAY.pRax while CoffNativeCodeManager::EnumGcRefs was called with isActiveStackFrame=true. RAX contained the resolved interface dispatch target, System.Linq.Enumerable.Iterator.System.Collections.IEnumerator.get_Current, so object validation treated a code pointer as a GC object and fail-fast asserted.

Clear ActiveStackFrame after unwinding the non-EH universal-transition thunk sequence so the yielded managed caller still reports its non-scratch roots and the conservative thunk range, but does not report scratch registers until the thunk has completed.

Validation: before the fix, the parallel System.Linq.Tests NativeAOT stress loop completed 69 runs with 63 successes and 6 fail-fast crashes; sampled dumps all showed the same pRax code-pointer root. After rebuilding with this fix, the same loop ran for 612.3 seconds at parallelism 4 and completed 132 runs with 132 successes, 0 crashes, and 0 test failures.

When a GC stack walk starts from a hijacked universal-transition frame,
the iterator unwinds through the thunk and then yields the managed caller
at the post-call IP. That caller is not actually the active frame yet, so
reporting scratch registers from its post-call GC info can expose stale
thunk state.

In the failing System.Linq.Tests NativeAOT case, the precise GC root came
from REGDISPLAY.pRax while CoffNativeCodeManager::EnumGcRefs was called
with isActiveStackFrame=true. RAX contained the resolved interface
dispatch target, System.Linq.Enumerable.Iterator<int>.System.Collections.IEnumerator.get_Current,
so object validation treated a code pointer as a GC object and fail-fast
asserted.

Clear ActiveStackFrame after unwinding the non-EH universal-transition
thunk sequence so the yielded managed caller still reports its
non-scratch roots and the conservative thunk range, but does not report
scratch registers until the thunk has completed.

Validation: before the fix, the parallel System.Linq.Tests NativeAOT
stress loop completed 69 runs with 63 successes and 6 fail-fast crashes;
sampled dumps all showed the same pRax code-pointer root. After
rebuilding with this fix, the same loop ran for 612.3 seconds at
parallelism 4 and completed 132 runs with 132 successes, 0 crashes, and
0 test failures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @agocke, @dotnet/ilc-contrib
See info in area-owners.md if you want to be subscribed.

@MichalStrehovsky
Copy link
Copy Markdown
Member Author

If anyone wants to follow at home, runfo get-helix-payload -j 969d2271-251c-445a-964a-cd8c8ec0ad6d -w System.Linq.Tests -o c:\hell\969d2271-251c-445a-964a-cd8c8ec0ad6d\System.Linq.Tests\

The fix is from GPT-5.5, the analysis is from Claude 4.7. I liked Claude's analysis better, but then it went into the weeds trying to come up with a fix. GPT-5.5 came up with the same root cause and had a good looking fix.

Root cause

A NativeAOT GC return-address hijack on RhpCidResolve reports the resolver's IntPtr return value as if it were the interface method's object return value.

What the dump shows

  • The faulting instruction is mov eax, [rdi] in MethodTable::Validate+0x28 with rdi = d98b4820ec834850 (unmapped). NativeAOT's vectored handler converts this AV into a FailFast/Assert, hence the secondary stack you see.
  • The faulting address d98b4820ec834850 is not poisoned memory — it is the little-endian read of the bytes 50 48 83 ec 20 48 8b d9, which is the function prologue (push rax; sub rsp,20h; mov rbx,rcx) of Iterator_1<Int32>::IEnumerator.get_Current at 00007ff7d3377e90. So GC tried to dereference a code pointer as if it were the MethodTable of an object.
  • The "object" pointer (00007ff7d3377e90) was sourced from thread 4's stack at be07d1f0. That address is exactly probe_frame_SP + 0xC0, which is the saved-RAX slot in the GC probe frame built by PUSH_PROBE_FRAME in RhpWaitForGC.

The chain of events

  1. Worker thread executing CastIterator<__Canon>.MoveNext reaches +0x8C: call qword ptr [r11] (interface dispatch to IEnumerator.get_Current).
  2. First call through this cell → RhpInterfaceDispatchSlowjmp RhpUniversalTransitionTailCallcall r10 where r10 = RhpCidResolve (a managed function with [RuntimeExport], returning IntPtr).
  3. While RhpCidResolve is running, the suspending thread's GC starts hijacking. Thread::HijackReturnAddressWorker finds the topmost managed frame (RhpCidResolve) and overwrites its return address (which points back into the universal-transition stub, just after call r10) with RhpGcProbeHijack.
  4. RhpCidResolve returns the resolved code pointer (= &Iterator<Int32>::IEnumerator.get_Current = 00007ff7d3377e90) in rax, executes ret, and lands in RhpGcProbeHijack.
  5. RhpGcProbeHijack restores the original return address, sets r11d with PTFF_THREAD_HIJACK | PTFF_SAVE_RAX | …, and tail-jumps RhpWaitForGC. PUSH_PROBE_FRAME saves all volatile regs, including the code-pointer-valued RAX, at probe_SP + 0xC0 (= be07d1f0).
  6. This very GC scans roots. Walking thread 4's stack: probe frame had PTFF_THREAD_HIJACK → the StackFrameIterator enters with ActiveStackFrame set; the first managed frame yielded is MoveNext at PC = +0x8F (the universal transition is dissolved by its hand-rolled unwind info).
  7. CoffNativeCodeManager::EnumGcRefs is called with isActiveStackFrame = true, so reportScratchSlots = true.
  8. From the JitGCDump output for that exact method: at offset 0x8F (right after call [r11]), Set state of slot 3 at instr offset 0x8f to Live — slot 3 = rax. The decoder dutifully reports rax.
  9. GetRegisterSlot(rax) resolves to the saved-RAX slot in the probe frame → value = 00007ff7d3377e90 (the resolver's code-pointer return value).
  10. GC Promote reads *pObj (= first 8 bytes of get_Current's code) → MT pointer = d98b4820ec834850MethodTable::Validate AV.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a NativeAOT GC stack-walk correctness issue when the walk is initiated from a hijacked universal transition thunk frame. After unwinding through the thunk to the next managed caller frame, the iterator could previously still mark that managed frame as the active frame, causing scratch registers to be reported using the caller’s post-call GC state. In the reported failure mode, that exposed stale thunk register contents (e.g., a code pointer in RAX) as a “precise GC root”, leading to fail-fast object validation.

Changes:

  • Clear ActiveStackFrame after unwinding a non-EH universal-transition thunk sequence, so the yielded managed caller frame is not treated as the active frame for scratch-register reporting.
  • Preserve the existing behavior of publishing the conservative stack range computed while unwinding the thunk sequence.

@MichalStrehovsky
Copy link
Copy Markdown
Member Author

/azp run runtime-nativeaot-outerloop

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copilot AI review requested due to automatic review settings May 1, 2026 10:53
@MichalStrehovsky
Copy link
Copy Markdown
Member Author

/azp run runtime-nativeaot-outerloop

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@MichalStrehovsky MichalStrehovsky enabled auto-merge (squash) May 1, 2026 21:17
@MichalStrehovsky MichalStrehovsky merged commit ad4354d into dotnet:main May 2, 2026
109 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants