[release/9.0] Enhance createdump to detect alt stack execution#127067
[release/9.0] Enhance createdump to detect alt stack execution#127067hoyosjs wants to merge 1 commit intodotnet:release/9.0from
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes createdump native stack unwinding when crashes occur on an alternate signal stack (SA_ONSTACK) by detecting signal trampoline frames and allowing a single stack-pointer decrease to unwind back onto the original thread stack.
Changes:
- Extend
PAL_VirtualUnwindOutOfProcto optionally report whether the current frame is a signal trampoline (viaunw_is_signal_frame). - Update createdump’s
UnwindNativeFramesmonotonic-SP guard to allow one SP decrease when crossing a signal trampoline. - Plumb the updated unwind API through createdump PAL shim and DAC minimal prototypes.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/coreclr/pal/src/exception/remote-unwind.cpp |
Detects signal trampoline frames in out-of-proc unwinding and surfaces that via a new out-parameter. |
src/coreclr/pal/inc/pal.h |
Updates the PAL API signature for PAL_VirtualUnwindOutOfProc to include isSignalFrame. |
src/coreclr/debug/daccess/dacfn.cpp |
Updates DAC-side minimal prototype and call site to pass the new parameter (null). |
src/coreclr/debug/createdump/threadinfo.cpp |
Allows a single SP decrease after unwinding a signal trampoline to capture the original stack. |
src/coreclr/debug/createdump/createdumppal.cpp |
Updates the dynamically-loaded PAL function pointer signature and forwards the new argument. |
6085053 to
eb628ba
Compare
Use CFI entries in libc trampoline frames to detect and allow for stack unwind finding SP backwards jumps.
eb628ba to
5649912
Compare
| extern | ||
| BOOL | ||
| PAL_VirtualUnwindOutOfProc(PT_CONTEXT context, PT_KNONVOLATILE_CONTEXT_POINTERS contextPointers, PULONG64 functionStart, SIZE_T baseAddress, UnwindReadMemoryCallback readMemoryCallback); | ||
| PAL_VirtualUnwindOutOfProc(PT_CONTEXT context, PT_KNONVOLATILE_CONTEXT_POINTERS contextPointers, PULONG64 functionStart, SIZE_T baseAddress, UnwindReadMemoryCallback readMemoryCallback, BOOL *isSignalFrame); |
There was a problem hiding this comment.
The cross-OS DAC minimal prototype uses BOOL* isSignalFrame, but the PAL declaration/implementation uses bool*. In C++ this changes the mangled symbol name and can break linking/calls in HOST_WINDOWS/TARGET_UNIX builds. Update the prototype to take bool* (and keep it consistent with pal.h).
| PAL_VirtualUnwindOutOfProc(PT_CONTEXT context, PT_KNONVOLATILE_CONTEXT_POINTERS contextPointers, PULONG64 functionStart, SIZE_T baseAddress, UnwindReadMemoryCallback readMemoryCallback, BOOL *isSignalFrame); | |
| PAL_VirtualUnwindOutOfProc(PT_CONTEXT context, PT_KNONVOLATILE_CONTEXT_POINTERS contextPointers, PULONG64 functionStart, SIZE_T baseAddress, UnwindReadMemoryCallback readMemoryCallback, bool *isSignalFrame); |
There was a problem hiding this comment.
This looks like valid feedback.
Fixes .NET 9 version of #126981
Description
createdump's UnwindNativeFrames fails to capture the original thread stack when a crash occurs on a thread using an alternate signal stack. The native unwinder's monotonic-SP guard breaks we cross crosses the signal trampoline back to the original stack, because the SP legitimately decreases. This causes the unwinder to stop early, omitting the original stack memory from the dump.
The fix uses libunwind's
unw_is_signal_frameinPAL_VirtualUnwindOutOfProcto detect signal trampoline frames. When a signal frame is detected,UnwindNativeFramesallows a SP decrease, enabling the unwinder to cross back to the original stack and capture its memory.Customer Impact
Minidumps collected via createdump for crashes on alternate signal stacks are missing the original thread's stack memory. This makes the dumps incomplete and difficult to debug - native frames below the signal handler are absent from the stack walk, and you can only get the managed stack separately via clrstack. Watson and WinDBG both fail to do this automatically.
Regression
We added the SP monotonic check ~7 years ago to prevent corruption unwinding issues.
Testing
The following scenario was tested as proxy of customer's issue: Pinvoke into native library with some frames before hitting a nullref on a secondary thread. Pre-fix the repro shows the early bail unwind. The fix captured the full unwind across the signal trampoline, identifies the libc trampoline, and includes original stack memory in the dump. I also validated the fix works both with and without dwarf unwind info in the crashing native library.
Risk
Low. The change is narrowly scoped to createdump's native unwind path and some of the DAC's lazy state machine unwinding. unw_get_proc_info reuses the same cache unw_step populates - no additional remote memory reads. If unw_is_signal_frame returns false due to some libc not marking the trampoline correctly, behavior is identical to before - no regression.