Fix false-positive assertion in CheckRegDisplaySP due to stale cached stack limit#126740
Fix false-positive assertion in CheckRegDisplaySP due to stale cached stack limit#126740
Conversation
Use GetStackLowerBound() instead of GetCachedStackLimit() in the CheckRegDisplaySP debug assertion. On Linux, the main thread's stack grows on demand, so the cached limit set at thread initialization can become stale. GetStackLowerBound() queries the current stack boundary dynamically, preventing false-positive assertion failures. Since GetStackLowerBound() is only available in non-DAC builds (#ifndef DACCESS_COMPILE), DAC builds retain the original GetCachedStackLimit() check. Fixes #124839, #125760, #117918 Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/62ad3d3e-698b-4e2e-bab3-75d91ecb80c6 Co-authored-by: mangod9 <61718172+mangod9@users.noreply.github.com>
|
Tagging subscribers to this area: @agocke |
| // and the cached limit can be stale if the stack has grown past the initially committed boundary. | ||
| // GetStackLowerBound() is only available in non-DAC builds. | ||
| #ifndef DACCESS_COMPILE | ||
| _ASSERTE(pRD->_pThread->IsExecutingOnAltStack() || PTR_VOID(pRD->SP) >= pRD->_pThread->GetStackLowerBound()); |
There was a problem hiding this comment.
PAL is caching the cache limits as well:
runtime/src/coreclr/pal/src/thread/thread.cpp
Lines 2396 to 2405 in 6adf63b
There was a problem hiding this comment.
I think we will need to pass a bool flag all the way to the PAL to fetch non-cached stack limits.
Also, we make sure to use the cached stack limits most of the time to make this check cheap.
| @@ -5687,7 +5687,14 @@ void CheckRegDisplaySP (REGDISPLAY *pRD) | |||
| if (pRD->SP && pRD->_pThread) | |||
| { | |||
| #ifndef NO_FIXED_STACK_LIMIT | |||
There was a problem hiding this comment.
NO_FIXED_STACK_LIMIT is working around the same problem, just for Musl. If we find a better way to fix this, we should delete NO_FIXED_STACK_LIMIT.
|
It's been a long time since I've implemented the stack limits stuff for Unix, so I don't remember all the details, but I am not sure the copilot analysis is correct. I'll experiment locally to see if it is right or not. |
|
My feeling was right. On non-MUSL linux, the pthread_attr_getstack returns hard stack limit. Going over (or I should say below) that limit results in stack overflow. On MUSL based linux, there is no hard limit and the stack grows until it cannot grow anymore due to hitting virtual memory range that's already mapped by something or due to OOM when trying to get physical memory page to backup the next stack page. Looking at the failed legs listed in the actual issue, they were all failing when running with interpreter. So I think it is an issue when we somehow ended up passing a wrong SP to the CheckRegDisplaySP. Could have been due to some issues in the interpreter that were fixed since, as I cannot see recent failures due to this issue. |
|
ok should we close the issue then and wait for a repro? Looks like there was another hit a few weeks ago: #125760 (comment), but doesnt look like the dumps are still available for it. |
|
I think it would be reasonable to close it and wait for a repro with dump. |
On Linux, the main thread's stack grows on demand — the kernel commits new pages below the initial boundary as SP moves lower.
m_CacheStackLimitis set once at thread init viapthread_attr_getstack()and never refreshed. Under JIT stress modes (jitminopts,tailcallstress,interpmode1) that drive deeper recursion, GC/profiler stack walks encounter frames with SP below the stale cached limit, triggering a false-positive_ASSERTEinCheckRegDisplaySP.Description
In
CheckRegDisplaySP(#ifdef DEBUG_REGDISPLAY, checked/debug builds only), replace the staleGetCachedStackLimit()withGetStackLowerBound(), which queries the current committed stack boundary dynamically:GetStackLowerBound()is not available in DAC builds (#ifndef DACCESS_COMPILE), so DAC retains the originalGetCachedStackLimit()path. No release build impact — the entire function is inside#ifdef DEBUG_REGDISPLAY.Fixes #124839