Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stack overflow handling issue in GC stress #56733

Merged
merged 1 commit into from
Aug 3, 2021

Conversation

janvorli
Copy link
Member

@janvorli janvorli commented Aug 2, 2021

This change fixes a problem when in GC stress mode 3, GC started to run
on the thread that hit stack overflow due to the GCX_PREEMP in
DebuggerRCThread::DoFavor that is called from the
EEPolicy::HandleFatalStackOverflow. It was causing failures in the CI.
The issue is GC stress specific, the GCX_PREEMP would not start running
GC on the current thread in regular cases.

The fix is to inhibit GC stress in the HandleFatalStackOverflow.

This change fixes a problem when in GC stress mode 3, GC started to run
on the thread that hit stack overflow due to the GCX_PREEMP in
DebuggerRCThread::DoFavor that is called from the
EEPolicy::HandleFatalStackOverflow. It was causing failures in the CI.
The issue is GC stress specific, the GCX_PREEMP would not start running
GC on the current thread in regular cases.

The fix is to inhibit GC stress in the HandleFatalStackOverflow.
@janvorli janvorli added this to the 6.0.0 milestone Aug 2, 2021
@janvorli janvorli self-assigned this Aug 2, 2021
@janvorli
Copy link
Member Author

janvorli commented Aug 2, 2021

This fixes part of the #46279 that was reported on x86 (and that reproed there locally with GC stress). There are also failures on arm / x64 Windows that are not related.

@davidwrighton
Copy link
Member

@janvorli Do you know that this unrelated to the arm32 issue? I've been trying to reproduce that one locally and I can't seem to get anything out of it.

@janvorli
Copy link
Member Author

janvorli commented Aug 2, 2021

The behavior on arm and x64 is different. It just prints "stack overflow." without any stack trace, but it returns correct stack overflow return code. Also, it happens without GC stress while the x86 was GC stress only.
One thing I've just noticed about the arm and x64 runs was that the issue only happens on branches marked as fullpgo.

@janvorli
Copy link
Member Author

janvorli commented Aug 2, 2021

@davidwrighton I was able to repro it locally on x64. Trying again with instrumented runtime to see why it didn't print the call stack. Not sure how often it reproduces though.

@janvorli
Copy link
Member Author

janvorli commented Aug 2, 2021

So, the problem is that the stackwalker cannot walk the stack in the failure case, so it doesn't call the callback to log any frames. In the case I've seen, the stack overflow happened in MethodTable::SanityCheck when the method attempted to make the first call. The prolog starts like this:

coreclr!MethodTable::SanityCheck [F:\git\runtime2\src\coreclr\vm\methodtable.cpp @ 7546]:
 7546 00007fff`0347e498 48895c2408      mov     qword ptr [rsp+8],rbx
 7546 00007fff`0347e49d 48896c2410      mov     qword ptr [rsp+10h],rbp
 7546 00007fff`0347e4a2 4889742418      mov     qword ptr [rsp+18h],rsi
 7546 00007fff`0347e4a7 57              push    rdi
 7546 00007fff`0347e4a8 4883ec20        sub     rsp,20h

The rsp after this is 00000031`78646ff0, that is 16 bytes below the last valid stack page. Then a bit later, there is a call that triggers the stack overflow since its return address would go to the stack guard page.

And this is the top of the stack trace that WinDbg shows:

 # Child-SP          RetAddr           Call Site
00 00000031`78643e48 00007fff```9b3419ce ntdll!ZwWaitForSingleObject+0x14 [minkernel\ntdll\daytona\objfre\amd64\usrstubs.asm @ 211] 
01 00000031`78643e50 00007fff`035e8aa5 KERNELBASE!WaitForSingleObjectEx+0x8e [minkernel\kernelbase\synch.c @ 1328] 
02 00000031`78643ef0 00007fff`035e8d70 coreclr!EEPolicy::HandleFatalStackOverflow+0x17d [F:\git\runtime2\src\coreclr\vm\eepolicy.cpp @ 661] 
03 00000031`78644580 00007fff`0357ca7c coreclr!EEPolicy::HandleStackOverflow+0xb4 [F:\git\runtime2\src\coreclr\vm\eepolicy.cpp @ 147] 
04 00000031`786445c0 00007fff`03bd9e7b coreclr!CLRException::HandlerState::SetupCatch+0x98 [F:\git\runtime2\src\coreclr\vm\clrex.cpp @ 878] 
05 00000031`78644680 00007fff`03b437c0 coreclr!`Object::ValidateInner'::`1'::catch$12+0x9b [F:\git\runtime2\src\coreclr\vm\object.cpp @ 590] 
06 00000031`786446e0 00007fff`03b4173e coreclr!_CallSettingFrame_LookupContinuationIndex+0x20 [d:\agent\_work\4\s\src\vctools\crt\vcruntime\src\eh\amd64\handlers.asm @ 98] 
07 00000031`78644710 00007fff`9dbd1506 coreclr!__FrameHandler4::CxxCallCatchBlock+0x1de [d:\agent\_work\4\s\src\vctools\crt\vcruntime\src\eh\frame.cpp @ 1393] 
08 00000031`786447e0 00007fff`0348f003 ntdll!RcFrameConsolidation+0x6 [minkernel\ntos\rtl\amd64\capture.asm @ 1015] 
09 00000031`78647060 00007fff`0348eec9 coreclr!Object::ValidateInner+0x103 [F:\git\runtime2\src\coreclr\vm\object.cpp @ 514] 
0a 00000031`78647160 00007fff`03487d46 coreclr!Object::Validate+0xb9 [F:\git\runtime2\src\coreclr\vm\object.cpp @ 490] 
0b 00000031`786471a0 00007fff`0367a0b7 coreclr!OBJECTREF::OBJECTREF+0xa6 [F:\git\runtime2\src\coreclr\vm\object.cpp @ 1080] 
0c 00000031`786471e0 00007ffe`a3f6f007 coreclr!JIT_ClassProfile32+0xc7 [F:\git\runtime2\src\coreclr\vm\jithelpers.cpp @ 5264] 
0d 00000031`78647270 00007ffe`a3f6e254 System_Private_CoreLib!System.Text.ValueStringBuilder.AppendFormatHelper(System.IFormatProvider, System.String, System.ParamsArray)+0x737*** WARNING: Unable to verify checksum for F:\git\runtime2\artifacts\tests\coreclr\windows.x64.Checked\Tests\Core_Root\System.Private.CoreLib.dll
 [F:\git\runtime2\src\libraries\System.Private.CoreLib\src\System\Text\ValueStringBuilder.AppendFormat.cs @ 282] 
0e 00000031`78647410 00007ffe`a3f6de2b System_Private_CoreLib!System.String.FormatHelper(System.IFormatProvider, System.String, System.ParamsArray)+0x1e4 [F:\git\runtime2\src\libraries\System.Private.CoreLib\src\System\String.Manipulation.cs @ 504] 
0f 00000031`78647700 00007ffe`a3f6dd6b System_Private_CoreLib!System.String.Format(System.String, System.Object)+0x6b [F:\git\runtime2\src\libraries\System.Private.CoreLib\src\System\String.Manipulation.cs @ 444] 
10 00000031`78647780 00007ffe`a3f6dd7c stackoverflow3!TestStackOverflow3.Program.Execute(System.String)+0x14b*** WARNING: Unable to verify checksum for F:\git\runtime2\artifacts\tests\coreclr\windows.x64.Checked\baseservices\exceptions\stackoverflow\stackoverflow3\stackoverflow3.dll
 [F:\git\runtime2\src\tests\baseservices\exceptions\stackoverflow\stackoverflow3.cs @ 26] 
11 00000031`78649740 00007ffe`a3f6dd7c stackoverflow3!TestStackOverflow3.Program.Execute(System.String)+0x15c [F:\git\runtime2\src\tests\baseservices\exceptions\stackoverflow\stackoverflow3.cs @ 26] 
12 00000031`7864b700 00007ffe`a3f6dd7c stackoverflow3!TestStackOverflow3.Program.Execute(System.String)+0x15c [F:\git\runtime2\src\tests\baseservices\exceptions\stackoverflow\stackoverflow3.cs @ 26]

WinDbg had some issue here too, the frame of the failing function is missing, it should be between frames 8 and 9 above. But I don't think it is related.

When our stack walker tries to walk the stack, it starts at the FaultingExceptionFrame, which points to a native frame. I think this is the actual issue, the stack walker is not handling that case. I think the EEPolicy::HandleFatalStackOverflow should call Thread::VirtualUnwindToFirstManagedCallFrame on the context instead of the AdjustContextForJITHelpers. We don't care about native frames in the stack overflow case anyways (for the purpose of reporting).

@janvorli janvorli merged commit 28ce2cc into dotnet:main Aug 3, 2021
@janvorli janvorli deleted the fix-gcstress-stackoverflow-issue branch August 3, 2021 12:43
thaystg added a commit to thaystg/runtime that referenced this pull request Aug 3, 2021
* origin/main: (64 commits)
  [wasm][debugger] Create test Inherited Properties (dotnet#56754)
  Mark new test as incompatible with GC Mark4781_1GcStressIncompatible (dotnet#56739)
  Ensure MetadataEnumResult is sufficiently updated by MetaDataImport::Enum (dotnet#56756)
  [mono] Remove gdb xdebug and binary writer support, it hasn't worked in a while. (dotnet#56759)
  Update windows-requirements.md (dotnet#56476)
  Update doc and generic parameter name for JsonValue.GetValue (dotnet#56639)
  [wasm][debugger] Inspect static class (dotnet#56740)
  Fix stack overflow handling issue in GC stress (dotnet#56733)
  Use ReflectionOnly as serialization mode in case dynamic code runtime feature is not supported (dotnet#56604)
  Move Windows Compat pack to NuGet pack task (dotnet#56686)
  Fix build error when building some packages (dotnet#56767)
  Simplify JIT shutdown logic in crossgen2 (dotnet#56687)
  Fix race in crossdac publishing with PGO (dotnet#56762)
  Add DictionaryKeyPolicy support for EnumConverter [dotnet#47765] (dotnet#54429)
  Use ComWrappers in some Marshal unit-tests and update platform metadata  (dotnet#56595)
  Set `DisableImplicitNamespaceImports_Dotnet=true` to workaround sdk issue (dotnet#56744)
  Make sure ServerGCHeapDetails is up to date (dotnet#56056)
  [libraries] Reenable System.Diagnostics.DiagnosticSorce.Switches.Tests on mobile (dotnet#56737)
  Disable failing arm64 win10 Graphics.FromHdc tests  (dotnet#56732)
  Match xplat event source conditions (dotnet#56435)
  ...
thaystg added a commit to thaystg/runtime that referenced this pull request Aug 4, 2021
…ger_proxy_attribute

* origin/main: (340 commits)
  add RID for Debian 11 (dotnet#56789)
  [wasm] [debugger] Skip thread static field (dotnet#56749)
  Fix timeouts in coreroot_determinism test in GC stress mode (dotnet#56770)
  Use File.OpenHandle in Socket.SendFile directly (dotnet#56777)
  accept empty realm for digest auth (dotnet#56369) (dotnet#56455)
  [wasm][debugger] Create test Inherited Properties (dotnet#56754)
  Mark new test as incompatible with GC Mark4781_1GcStressIncompatible (dotnet#56739)
  Ensure MetadataEnumResult is sufficiently updated by MetaDataImport::Enum (dotnet#56756)
  [mono] Remove gdb xdebug and binary writer support, it hasn't worked in a while. (dotnet#56759)
  Update windows-requirements.md (dotnet#56476)
  Update doc and generic parameter name for JsonValue.GetValue (dotnet#56639)
  [wasm][debugger] Inspect static class (dotnet#56740)
  Fix stack overflow handling issue in GC stress (dotnet#56733)
  Use ReflectionOnly as serialization mode in case dynamic code runtime feature is not supported (dotnet#56604)
  Move Windows Compat pack to NuGet pack task (dotnet#56686)
  Fix build error when building some packages (dotnet#56767)
  Simplify JIT shutdown logic in crossgen2 (dotnet#56687)
  Fix race in crossdac publishing with PGO (dotnet#56762)
  Add DictionaryKeyPolicy support for EnumConverter [dotnet#47765] (dotnet#54429)
  Use ComWrappers in some Marshal unit-tests and update platform metadata  (dotnet#56595)
  ...
@ghost ghost locked as resolved and limited conversation to collaborators Sep 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants