Test failed: System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries fails with Timeout #13610

ahsonkhan · 2019-10-18T21:16:49Z

System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries
netcoreapp-Windows_NT-Debug-x64-(Windows.Nano.1809.Amd64.Open)

https://dev.azure.com/dnceng/public/_build/results?buildId=393033&view=ms.vss-test-web.build-test-results-tab&runId=12213282&resultId=145033&paneView=debug

https://helix.dot.net/api/2019-06-17/jobs/421e5238-fb5c-4d4f-8d24-a8e79abd46eb/workitems/System.Text.RegularExpressions.Tests/console

===========================================================================================================

C:\helix\work\workitem>"C:\helix\work\correlation\dotnet.exe" exec --runtimeconfig System.Text.RegularExpressions.Tests.runtimeconfig.json xunit.console.dll System.Text.RegularExpressions.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=nonnetcoreapptests -notrait category=nonwindowstests -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing  
  Discovering: System.Text.RegularExpressions.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Text.RegularExpressions.Tests (found 135 test cases)
  Starting:    System.Text.RegularExpressions.Tests (parallel test collections = on, max threads = 2)
    System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries [FAIL]
      Timed out at 10/18/2019 11:24:33 AM after 60000ms waiting for remote process.
      Wrote mini dump to: C:\helix\work\workitem\uploads\4708.wrrjax4w.c3b.dmp
      	Process ID: 4708
      	Handle: 912
      
      Stack Trace:
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(152,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose(Boolean disposing)
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(55,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose()
        /_/src/System.Text.RegularExpressions/tests/Regex.Cache.Tests.cs(71,0): at System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries()
  Finished:    System.Text.RegularExpressions.Tests
=== TEST EXECUTION SUMMARY ===
   System.Text.RegularExpressions.Tests  Total: 1454, Errors: 0, Failed: 1, Skipped: 0, Time: 118.055s
----- end Fri 10/18/2019 11:25:06.32 ----- exit code 1 ----------------------------------------------------------

System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_dictionary_linked_list_switch_does_not_throw
netcoreapp-Linux-Debug-x64-RedHat.7.Amd64.Open

https://dev.azure.com/dnceng/public/_build/results?buildId=393033&view=ms.vss-test-web.build-test-results-tab&runId=12213340&paneView=debug

https://helix.dot.net/api/2019-06-17/jobs/82ca2546-9df2-4b32-ab1f-d00835fcbe01/workitems/System.Text.RegularExpressions.Tests/console

===========================================================================================================
~/work/82ca2546-9df2-4b32-ab1f-d00835fcbe01/Work/93c6bbb0-68cc-493d-8eb4-85722fafeac6/Exec ~/work/82ca2546-9df2-4b32-ab1f-d00835fcbe01/Work/93c6bbb0-68cc-493d-8eb4-85722fafeac6/Exec
  Discovering: System.Text.RegularExpressions.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Text.RegularExpressions.Tests (found 135 test cases)
  Starting:    System.Text.RegularExpressions.Tests (parallel test collections = on, max threads = 2)
    System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_dictionary_linked_list_switch_does_not_throw [FAIL]
      Timed out at 10/18/2019 11:07:51 AM after 60000ms waiting for remote process.
      	Process ID: 2283
      	Handle: 892
      
      Stack Trace:
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(152,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose(Boolean disposing)
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(55,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose()
        /_/src/System.Text.RegularExpressions/tests/Regex.Cache.Tests.cs(106,0): at System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_dictionary_linked_list_switch_does_not_throw()
  Finished:    System.Text.RegularExpressions.Tests
=== TEST EXECUTION SUMMARY ===
   System.Text.RegularExpressions.Tests  Total: 1454, Errors: 0, Failed: 1, Skipped: 0, Time: 77.472s
~/work/82ca2546-9df2-4b32-ab1f-d00835fcbe01/Work/93c6bbb0-68cc-493d-8eb4-85722fafeac6/Exec
----- end Fri Oct 18 11:07:57 UTC 2019 ----- exit code 1 ----------------------------------------------------------

cc @ViktorHofer

The text was updated successfully, but these errors were encountered:

stephentoub · 2019-10-18T21:20:31Z

@ahsonkhan, it should have uploaded a dump from the remote process:
"Wrote mini dump to: C:\helix\work\workitem\uploads\4708.wrrjax4w.c3b.dmp"
Were you able to grab that? It looks like you restarted the legs.

danmoseley · 2019-10-18T21:21:07Z

Previously this was blocked by https://github.com/dotnet/core-eng/issues/7950 (https://github.com/dotnet/corefx/issues/41528) .. if dumps work now that's awesome.

stephentoub · 2019-10-18T21:22:14Z

@danmosemsft, the one with the dump is Windows.

danmoseley · 2019-10-18T21:23:09Z

@stephentoub ah, I just saw the RedHat above.

danmoseley · 2019-10-18T21:23:57Z

It looks like https://github.com/dotnet/core-eng/issues/7950 was recently closed, so I would hope to get a dump from that one too.

stephentoub · 2019-10-18T21:25:59Z

so I would hope to get a dump from that one too.

The dump that's relevant here is from the RemoteExecutor.Invoke child process; on Windows we now explicitly P/Invoke to MiniDumpWriteDump in order to save out a dump. That doesn't happen on Linux.
dotnet/arcade#4085

danmoseley · 2019-10-18T21:30:47Z

Should we have an issue to find a way to do it on Linux? Presumably deploying and invoking createdump?

stephentoub · 2019-10-18T21:31:58Z

Sure :)

danmoseley · 2019-10-18T22:58:23Z

dotnet/arcade#4153

ahsonkhan · 2019-10-18T23:31:59Z

Were you able to grab that? It looks like you restarted the legs.

No, sorry. Will remember to do that next time.

ViktorHofer · 2019-10-19T09:53:27Z

Here's the dump: https://helix.dot.net/api/2019-06-17/jobs/421e5238-fb5c-4d4f-8d24-a8e79abd46eb/workitems/System.Text.RegularExpressions.Tests/files/4708.wrrjax4w.c3b.dmp

ViktorHofer · 2019-10-19T09:54:05Z

updated link above (edited above post as the link was wrong)

0:000> k
 # Child-SP          RetAddr           Call Site
00 00000005`6597c310 00007ffc`becda9b5 coreclr!memset_repmovs+0x9 [d:\agent\_work\2\s\src\vctools\crt\vcruntime\src\string\amd64\memset.asm @ 66] 
01 00000005`6597c320 00007ffc`becda417 coreclr!WKS::gc_heap::adjust_limit_clr+0x155 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 12087] 
02 (Inline Function) --------`-------- coreclr!WKS::gc_heap::a_fit_segment_end_p+0x134 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 12864] 
03 (Inline Function) --------`-------- coreclr!WKS::gc_heap::soh_try_fit+0x1ad [f:\workspace\_work\1\s\src\gc\gc.cpp @ 13002] 
04 00000005`6597c390 00007ffc`becda148 coreclr!WKS::gc_heap::allocate_small+0x247 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 13067] 
05 (Inline Function) --------`-------- coreclr!WKS::gc_heap::try_allocate_more_space+0xcb [f:\workspace\_work\1\s\src\gc\gc.cpp @ 13939] 
06 (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate_more_space+0xcb [f:\workspace\_work\1\s\src\gc\gc.cpp @ 14369] 
07 (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate+0x116 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 14400] 
08 00000005`6597c490 00007ffc`bec54ee7 coreclr!WKS::GCHeap::Alloc+0x148 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 35826] 
09 00000005`6597c4f0 00007ffc`bec54732 coreclr!Alloc+0x16f [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 251] 
0a 00000005`6597c550 00007ffc`bec6603c coreclr!AllocateSzArray+0xee [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 543] 
0b (Inline Function) --------`-------- coreclr!AllocateSzArray+0x19 [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 421] 
0c 00000005`6597c5f0 00007ffc`bec64d55 coreclr!AllocateObjectArray+0x48 [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 964] 
0d 00000005`6597c630 00007ffc`bec54020 coreclr!GCHeapHash<KeyToValuesGCHeapHashTraits<InliningInfoTrackerHashTraits> >::CheckGrowth+0xc9 [f:\workspace\_work\1\s\src\vm\gcheaphashtable.inl @ 248] 
0e (Inline Function) --------`-------- coreclr!GCHeapHash<KeyToValuesGCHeapHashTraits<InliningInfoTrackerHashTraits> >::Add+0x9 [f:\workspace\_work\1\s\src\vm\gcheaphashtable.inl @ 438] 
0f 00000005`6597c660 00007ffc`bec5399e coreclr!CrossLoaderAllocatorHash<InliningInfoTrackerHashTraits>::Add+0x314 [f:\workspace\_work\1\s\src\vm\crossloaderallocatorhash.inl @ 389] 
10 (Inline Function) --------`-------- coreclr!JITInlineTrackingMap::AddInliningDontTakeLock+0x5f3 [f:\workspace\_work\1\s\src\vm\inlinetracking.cpp @ 664] 
11 (Inline Function) --------`-------- coreclr!JITInlineTrackingMap::AddInlining+0xa46 [f:\workspace\_work\1\s\src\vm\inlinetracking.cpp @ 648] 
12 (Inline Function) --------`-------- coreclr!Module::AddInlining+0xa6e [f:\workspace\_work\1\s\src\vm\ceeload.cpp @ 142] 
13 00000005`6597c740 00007ffc`d4649e8a coreclr!CEEInfo::reportInliningDecision+0x110e [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 8319] 
14 00000005`6597cd10 00007ffc`d45c6b13 clrjit!InlineResult::Report+0xde [f:\workspace\_work\1\s\src\jit\inline.cpp @ 749] 
15 (Inline Function) --------`-------- clrjit!InlineResult::{dtor}+0x9 [f:\workspace\_work\1\s\src\jit\inline.h @ 420] 
16 00000005`6597cd50 00007ffc`d45c5041 clrjit!Compiler::fgInline+0x273 [f:\workspace\_work\1\s\src\jit\flowgraph.cpp @ 22027] 
17 00000005`6597ced0 00007ffc`d4622d38 clrjit!Compiler::fgMorph+0x131 [f:\workspace\_work\1\s\src\jit\morph.cpp @ 16632] 
18 00000005`6597d1b0 00007ffc`d4631726 clrjit!Compiler::compCompile+0x368 [f:\workspace\_work\1\s\src\jit\compiler.cpp @ 4560] 
19 00000005`6597d2f0 00007ffc`d462fd7d clrjit!Compiler::compCompileHelper+0x296 [f:\workspace\_work\1\s\src\jit\compiler.cpp @ 6099] 
1a 00000005`6597d3a0 00007ffc`d462fa33 clrjit!Compiler::compCompile+0x21d [f:\workspace\_work\1\s\src\jit\compiler.cpp @ 5427] 
1b 00000005`6597d450 00007ffc`d462ecc2 clrjit!jitNativeCode+0x273 [f:\workspace\_work\1\s\src\jit\compiler.cpp @ 6727] 
1c 00000005`6597d610 00007ffc`bec79699 clrjit!CILJit::compileMethod+0x92 [f:\workspace\_work\1\s\src\jit\ee_il_dll.cpp @ 319] 
1d (Inline Function) --------`-------- coreclr!invokeCompileMethodHelper+0x63 [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 12464] 
1e 00000005`6597d680 00007ffc`bec7952c coreclr!invokeCompileMethod+0xdd [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 12529] 
1f 00000005`6597d720 00007ffc`bec78dbf coreclr!CallCompileMethodWithSEHWrapper+0x50 [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 12583] 
20 00000005`6597d7c0 00007ffc`bec77d91 coreclr!UnsafeJitFunction+0xe1f [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 13070] 
21 00000005`6597dbc0 00007ffc`bec77889 coreclr!MethodDesc::JitCompileCodeLocked+0x2e1 [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 974] 
22 00000005`6597ddb0 00007ffc`bec771f4 coreclr!MethodDesc::JitCompileCodeLockedEventWrapper+0x33d [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 830] 
23 00000005`6597def0 00007ffc`becbbdca coreclr!MethodDesc::JitCompileCode+0x194 [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 769] 
24 00000005`6597df90 00007ffc`bed595a5 coreclr!MethodDesc::PrepareILBasedCode+0xfa [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 398] 
25 (Inline Function) --------`-------- coreclr!MethodDesc::PrepareCode+0xc [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 330] 
26 00000005`6597dfc0 00007ffc`becb9d06 coreclr!CodeVersionManager::PublishVersionableCodeIfNecessary+0x121 [f:\workspace\_work\1\s\src\vm\codeversion.cpp @ 1800] 
27 00000005`6597e140 00007ffc`becb9815 coreclr!MethodDesc::DoPrestub+0x2b6 [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 1972] 
28 00000005`6597e210 00007ffc`bed4db95 coreclr!PreStubWorker+0x455 [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 1798] 
29 00000005`6597e420 00007ffc`5f264362 coreclr!ThePreStub+0x55 [F:\workspace\_work\1\s\src\vm\amd64\ThePreStubAMD64.asm @ 22] 
2a 00000005`6597e4d0 000001c6`00010fa8 0x00007ffc`5f264362
2b 00000005`6597e4d8 00000005`6597e518 0x000001c6`00010fa8
2c 00000005`6597e4e0 000001c6`0000cc90 0x00000005`6597e518
2d 00000005`6597e4e8 00000000`ffffffff 0x000001c6`0000cc90
2e 00000005`6597e4f0 00000000`0000003e 0xffffffff
2f 00000005`6597e4f8 00000000`00000003 0x3e
30 00000005`6597e500 00000000`00000000 0x3

ViktorHofer · 2019-10-19T09:56:44Z

cc @jkotas

ViktorHofer · 2019-10-19T09:59:30Z

And for learning purposes, to find the uploaded files (i.e. after a retry) you can use the Helix API directly: https://helix.dot.net/api/2019-06-17/jobs/421e5238-fb5c-4d4f-8d24-a8e79abd46eb/workitems/System.Text.RegularExpressions.Tests/files

This will list all uploaded files, in this case I searched for the ".dmp" file and added it to the url: "/4708.wrrjax4w.c3b.dmp".

stephentoub · 2019-10-19T16:53:58Z

Thanks, @ViktorHofer.

This is happening early on in the invocation of the remote process, as we're looking up the method to invoke:

jkotas · 2019-10-19T17:07:38Z

coreclr!memset_repmovs+0x9

This seems to be stuck on a page-in request in the OS. I do not see anything wrong on the runtime side.

The GC is trying to clear memory from 0x000001c600011398 to 0x000001c6000139d8. The clearing is stuck at address 0x000001c600013000 (ie page boundary).

We need to get more dumps for this to see the pattern.

jkotas · 2019-10-19T17:11:35Z

From just looking at this dump, my bet would be on infrastructure problem: The physical machine is over-subscribed and super slow.

jkotas · 2019-10-19T17:21:45Z

we're looking up the method to invoke

Note that VS is hiding the part of the stack in coreclr.dll. We are in the middle of GC that happened to be triggered while we're looking up the method to invoke.

stephentoub · 2019-10-19T17:37:12Z

Note that VS is hiding the part of the stack in coreclr.dll. We are in the middle of GC that happened to be triggered while we're looking up the method to invoke.

Right, understood. I was pasting the image from VS to show the managed frames that were obscured in the windbg stack trace as well as the arguments to those methods.

danmoseley · 2019-10-21T04:11:32Z

Odd that we've had 3 hangs (in remote executor in Regex cache tests). I'd have thought a slow machine might give a more distributed set of hangs..

jkotas · 2019-10-21T04:43:01Z

They can be each different root cause. Do you happen to have dumps for the other 2 hangs?

ViktorHofer · 2019-10-21T07:46:50Z

We don't have more dumps yet and we only produce ones on Windows for these kinds of failures.

safern · 2019-10-21T20:24:15Z

This failed again just now on Windows, here's the dump: https://helix.dot.net/api/2019-06-17/jobs/f20c2934-638e-4637-809b-b960d0287af0/workitems/System.Text.RegularExpressions.Tests/files/6112.xqszggzs.qit.dmp

jkotas · 2019-10-21T21:18:10Z

https://helix.dot.net/api/2019-06-17/jobs/f20c2934-638e-4637-809b-b960d0287af0/workitems/System.Text.RegularExpressions.Tests/files/6112.xqszggzs.qit.dmp

We are in the middle of JITing System.Text.RegularExpressions.Regex.InitDefaultMatchTimeout called from RegEx static constructor. The thread is not blocked for any obvious reason.

So the common theme is:

Windows Nano
Early during startup, in the middle of the JIT

Keep more dumps coming...

jkotas · 2019-10-23T03:47:14Z

Another hit in dotnet/coreclr#27375 . System.Transactions tests. The process is stuck early during startup inside System.Transactions.TransactionsEtwProvider. This hit was on checked build of the runtime that allows us to see what happened in the past using stresslog.

The full stress log is here:
log.txt

jkotas · 2019-10-23T03:47:32Z

The interesting bits of the stress log are:

THREAD  TIMESTAMP     FACILITY                              MESSAGE
  ID  (sec from start)
--------------------------------------------------------------------------------------
173c 153.830905200 : `CLASSLOADER`        MethodTableBuilder: finished method table for module 00007FFD947D4020 token 20006be = 00007FFD94C9C678 (System.Globalization.UnicodeCategory) 
173c 152.449501000 : `CLASSLOADER`        MethodTableBuilder: finished method table for module 00007FFD947D4020 token 200037c = 00007FFD94AB3200 (System.Runtime.Intrinsics.Vector128`1) 

1108 110.031347100 : `GCROOTS`            Scanning Frameless method 00007FFD94A7A918 (System.AppContext.Setup(Char**, Char**, Int32)) ControlPC = 00007FFE16EC2706
1108 110.031328100 : `GCROOTS`            Scanning Frameless method 00007FFD949F64C8 (System.String.Ctor(Char*)) ControlPC = 00007FFE16E93EC9
1108 110.031286600 : `GCROOTS`            Scanning ExplicitFrame 00000082F1F7E3E8 AssocMethod = 00007FFD94AB2468 (System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)) frameVTable = 00007FFDF467BEB0 (coreclr!PrestubMethodFrame::`vftable')
1108 110.031261900 : `GC`GCALLOC`GCROOTS` } =========== BEGINGC 3, (requested generation = 2, collect_classes = 9) ==========

1108 110.031259000 : `GC`                 (GC stack)00007FFDF3E0A07C (coreclr!FinalizerThread::FinalizerThreadWorker+0x15c)
1108 110.031258700 : `GC`                 (GC stack)00007FFDF3E0A84F (coreclr!FinalizerThread::WaitForFinalizerEvent+0x10f)
1108 110.031258500 : `GC`                 (GC stack)00007FFDF42A2546 (coreclr!WKS::GCHeap::GarbageCollect+0x106)
1108 110.031258300 : `GC`                 (GC stack)00007FFDF42A2829 (coreclr!WKS::GCHeap::GarbageCollectGeneration+0x295)
1108 110.031258000 : `GC`                 (GC stack)00007FFDF42B644F (coreclr!WKS::gc_heap::garbage_collect+0x213)
1108 110.031257700 : `GC`                 Start of GC stack 

1108  61.804306200 : `GCROOTS`            Scanning Frameless method 00007FFD94A7A918 (System.AppContext.Setup(Char**, Char**, Int32)) ControlPC = 00007FFE16EC2706
1108  61.804288500 : `GCROOTS`            Scanning Frameless method 00007FFD949F64C8 (System.String.Ctor(Char*)) ControlPC = 00007FFE16E93EC9
1108  61.804249300 : `GCROOTS`            Scanning ExplicitFrame 00000082F1F7E3E8 AssocMethod = 00007FFD94AB2468 (System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)) frameVTable = 00007FFDF467BEB0 (coreclr!PrestubMethodFrame::`vftable')
1108  61.804225600 : `GC`GCALLOC`GCROOTS` } =========== BEGINGC 2, (requested generation = 1, collect_classes = 9) ==========

1108  61.804222700 : `GC`                 (GC stack)00007FFDF3E0A07C (coreclr!FinalizerThread::FinalizerThreadWorker+0x15c)
1108  61.804222500 : `GC`                 (GC stack)00007FFDF3E0A84F (coreclr!FinalizerThread::WaitForFinalizerEvent+0x10f)
1108  61.804222300 : `GC`                 (GC stack)00007FFDF42A2546 (coreclr!WKS::GCHeap::GarbageCollect+0x106)
1108  61.804222000 : `GC`                 (GC stack)00007FFDF42A2829 (coreclr!WKS::GCHeap::GarbageCollectGeneration+0x295)
1108  61.804221800 : `GC`                 (GC stack)00007FFDF42B644F (coreclr!WKS::gc_heap::garbage_collect+0x213)
1108  61.804221600 : `GC`                 Start of GC stack 

1108  13.291604900 : `GCROOTS`            Scanning Frameless method 00007FFD94A7A918 (System.AppContext.Setup(Char**, Char**, Int32)) ControlPC = 00007FFE16EC2706
1108  13.291578000 : `GCROOTS`            Scanning Frameless method 00007FFD949F64C8 (System.String.Ctor(Char*)) ControlPC = 00007FFE16E93EC9
1108  13.291504100 : `GCROOTS`            Scanning ExplicitFrame 00000082F1F7E3E8 AssocMethod = 00007FFD94AB2468 (System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)) frameVTable = 00007FFDF467BEB0 (coreclr!PrestubMethodFrame::`vftable')
1108  13.291459400 : `GC`GCALLOC`GCROOTS` } =========== BEGINGC 1, (requested generation = 1, collect_classes = 9) ==========

1108  13.291456300 : `GC`                 (GC stack)00007FFDF3E0A07C (coreclr!FinalizerThread::FinalizerThreadWorker+0x15c)
1108  13.291456000 : `GC`                 (GC stack)00007FFDF3E0A84F (coreclr!FinalizerThread::WaitForFinalizerEvent+0x10f)
1108  13.291455800 : `GC`                 (GC stack)00007FFDF42A2546 (coreclr!WKS::GCHeap::GarbageCollect+0x106)
1108  13.291455500 : `GC`                 (GC stack)00007FFDF42A2829 (coreclr!WKS::GCHeap::GarbageCollectGeneration+0x295)
1108  13.291455300 : `GC`                 (GC stack)00007FFDF42B644F (coreclr!WKS::gc_heap::garbage_collect+0x213)
1108  13.291455100 : `GC`                 Start of GC stack 
1108  13.291396100 : `GC`                 GC is triggered 

173c   0.335993200 : `CLASSLOADER`        MethodTableBuilder: finished method table for module 00007FFD947D4020 token 2000175 = 00007FFD94AB2D30 (System.SpanHelpers)

jkotas · 2019-10-23T03:52:12Z

The process was making good progress for the first 0.3s, but then the progress stopped
Several GCs were explicitly triggered by LowResourceNotifications at 13s, 61s and 110s.
The process started making progress at 152s again and got killed by the timeout shortly after that

jkotas · 2019-10-23T03:58:58Z

This confirms my theory that the machine is oversubscribed when this hang/timeout happens. We need to find out what else is going on the machine that is taking all resources.

safern · 2020-04-22T00:24:13Z

@jkotas I've seen we are hitting this again in some PRs. What thread count and parallelism would you recommend setting if memory is < 1GB? -parallel collections -maxthread 1?

jkotas · 2020-04-22T02:30:14Z

-maxthread 1 sounds reasonable to me to start with.

Does -parallel collections have any affect when you specify -maxthread 1?

There is still a chance that we find the machine to be in fine state when the test is starting, and then the expensive background services kick in while the test is running. It may be better to throttle this in the xunit itself to be more agile.

It makes me think: We have seen these hangs with remote executor. Maybe we should add the throttling to the remote executor:

When the machine is oversubscribed, wait until other remote executor processes finish before starting new one (ie do not have more than one running)
Otherwise, if the machine is still oversubscribed, wait 1 second before starting the remote executor process
Make sure to log diagnostics in any of these cases

ViktorHofer · 2020-04-27T16:05:32Z

It makes me think: We have seen these hangs with remote executor. Maybe we should add the throttling to the remote executor

@jkotas sounds good. Do you have cycles to submit a PR with the proposed changes?

jkotas · 2020-04-27T16:25:50Z

Sorry, I do not have cycles to deal with this.

I've seen we are hitting this again in some PRs

Do you have the list of recent PRs that are hitting this? The problem may have shifted somewhat compared to what was identified earlier in this issue.

The one that I have seen recently (#35451) has Memory load: 14 that is pretty reasonable. The signature of the problem identified earlier in this issue was very high memory load (like Memory load: 98).

ViktorHofer · 2020-04-27T17:03:16Z

Sorry, I do not have cycles to deal with this.

No problem at all. Thought it's worth asking. @Anipik this might be something worth to pick up as it directly contributes to CI pass rate. cc @ericstj

ericstj · 2020-04-29T16:57:51Z

When the machine is oversubscribed, wait until other remote executor processes finish before starting new one (ie do not have more than one running)
Otherwise, if the machine is still oversubscribed, wait 1 second before starting the remote executor process
Make sure to log diagnostics in any of these cases

I think it's reasonable to try to address this by throttling the number of concurrent RemoteExecutor processes when resources are low. @jkotas's suggestion above seems like a good starting point. @ViktorHofer / @jkotas are we only running a single test process, such that handling in the static RemoteExecutor would be sufficient, or do we need to consider multiple processes launching RemoteExecutors on the machine?

We'd want to measure the prevalence of this problem before and after a change.

Here's the command I ran to measure:
runfo search-helix -d runtime -c 100 --pr -v "after 60000ms waiting for remote process."

Build	Kind	Console Log
623148	PR #35566	console.log
622638	Rolling	console.log
622949	PR #35600	console.log
623077	PR #35573	console.log
622787	PR #35566	console.log
621930	PR #35421	console.log
622225	PR #35483	console.log
622286	PR #35590	console.log
622860	PR #35285	console.log
622455	PR #35593	console.log
622684	PR #35573	console.log
622514	PR #35592	console.log
621713	PR #35565	console.log
621834	PR #35579	console.log
619501	PR #34973	console.log
621362	PR #31874	console.log
621635	Rolling	console.log
622808	PR #35538	console.log
621573	PR #35573	console.log
622206	PR #35589	console.log
619954	PR #35013	console.log
622502	PR #35584	console.log
621439	PR #35567	console.log
620597	PR #35383	console.log
620431	PR #35368	console.log
621318	PR #35316	console.log
621273	PR #35569	console.log
622077	Rolling	console.log
622279	PR #35570	console.log
622156	PR #35183	console.log
622081	PR #35246	console.log
621261	PR #35568	console.log
621799	PR #35582	console.log
621334	Rolling	console.log
619918	PR #35353	console.log
621680	PR #35549	console.log
615697	PR #31874	console.log
621675	PR #35522	console.log
621600	PR #35575	console.log
621354	PR #32552	console.log
620050	PR #35233	console.log
621188	PR #35567	console.log

jkotas · 2020-04-29T17:56:30Z

such that handling in the static RemoteExecutor would be sufficient

Yes, handling this in the static RemoteExecutor should be sufficient.

However, I am not convinced that this will address the problem we are seeing recently. If the problem was really caused by oversubscribed machine, we would see crashes with variety of stacks, but that's not the case. We are always seeing crash with EventUnregister on the stack. It means that there is something wrong happening at this specific stack, and it is unlikely to be related to oversubscribed machine.

Trying to gather more data about this failure like @noahfalk suggested in #35451 (comment) may be a better course of action.

ericstj · 2020-04-29T18:57:07Z

I see what you mean. All these logs show the following callstack hanging:

[InlinedCallFrame] (Interop+Advapi32.EventUnregister)
      System.Diagnostics.Tracing.EtwEventProvider.System.Diagnostics.Tracing.IEventProvider.EventUnregister(Int64)
      System.Diagnostics.Tracing.EventProvider.EventUnregister(Int64)
      System.Diagnostics.Tracing.EventProvider.Dispose(Boolean)
      System.Diagnostics.Tracing.EventProvider.Dispose()
      System.Diagnostics.Tracing.EventSource.Dispose(Boolean)
      System.Diagnostics.Tracing.EventListener.DisposeOnShutdown(System.Object, System.EventArgs)
      System.AppContext.OnProcessExit()

@noahfalk do you think the number of occurrences we are seeing here indicate that progress is not being made?

ericstj · 2020-04-29T20:49:13Z

The frequency of this issue seems too high to be an existing issue. I suspect a recent regression. Also it seems to be observed in WinForms as well starting around 4/20 when the took a runtime update from 4.20218.3 to 4.20220.15. So we should look in this time window for a commit that could have introduced this hang.

noahfalk · 2020-04-29T21:39:56Z

@noahfalk do you think the number of occurrences we are seeing here indicate that progress is not being made?

If we are seeing many occurences of that stack I agree it raises suspicion that something else is going on. In terms of progress I don't yet have a good explanation and we've got contradictory evidence. On the one hand the dump doesn't indicate that we are waiting on anything other than the OS scheduler to assign a core, however if that were truly the only issue then I wouldn't expect to see this particular callstack at high volume. Right now I don't think we've got the data to draw a conclusion in either direction. Additional dumps after a time delay and/or an ETW trace could provide that evidence.

safern · 2020-04-29T23:33:19Z

The frequency of this issue seems too high to be an existing issue. I suspect a recent regression. Also it seems to be observed in WinForms as well starting around 4/20 when the took a runtime update from 4.20218.3 to 4.20220.15. So we should look in this time window for a commit that could have introduced this hang.

Based on this info, we were able to track down that the shas in between those two package versions are: c409bd0 and 44ed52a

Then we looked at Kusto data to find the first build that hit this issue and it was: https://dev.azure.com/dnceng/public/_build/results?buildId=608569&view=results -- which used: 629dba5 to merge with master to checkout the repo for the build. The builds started failing in april 21st.

So based on this I can only thing of 2 commits in question: 629dba5 and aa5b204

@jkotas @noahfalk do you see anything that pops up there?

jkotas · 2020-04-30T00:21:50Z

The diagnostic server PR 629dba5 is the obvious culprit. It is closely coupled with the code where this hang occurs.

It is likely that the diagnostic server eventpipe code is doing something bad during shutdown, and that corruption is leading to this hang. We had similar intermittent shutdown bugs in eventpipe before, so it is not that surprising that a new one popped up after significant eventpipe change.

Let's use this issue to track the original problem with machine oversubscription, and reactivate #35451 to track the hang in EventUnregister.

cc @josalem

danmoseley · 2020-04-30T00:31:36Z

Kudos to @safern and @ericstj for quickly helping narrow it to 2 commits.

noahfalk · 2020-04-30T00:41:24Z

Indeed, thanks for narrowing to a few possible commits. I still have little idea what exactly the server change would do to produce the particular result we saw in #35451, but it narrows the context enough that we can investigate.

josalem · 2020-04-30T16:15:58Z

Taking a look today! Thanks @safern and @ericstj for triaging.

ViktorHofer · 2020-07-20T18:14:38Z

Triage: will need to trigger runfo to get current data to decide if this issue should be closed or further action is necessary.

josalem · 2020-07-20T18:44:27Z

@ViktorHofer I think there are few different issues being discussed in this thread. The last one in the thread was resolved with the same fixes in here I believe: #35451. I'm not sure about the other issues discussed here though.

danmoseley · 2020-07-21T18:06:54Z

@safern is this test now stable? (Also is there info on the OneNote to remind me how to make such queries myself)

If the test is stable we can close this issue.

safern · 2020-07-21T19:13:29Z

I think we could still hit this issue it just depends the available memory at the test execution but we don't hit it that often.

Also is there info on the OneNote to remind me how to make such queries myself

I don't think so, we can add it.

ViktorHofer · 2020-10-04T17:46:01Z

Runfo doesn't list any failures in the last 14 days for the described remote executor hang: https://runfo.azurewebsites.net/search/tests/?bq=definition%3Aruntime++started%3A%7E14&tq=name%3A%22after+60000ms+waiting+for+remote+process%22.

Feel free to open an issue for following-up on improvements to RemoteExecutor or the test runner.

stephentoub transferred this issue from dotnet/corefx Oct 19, 2019

jeffschwMSFT changed the title ~~System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries fails with Timeout~~ Test failed: System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries fails with Timeout Oct 21, 2019

jkotas mentioned this issue Apr 25, 2020

Test hang in System.Diagnostics.Tracing.EventProvider.EventUnregister #35451

Closed

GrabYourPitchforks mentioned this issue Apr 27, 2020

Test failure: System.Reflection.Tests.IsCollectibleTests.MemberInfo_IsCollectibleFalse_WhenUsingAssemblyLoad(memberName: \"MyGenericMethod\") #35502

Closed

ViktorHofer mentioned this issue Apr 28, 2020

fix build.ps1 script execution error on cmd due to whitespace #35554

Merged

ericstj assigned Anipik Apr 29, 2020

ericstj removed the untriaged New issue has not been triaged by the area owner label Apr 29, 2020

ericstj mentioned this issue Apr 29, 2020

RemoteExecutor crashes with "AbandonedMutexException: The wait completed due to an abandoned mutex." dotnet/arcade#5325

Open

2 tasks

Anipik modified the milestones: 5.0.0, Future Aug 11, 2020

ViktorHofer closed this as completed Oct 4, 2020

ghost locked as resolved and limited conversation to collaborators Dec 12, 2020

Test failed: System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries fails with Timeout #13610

Test failed: System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries fails with Timeout #13610

Comments

ahsonkhan commented Oct 18, 2019

stephentoub commented Oct 18, 2019 • edited Loading

danmoseley commented Oct 18, 2019

stephentoub commented Oct 18, 2019

danmoseley commented Oct 18, 2019

danmoseley commented Oct 18, 2019

stephentoub commented Oct 18, 2019

danmoseley commented Oct 18, 2019

stephentoub commented Oct 18, 2019

danmoseley commented Oct 18, 2019

ahsonkhan commented Oct 18, 2019

ViktorHofer commented Oct 19, 2019 • edited Loading

ViktorHofer commented Oct 19, 2019 • edited Loading

ViktorHofer commented Oct 19, 2019

ViktorHofer commented Oct 19, 2019

stephentoub commented Oct 19, 2019

jkotas commented Oct 19, 2019 • edited Loading

jkotas commented Oct 19, 2019

jkotas commented Oct 19, 2019

stephentoub commented Oct 19, 2019

danmoseley commented Oct 21, 2019

jkotas commented Oct 21, 2019

ViktorHofer commented Oct 21, 2019

safern commented Oct 21, 2019

jkotas commented Oct 21, 2019

jkotas commented Oct 23, 2019

jkotas commented Oct 23, 2019 • edited Loading

jkotas commented Oct 23, 2019

jkotas commented Oct 23, 2019

safern commented Apr 22, 2020

jkotas commented Apr 22, 2020 • edited Loading

ViktorHofer commented Apr 27, 2020

jkotas commented Apr 27, 2020

ViktorHofer commented Apr 27, 2020

ericstj commented Apr 29, 2020

jkotas commented Apr 29, 2020 • edited Loading

ericstj commented Apr 29, 2020

ericstj commented Apr 29, 2020

noahfalk commented Apr 29, 2020

safern commented Apr 29, 2020

jkotas commented Apr 30, 2020

danmoseley commented Apr 30, 2020

noahfalk commented Apr 30, 2020

josalem commented Apr 30, 2020

ViktorHofer commented Jul 20, 2020

josalem commented Jul 20, 2020

danmoseley commented Jul 21, 2020

safern commented Jul 21, 2020

ViktorHofer commented Oct 4, 2020

stephentoub commented Oct 18, 2019 •

edited

Loading

ViktorHofer commented Oct 19, 2019 •

edited

Loading

ViktorHofer commented Oct 19, 2019 •

edited

Loading

jkotas commented Oct 19, 2019 •

edited

Loading

jkotas commented Oct 23, 2019 •

edited

Loading

jkotas commented Apr 22, 2020 •

edited

Loading

jkotas commented Apr 29, 2020 •

edited

Loading