Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test hang in System.Diagnostics.Tracing.EventProvider.EventUnregister #35451

Closed
jkotas opened this issue Apr 25, 2020 · 23 comments · Fixed by #35767
Closed

Test hang in System.Diagnostics.Tracing.EventProvider.EventUnregister #35451

jkotas opened this issue Apr 25, 2020 · 23 comments · Fixed by #35767
Labels
area-System.Diagnostics.EventLog blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Milestone

Comments

@jkotas
Copy link
Member

jkotas commented Apr 25, 2020

      Half-way through waiting for remote process.
      Memory load: 14
      
      Image Name                     PID Services                                    
      ========================= ======== ============================================
      svchost.exe                    620 BrokerInfrastructure, DcomLaunch, LSM,      
                                         PlugPlay, Power, SystemEventsBroker         
      svchost.exe                    652 RpcEptMapper, RpcSs                         
      svchost.exe                    776 Dhcp, EventLog, lmhosts, vmictimesync,      
                                         Wcmsvc                                      
      svchost.exe                    808 AeLookupSvc, Appinfo, BITS, CertPropSvc,    
                                         gpsvc, iphlpsvc, LanmanServer, ProfSvc,     
                                         sacsvr, Schedule, SENS, SessionEnv,         
                                         ShellHWDetection, Themes, Winmgmt           
      svchost.exe                    888 EventSystem, FontCache, netprofm, nsi,      
                                         W32Time, WinHttpAutoProxySvc                
      svchost.exe                    976 CryptSvc, Dnscache, LanmanWorkstation,      
                                         NlaSvc, WinRM                               
      svchost.exe                    592 BFE, DPS, MpsSvc, pla                       
      svchost.exe                   1136 AppHostSvc                                  
      svchost.exe                   1164 DiagTrack                                   
      svchost.exe                   1328 TrkWks, UALSVC, UmRdpService,               
                                         vmickvpexchange, vmicshutdown, vmicvss      
      svchost.exe                   1352 W3SVC, WAS                                  
      svchost.exe                   2632 TermService                                 
      svchost.exe                   2672 vmicheartbeat, vmicrdv                      
      Timed out at 4/25/2020 1:55:40 AM after 60000ms waiting for remote process.
      Wrote mini dump to: C:\h\w\AE3609A5\w\B05309CA\uploads\3476.zuzob0j0.l20.dmp
      	Process ID: 3476
      	Handle: 1004
      	Name: dotnet
      	MainModule: C:\h\w\AE3609A5\p\dotnet.exe
      	StartTime: 4/25/2020 1:54:40 AM
      	TotalProcessorTime: 00:00:00.2343750
      	Threads:
      		Thread #1 (OS 0x560)   
      			[InlinedCallFrame] (System.Environment._Exit)
      			[InlinedCallFrame] (System.Environment._Exit)
      			System.Environment.Exit(Int32)
      			Microsoft.DotNet.RemoteExecutor.Program.Main(System.String[])
      		Thread #2 (OS 0xA38) [Finalizer] [Background] [MTA]
      			[InlinedCallFrame] (Interop+Advapi32.EventUnregister)
      			[InlinedCallFrame] (Interop+Advapi32.EventUnregister)
      			System.Diagnostics.Tracing.EtwEventProvider.System.Diagnostics.Tracing.IEventProvider.EventUnregister(Int64)
      			System.Diagnostics.Tracing.EventProvider.EventUnregister(Int64)
      			System.Diagnostics.Tracing.EventProvider.Dispose(Boolean)
      			System.Diagnostics.Tracing.EventProvider.Dispose()
      			System.Diagnostics.Tracing.EventSource.Dispose(Boolean)
      			System.Diagnostics.Tracing.EventListener.DisposeOnShutdown(System.Object, System.EventArgs)
      			System.AppContext.OnProcessExit()
      		Thread #3 (OS 0x700) [Thread pool worker] [Background] [MTA]
      		Thread #4 (OS 0xE50) [Thread pool worker] [Background] [MTA]
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 25, 2020
@ghost
Copy link

ghost commented Apr 25, 2020

Tagging subscribers to this area: @tarekgh, @tommcdon
Notify danmosemsft if you want to be subscribed.

@jkotas
Copy link
Member Author

jkotas commented Apr 25, 2020

Saved dump at \jkotas9\drops\issue-35451

We are stuck inside GetProcAddress for no apparent reason:

 # Child-SP          RetAddr           Call Site
00 0000000f`f392ef48 00007ffa`a429de39 ntdll!ZwWaitForSingleObject+0xa [d:\blue.obj.amd64fre\minkernel\ntdll\daytona\objfre\amd64\usrstubs.asm @ 198] 
01 0000000f`f392ef50 00007ffa`a429b7e4 ntdll!RtlpWaitOnCriticalSection+0xe1 [d:\blue\minkernel\ntdll\resource.c @ 771] 
02 0000000f`f392f020 00007ffa`a429b6dc ntdll!RtlpEnterCriticalSectionContended+0xa4 [d:\blue\minkernel\ntdll\resource.c @ 1220] 
03 0000000f`f392f060 00007ffa`a429f6be ntdll!LdrpAcquireLoaderLock+0x2c [d:\blue\minkernel\ntdll\ldr.c @ 250] 
04 0000000f`f392f0a0 00007ffa`a17d58f0 ntdll!LdrGetProcedureAddressForCaller+0x19e [d:\blue\minkernel\ntdll\ldrapi.c @ 1901] 
05 0000000f`f392f290 00007ffa`803ee358 KERNELBASE!GetProcAddressForCaller+0x70 [d:\blue\minkernel\kernelbase\module.c @ 1891] 
06 0000000f`f392f2e0 00007ffa`803ee17d coreclr!NDirectMethodDesc::FindEntryPoint+0x4c [F:\workspace\_work\1\s\src\coreclr\src\vm\method.cpp @ 5339] 
07 0000000f`f392f340 00007ffa`803edb8b coreclr!NDirect::NDirectLink+0x119 [F:\workspace\_work\1\s\src\coreclr\src\vm\dllimport.cpp @ 7121] 
08 0000000f`f392f670 00007ffa`804a2ae3 coreclr!NDirectImportWorker+0x7b [F:\workspace\_work\1\s\src\coreclr\src\vm\dllimport.cpp @ 7271] 
09 0000000f`f392f6f0 00007ffa`2093d471 coreclr!NDirectImportThunk+0x43 [F:\workspace\_work\1\s\src\coreclr\src\vm\amd64\AsmHelpers.asm @ 198] 
0a 0000000f`f392f760 000e000f`d1613df0 System_Private_CoreLib!System.Diagnostics.Tracing.EtwEventProvider.System.Diagnostics.Tracing.IEventProvider.EventUnregister(Int64)+0x81 [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventProvider.cs @ 1286] 

This may be duplicate of #13610

@tommcdon tommcdon added this to the 5.0 milestone Apr 25, 2020
@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Apr 25, 2020
@tommcdon
Copy link
Member

@noahfalk for thoughts on what might be causing this

@noahfalk
Copy link
Member

I'll see if I can take a closer look tomorrow (I'll need to get back on corpnet to access the dump). Off the cuff that callstack looked like the thread was blocked on the loader lock so probably some other thread is holding it. If so we'll spot the culprit in the dump.

@noahfalk
Copy link
Member

noahfalk commented Apr 29, 2020

@jkotas - I wasn't able to access the dump on your share. I can list directories in \\jkotas9 and \\jkotas9\public, but I get access denied trying to open \\jkotas9\drops. For comparison I am able to access shares on \\cpvsbuild\ so I suspect this is a security configuration issue on your share rather than a broader issue with my credentials or VPN connection.

@noahfalk
Copy link
Member

Attaching windbg showed the loader lock was unowned:

0:002> dx -r1 ((ntdll!_RTL_CRITICAL_SECTION *)0x7ffaa438e8a8)
((ntdll!_RTL_CRITICAL_SECTION *)0x7ffaa438e8a8)                 : 0x7ffaa438e8a8 [Type: _RTL_CRITICAL_SECTION *]
    [+0x000] DebugInfo        : 0x7ffaa438ebe0 [Type: _RTL_CRITICAL_SECTION_DEBUG *]
    [+0x008] LockCount        : 0 [Type: long]
    [+0x00c] RecursionCount   : 0 [Type: long]
    [+0x010] OwningThread     : 0x0 [Type: void *]
    [+0x018] LockSemaphore    : 0x168 [Type: void *]
    [+0x020] SpinCount        : 0x4000000 [Type: unsigned __int64]

Best guess the loader lock was owned earlier causing the OS to switch the thread out, later the lock was released and for whatever reason the OS hasn't yet switched the thread back in.

Jan's suggestion of #13610 sounds very plausible to me. If we wanted more data on this failure (or failures of this class) we'd probably need to upgrade the telemetry to determine if the thread is making progress and what behavior is happening machine wide. A better set of telemetry for a suspected hung test to capture would be:

  1. process dump
  2. ETW trace including stack sampling and context switches machine wide for ~30 sec
  3. another process dump
    This would clearly show if progress was being made, as well as what else the machine is doing.

Closing this as duplicate of #13610

@jkotas jkotas reopened this Apr 30, 2020
@jkotas
Copy link
Member Author

jkotas commented Apr 30, 2020

This was most likely introduced by #33307 according the CI telemetry. See #13610 (comment) for more context.

@jkotas jkotas added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Apr 30, 2020
@safern
Copy link
Member

safern commented Apr 30, 2020

Another data point for this, based on the Kusto data I see, this only happens in Windows7 and Windows8.1 -- I tried to repro locally on Windows 10 and after 150 runs it didn't repro.

@noahfalk
Copy link
Member

Thanks @safern! That's very likely to save @josalem from hitting some deadends.

@josalem
Copy link
Contributor

josalem commented Apr 30, 2020

this only happens in Windows7 and Windows8.1 -- I tried to repro locally on Windows 10 and after 150 runs it didn't repro.

Interesting!

I'll take a look at this today. Thanks for triaging things @safern! I'm not sure what would be causing this issue, but based on the versions being hit, I'm guessing it might be something with Overlapped IO in old versions of Windows. I'll have to do some investigating.

@josalem
Copy link
Contributor

josalem commented May 1, 2020

I spun up a win7 VM and I'm trying to catch this, but I can't seem to reproduce the failure. I am running the System.Collections.Concurrent.Tests suite in a loop and having it ping me if the test takes longer than 60 seconds. So far I've noticed that the tests either take almost exactly 30 seconds or they take almost exactly 65 seconds. My VM is 2 vCPU 4GB, so it's a little bigger than the ones in CI unfortunately.

I've narrowed the test cycle down to just the System.Collections.Concurrent.Tests.EtwTests test that I saw a failure on in the other thread. That test usually completes in <1 second, but every once in a while takes >10 seconds (1 in 750+ loops so far). I'll leave it running overnight and see if it gets hit more often. I might try artificially bogging down the machine as well and see if that changes things.

@safern, were you ever able to locally reproduce this?

While I try to get a repro set up, I'll take another look at the dump that Jan shared. Based on what I've already seen and what Noah found, I'm not sure there is enough in the dump to root cause this. The thing I found interesting, is that the dump shows that the DiagnosticsServer is inactive, i.e., it's in the middle of a Poll call that is using WaitForMultipleEvents to wait on a Named Pipe handle. That PR doesn't include any changes to eventing, and since the monitor configuration isn't enabled, it shouldn't be doing anything new. I'm pretty sure this also means that we haven't reached the VM's shutdown code, so we wouldn't have called DiagnosticServer::Shutdown yet, so it should be fine that the server is waiting. I'll can try bisecting my changes with ifdefs to see if something is exacerbating the problem, but I can't do too much until I can reliably reproduce the error. I'll keep digging.

@josalem
Copy link
Contributor

josalem commented May 1, 2020

By "ping" I meant that it literally made a ping sound from the console. I had stepped away from my computer for a while, and it started beeping. It seems like after 1000 loops most of the runs are taking >10 seconds! The xunit output claims the tests ran in 5 seconds (still a big increase over the 0.6 second runtime from the first 900 loops), but the RuntTests.cmd script didn't finish before 10 seconds. The logs also say that the tests passed. I'll dig into this more tomorrow with some better logging and see if it actually means anything (and mute my computer before I call it a day 😝 ).

@josalem
Copy link
Contributor

josalem commented May 1, 2020

I left my script running overnight. I'm seeing some interesting behavior that I'm going to try and reproduce the rest of today.

I wrote up a powershell script does the following in an infinite loop:

$Now = Get-Date
$job = Start-Job -ScriptBlock {
    # invoke RuntTests.cmd scoped to just System.Collections.Concurrent.Tests.EtwTests
}

While ($job.State -eq 'Running') {
    if ((Get-Date).AddSeconds(-10) -gt $Now) {
		# beep and log
    }
}

$Output = $job | Receive-Job
Write-Host $Output

The time check is wall clock time, so it isn't timing how long xunit claims the tests executed for. For the first ~1000 iterations, the job would compelete in <10s and xunit would claim ~0.6s of test time. After that, almost every job took >10s and xunit claimed the tests tool 3-6s, which is 10x time increase at the upper end. After ~3000 iterations, it looks like something went haywire and the powershell job died, but was still in the running state. The time check loop never exited, but the state was still in 'Running' and there wasn't a dotnet.exe process running on the system.

I want to build a more robust script that does some more interesting logging, especially better timing. Right now, my thinking is there might be some kind of OS scheduling issue that we are exacerbating with high volumes of RemoteExecutor tests which causes performance of the tests to degrade over time. I'll try running this same experiment with another test suite and see if I'm seeing similar behavior. I think it might be possible that the tests are hitting a timeout cumulatively as this degradation happens. When we then take a dump we keep seeing the EventUnregister stack because that method is specifically waiting on the OS to wake it up, but the scheduling degradation is causing a consistent delay so that our dump logic is invoked before the OS "gets around to it". I need more evidence to back this up though since it does seem to happen more often after #33307 despite that code not touching this logic and being inert through the test I used.

@stephentoub
Copy link
Member

Can we revert the offending PR? This is causing at least one leg in the majority of PRs to fail. It's going to make it much more likely that other regressions slip in, as we're having to rerun lots of legs and merge with failures that might be masking other failures, and it's going to make it more likely devs assume that failures aren't theirs even if they are in fact caused by their PR. Thanks!

@safern
Copy link
Member

safern commented May 3, 2020

I agree this is being hit in 50+ runs a day and we’re having a 0% pass rate now in rolling builds as well.

@jkotas
Copy link
Member Author

jkotas commented May 3, 2020

Reverting in #35767

@jkotas
Copy link
Member Author

jkotas commented May 3, 2020

I have stepped through the Event Pipe IPC shutdown sequence. I believe that there is a clear HANDLE use-after-close bug:

Handle that is being closed here:

00 CoreCLR!IpcStream::DiagnosticsIpc::Close+0xbb [D:\runtime\src\coreclr\src\debug\debug-pal\win\diagnosticsipc.cpp @ 213] 
01 CoreCLR!IpcStreamFactory::ConnectionState::Close+0x42 [D:\runtime\src\coreclr\src\vm\ipcstreamfactory.h @ 38] 
02 CoreCLR!IpcStreamFactory::CloseConnections+0x71 [D:\runtime\src\coreclr\src\vm\ipcstreamfactory.cpp @ 115] 
03 CoreCLR!DiagnosticServer::Shutdown+0x1e8 [D:\runtime\src\coreclr\src\vm\diagnosticserver.cpp @ 229] 
04 CoreCLR!EEShutDownHelper+0x1b6 [D:\runtime\src\coreclr\src\vm\ceemain.cpp @ 1212] 
05 CoreCLR!EEShutDown+0x346 [D:\runtime\src\coreclr\src\vm\ceemain.cpp @ 1677] 
06 CoreCLR!CorHost2::UnloadAppDomain2+0xfb [D:\runtime\src\coreclr\src\vm\corhost.cpp @ 942] 

is being waited on - or about to be waited on - here:

00 KERNELBASE!WaitForMultipleObjects [minkernel\kernelbase\synch.c @ 1401] 
01 CoreCLR!IpcStream::DiagnosticsIpc::Poll+0x31f [D:\runtime\src\coreclr\src\debug\debug-pal\win\diagnosticsipc.cpp @ 311] 
02 CoreCLR!IpcStreamFactory::GetNextAvailableStream+0x1c2 [D:\runtime\src\coreclr\src\vm\ipcstreamfactory.cpp @ 162] 
03 CoreCLR!DiagnosticServer::DiagnosticsServerThread+0x379 [D:\runtime\src\coreclr\src\vm\diagnosticserver.cpp @ 50] 
04 KERNEL32!BaseThreadInitThunk+0x14 [base\win32\client\thread.c @ 64] 
05 ntdll!RtlUserThreadStart+0x21 [minkernel\ntdll\rtlstrt.c @ 1153] 

How does this explain the hang:

  • The handle gets reused for auto-reset event allocated by OS loader code
  • The Wait operation in IpcStream will reset the auto-reset event
  • The OS loader code will hang waiting for the event forever becase of it was reset already

Why is it more likely to repro on Win7/8:

  • Differences in the OS scheduler and differences in the OS handle allocation algorithm

The shutdown sequence of the event pipe need careful review. Also, it would be a good idea to minimize work done during shutdown to only what's absolutely necessary, e.g. these event handles do not really need to be closed since the OS will close them anyway as part of the process shutdown.

@jkotas
Copy link
Member Author

jkotas commented May 3, 2020

The dump at \\jkotas9\drops\issue-35451 confirms this: pHandles[0] on thread 3 is 0x168, CriticalSection.LockSemaphore on thread 2 is 0x168

@stephentoub
Copy link
Member

Thanks, Jan!

@josalem
Copy link
Contributor

josalem commented May 4, 2020

Oof! That's a tricky one! Thanks for pinpointing the issue @jkotas! I'm writing a fix for it now. Earlier iterations of the PR specifically didn't close all the handles and didn't attempt to cancel any waits, but I recalled #12991 which was "fixed" by closing handles to break out of a blocking. I'll come up with a more elegant approach that cancels the blocking Poll call and doesn't attempt subsequent cleanup.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Diagnostics.EventLog blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants