Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing\\eventpipe\\diagnosticport\\diagnosticport\\diagnosticport.cmd Intermittent Failure in CI #2044

Open
runfoapp bot opened this issue Oct 30, 2020 · 14 comments
Assignees
Milestone

Comments

@runfoapp
Copy link

runfoapp bot commented Oct 30, 2020

Runfo Tracking Issue: tracing\eventpipe\diagnosticport\diagnosticport\diagnosticport.cmd Intermittent Failure in CI

Build Definition Kind Run Name Console Core Dump Test Results Run Client
1269199 runtime PR 55604 CoreCLR windows x86 Checked no_tiered_compilation @ Windows.10.Amd64.Open console.log core dump runclient.py

Build Result Summary

Day Hit Count Week Hit Count Month Hit Count
0 0 1
@Dotnet-GitSync-Bot
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost
Copy link

ghost commented Oct 30, 2020

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

@tommcdon
Copy link
Member

@josalem

@josalem
Copy link
Contributor

josalem commented Nov 18, 2020

Did some preliminary triage on this I wanted to document:

All the failures are on win-x64 and present with the following failure in the logs:

          4.9s: ::== Running test: TEST_SuspendDefaultPort
          5.0s: running sub-process: C:\h\w\C3B00A3D\p\CoreRun.exe C:\h\w\C3B00A3D\w\A43B08DC\e\tracing\eventpipe\diagnosticport\diagnosticport\diagnosticport.dll 0
          5.0s: subprocess started: True
          5.0s: subprocess PID: 5428
          5.0s: Standard Diagnostics Server connection not created yet -> try again in 100 ms
          5.3s: Starting EventPipeSession over standard connection
          5.3s: Started EventPipeSession over standard connection with session id: 0x1b3a7d660c0
          5.3s: Runtime HAS NOT resumed (expects: true): True
          5.3s: Sent: { Header={ Magic=System.Byte[]; Size=20; CommandSet=4; CommandId=1; Reserved=0 }; }
          5.5s: Creating EventPipeEventSource
          5.6s: Received: { Header={ Magic=System.Byte[]; Size=24; CommandSet=255; CommandId=0; Reserved=0 }; Payload=[ 0x00 0x00 0x00 0x00  ]}
          5.6s: Waiting for EEStartupStarted event
          5.8s: Created EventPipeEventSource
          5.8s: Starting processing

The first failure is from October 29th, 2020 which is a full 2 months after that test was added and the day after the new managed thread pool was merged in.

I've had 3 instances of TEST_SuspendDefaultPort running in a loop for several hours overnight with processor affinity set to 0x1. I haven't been able to reproduce the issue locally.

This test is specifically checking the functionality of setting DOTNET_SuspendDefaultPort=1, which causes the default listen Diagnostic Port to be set to suspend mode. This suspends runtime startup in EEStartupHelper and waits for a ResumeRuntime command to come on the default Port. The test sets up an EventPipeEventSource that watches for the EEStartupStarted event (the first event fired in the runtime) and sets a ManualResetEvent.

The logs indicate that the ResumeRuntime command was sent and the runtime returned the OK response, but the MRE hasn't been set, meaning we haven't seen the event yet.

That means there are two possible errors happening:

  1. The EEStartupStarted event isn't being fired
  2. The runtime didn't actually resume after the ResumeRuntime command

There are dumps of both the parent and child processes from these failures, but neither seem to be in good condition.

When I open them in windbg I see suspicious stacks:
image

I think there may be issues with the DAC in th3e artifacts from the build I got the dumps from.

Opening them in VS gives a better view, but the main thread stack is truncated:
image

This is the child process, which should have sent the EEStartupStarted event. The parallel stacks view in VS implies the runtime did in fact resume. The main thread is not in EEStartupHelper which is where it would be stuck if the runtime didn't resume. Instead, it is in ReadFile with what appers to be a truncated managed stack under it. My hunch is this is the top of the call to Console.ReadLine from the test code: https://github.com/dotnet/runtime/blob/6f29f0be4f699d6aacde916e3ce15ff0f90c48ea/src/tests/tracing/eventpipe/diagnosticport/diagnosticport.cs#L384

The evidence points to the EEStartupStarted event sometimes not firing. I'm not sure how that could be the case just yet, but I'm inclined to think the managed threadpool might have changed things somehow. This test started failing intermittently right after that was merged in. I'll focus further investigation on that hypothesis with the hopes that I can reproduce it locally.

@josalem
Copy link
Contributor

josalem commented Nov 23, 2020

I've left the test running in three consoles restricted to one core for >6 days and the issue has not reproduced locally. Since we are getting core dumps from the failure in CI, I'll look at turning on StressLog in these tests to see if I can get more information.

@josalem
Copy link
Contributor

josalem commented Dec 9, 2020

Managed to validate with a different dump that the test is indeed waiting inside the Console.ReadLine() from https://github.com/dotnet/runtime/blob/a7c31798739670fae93e962d1158439a83a118cf/src/tests/tracing/eventpipe/diagnosticport/diagnosticport.cs#L384

0:000> !clrstack
OS Thread Id: 0x1c5c (0)
        Child SP               IP Call Site
000000959357CFA8 00007ffe324c5b04 [InlinedCallFrame: 000000959357cfa8] Interop+Kernel32.ReadFile(IntPtr, Byte*, Int32, Int32 ByRef, IntPtr)
000000959357CFA8 00007ffda4a5e16a [InlinedCallFrame: 000000959357cfa8] Interop+Kernel32.ReadFile(IntPtr, Byte*, Int32, Int32 ByRef, IntPtr)
000000959357CF70 00007ffda4a5e16a ILStubClass.IL_STUB_PInvoke(IntPtr, Byte*, Int32, Int32 ByRef, IntPtr)
000000959357D0B0 00007ffda4a61d1a System.ConsolePal+WindowsConsoleStream.ReadFileNative(IntPtr, System.Span`1, Boolean, Int32 ByRef, Boolean) [/_/src/libraries/System.Console/src/System/ConsolePal.Windows.cs @ 1165]
000000959357D170 00007ffda4a61bc7 System.ConsolePal+WindowsConsoleStream.Read(System.Span`1) [/_/src/libraries/System.Console/src/System/ConsolePal.Windows.cs @ 1123]
000000959357D1F0 00007ffda4a61a90 System.IO.ConsoleStream.Read(Byte[], Int32, Int32) [/_/src/libraries/System.Console/src/System/IO/ConsoleStream.cs @ 34]
000000959357D260 00007ffe027d534c System.IO.StreamReader.ReadBuffer()
000000959357D2C0 00007ffe027d581d System.IO.StreamReader.ReadLine()
000000959357D310 00007ffda4a6185f System.IO.SyncTextReader.ReadLine() [/_/src/libraries/System.Console/src/System/IO/SyncTextReader.cs @ 76]
000000959357D370 00007ffda4a5c26d diagnosticport.dll!Unknown
000000959357D430 00007ffda4a59982 System.Private.CoreLib.dll!Unknown
000000959357D490 00007ffda4a598f9 diagnosticport.dll!Unknown
000000959357D4E0 00007ffda4a59861 diagnosticport.dll!Unknown

Since we're in managed code, that means the runtime resumed already. This implies that the EEStartupStarted event was either not fired or not received.

@safern
Copy link
Member

safern commented Dec 10, 2020

Added blocking-clean-ci as it I just hit in on a PR.

@josalem it seems like you're actively investigating, if you think it the fix will take some time still (more than a couple of hours) can we disable the test on the meantime?

@josalem
Copy link
Contributor

josalem commented Dec 10, 2020

I'm hesitant to turn it off since it only seems to happen in CI, but we can if it is blocking things. I can disable it in the morning. I'm actively looking into this one.


I went digging through the session state for the app and it's in a weird state:
image

Ignore the m_numEventsWritten value; it isn't modified anywhere in the code, so it should be 0.

The odd thing is that m_numEventsStored is set to 2, indicating that 2 events were written and neither required a new buffer to be allocated. Presumably there was a buffer there because when I dig further, I find that the EventPipeSessionState shows an empty BufferList, but the m_LastReadSequenceNumber is 2:
image

Looking into the buffer manager logic, we should only ever deallocate buffers if it is empty or the session ended.

Looking at a couple instances of the issue, the symptom appears to be the same. The runtime is successfully resumed and the EEStartupStart events is fired, but the EventPipeEventSource in the test harness never sees it. This only happens sometimes on Windows x64 Checked and only since October 29th, 2 months after the test went in.

CC @sywhang

@sywhang
Copy link
Contributor

sywhang commented Dec 10, 2020

@safern I will follow up with a PR to disable this test now.
@josalem lmk if you need help investigating it.

@josalem
Copy link
Contributor

josalem commented Dec 11, 2020

Jotting down a few notes:

  • The writer app's StressLog indicates that the tested behavior is working, i.e., the runtime is suspended on start and is then resumed after the resume command (the first managed thread appears in the log after the command is received). The issue is that the EEStartupStart event isn't being received by the reader app, so it is waiting on the MRE. The reader appears to be in the middle of reading an EventBlock, while the writer app says it already wrote the event and is waiting for more events to be written.
  • The reader app's state indicates that it received data over EventPipe. The EventPipeEventSource knows the writer app's PID, and EventPipeEventSource..ctor() returned meaning that the trace header was at least received. It is difficult to determine what else has been received. The EventCache is empty, but the StackCache isn't. The reader also saw the MetadataBlock containing the metadata for the event we expect to see. The internal stream reader seems to be partway through reading an EventBlock at the time of the dump. It is blocked in a read on the Named Pipe. The runtime shouldn't be sending partial EventBlocks, so this might indicate an angle to investigate. It is possible that the EventBlock being read contains the event we are blocked waiting on.
  • The Windows10.Amd64.Open queue switched to 2 core AMD CPU VMs on October 28th which is when this test started failing intermittently. It had been in the repo for a long time before that without failures. The fact that this test has failed in release/5.0 branches implies the CPU manufacturer change is more likely an influence than the portable threadpool as I previously hypothesized. I'm not sure what a manufacturer change could do to intermittently break this scenario yet. One hypothesis is that a difference in TSC synchronizing between cores with AMD vs Intel could result in the EventPipeBuffer* logic tripping and not performing certain actions like flush, tag events as oldest, or send sequence points. Any of those 3 failing to happen could cause the reader to not dispatch an event even if received.

@josalem
Copy link
Contributor

josalem commented Dec 12, 2020

A little more progress.

The EventPipeEventSource is in the middle of reading an EventBlock from the stream. According to the internal variables for the reader, it has already read the entirety of the EventBlock into the buffer. Now it is attempting to do a 6 byte read to get to an aligned address.

I looked in the internal buffer for the reader and decoded the EventBlock. It contains the event the test is waiting for:

0:007> db 000001920011a068 L31
00000192`0011a068  14 00 01 00 4d 47 e6 dc-04 00 00 00 66 97 34 dd  ....MG......f.4.
00000192`0011a078  04 00 00 00 cf 01 00 dc-38 ff ff ff ff 0f dc 38  ........8......8
00000192`0011a088  01 cd 8e 99 e7 4d 02 01-00 01 02 99 a0 b9 02 01  .....M..........
00000192`0011a098  00    

Header Size:  0x0014 (0n20)
Flags:        0x0001 (0n1) (Variable Length Encoded Compressed Headers)
MinTimestamp: 0x00000004dce6474d (0n20_885_948_237)
MaxTimestamp: 0x00000004dd349766 (0n20_891_080_550)

Event with compression:
 Header:
  Flags byte:  0xcf (1100`1111)
  MetadataID:  0x01 (This is EEStartupStart)
  Sequence#:   0x00 (increment by 1) => 1
  CaptureTID:  0xdc (1101 1100) + 0x38 (0011 1000) ==> 0x1c5c (0001 1100 0101 1100)
  ProcessNum:  ff ff ff ff 0f => 1111`1111 1111`1111 1111`1111 1111`1111 (0n4294967295 => -1)
  ThreadID:    dc 38 => 0x1c5c
  StackID:     01
  TimeStamp:   cd 8e 99 e7 4d => 10011011100111001100100011101001101 => 20885948237
  ActivityID:  N/A
  RelActID:    N/A
  IsSorted:    (0xcf & 64) => 1
  PayloadSize: 0x2
 EventPayload:
  0x0001

Event with compression:
 Header:
  Flag:        0x01
  MetadataID:  0x02 (Didn't validate, but this is probably EEStartupEnd)
  SequenceNum: previous + 1 => 2
  CaptureTID:  same as previous => 0x1c5c
  ProcessNum:  same as previous => -1
  TID:         same as previous => 0x1c5c
  StackId:     same as previous => 0x01
  TimeStamp:   99 a0 b9 02 (didn't convert)
  ActivityID:  N/A
  RelActID:    N/A
  PayloadSize: 0x01
 EventPayload:
  0x00

This means that the reader is going to be stuck in this 6 byte read indefinitely because the writer app won't be sending any more events/data. This also means that the received events won't get dispatched which means our test will remain waiting for them.

Why is this intermittent? EventBlobs are variable length encoded, meaning that a given sequence of events will have a non-deterministic size. More importantly, this means that while an EventPipeBlock might start with padding to be aligned, it won't necessarily end aligned. Specifically, the length of the Block isn't guaranteed to be 4 or 8 byte aligned, the latter of which is expected by EventPipeEventSource.

Why did it not start failing till Oct 29th? This one is harder to rationalize with this hypothesis. It is possible that a difference in TSC speeds or values between Intel and AMD caused the EventBlocks to be problematic lengths more often, but that would be difficult to prove. It is also possible that the test was intermittently failing from before Oct 29th and the AzDO data simply wasn't reflecting that. In either case, I think the timing of this test failing might be a red herring. I'm inclined to think it was the former of those two guesses.

I am fairly confident that this means the runtime is actually behaving correctly. The StressLogs show that the server is in a good state at the timeout and that the runtime resumed correctly. The dump shows that EventPipe correctly sent the 2 expected events and is sleeping until more events were sent. The reader app has the full EventBlock containing these events.

The issue is that the event never get dispatched due to the attempt to get to an 8 bit offset into the internal buffer. I believe the issue is in TraceEvent and not in the runtime.

This is still a hypothesis, but my confidence in this is growing. Especially because the logic we are blocked in has the following comment on it:

                // PERF policy
                // In the streaming (non-seekable) case we don't want to buffer any more data than was
                // requested and needed for alignment because this might cause the thread to block waiting 
                // for the unneeded data to arrive. There is probably a better way to do this that can
                // oportunistically buffer if the data is available but this code isn't that sophisticated
                // yet.
                //
                // In the non-streaming (seekable) case we do want to buffer because that lets the
                // reader achieve higher throughput.
                int fillSize = inputStream.CanSeek ? bytes.Length : (position + minimum + (align-1)) & ~(align-1);

I think I could validate this by splicing together a trace by hand that has an EventBlock(s) that causes EventPipeEventSource to get in this state.

CC @brianrob @sywhang @noahfalk @tommcdon

@josalem
Copy link
Contributor

josalem commented Jan 9, 2021

Confirmed my hypothesis about the cause of the hang. I doctored a nettrace stream to have an EventBlock with a size not divisible by 8. Feeding this EventPipeEventSource causes it to hang at the spot in my previous comment waiting for more data before doing the event dispatch. The reason this behavior started abruptly is still unexplained but seems unimportant. Regardless of the reason, the fix will need to happen in TraceEvent.

josalem referenced this issue in josalem/runtime Jan 21, 2021
* real fix needs to happen in TraceEvent
josalem referenced this issue in dotnet/runtime Jan 25, 2021
* real fix needs to happen in TraceEvent
@josalem
Copy link
Contributor

josalem commented Jan 27, 2021

Failures since bandaid fix all appear unrelated and are due to PRs testing linking (which is failing the test since they rely on EventSource).

@josalem
Copy link
Contributor

josalem commented Feb 16, 2021

New failure is not a recurrence of the original issue. It is a failure in a PR to changes in floating point number parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants