Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure: tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.sh #35884

Closed
v-haren opened this issue May 6, 2020 · 8 comments · Fixed by #40171
Closed
Labels
area-Diagnostics-coreclr GCStress tracking This issue is tracking the completion of other related issues.
Milestone

Comments

@v-haren
Copy link

v-haren commented May 6, 2020

failed in job: runtime-coreclr gcstress0x3-gcstress0xc 20200503.1

Error message

Process terminated. A callback was made on a garbage collected delegate of type 'System.Private.CoreLib!Interop+Advapi32+EtwEnableCallback::Invoke'.
/home/helixbot/work/A9DD08E1/w/A2D60904/e/tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.sh: line 312: 9131 Aborted $LAUNCHER $ExePath "${CLRTestExecutionArguments[@]}"

Return code: 1
Raw output file: /home/helixbot/work/A9DD08E1/w/A2D60904/e/tracing/eventpipe/Reports/tracing.eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.output.txt
Raw output:
BEGIN EXECUTION
/home/helixbot/work/A9DD08E1/p/corerun eventsourceerror.dll ''
 0.0s: ==TEST STARTING==
 2.6s: Validating clean environment...
 10.3s: Environment was clean.
 11.0s: Started sending sentinel events...
 11.5s: Connecting to EventPipe...
 22.0s: Connected to EventPipe with sessionID '0x7f59bc000fb0'
 22.0s: Creating EventPipeEventSource...
 24.8s: EventPipeEventSource created
 26.0s: Dynamic.All callback registered
 26.0s: Running optional trace validator
 26.1s: Finished running optional trace validator
 26.1s: Starting stream processing...
 31.7s: Saw sentinel event
 31.7s: Starting event generating action...
 31.7s: Stopped sending sentinel events
 36.5s: Stopping event generating action
 36.9s: Sending StopTracing command...
 38.4s: Saw new provider 'IllegalTypesEventSource'
 39.2s: Saw new provider 'Microsoft-DotNETCore-EventPipe'
 39.5s: Saw new provider 'Microsoft-Windows-DotNETRuntimeRundown'
Writing minidump to file /home/helixbot/dotnetbuild/dumps/coredump.9131.dmp
Written 105562112 bytes (25772 pages) to core file
Dump successfully written
Expected: 100
Actual: 134
END EXECUTION - FAILED
Test Harness Exitcode is : 1
To run the test:
> set CORE_ROOT=/home/helixbot/work/A9DD08E1/p
> /home/helixbot/work/A9DD08E1/w/A2D60904/e/tracing/eventpipe/eventsourceerror/eventsourceerror/eventsourceerror.sh
Expected: True
Actual: False


Stack trace
   at tracing_eventpipe._eventsourceerror_eventsourceerror_eventsourceerror_._eventsourceerror_eventsourceerror_eventsourceerror_sh() in /__w/1/s/artifacts/tests/coreclr/Linux.x64.Checked/TestWrappers/tracing.eventpipe/tracing.eventpipe.XUnitWrapper.cs:line 237
@ghost
Copy link

ghost commented May 6, 2020

Tagging subscribers to this area: @Maoni0
Notify danmosemsft if you want to be subscribed.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label May 6, 2020
@ghost
Copy link

ghost commented May 6, 2020

Tagging subscribers to this area: @tommcdon
Notify danmosemsft if you want to be subscribed.

@tommcdon
Copy link
Member

tommcdon commented May 6, 2020

@josalem

@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label May 6, 2020
@tommcdon tommcdon added this to Needs Triage in .NET Core Diagnostics via automation May 6, 2020
@tommcdon tommcdon added this to the 5.0 milestone May 6, 2020
@josalem
Copy link
Contributor

josalem commented May 6, 2020

I'll take a look!

@josalem
Copy link
Contributor

josalem commented May 6, 2020

CC @davmason

It looks like this test has been failing under gcstress since it went in with #35334. It fails across all configurations except Linux arm32 gcstress=0x3 (at least on the linked run). I'm not sure why it's happening, but my hunch is that is has to do with the lifetime of the delegate in the failure path for the EventSource constructor now that EventSource tries to send messages across EventPipe for that path.

@davmason
Copy link
Member

davmason commented May 6, 2020

I'll take a look

@davmason
Copy link
Member

davmason commented May 7, 2020

So, this turned in to way more of an involved investigation than I thought. This turned out to not be a test issue but exposed a fundamental problem with the way EventSources interact with EventPipe.

An EventSource will call dispose on its underlying providers when it is finalized or Dispose is called, and in turn the EventPipeProvider will call EventUnregister which calls EventPipeInternal::DeleteProvider.

Up until this point everything is ok, but inside of EventPipe::DeleteProvider there is a check for Enabled(), which checks if there are any active sessions, and if so will put the providers in a deferred delete state.

if (pProvider != NULL)
{
if (Enabled())
{
// Save the provider until the end of the tracing session.
pProvider->SetDeleteDeferred();
}
else
{
// Delete the provider now.
s_config.DeleteProvider(pProvider);
}

Which means if any other provider is active when the EventSource tries to delete itself from EventPipe, it will get placed in the deferred list. For example, in this test Microsoft-Windows-DotNETRuntime is also active. This is fundamentally flawed for interacting with managed providers, because as soon as they are disposed/finalized the EventProvider's m_etwCallback will be collected and later when EventPipe goes to delete the deferred providers we will attempt to reverse pinvoke to a collected delegate.

I suspect this has gone undetected for so long because we don't have a ton of testing around EventSources and how they interact with EventPipe, and likely the tests that do test the scenario just happen to be lucky and the delegate doesn't get collected in the timeframe. Even this test passes fine under normal circumstances and it takes GCStress to fail.

@noahfalk @josalem @sywhang - I don't fully understand the need to keep providers around until tracing stops. I see it was added in this PR: dotnet/coreclr#11651, but I don't see anything that even attempts to root the EventSource/EventPipeProvider managed objects. So we either need to root those objects, or if the native provider is all we need then we should not attempt to do the reverse pinvoke during deferred deletion.

@BruceForstall
Copy link
Member

This is still failing in the GCStress runs. I submitted PR #38516 to disable it for GCStress while this bug is open.

@tommcdon tommcdon added the tracking This issue is tracking the completion of other related issues. label Jul 8, 2020
@tommcdon tommcdon modified the milestones: 5.0.0, 6.0.0 Jul 8, 2020
kevinwkt pushed a commit to kevinwkt/runtimelab that referenced this issue Jul 15, 2020
.NET Core Diagnostics automation moved this from Needs Triage to Done Jul 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Diagnostics-coreclr GCStress tracking This issue is tracking the completion of other related issues.
Projects
Development

Successfully merging a pull request may close this issue.

7 participants