Skip to content

[Tests][UserEvents] Fix tracee discovery race and improve failure diagnostics#125345

Merged
mdh1418 merged 5 commits intodotnet:mainfrom
mdh1418:userevents/improve-test-diagnostics
Mar 10, 2026
Merged

[Tests][UserEvents] Fix tracee discovery race and improve failure diagnostics#125345
mdh1418 merged 5 commits intodotnet:mainfrom
mdh1418:userevents/improve-test-diagnostics

Conversation

@mdh1418
Copy link
Copy Markdown
Member

@mdh1418 mdh1418 commented Mar 9, 2026

#123442

More diagnostics and better tracee timing in response to the failures described in #123442 (comment)

Wait for "Recording started" instead of fixed delay. The 300ms fixed sleep before launching the tracee was insufficient on ARM64 CI where record-trace's startup (proc scan + ring buffer setup) took ~1845ms. I believe the tracee would start in the dead zone between the proc scan and ring buffer enable, so record-trace never discovered it via mmap events. Replace the sleep with a ManualResetEventSlim that gates on record-trace's "Recording started" stdout message, which is printed after PerfSession::enable succeeds.

Improve failure diagnostics

  • Add --log-filter to capture debug-level record-trace diagnostics (mmap discovery, IPC enablement, nettrace event flushing) while suppressing noisy modules. The log is written to a file and uploaded to Helix only on failure.
  • Add DumpTraceeEvents to log a summary of all events from the tracee PID in the nettrace on validation failure, to distinguish "event not captured" from "event captured with unexpected metadata."
  • Add tracee-side console messages confirming event emission completed, and log the tracee exit code.

mdh1418 and others added 3 commits March 9, 2026 19:21
Add console messages to the tracee path to confirm event emission
completed, and log the tracee exit code in the orchestrator. Add a
comment explaining why the tracee timeout is 60s.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the 300ms fixed sleep with a synchronization gate that waits for
record-trace to emit 'Recording started' on stdout before launching the
tracee. This message is printed after PerfSession::enable succeeds,
guaranteeing the ring buffers are active and will capture the tracee's
mmap events.

The fixed delay was insufficient on ARM64 CI machines where record-trace
startup (proc scan + ring buffer setup) took ~1845ms, causing the tracee
to start in the dead zone between the proc scan and ring buffer enable.

If record-trace exits before emitting the signal, report the exit code
rather than waiting for the full timeout.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add --log-filter to record-trace to capture debug-level diagnostics for
dotnet process discovery (mmap events, IPC enablement), perf session
lifecycle (capture_environment, enable/disable, parse_all), and nettrace
event block flushing. Suppress noisy modules (ELF metadata loading,
stack unwinding, tracefs) to keep logs focused.

On validation failure, dump a summary of all events from the tracee PID
found in the nettrace (DumpTraceeEvents) to distinguish between 'event
not captured' and 'event captured with unexpected metadata'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mdh1418 mdh1418 requested review from jkotas and noahfalk March 9, 2026 20:19
@mdh1418 mdh1418 self-assigned this Mar 9, 2026
Copilot AI review requested due to automatic review settings March 9, 2026 20:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition in the UserEvents tracing tests where the tracee process could start before record-trace had fully initialized its ring buffers, causing events to be missed. The fix replaces a fixed 300ms sleep with event-based synchronization, and adds comprehensive failure diagnostics to aid debugging of any remaining issues.

Changes:

  • Replace the fixed Thread.Sleep(300ms) delay with a ManualResetEventSlim that gates tracee startup on record-trace's "Recording started" stdout message, ensuring ring buffers are active before the tracee launches.
  • Add --log-filter to capture debug-level record-trace diagnostics and a DumpTraceeEvents method that summarizes all tracee events in the nettrace on validation failure.
  • Add tracee-side completion logging ("Tracee finished emitting events.") and tracee exit code reporting for improved failure diagnostics.

@mdh1418
Copy link
Copy Markdown
Member Author

mdh1418 commented Mar 9, 2026

/azp run runtime-coreclr jitstress2-jitstressregs, runtime-coreclr r2r-extra

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 2 pipeline(s).

@mdh1418
Copy link
Copy Markdown
Member Author

mdh1418 commented Mar 9, 2026

I don't think we are getting much value running the userevents tests under jitstress (and gcstress) configurations. So will disable those configurations, and hopefully only issues related with the runtime emitting userevents will surface (hopefully none at all 😅).

@mdh1418
Copy link
Copy Markdown
Member Author

mdh1418 commented Mar 10, 2026

From the latest jitstress2 leg, it looks like everything is working end-to-end, the tracee doesn't start until record-trace emits "Recording started", record trace detects the tracee, sends the command, the tracee emits its event, and record-trace drains its ring buffers while writing to the nettrace.

The only explanation I can come up with that CLI also speculates is that the 2CPU arm64 machine generates an exorbitant amount of mmap events under JITstress configurations, hence all the Found dotnet process 4293 via mapping /memfd:doublemapper events in the record-trace log, and that floods the ring buffers, making it possible that the ring buffer overflow drops the single custommetadata event.

Copilot AI review requested due to automatic review settings March 10, 2026 02:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread src/tests/tracing/userevents/common/UserEventsTestRunner.cs
@mdh1418 mdh1418 merged commit 59dc3d0 into dotnet:main Mar 10, 2026
86 checks passed
@mdh1418 mdh1418 deleted the userevents/improve-test-diagnostics branch March 10, 2026 15:01
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 10, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants