[Tests][UserEvents] Fix tracee discovery race and improve failure diagnostics#125345
Conversation
Add console messages to the tracee path to confirm event emission completed, and log the tracee exit code in the orchestrator. Add a comment explaining why the tracee timeout is 60s. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the 300ms fixed sleep with a synchronization gate that waits for record-trace to emit 'Recording started' on stdout before launching the tracee. This message is printed after PerfSession::enable succeeds, guaranteeing the ring buffers are active and will capture the tracee's mmap events. The fixed delay was insufficient on ARM64 CI machines where record-trace startup (proc scan + ring buffer setup) took ~1845ms, causing the tracee to start in the dead zone between the proc scan and ring buffer enable. If record-trace exits before emitting the signal, report the exit code rather than waiting for the full timeout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add --log-filter to record-trace to capture debug-level diagnostics for dotnet process discovery (mmap events, IPC enablement), perf session lifecycle (capture_environment, enable/disable, parse_all), and nettrace event block flushing. Suppress noisy modules (ELF metadata loading, stack unwinding, tracefs) to keep logs focused. On validation failure, dump a summary of all events from the tracee PID found in the nettrace (DumpTraceeEvents) to distinguish between 'event not captured' and 'event captured with unexpected metadata'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes a race condition in the UserEvents tracing tests where the tracee process could start before record-trace had fully initialized its ring buffers, causing events to be missed. The fix replaces a fixed 300ms sleep with event-based synchronization, and adds comprehensive failure diagnostics to aid debugging of any remaining issues.
Changes:
- Replace the fixed
Thread.Sleep(300ms)delay with aManualResetEventSlimthat gates tracee startup on record-trace's "Recording started" stdout message, ensuring ring buffers are active before the tracee launches. - Add
--log-filterto capture debug-level record-trace diagnostics and aDumpTraceeEventsmethod that summarizes all tracee events in the nettrace on validation failure. - Add tracee-side completion logging ("Tracee finished emitting events.") and tracee exit code reporting for improved failure diagnostics.
|
/azp run runtime-coreclr jitstress2-jitstressregs, runtime-coreclr r2r-extra |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
I don't think we are getting much value running the userevents tests under jitstress (and gcstress) configurations. So will disable those configurations, and hopefully only issues related with the runtime emitting userevents will surface (hopefully none at all 😅). |
|
From the latest jitstress2 leg, it looks like everything is working end-to-end, the tracee doesn't start until record-trace emits "Recording started", record trace detects the tracee, sends the command, the tracee emits its event, and record-trace drains its ring buffers while writing to the nettrace. The only explanation I can come up with that CLI also speculates is that the 2CPU arm64 machine generates an exorbitant amount of mmap events under JITstress configurations, hence all the |
#123442
More diagnostics and better tracee timing in response to the failures described in #123442 (comment)
Wait for "Recording started" instead of fixed delay. The 300ms fixed sleep before launching the tracee was insufficient on ARM64 CI where record-trace's startup (proc scan + ring buffer setup) took ~1845ms. I believe the tracee would start in the dead zone between the proc scan and ring buffer enable, so record-trace never discovered it via mmap events. Replace the sleep with a ManualResetEventSlim that gates on record-trace's "Recording started" stdout message, which is printed after PerfSession::enable succeeds.
Improve failure diagnostics