Add watchdog instrumentation to OleTxTests.Recovery to capture hang diagnostics#126666
Add watchdog instrumentation to OleTxTests.Recovery to capture hang diagnostics#126666danmoseley merged 2 commits intomainfrom
Conversation
The Recovery test consistently times out (20+ min) under stress modes (fullpgo, jitstress2_jitstressregs) — confirmed from Helix logs across multiple hits. PR #125813 added a 120s child-process timeout but the main thread still hangs waiting for MSDTC under slow runtimes. This is a libraries-level test exercising MSDTC/OLE transaction recovery; running it under JIT stress provides no additional signal. Skip it on all non-regular CoreCLR test modes. Fixes #126304 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent System.Transactions.Local CI work item timeouts by skipping the long-running OleTxTests.Recovery test when running under non-regular CoreCLR test modes (e.g., JIT stress configurations).
Changes:
- Add a
SkipOnCoreClrattribute toOleTxTests.Recoveryintended to exclude non-regular CoreCLR test modes.
Show a summary per file
| File | Description |
|---|---|
| src/libraries/System.Transactions.Local/tests/OleTxTests.cs | Adds CoreCLR-mode-based skip metadata to avoid stress-mode timeouts in Recovery() |
Copilot's findings
- Files reviewed: 1/1 changed files
- Comments generated: 1
|
Note This review was generated by Copilot. 🤖 Copilot Code Review — PR #126666Holistic AssessmentMotivation: The Approach: Adding Summary: ✅ LGTM. Single-line, well-targeted change that follows established conventions for skipping timeout-sensitive infrastructure tests under stress modes. The attribute semantics are correct, the namespace resolves via the existing Detailed Findings✅ Correctness — Attribute semantics are correct
✅ Convention consistency — Matches established patternThis pattern is identical to the one used in ✅ No new public API surfaceThe change only modifies a test file. No 💡 Cross-cutting observation —
|
|
@copilot instead of disabling this test, add an instrumentation that will allow us to get insight into why this test hangs. This instrumentation should start a new background thread at the start of the test. The thread should wait for 5 minutes and then check whether the test completed. If the test did not complete in 5 minutes, crash the process using Environment.FailFast that's going to generate a crash dump that can be investigated. Mentioned #126304 in the fail fast message. |
…hang Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/57e057a0-4388-4473-8c26-c58d16566062 Co-authored-by: jkotas <6668460+jkotas@users.noreply.github.com>
Done in df4852c. I removed the |
|
Just a guess: This can have the same root cause as #105124 . The problem is that System.Transactions.Local has a global timeout setting that is a problem on itself. Moreover, the global timeout setting has buggy implementation with race conditions (reading the timeout value can reset it too in some situations). We may be seeing bad interactions between different tests reading and writing the global timeout on multiple threads in parallel. |
|
Aha! Didn't see that one. Let's see what evidence we gather |
OleTxTests.Recoveryintermittently hangs (20+ min) waiting on MSDTC, killing CI work items. Previous approach skipped the test under stress modes, but the hang also occurs on regular CoreCLR runs.Changes
[SkipOnCoreClr]attribute — skipping doesn't address the root cause and misses failures on regular runsEnvironment.FailFastafter 5 minutes if the test hasn't completed, generating a crash dump for offline investigation of System.Transactions.Tests.OleTxTests.Recovery fails with timeout #126304The watchdog thread is a background thread and the completion signal is in a
finallyblock, so it does not interfere with normal test execution.