Skip to content

Add watchdog instrumentation to OleTxTests.Recovery to capture hang diagnostics#126666

Merged
danmoseley merged 2 commits intomainfrom
fix/disable-oletx-recovery-stress
Apr 9, 2026
Merged

Add watchdog instrumentation to OleTxTests.Recovery to capture hang diagnostics#126666
danmoseley merged 2 commits intomainfrom
fix/disable-oletx-recovery-stress

Conversation

@danmoseley
Copy link
Copy Markdown
Member

@danmoseley danmoseley commented Apr 8, 2026

OleTxTests.Recovery intermittently hangs (20+ min) waiting on MSDTC, killing CI work items. Previous approach skipped the test under stress modes, but the hang also occurs on regular CoreCLR runs.

Changes

var testCompleted = new ManualResetEventSlim(false);
var watchdog = new Thread(() =>
{
    if (!testCompleted.Wait(TimeSpan.FromMinutes(5)))
        Environment.FailFast("OleTxTests.Recovery did not complete within 5 minutes. See https://github.com/dotnet/runtime/issues/126304");
});
watchdog.IsBackground = true;
watchdog.Start();

try { Test(() => { /* ... */ }); }
finally { testCompleted.Set(); }

The watchdog thread is a background thread and the completion signal is in a finally block, so it does not interfere with normal test execution.

The Recovery test consistently times out (20+ min) under stress modes
(fullpgo, jitstress2_jitstressregs) — confirmed from Helix logs across
multiple hits. PR #125813 added a 120s child-process timeout but the
main thread still hangs waiting for MSDTC under slow runtimes.

This is a libraries-level test exercising MSDTC/OLE transaction recovery;
running it under JIT stress provides no additional signal. Skip it on
all non-regular CoreCLR test modes.

Fixes #126304

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 8, 2026 22:06
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 8, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent System.Transactions.Local CI work item timeouts by skipping the long-running OleTxTests.Recovery test when running under non-regular CoreCLR test modes (e.g., JIT stress configurations).

Changes:

  • Add a SkipOnCoreClr attribute to OleTxTests.Recovery intended to exclude non-regular CoreCLR test modes.
Show a summary per file
File Description
src/libraries/System.Transactions.Local/tests/OleTxTests.cs Adds CoreCLR-mode-based skip metadata to avoid stress-mode timeouts in Recovery()

Copilot's findings

  • Files reviewed: 1/1 changed files
  • Comments generated: 1

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Note

This review was generated by Copilot.

🤖 Copilot Code Review — PR #126666

Holistic Assessment

Motivation: The Recovery test consistently times out (20+ min) under JIT stress modes (fullpgo, jitstress2_jitstressregs) due to MSDTC operations being extremely slow under stress runtimes. This is a libraries-level test exercising MSDTC distributed transaction recovery — running it under JIT stress provides no additional signal. The motivation is clear and justified.

Approach: Adding [SkipOnCoreClr("...", ~RuntimeTestModes.RegularRun)] is the established pattern for disabling tests under all non-regular CoreCLR stress modes. This matches the exact pattern used in System.Diagnostics.Tracing/tests/BasicEventSourceTest/TestsManifestGeneration.Etw.cs for similar timeout-sensitive tests. The approach is correct and minimal.

Summary: ✅ LGTM. Single-line, well-targeted change that follows established conventions for skipping timeout-sensitive infrastructure tests under stress modes. The attribute semantics are correct, the namespace resolves via the existing using Xunit; import, and no new public API surface is introduced.


Detailed Findings

✅ Correctness — Attribute semantics are correct

~RuntimeTestModes.RegularRun is the bitwise complement of RegularRun (= 1), which matches all other flags (JitStress, JitStressRegs, JitMinOpts, TailcallStress, DisableR2R, GCStress3, etc.) — exactly the modes where this test times out. The test will continue to run in RegularRun mode, preserving normal CI coverage. The SkipOnCoreClrAttribute and RuntimeTestModes enum are both in the Xunit namespace, which is already imported at line 7 of the file.

✅ Convention consistency — Matches established pattern

This pattern is identical to the one used in System.Diagnostics.Tracing tests (TestsManifestGeneration.Etw.cs:26, 67) and is documented in docs/workflow/testing/libraries/filtering-tests.md. The skip reason message is descriptive and explains both the symptom and rationale.

✅ No new public API surface

The change only modifies a test file. No ref/ assembly changes, no new public members.

💡 Cross-cutting observation — Promotion tests have similar timeout-sensitive patterns

The Promotion() and Promotion_twice() tests (lines 110–269) also use RemoteExecutor with multiple WaitOne(Timeout) calls and MSDTC operations. Unlike Recovery(), they don't have a RemoteInvokeOptions { TimeOut = 120_000 } guard on the child process, nor a SkipOnCoreClr attribute. If stress modes cause similar slowdowns there, these could also time out — though the Recovery test's crash-and-reenlist flow is uniquely heavy. Worth monitoring, but not blocking for this PR.

Generated by Code Review for issue #126666 ·

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Apr 9, 2026

@copilot instead of disabling this test, add an instrumentation that will allow us to get insight into why this test hangs. This instrumentation should start a new background thread at the start of the test. The thread should wait for 5 minutes and then check whether the test completed. If the test did not complete in 5 minutes, crash the process using Environment.FailFast that's going to generate a crash dump that can be investigated. Mentioned #126304 in the fail fast message.

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 9, 2026

@copilot instead of disabling this test, add an instrumentation that will allow us to get insight into why this test hangs. This instrumentation should start a new background thread at the start of th...

Done in df4852c. I removed the [SkipOnCoreClr] attribute and instead added a background watchdog thread at the start of Recovery(). The thread waits 5 minutes and, if the test hasn't completed by then, calls Environment.FailFast with a message referencing #126304. The testCompleted event is set in a finally block wrapping the Test(...) call so it's always signaled when the test finishes (pass or fail).

Copilot AI changed the title Disable OleTxTests.Recovery under JIT stress configurations Add watchdog instrumentation to OleTxTests.Recovery to capture hang diagnostics Apr 9, 2026
Copilot AI requested a review from jkotas April 9, 2026 05:56
@jkotas
Copy link
Copy Markdown
Member

jkotas commented Apr 9, 2026

Just a guess: This can have the same root cause as #105124 . The problem is that System.Transactions.Local has a global timeout setting that is a problem on itself. Moreover, the global timeout setting has buggy implementation with race conditions (reading the timeout value can reset it too in some situations). We may be seeing bad interactions between different tests reading and writing the global timeout on multiple threads in parallel.

@danmoseley danmoseley enabled auto-merge (squash) April 9, 2026 06:28
@danmoseley
Copy link
Copy Markdown
Member Author

Aha! Didn't see that one. Let's see what evidence we gather

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants