Skip to content

Fix FileSystemWatcher test flakiness on Windows and re-enable 86 disabled tests#125818

Open
danmoseley wants to merge 5 commits intodotnet:mainfrom
danmoseley:fix/fsw-test-robustness
Open

Fix FileSystemWatcher test flakiness on Windows and re-enable 86 disabled tests#125818
danmoseley wants to merge 5 commits intodotnet:mainfrom
danmoseley:fix/fsw-test-robustness

Conversation

@danmoseley
Copy link
Member

@danmoseley danmoseley commented Mar 20, 2026

Categories of failures

The FSW test failures on Windows fall into three categories:

  1. Timeout too short for callback latency — Under thread pool pressure (typical in CI), ReadDirectoryChangesW callbacks arrive late. The 1000ms WaitForExpectedEventTimeout expires before the event is delivered. This is the dominant failure mode: "<EventType> event did not occur as expected".

  2. Stale state between retries — Some tests create files/directories in action() but never clean up. When ExpectEvent retries (re-creating the watcher and re-running the action), the second attempt fails because the file already exists, or the watcher fires on leftover state rather than the fresh action.

  3. ExpectNoEvent using wrong timeoutExpectNoEvent was hardcoded to use WaitForExpectedEventTimeout (1000ms) instead of WaitForUnexpectedEventTimeout (500ms), making negative tests take 2x longer than intended and potentially masking timing issues.

Fixes in this PR (and why each)

No product changes. All changes are in the test utility (FileSystemWatcherTest.cs) and individual test files.

  1. Progressive retry timeoutsExpectEvent now multiplies WaitForExpectedEventTimeout by the attempt number (1x, 2x, 3x). This directly addresses category 1: if the first attempt's 1000ms is too short under load, the second attempt waits 2000ms and the third waits 3000ms, giving the callback time to arrive without penalizing the common fast case.

  2. Increased SubsequentExpectedWait (10ms to 500ms) — Tests that check multiple event types (e.g., Changed + Created) waited only 10ms for the second event type after catching the first. Under load, the second event arrives later. 500ms provides adequate margin.

  3. Settling delay after enabling watcher (50ms Thread.Sleep) — Added in ExecuteAndVerifyEvents, ExpectEvents, and TryErrorEvent after setting EnableRaisingEvents = true and before running the test action. While Windows registers the watch synchronously, this guards against edge cases on all platforms.

  4. Cleanup in Create event testsFile.Create.cs and Directory.Create.cs tests now delete the created file/directory in a finally block, so retries start from a clean state (category 2).

  5. ExpectNoEvent timeout fix — Changed to use WaitForUnexpectedEventTimeout as intended (category 3).

  6. Re-enabled 86 tests previously disabled by 25 [ActiveIssue] annotations (11 class-level, 14 method-level) referencing Failing test due to no detected IO events in 'System.IO.Tests.FileSystemWatcherTest.ExecuteAndVerifyEvents' #103584 on Windows. Two tests (File_Move_With_Set_Environment_Variable, File_Move_With_Unset_Environment_Variable) are disabled against a new issue for a product-level bug unrelated to timing.

Validation steps

Tested on Intel i9-14900K (24 cores / 32 threads), 64 GB RAM, Windows 11. Tests run on both NTFS and ReFS (dev drive).

Controlled callback delay (proving the fix addresses the root cause):

  • Injected a temporary busy-spin delay into ReadDirectoryChangesCallback to deterministically reproduce the late-event-delivery condition that occurs under thread pool starvation in CI
  • At 900ms delay, baseline (no fixes): 114 event failures across 20 runs (100% failure rate)
  • At 900ms delay, with fixes: 0 event failures across 20 runs

Stress testing with fixes applied (proving stability under realistic CI-like load):

  • Concurrent stress generator running alongside tests: 32 CPU-saturating workers (one per logical processor) doing tight math loops, 8 IO workers (rapid file create/write/delete cycles with 1-64KB random data), and continuous thread pool flooding (~500 short work items queued every 10ms) to simulate thread pool contention
  • Under this load, ran 20 sequential iterations on each filesystem: 0 flaky failures on NTFS, 0 on ReFS
  • 3 parallel test instances running simultaneously under stress: 0 flaky failures

Unchanged

Fixes #103630
Fixes #103584

danmoseley and others added 5 commits March 19, 2026 23:05
- Increase SubsequentExpectedWait from 10ms to 500ms so subsequent
  event-type checks in ExecuteAndVerifyEvents tolerate delayed delivery
  under thread pool starvation.
- Add 50ms settling delay after EnableRaisingEvents=true in
  ExecuteAndVerifyEvents, ExpectEvents, and TryErrorEvent to allow
  OS-specific async startup to complete before the test action runs.
- Implement progressive timeout on retries in ExpectEvent and
  TryErrorEvent: attempt 1 uses the base timeout (1000ms), attempt 2
  uses 2x, attempt 3 uses 3x. This is the key fix: under thread pool
  starvation the ReadDirectoryChangesW callback can be delayed beyond
  the base timeout, but a fixed retry with the same timeout just fails
  again. Progressive timeout gives later attempts enough headroom.

Validated with artificial callback delay injection at 900ms:
- Baseline (no fixes): 114 event failures across 20 runs (100% failure rate)
- With fixes: 0 failures across 20 runs on both NTFS and ReFS

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The MultipleFilters and ModifyFiltersConcurrentWithEvents Create tests
pass cleanup: null to ExpectEvent, but the action (File.Create /
Directory.CreateDirectory) is not idempotent: on retry the file/dir
already exists, so no Created event fires and the retry always fails.

The corresponding Delete tests already have proper cleanup (re-creating
the deleted item between retries). Apply the same pattern in reverse:
delete the created item between retries so the next attempt gets a
fresh Created event.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- WaitForExpectedEventTimeout: 1000ms -> 2000ms. With progressive retry
  this gives 2000/4000/6000ms across 3 attempts, enough headroom for
  thread pool starvation even on single-core CI machines.
- WaitForExpectedEventTimeout_NoRetry: 3000ms -> 5000ms. Tests using the
  simple ExpectEvent(WaitHandle, string) overload have no retry loop, so
  a generous single timeout is needed.
- ExpectEvents() collection timeout: 5s -> 10s. This method has no retry
  mechanism and is used by Directory/File Move multi-event tests.

These increases only affect the failure path (WaitOne returns immediately
when the event arrives). Happy-path execution time is unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove [ActiveIssue] annotations for dotnet#103584 from 14 test
files (22 annotations total, including class-level disabling of entire
test classes like Directory_Create_Tests, File_Create_Tests, etc.).

Also remove [ActiveIssue] for dotnet#53366 on
FileSystemWatcher_DirectorySymbolicLink_TargetsFile_Fails — the issue
has had zero hits since 2021 and the test is Windows-only.

The robustness improvements in the preceding commits (progressive retry
timeouts, increased base timeouts, retry idempotency cleanup) should
prevent the flakiness that originally motivated disabling these tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ExpectNoEvent defaulted to WaitForExpectedEventTimeout (2000ms) but
negative tests should use the shorter WaitForUnexpectedEventTimeout
(150ms). The codebase already defines this constant for exactly this
purpose but it was never wired up. Using the long timeout slows down
every test that verifies an event does NOT occur.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 20, 2026 05:06
@danmoseley danmoseley requested a review from adamsitnik March 20, 2026 05:07
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets Windows flakiness in System.IO.FileSystem.Watcher tests by increasing tolerance to callback latency and retry-related stale state, and then re-enables a large set of tests previously disabled on Windows.

Changes:

  • Adjusts FileSystemWatcherTest timeouts/retry behavior (including progressive timeouts) and fixes ExpectNoEvent default timeout usage.
  • Adds a small post-EnableRaisingEvents settling delay to reduce timing sensitivity across platforms.
  • Re-enables many tests previously disabled via [ActiveIssue], and adds cleanup in some create-related tests to keep retries idempotent.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/libraries/System.IO.FileSystem.Watcher/tests/Utility/FileSystemWatcherTest.cs Updates core test helper timeouts/retry logic and adds settling delays.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.unit.cs Removes Windows [ActiveIssue] suppressions and adds cleanup for create-related retry safety.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.cs Re-enables a previously disabled init/resume test on Windows.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.SymbolicLink.cs Re-maps Windows ActiveIssue annotations for specific symlink tests.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.InternalBufferSize.cs Removes Windows [ActiveIssue] suppression for internal buffer size tests.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.File.NotifyFilter.cs Removes Windows [ActiveIssue] suppression at the class level.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.File.Move.cs Removes Windows [ActiveIssue] suppression at the class level.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.File.Delete.cs Removes Windows [ActiveIssue] suppression at the class level.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.File.Create.cs Removes Windows [ActiveIssue] suppression at the class level.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.File.Changed.cs Removes Windows [ActiveIssue] suppression at the class level.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.Directory.NotifyFilter.cs Removes Windows [ActiveIssue] suppression at the class level.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.Directory.Move.cs Removes Windows [ActiveIssue] suppression at the class level.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.Directory.Delete.cs Removes Windows [ActiveIssue] suppression at the class level.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.Directory.Create.cs Removes Windows [ActiveIssue] suppression at the class level.
src/libraries/System.IO.FileSystem.Watcher/tests/FileSystemWatcher.Directory.Changed.cs Removes Windows [ActiveIssue] suppression at the class level.
Comments suppressed due to low confidence (1)

src/libraries/System.IO.FileSystem.Watcher/tests/Utility/FileSystemWatcherTest.cs:286

  • cleanup() is executed after the assertion. If the assertion fails (i.e., an unexpected event did occur), cleanup won’t run, which can leave state behind for the outer retry wrapper and lead to follow-on failures. Run cleanup in a finally so it executes regardless of assertion outcome.
            bool result = ExecuteAndVerifyEvents(watcher, unExpectedEvents, action, false, expectedPath == null ? null : new string[] { expectedPath }, timeout);
            Assert.False(result, "Expected Event occurred");

            if (cleanup != null)
                cleanup();

Comment on lines +199 to +200
// for fast failure in normal conditions. Subsequent attempts double the timeout to tolerate
// transient delays (thread pool starvation, slow CI machines, etc.).
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says subsequent retries "double" the timeout, but effectiveTimeout = timeout * attemptsCompleted scales linearly (1x, 2x, 3x). Update the comment to match the actual behavior (or change the implementation if exponential backoff is intended).

Suggested change
// for fast failure in normal conditions. Subsequent attempts double the timeout to tolerate
// transient delays (thread pool starvation, slow CI machines, etc.).
// for fast failure in normal conditions. Subsequent attempts increase the timeout linearly
// with the attempt count to tolerate transient delays (thread pool starvation, slow CI machines, etc.).

Copilot uses AI. Check for mistakes.
Comment on lines 195 to +199
Thread.Sleep(RetryDelayMilliseconds);
}

result = ExecuteAndVerifyEvents(newWatcher, expectedEvents, action, attemptsCompleted == attempts, expectedPaths, timeout);
// Use progressively longer timeouts on retries. The first attempt uses the base timeout
// for fast failure in normal conditions. Subsequent attempts double the timeout to tolerate
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This retry path recreates FileSystemWatcher instances, but the helper doesn’t dispose watchers created during retries (and it doesn’t return the recreated watcher to the caller). In flaky runs this can leak OS resources/handles and make subsequent retries/tests less reliable. Consider disposing any temporary recreated watchers (and/or redesigning the helper to surface the recreated watcher to the caller).

Copilot uses AI. Check for mistakes.
Comment on lines +203 to 206
result = ExecuteAndVerifyEvents(newWatcher, expectedEvents, action, attemptsCompleted == attempts, expectedPaths, effectiveTimeout);

if (cleanup != null)
cleanup();
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup() is invoked after ExecuteAndVerifyEvents, but if ExecuteAndVerifyEvents throws (e.g., on the final attempt where it asserts), cleanup won’t run. To keep retries reliable and avoid leaving stale filesystem state behind, run cleanup in a finally around the attempt execution.

Suggested change
result = ExecuteAndVerifyEvents(newWatcher, expectedEvents, action, attemptsCompleted == attempts, expectedPaths, effectiveTimeout);
if (cleanup != null)
cleanup();
try
{
result = ExecuteAndVerifyEvents(newWatcher, expectedEvents, action, attemptsCompleted == attempts, expectedPaths, effectiveTimeout);
}
finally
{
cleanup?.Invoke();
}

Copilot uses AI. Check for mistakes.
Comment on lines +24 to 28
public const int WaitForExpectedEventTimeout = 2000; // ms to wait for an event to happen
public const int LongWaitTimeout = 50000; // ms to wait for an event that takes a longer time than the average operation
public const int SubsequentExpectedWait = 10; // ms to wait for checks that occur after the first.
public const int WaitForExpectedEventTimeout_NoRetry = 3000;// ms to wait for an event that isn't surrounded by a retry.
public const int SubsequentExpectedWait = 500; // ms to wait for checks that occur after the first.
public const int WaitForExpectedEventTimeout_NoRetry = 5000;// ms to wait for an event that isn't surrounded by a retry.
public const int WaitForUnexpectedEventTimeout = 150; // ms to wait for a non-expected event.
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default timeout here was changed to WaitForUnexpectedEventTimeout, but that constant is currently 150ms. Given the motivation of tolerating delayed callback delivery, 150ms may be too short to reliably detect “unexpected” events that arrive late under load; consider increasing the default/constant (or document why 150ms is sufficient).

Copilot uses AI. Check for mistakes.
Comment on lines 311 to 320
renamed = WatchRenamed(watcher, expectedPaths);

watcher.EnableRaisingEvents = true;

// Allow the OS-specific watcher implementation to finish async startup
// (e.g., ReadDirectoryChangesW registration on Windows, inotify thread on Linux).
Thread.Sleep(50);

action();

Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExecuteAndVerifyEvents sets up handlers and enables EnableRaisingEvents, but there’s no try/finally around action() and the subsequent waits. If action() throws, handlers won’t be detached and EnableRaisingEvents won’t be reset, which can leak state into retries/other tests. Wrap the body in try/finally to always detach handlers and restore EnableRaisingEvents.

Copilot uses AI. Check for mistakes.
Comment on lines +443 to 449
// Allow the OS-specific watcher implementation to finish async startup.
Thread.Sleep(50);

action();
result = errorOccurred.WaitOne(WaitForExpectedEventTimeout);
result = errorOccurred.WaitOne(WaitForExpectedEventTimeout * attemptsCompleted);
watcher.EnableRaisingEvents = false;
cleanup();
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In TryErrorEvent, the attempt adds an Error handler each time but never removes it, and the recreated watchers from retries aren’t disposed. Over multiple retries this can accumulate handlers and leak watcher resources. Consider detaching the handler and disposing any watcher instances created during retries (and do this in a finally so it also runs when action() throws).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

2 participants