Stabilize 3 ToolTask_Tests flakes with diagnostics + timing fix#13830
Conversation
Tracks three closely-related flakes from the past 7 days of CI:
* ToolTaskCanChangeCanonicalErrorFormat (23 failing builds, 17 branches)
Add diagnostic _output.WriteLine of Execute() result, ExitCode, and the
full engine.Log so that future failures expose the captured tool output
instead of only the cmd-echo currently shown in xUnit error messages.
* HandleExecutionErrorsWhenToolLogsError (7 failing builds, 6 branches)
Same diagnostic dump plus a Shouldly customMessage on ShouldBeFalse so
we can see why ToolTask sometimes reports success when the tool logs
a canonical error.
* ToolTaskThatTimeoutAndRetry (7 failing builds, 5 branches)
Quadruple the slow-delay (5s -> 20s) and timeout (2s -> 5s) so the
timing gap is wide enough to survive process startup overhead on slow
CI agents. Also include attempt info in the assertion failure message.
These tests share a common 'log race / process timing' failure mode that
is hard to reproduce locally, so the priority is making the next failure
actionable.
Fixes #38
Fixes #39
Fixes #40
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses expert-reviewer feedback on PR #46: - All 3 tests now log elapsedMs alongside the existing diagnostics so a follow-up PR can tune the bumped budgets in ToolTaskThatTimeoutAndRetry (slowDelay=20s, timeout=5s) back to tighter values once we see actuals. - HandleExecutionErrorsWhenToolLogsError: dropped the engine.Log dump (MockEngine is wired to _output already, so log lines stream live; the extra dump just duplicated output in the trx). - All three tests now also log Errors/Warnings/Messages counts which helps distinguish the async-pipe-drain truncation hypothesis from a canonical-error parser bug. TODO markers point at issues #38/#39/#40 so the diagnostics get cleaned up once root cause is fixed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates ToolTask_Tests to reduce CI flakiness by adding targeted diagnostics to two flaky tests and by widening the timing gap in the timeout/retry test so slow agents don’t accidentally miss the intended timeout path.
Changes:
- Add per-test diagnostics (Execute result, ExitCode, error/warning/message counters, elapsed time) to make future failures actionable.
- Dump
engine.LogforToolTaskCanChangeCanonicalErrorFormatto capture truncated/missing output scenarios. - Increase the delay/timeout budgets in
ToolTaskThatTimeoutAndRetryand log per-attempt elapsed vs configured timeout.
There was a problem hiding this comment.
Review Summary
This is a well-structured diagnostic-instrumentation PR that follows MSBuild test conventions correctly. The approach of adding Stopwatch timing + _output.WriteLine diagnostics to capture flaky test state is sound and non-invasive.
✅ Looks Good
- Test semantics preserved: All original assertions are maintained; new custom messages in
ShouldBe/ShouldBeFalseadd context without changing pass/fail behavior. - Test infrastructure patterns: Correctly uses
ITestOutputHelper(_output),Shouldlyassertions with custom messages, andTestEnvironment. System.Diagnostics.Stopwatch: Already imported — no missing using.engine.Messages: Confirmed to be anintcount onMockEngine3— the interpolated string is correct.ToolTaskCanChangeCanonicalErrorFormat: Not asserting onexecuteResultis intentional and matches original behavior.
⚠️ Minor Concerns (non-blocking)
- No tracking issue linked: The TODO comments should reference a GitHub issue so the diagnostics don't become permanent.
- Test duration increase: The timeout bump (2s→5s) adds ~3s per timeout-path Theory case. Acceptable short-term but should be tightened once data is collected (as the comments state).
- Line 1 whitespace: Trivial cosmetic change — consider reverting to keep
git blameclean.
Overall this is a pragmatic approach to debugging a known flaky test. No blocking issues.
Note
🔒 Integrity filter blocked 1 item
The following item were blocked because they don't meet the GitHub integrity level.
- #13830
pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
To allow these resources, lower min-integrity in your GitHub frontmatter:
tools:
github:
min-integrity: approved # merged | approved | unapproved | noneGenerated by Expert Code Review (on open) for issue #13830 · ● 847.9K
c7234f1 to
f50fecc
Compare
|
Thanks for the review. Triage: Addressed (force-pushed):
Not addressed (left as-is):
This PR is test-only ( |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
f50fecc to
a32ccbc
Compare
|
/azp run msbuild-pr |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
24h post-merge telemetry from ToolTaskCanChangeCanonicalErrorFormat
HandleExecutionErrorsWhenToolLogsError
MessageCount==0 with non-empty exit (the race the PR predicted)Not observed. Across all instrumented runs (37 instances for the first test, 19 for the second), every passing run had ToolTaskThatTimeoutAndRetryFast path (target: succeed within Timeout)
Slow path (target: be killed by Timeout — elapsed ≈ configured Timeout)
Follow-up draft #13846 shrinks |
…13864) Deterministic rewrite for the `ToolTaskThatTimeoutAndRetry` flake tracked in #13667. ## Background After the bump in #13830 (slowDelay 20s, timeout 5s) the test still flaked: a 4-day telemetry harvest of dnceng-public PR CI surfaced 2 failures on a no-code Arcade dependency-bump PR (#13863), proving the bumped budget is the floor, not a comfortable ceiling. ## Root cause The test set `task.Timeout` **once before the loop** and shared it across every attempt. On the `(3, true)` case the follow-up attempts run `ping -n 2` (~1-2s) against the shared 5s budget, leaving only 3-4s of headroom for CI cold-start overhead. The success-path attempts have **no semantic reason** to be wall-clock-bounded -- the test asserts they succeed regardless of how long the underlying ping/sleep takes. They were only coupled to the timing-out attempt because both used the same `Timeout` field. ## Fix 1. **Set `task.Timeout` per attempt.** Only the attempt expected to time out gets a finite 2s budget; every other attempt uses `Timeout.Infinite`. Removes all wall-clock dependency from the success path. 2. **Tighten `slowDelay` from 20s to 10s.** With Timeout=2s, ping ~10s vs timeout 2s gives 5x headroom -- the tool cannot finish before the timeout fires on any agent. Also halves test wall clock (~4-6s vs ~7-9s). ## Telemetry for post-merge health checks The test emits a stable-prefix line per attempt so future flake-trend analysis can grep it out of test stdout attachments: \\\ [TTTAR-TELEMETRY] attempt=1/3 role=timeoutAttempt expectedSuccess=False actualSuccess=False exitCode=-1 elapsedMs=2034 configuredTimeoutMs=2000 slowDelayMs=10000 fastDelayMs=100 [TTTAR-TELEMETRY] attempt=2/3 role=successAttempt expectedSuccess=True actualSuccess=True exitCode=0 elapsedMs=1521 configuredTimeoutMs=-1 slowDelayMs=10000 fastDelayMs=100 \\\ **Health-check recipe** (after the change has soaked for a few days): 1. `[TTTAR-TELEMETRY] ... actualSuccess=False expectedSuccess=True` anywhere in passing-build stdout = flake re-emerged on the success path; investigate immediately. 2. For `role=timeoutAttempt`, `elapsedMs` should be in roughly `[1900, 3000]` ms (timeout fires plus process-kill overhead). A consistent drift to the high end signals process-termination regressions. 3. For `role=successAttempt`, `elapsedMs` is uncapped by design; collect the distribution to know what budget margin we have if we ever want to re-introduce a Timeout on success. ## Verification 5 consecutive local runs ├ù 6 inline cases = **30/30 passing** on `net10.0|x64` and `net48|x86`. Opening as draft -- happy to iterate on shape.
…13734) (#13878) **Human-writen TL;DR for @AlesProkop:** Basically I see in the data that we are failing at around 2s which correlates to timeout added in #13351. The hypothesis is that the messages don't all make it in time to the tool task so instead of terminating on the null message, we terminate on timeout and hence the messages are missing. Since this test is flaky, I want to see how much 15s changes the flakiness. Note that my current feeling is that the flakes are actually legitimate - there is a clear code path introduced in the PR above that would cause this type of behavior, so I need to figure out how the test design needs to evolve to meet it. --- ### Summary Temporary test-side mitigation for the flake tracked in #13734, specifically affecting `ToolTaskCanChangeCanonicalErrorFormat` (~21.8% failure rate in the last 7 days of `dotnet-msbuild-public-ci`). ### Root cause (from per-failure trace evidence) The diagnostics added in #13830 show every failure pinned at: ``` Execute()=True, ExitCode=0, Errors=0, Warnings=0, MessageCount=1, elapsedMs ∈ {2108, 2124, 2143, 2151} ``` `MessageCount=1` is just `ToolTask`'s own pre-launch cmd echo — **none** of the spawned tool's stdout lines arrived. `elapsedMs ≈ 2100` is pinned at the 2s `eofTimeoutSec` budget in `ToolTask.WaitForProcessExit` (added in #13351 / Wave 18.6). The race is not tail-truncation but whole-stream loss: on 2-vCPU CI agents with default `ThreadPool` IOCP min=2 and many parallel test hosts, the I/O completion that drives `AsyncStreamReader → ReceiveStandardErrorOrOutputData → _standardOutputEOF.Set()` can be queued behind other completions for >2s. When `WaitHandle.WaitAll` times out, `LogMessagesFromStandardOutput` drains a still-empty queue. ### Change Append a sleep to the shell command in the affected test so the writer end of the pipe stays open ~15s after the data is written. This gives the AsyncStreamReader's IOCP completion ample slack to be scheduled before the 2s `WaitAll` budget starts being consumed. - Windows: `ping 127.0.0.1 -n 16 > nul` (~15s) - Unix: `sleep 15` ### Why 15s, why temporary 15s is intentionally generous as a back-stop while we confirm the IOCP-starvation hypothesis at CI scale. **Once CI history shows the flake is gone, we should shrink this** to whatever the observed delivery latency suggests (likely 1–3s is enough). I left a comment in the code to that effect. This is **only a test fix** — `ToolTask.cs` is intentionally untouched. The production-side question (whether `WaitForProcessExit` should perform a synchronous final read on timeout, or extend the 2s budget) is real but separate and worth a follow-up issue. ### Verification Local run on Windows passed on both TFMs (each ~16s, dominated by the sleep as expected): ``` net10.0|x64 passed (15s 781ms) net48|x86 passed (16s 032ms) ``` ### Follow-ups (not in this PR) - Apply the same treatment to `OverrideStdOutImportanceToHigh` (same `Assert.Contains "hello world" not found` signature in flake data). - Investigate `ErrorWhenTextSentToStandardError`, `HandleExecutionErrorsWhenToolLogsError`, `ToolTaskThatTimeoutAndRetry` — different failure signatures, may be real bugs rather than the same race. - File a production-side issue for the `WaitForProcessExit` 2s data-loss window.
Three
ToolTask_Testsflakes share a common timing / process-output capture failure mode. Combined they account for the loudest cluster of CI noise inMicrosoft.Build.Utilities.UnitTestsover a 7-day flake survey on dnceng-public pipeline 75.Changes
ToolTaskCanChangeCanonicalErrorFormat— diagnostic-only. LogsExecute()return,ExitCode,Errors,Warnings,MessageCount,elapsedMs, and fullengine.Logbefore the assertions, so the next failure is actionable instead of just showing the command line.HandleExecutionErrorsWhenToolLogsError— diagnostic-only. Same counters +elapsedMs.ShouldlycustomMessage onShouldBeFalseso we can seeExitCode/engine.ErrorswhenExecute()unexpectedly returns true. Drops a redundant log dump (the engine already streams to_outputlive).ToolTaskThatTimeoutAndRetry— widens the slow / fast gap so process startup overhead on slow CI agents can't cause the "slow" path to outrun its timeout:slowDelay5 s -> 20 s,Timeout2 s -> 5 s. Logs per-attemptelapsedMsalongsideconfiguredTimeoutMsso a follow-up can shrink these back once CI data accumulates.Likely-suspect for the two diagnostic-only tests: async pipe drain race in
ToolTask(Execute returns before stdout EOF, so engine.Log only captures the cmd echo). Confirmation signal would beMessageCount==0with non-empty exit. These PRs make that signal visible.Risk
Test-only changes. All 12 affected test instances pass locally.