Stabilize 3 ToolTask_Tests flakes with diagnostics + timing fix by jankratochvilcz · Pull Request #13830 · dotnet/msbuild

jankratochvilcz · 2026-05-21T09:26:29Z

Three ToolTask_Tests flakes share a common timing / process-output capture failure mode. Combined they account for the loudest cluster of CI noise in Microsoft.Build.Utilities.UnitTests over a 7-day flake survey on dnceng-public pipeline 75.

Changes

ToolTaskCanChangeCanonicalErrorFormat — diagnostic-only. Logs Execute() return, ExitCode, Errors, Warnings, MessageCount, elapsedMs, and full engine.Log before the assertions, so the next failure is actionable instead of just showing the command line.

HandleExecutionErrorsWhenToolLogsError — diagnostic-only. Same counters + elapsedMs. Shouldly customMessage on ShouldBeFalse so we can see ExitCode / engine.Errors when Execute() unexpectedly returns true. Drops a redundant log dump (the engine already streams to _output live).

ToolTaskThatTimeoutAndRetry — widens the slow / fast gap so process startup overhead on slow CI agents can't cause the "slow" path to outrun its timeout: slowDelay 5 s -> 20 s, Timeout 2 s -> 5 s. Logs per-attempt elapsedMs alongside configuredTimeoutMs so a follow-up can shrink these back once CI data accumulates.

Likely-suspect for the two diagnostic-only tests: async pipe drain race in ToolTask (Execute returns before stdout EOF, so engine.Log only captures the cmd echo). Confirmation signal would be MessageCount==0 with non-empty exit. These PRs make that signal visible.

Risk

Test-only changes. All 12 affected test instances pass locally.

Tracks three closely-related flakes from the past 7 days of CI: * ToolTaskCanChangeCanonicalErrorFormat (23 failing builds, 17 branches) Add diagnostic _output.WriteLine of Execute() result, ExitCode, and the full engine.Log so that future failures expose the captured tool output instead of only the cmd-echo currently shown in xUnit error messages. * HandleExecutionErrorsWhenToolLogsError (7 failing builds, 6 branches) Same diagnostic dump plus a Shouldly customMessage on ShouldBeFalse so we can see why ToolTask sometimes reports success when the tool logs a canonical error. * ToolTaskThatTimeoutAndRetry (7 failing builds, 5 branches) Quadruple the slow-delay (5s -> 20s) and timeout (2s -> 5s) so the timing gap is wide enough to survive process startup overhead on slow CI agents. Also include attempt info in the assertion failure message. These tests share a common 'log race / process timing' failure mode that is hard to reproduce locally, so the priority is making the next failure actionable. Fixes #38 Fixes #39 Fixes #40 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Addresses expert-reviewer feedback on PR #46: - All 3 tests now log elapsedMs alongside the existing diagnostics so a follow-up PR can tune the bumped budgets in ToolTaskThatTimeoutAndRetry (slowDelay=20s, timeout=5s) back to tighter values once we see actuals. - HandleExecutionErrorsWhenToolLogsError: dropped the engine.Log dump (MockEngine is wired to _output already, so log lines stream live; the extra dump just duplicated output in the trx). - All three tests now also log Errors/Warnings/Messages counts which helps distinguish the async-pipe-drain truncation hypothesis from a canonical-error parser bug. TODO markers point at issues #38/#39/#40 so the diagnostics get cleaned up once root cause is fixed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR updates ToolTask_Tests to reduce CI flakiness by adding targeted diagnostics to two flaky tests and by widening the timing gap in the timeout/retry test so slow agents don’t accidentally miss the intended timeout path.

Changes:

Add per-test diagnostics (Execute result, ExitCode, error/warning/message counters, elapsed time) to make future failures actionable.
Dump engine.Log for ToolTaskCanChangeCanonicalErrorFormat to capture truncated/missing output scenarios.
Increase the delay/timeout budgets in ToolTaskThatTimeoutAndRetry and log per-attempt elapsed vs configured timeout.

github-actions

Review Summary

This is a well-structured diagnostic-instrumentation PR that follows MSBuild test conventions correctly. The approach of adding Stopwatch timing + _output.WriteLine diagnostics to capture flaky test state is sound and non-invasive.

✅ Looks Good

Test semantics preserved: All original assertions are maintained; new custom messages in ShouldBe/ShouldBeFalse add context without changing pass/fail behavior.
Test infrastructure patterns: Correctly uses ITestOutputHelper (_output), Shouldly assertions with custom messages, and TestEnvironment.
System.Diagnostics.Stopwatch: Already imported — no missing using.
engine.Messages: Confirmed to be an int count on MockEngine3 — the interpolated string is correct.
ToolTaskCanChangeCanonicalErrorFormat: Not asserting on executeResult is intentional and matches original behavior.

⚠️ Minor Concerns (non-blocking)

No tracking issue linked: The TODO comments should reference a GitHub issue so the diagnostics don't become permanent.
Test duration increase: The timeout bump (2s→5s) adds ~3s per timeout-path Theory case. Acceptable short-term but should be tightened once data is collected (as the comments state).
Line 1 whitespace: Trivial cosmetic change — consider reverting to keep git blame clean.

Overall this is a pragmatic approach to debugging a known flaky test. No blocking issues.

Note

🔒 Integrity filter blocked 1 item

The following item were blocked because they don't meet the GitHub integrity level.

#13830 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by Expert Code Review (on open) for issue #13830 · ● 847.9K

jankratochvilcz · 2026-05-21T09:38:52Z

Thanks for the review. Triage:

Addressed (force-pushed):

Reverted unintended UTF-8 BOM strip on line 1 of ToolTask_Tests.cs. Diff is now purely the three intended test changes.

Not addressed (left as-is):

TODO comments should reference a GitHub issue. These TODOs (e.g. TODO: re-tighten once telemetry confirms ) are deliberately not tied to an upstream issue — they are anchored to the elapsed-time telemetry this PR introduces, which is the data source we'll use to tighten the bumps. Filing a tracking issue just to point at telemetry that lives in the same diff would be noise.
Theory cases now take ~3s longer due to the 2s → 5s bump. Acknowledged in code comments. The point of the elapsed-time telemetry added here is precisely to gather data to tighten this back down; it is intended as a short-term widening, not a permanent budget.

This PR is test-only (src/Utilities.UnitTests/ToolTask_Tests.cs); production concerns (allocations, Span<T> etc.) don't apply.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jankratochvilcz · 2026-05-21T11:32:54Z

/azp run msbuild-pr

azure-pipelines · 2026-05-21T11:33:06Z

Azure Pipelines successfully started running 1 pipeline(s).

jankratochvilcz · 2026-05-22T14:08:36Z

24h post-merge telemetry from main (dnceng-public pipeline 75, builds 1430301, 1430551, 1430938, 1431045, 1431261):

ToolTaskCanChangeCanonicalErrorFormat

TFM/arch	Runs	min ms	p50 ms	p95 ms	max ms	Failed
net10.0/x64	18	12	46	99	100	0
net472/x86	10	64	90	1154	1667	0

HandleExecutionErrorsWhenToolLogsError

TFM/arch	Runs	min ms	p50 ms	p95 ms	max ms	Failed
net10.0/x64	9	32	36	124	142	0
net472/x86	10	35	54	301	453	0

MessageCount==0 with non-empty exit (the race the PR predicted)

Not observed. Across all instrumented runs (37 instances for the first test, 19 for the second), every passing run had MessageCount > 0. The async-pipe-drain race remains a plausible-but-unconfirmed hypothesis after 24 h — leaving the diagnostics in place for another wave.

ToolTaskThatTimeoutAndRetry

Fast path (target: succeed within Timeout)

TFM/arch	Runs	min ms	p50 ms	p95 ms	max ms	Failed
net10.0/x64	108	102	684	1102	1136	0
net472/x86	60	1030	1048	1226	1370	0

Slow path (target: be killed by Timeout — elapsed ≈ configured Timeout)

TFM/arch	Runs	min ms	p50 ms	p95 ms	max ms	Failed
net10.0/x64	18	5030	5069	5199	5269	0
net472/x86	10	5030	5068	5579	5991	0

Follow-up draft #13846 shrinks Timeout 5 s → 4.5 s (~3.3x of observed fast max) and slowDelay 20 s → 18 s (preserves the 4x slow/fast gap), reclaiming ~17 s of wall-clock per slow attempt.

…13864) Deterministic rewrite for the `ToolTaskThatTimeoutAndRetry` flake tracked in #13667. ## Background After the bump in #13830 (slowDelay 20s, timeout 5s) the test still flaked: a 4-day telemetry harvest of dnceng-public PR CI surfaced 2 failures on a no-code Arcade dependency-bump PR (#13863), proving the bumped budget is the floor, not a comfortable ceiling. ## Root cause The test set `task.Timeout` **once before the loop** and shared it across every attempt. On the `(3, true)` case the follow-up attempts run `ping -n 2` (~1-2s) against the shared 5s budget, leaving only 3-4s of headroom for CI cold-start overhead. The success-path attempts have **no semantic reason** to be wall-clock-bounded -- the test asserts they succeed regardless of how long the underlying ping/sleep takes. They were only coupled to the timing-out attempt because both used the same `Timeout` field. ## Fix 1. **Set `task.Timeout` per attempt.** Only the attempt expected to time out gets a finite 2s budget; every other attempt uses `Timeout.Infinite`. Removes all wall-clock dependency from the success path. 2. **Tighten `slowDelay` from 20s to 10s.** With Timeout=2s, ping ~10s vs timeout 2s gives 5x headroom -- the tool cannot finish before the timeout fires on any agent. Also halves test wall clock (~4-6s vs ~7-9s). ## Telemetry for post-merge health checks The test emits a stable-prefix line per attempt so future flake-trend analysis can grep it out of test stdout attachments: \\\ [TTTAR-TELEMETRY] attempt=1/3 role=timeoutAttempt expectedSuccess=False actualSuccess=False exitCode=-1 elapsedMs=2034 configuredTimeoutMs=2000 slowDelayMs=10000 fastDelayMs=100 [TTTAR-TELEMETRY] attempt=2/3 role=successAttempt expectedSuccess=True actualSuccess=True exitCode=0 elapsedMs=1521 configuredTimeoutMs=-1 slowDelayMs=10000 fastDelayMs=100 \\\ **Health-check recipe** (after the change has soaked for a few days): 1. `[TTTAR-TELEMETRY] ... actualSuccess=False expectedSuccess=True` anywhere in passing-build stdout = flake re-emerged on the success path; investigate immediately. 2. For `role=timeoutAttempt`, `elapsedMs` should be in roughly `[1900, 3000]` ms (timeout fires plus process-kill overhead). A consistent drift to the high end signals process-termination regressions. 3. For `role=successAttempt`, `elapsedMs` is uncapped by design; collect the distribution to know what budget margin we have if we ever want to re-introduce a Timeout on success. ## Verification 5 consecutive local runs ├ù 6 inline cases = **30/30 passing** on `net10.0|x64` and `net48|x86`. Opening as draft -- happy to iterate on shape.

@AlesProkop

…13734) (#13878) **Human-writen TL;DR for @AlesProkop:** Basically I see in the data that we are failing at around 2s which correlates to timeout added in #13351. The hypothesis is that the messages don't all make it in time to the tool task so instead of terminating on the null message, we terminate on timeout and hence the messages are missing. Since this test is flaky, I want to see how much 15s changes the flakiness. Note that my current feeling is that the flakes are actually legitimate - there is a clear code path introduced in the PR above that would cause this type of behavior, so I need to figure out how the test design needs to evolve to meet it. --- ### Summary Temporary test-side mitigation for the flake tracked in #13734, specifically affecting `ToolTaskCanChangeCanonicalErrorFormat` (~21.8% failure rate in the last 7 days of `dotnet-msbuild-public-ci`). ### Root cause (from per-failure trace evidence) The diagnostics added in #13830 show every failure pinned at: ``` Execute()=True, ExitCode=0, Errors=0, Warnings=0, MessageCount=1, elapsedMs ∈ {2108, 2124, 2143, 2151} ``` `MessageCount=1` is just `ToolTask`'s own pre-launch cmd echo — **none** of the spawned tool's stdout lines arrived. `elapsedMs ≈ 2100` is pinned at the 2s `eofTimeoutSec` budget in `ToolTask.WaitForProcessExit` (added in #13351 / Wave 18.6). The race is not tail-truncation but whole-stream loss: on 2-vCPU CI agents with default `ThreadPool` IOCP min=2 and many parallel test hosts, the I/O completion that drives `AsyncStreamReader → ReceiveStandardErrorOrOutputData → _standardOutputEOF.Set()` can be queued behind other completions for >2s. When `WaitHandle.WaitAll` times out, `LogMessagesFromStandardOutput` drains a still-empty queue. ### Change Append a sleep to the shell command in the affected test so the writer end of the pipe stays open ~15s after the data is written. This gives the AsyncStreamReader's IOCP completion ample slack to be scheduled before the 2s `WaitAll` budget starts being consumed. - Windows: `ping 127.0.0.1 -n 16 > nul` (~15s) - Unix: `sleep 15` ### Why 15s, why temporary 15s is intentionally generous as a back-stop while we confirm the IOCP-starvation hypothesis at CI scale. **Once CI history shows the flake is gone, we should shrink this** to whatever the observed delivery latency suggests (likely 1–3s is enough). I left a comment in the code to that effect. This is **only a test fix** — `ToolTask.cs` is intentionally untouched. The production-side question (whether `WaitForProcessExit` should perform a synchronous final read on timeout, or extend the 2s budget) is real but separate and worth a follow-up issue. ### Verification Local run on Windows passed on both TFMs (each ~16s, dominated by the sleep as expected): ``` net10.0|x64 passed (15s 781ms) net48|x86 passed (16s 032ms) ``` ### Follow-ups (not in this PR) - Apply the same treatment to `OverrideStdOutImportanceToHigh` (same `Assert.Contains "hello world" not found` signature in flake data). - Investigate `ErrorWhenTextSentToStandardError`, `HandleExecutionErrorsWhenToolLogsError`, `ToolTaskThatTimeoutAndRetry` — different failure signatures, may be real bugs rather than the same race. - File a production-side issue for the `WaitForProcessExit` 2s data-loss window.

Jan Krivanek and others added 2 commits May 21, 2026 11:09

Copilot AI review requested due to automatic review settings May 21, 2026 09:26

Copilot started reviewing on behalf of jankratochvilcz May 21, 2026 09:26 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread src/Utilities.UnitTests/ToolTask_Tests.cs Outdated

Comment thread src/Utilities.UnitTests/ToolTask_Tests.cs

Comment thread src/Utilities.UnitTests/ToolTask_Tests.cs Outdated

github-actions Bot reviewed May 21, 2026

View reviewed changes

Comment thread src/Utilities.UnitTests/ToolTask_Tests.cs

Comment thread src/Utilities.UnitTests/ToolTask_Tests.cs

Comment thread src/Utilities.UnitTests/ToolTask_Tests.cs Outdated

Comment thread src/Utilities.UnitTests/ToolTask_Tests.cs

jankratochvilcz force-pushed the upstream-pr/tooltask-diagnostics branch from c7234f1 to f50fecc Compare May 21, 2026 09:37

Drop fork-specific issue refs from source comments

a32ccbc

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jankratochvilcz force-pushed the upstream-pr/tooltask-diagnostics branch from f50fecc to a32ccbc Compare May 21, 2026 09:44

AlesProkop approved these changes May 21, 2026

View reviewed changes

jankratochvilcz merged commit af8e2bb into dotnet:main May 21, 2026
10 checks passed

jankratochvilcz mentioned this pull request May 21, 2026

Add elapsed-time telemetry to RunProcessAndGetOutput test helper #13836

Closed

This was referenced May 22, 2026

[release/10.0.4xx] Source code updates from dotnet/msbuild dotnet/dotnet#6758

Merged

[main] Source code updates from dotnet/msbuild dotnet/dotnet#6760

Merged

jankratochvilcz mentioned this pull request May 22, 2026

Shrink ToolTaskThatTimeoutAndRetry budgets: slowDelay 20s -> 18s, timeout 5s -> 4.5s (follow-up to #13830) #13846

Closed

This was referenced May 25, 2026

Fix or disable flaky tests #13667

Open

Deterministic ToolTaskThatTimeoutAndRetry: set Timeout per-attempt jankratochvilcz/msbuild#50

Draft

Deterministic ToolTaskThatTimeoutAndRetry: set Timeout per-attempt #13864

Merged

greptile-apps Bot mentioned this pull request May 25, 2026

[pull] main from dotnet:main MarcelRaschke/msbuild#353

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize 3 ToolTask_Tests flakes with diagnostics + timing fix#13830

Stabilize 3 ToolTask_Tests flakes with diagnostics + timing fix#13830
jankratochvilcz merged 3 commits into
dotnet:mainfrom
jankratochvilcz:upstream-pr/tooltask-diagnostics

jankratochvilcz commented May 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jankratochvilcz commented May 21, 2026

Uh oh!

jankratochvilcz commented May 21, 2026

Uh oh!

azure-pipelines Bot commented May 21, 2026

Uh oh!

Uh oh!

jankratochvilcz commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jankratochvilcz commented May 21, 2026

Changes

Risk

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Review Summary

✅ Looks Good

⚠️ Minor Concerns (non-blocking)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jankratochvilcz commented May 21, 2026

Uh oh!

jankratochvilcz commented May 21, 2026

Uh oh!

azure-pipelines Bot commented May 21, 2026

Uh oh!

Uh oh!

jankratochvilcz commented May 22, 2026

ToolTaskCanChangeCanonicalErrorFormat

HandleExecutionErrorsWhenToolLogsError

MessageCount==0 with non-empty exit (the race the PR predicted)

ToolTaskThatTimeoutAndRetry

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants