Skip to content

Stabilize 3 ToolTask_Tests flakes with diagnostics + timing fix#13830

Merged
jankratochvilcz merged 3 commits into
dotnet:mainfrom
jankratochvilcz:upstream-pr/tooltask-diagnostics
May 21, 2026
Merged

Stabilize 3 ToolTask_Tests flakes with diagnostics + timing fix#13830
jankratochvilcz merged 3 commits into
dotnet:mainfrom
jankratochvilcz:upstream-pr/tooltask-diagnostics

Conversation

@jankratochvilcz
Copy link
Copy Markdown
Contributor

Three ToolTask_Tests flakes share a common timing / process-output capture failure mode. Combined they account for the loudest cluster of CI noise in Microsoft.Build.Utilities.UnitTests over a 7-day flake survey on dnceng-public pipeline 75.

Changes

ToolTaskCanChangeCanonicalErrorFormat — diagnostic-only. Logs Execute() return, ExitCode, Errors, Warnings, MessageCount, elapsedMs, and full engine.Log before the assertions, so the next failure is actionable instead of just showing the command line.

HandleExecutionErrorsWhenToolLogsError — diagnostic-only. Same counters + elapsedMs. Shouldly customMessage on ShouldBeFalse so we can see ExitCode / engine.Errors when Execute() unexpectedly returns true. Drops a redundant log dump (the engine already streams to _output live).

ToolTaskThatTimeoutAndRetry — widens the slow / fast gap so process startup overhead on slow CI agents can't cause the "slow" path to outrun its timeout: slowDelay 5 s -> 20 s, Timeout 2 s -> 5 s. Logs per-attempt elapsedMs alongside configuredTimeoutMs so a follow-up can shrink these back once CI data accumulates.

Likely-suspect for the two diagnostic-only tests: async pipe drain race in ToolTask (Execute returns before stdout EOF, so engine.Log only captures the cmd echo). Confirmation signal would be MessageCount==0 with non-empty exit. These PRs make that signal visible.

Risk

Test-only changes. All 12 affected test instances pass locally.

Jan Krivanek and others added 2 commits May 21, 2026 11:09
Tracks three closely-related flakes from the past 7 days of CI:

  * ToolTaskCanChangeCanonicalErrorFormat (23 failing builds, 17 branches)
    Add diagnostic _output.WriteLine of Execute() result, ExitCode, and the
    full engine.Log so that future failures expose the captured tool output
    instead of only the cmd-echo currently shown in xUnit error messages.

  * HandleExecutionErrorsWhenToolLogsError (7 failing builds, 6 branches)
    Same diagnostic dump plus a Shouldly customMessage on ShouldBeFalse so
    we can see why ToolTask sometimes reports success when the tool logs
    a canonical error.

  * ToolTaskThatTimeoutAndRetry (7 failing builds, 5 branches)
    Quadruple the slow-delay (5s -> 20s) and timeout (2s -> 5s) so the
    timing gap is wide enough to survive process startup overhead on slow
    CI agents. Also include attempt info in the assertion failure message.

These tests share a common 'log race / process timing' failure mode that
is hard to reproduce locally, so the priority is making the next failure
actionable.

Fixes #38
Fixes #39
Fixes #40

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses expert-reviewer feedback on PR #46:
- All 3 tests now log elapsedMs alongside the existing diagnostics so a
  follow-up PR can tune the bumped budgets in ToolTaskThatTimeoutAndRetry
  (slowDelay=20s, timeout=5s) back to tighter values once we see actuals.
- HandleExecutionErrorsWhenToolLogsError: dropped the engine.Log dump
  (MockEngine is wired to _output already, so log lines stream live; the
  extra dump just duplicated output in the trx).
- All three tests now also log Errors/Warnings/Messages counts which
  helps distinguish the async-pipe-drain truncation hypothesis from a
  canonical-error parser bug. TODO markers point at issues #38/#39/#40
  so the diagnostics get cleaned up once root cause is fixed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 21, 2026 09:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates ToolTask_Tests to reduce CI flakiness by adding targeted diagnostics to two flaky tests and by widening the timing gap in the timeout/retry test so slow agents don’t accidentally miss the intended timeout path.

Changes:

  • Add per-test diagnostics (Execute result, ExitCode, error/warning/message counters, elapsed time) to make future failures actionable.
  • Dump engine.Log for ToolTaskCanChangeCanonicalErrorFormat to capture truncated/missing output scenarios.
  • Increase the delay/timeout budgets in ToolTaskThatTimeoutAndRetry and log per-attempt elapsed vs configured timeout.

Comment thread src/Utilities.UnitTests/ToolTask_Tests.cs Outdated
Comment thread src/Utilities.UnitTests/ToolTask_Tests.cs
Comment thread src/Utilities.UnitTests/ToolTask_Tests.cs Outdated
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This is a well-structured diagnostic-instrumentation PR that follows MSBuild test conventions correctly. The approach of adding Stopwatch timing + _output.WriteLine diagnostics to capture flaky test state is sound and non-invasive.

✅ Looks Good

  • Test semantics preserved: All original assertions are maintained; new custom messages in ShouldBe/ShouldBeFalse add context without changing pass/fail behavior.
  • Test infrastructure patterns: Correctly uses ITestOutputHelper (_output), Shouldly assertions with custom messages, and TestEnvironment.
  • System.Diagnostics.Stopwatch: Already imported — no missing using.
  • engine.Messages: Confirmed to be an int count on MockEngine3 — the interpolated string is correct.
  • ToolTaskCanChangeCanonicalErrorFormat: Not asserting on executeResult is intentional and matches original behavior.

⚠️ Minor Concerns (non-blocking)

  1. No tracking issue linked: The TODO comments should reference a GitHub issue so the diagnostics don't become permanent.
  2. Test duration increase: The timeout bump (2s→5s) adds ~3s per timeout-path Theory case. Acceptable short-term but should be tightened once data is collected (as the comments state).
  3. Line 1 whitespace: Trivial cosmetic change — consider reverting to keep git blame clean.

Overall this is a pragmatic approach to debugging a known flaky test. No blocking issues.

Note

🔒 Integrity filter blocked 1 item

The following item were blocked because they don't meet the GitHub integrity level.

  • #13830 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by Expert Code Review (on open) for issue #13830 · ● 847.9K

Comment thread src/Utilities.UnitTests/ToolTask_Tests.cs
Comment thread src/Utilities.UnitTests/ToolTask_Tests.cs
Comment thread src/Utilities.UnitTests/ToolTask_Tests.cs Outdated
Comment thread src/Utilities.UnitTests/ToolTask_Tests.cs
@jankratochvilcz jankratochvilcz force-pushed the upstream-pr/tooltask-diagnostics branch from c7234f1 to f50fecc Compare May 21, 2026 09:37
@jankratochvilcz
Copy link
Copy Markdown
Contributor Author

Thanks for the review. Triage:

Addressed (force-pushed):

  • Reverted unintended UTF-8 BOM strip on line 1 of ToolTask_Tests.cs. Diff is now purely the three intended test changes.

Not addressed (left as-is):

  • TODO comments should reference a GitHub issue. These TODOs (e.g. TODO: re-tighten once telemetry confirms ) are deliberately not tied to an upstream issue — they are anchored to the elapsed-time telemetry this PR introduces, which is the data source we'll use to tighten the bumps. Filing a tracking issue just to point at telemetry that lives in the same diff would be noise.
  • Theory cases now take ~3s longer due to the 2s → 5s bump. Acknowledged in code comments. The point of the elapsed-time telemetry added here is precisely to gather data to tighten this back down; it is intended as a short-term widening, not a permanent budget.

This PR is test-only (src/Utilities.UnitTests/ToolTask_Tests.cs); production concerns (allocations, Span<T> etc.) don't apply.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jankratochvilcz jankratochvilcz force-pushed the upstream-pr/tooltask-diagnostics branch from f50fecc to a32ccbc Compare May 21, 2026 09:44
@jankratochvilcz
Copy link
Copy Markdown
Contributor Author

/azp run msbuild-pr

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@jankratochvilcz
Copy link
Copy Markdown
Contributor Author

24h post-merge telemetry from main (dnceng-public pipeline 75, builds 1430301, 1430551, 1430938, 1431045, 1431261):

ToolTaskCanChangeCanonicalErrorFormat

TFM/arch Runs min ms p50 ms p95 ms max ms Failed
net10.0/x64 18 12 46 99 100 0
net472/x86 10 64 90 1154 1667 0

HandleExecutionErrorsWhenToolLogsError

TFM/arch Runs min ms p50 ms p95 ms max ms Failed
net10.0/x64 9 32 36 124 142 0
net472/x86 10 35 54 301 453 0

MessageCount==0 with non-empty exit (the race the PR predicted)

Not observed. Across all instrumented runs (37 instances for the first test, 19 for the second), every passing run had MessageCount > 0. The async-pipe-drain race remains a plausible-but-unconfirmed hypothesis after 24 h — leaving the diagnostics in place for another wave.

ToolTaskThatTimeoutAndRetry

Fast path (target: succeed within Timeout)

TFM/arch Runs min ms p50 ms p95 ms max ms Failed
net10.0/x64 108 102 684 1102 1136 0
net472/x86 60 1030 1048 1226 1370 0

Slow path (target: be killed by Timeout — elapsed ≈ configured Timeout)

TFM/arch Runs min ms p50 ms p95 ms max ms Failed
net10.0/x64 18 5030 5069 5199 5269 0
net472/x86 10 5030 5068 5579 5991 0

Follow-up draft #13846 shrinks Timeout 5 s → 4.5 s (~3.3x of observed fast max) and slowDelay 20 s → 18 s (preserves the 4x slow/fast gap), reclaiming ~17 s of wall-clock per slow attempt.

jankratochvilcz added a commit that referenced this pull request May 25, 2026
…13864)

Deterministic rewrite for the `ToolTaskThatTimeoutAndRetry` flake
tracked in #13667.

## Background

After the bump in #13830 (slowDelay 20s, timeout 5s) the test still
flaked: a 4-day telemetry harvest of dnceng-public PR CI surfaced 2
failures on a no-code Arcade dependency-bump PR (#13863), proving the
bumped budget is the floor, not a comfortable ceiling.

## Root cause

The test set `task.Timeout` **once before the loop** and shared it
across every attempt. On the `(3, true)` case the follow-up attempts run
`ping -n 2` (~1-2s) against the shared 5s budget, leaving only 3-4s of
headroom for CI cold-start overhead. The success-path attempts have **no
semantic reason** to be wall-clock-bounded -- the test asserts they
succeed regardless of how long the underlying ping/sleep takes. They
were only coupled to the timing-out attempt because both used the same
`Timeout` field.

## Fix

1. **Set `task.Timeout` per attempt.** Only the attempt expected to time
out gets a finite 2s budget; every other attempt uses
`Timeout.Infinite`. Removes all wall-clock dependency from the success
path.
2. **Tighten `slowDelay` from 20s to 10s.** With Timeout=2s, ping ~10s
vs timeout 2s gives 5x headroom -- the tool cannot finish before the
timeout fires on any agent. Also halves test wall clock (~4-6s vs
~7-9s).

## Telemetry for post-merge health checks

The test emits a stable-prefix line per attempt so future flake-trend
analysis can grep it out of test stdout attachments:

\\\
[TTTAR-TELEMETRY] attempt=1/3 role=timeoutAttempt expectedSuccess=False
actualSuccess=False exitCode=-1 elapsedMs=2034 configuredTimeoutMs=2000
slowDelayMs=10000 fastDelayMs=100
[TTTAR-TELEMETRY] attempt=2/3 role=successAttempt expectedSuccess=True
actualSuccess=True exitCode=0 elapsedMs=1521 configuredTimeoutMs=-1
slowDelayMs=10000 fastDelayMs=100
\\\

**Health-check recipe** (after the change has soaked for a few days):
1. `[TTTAR-TELEMETRY] ... actualSuccess=False expectedSuccess=True`
anywhere in passing-build stdout = flake re-emerged on the success path;
investigate immediately.
2. For `role=timeoutAttempt`, `elapsedMs` should be in roughly `[1900,
3000]` ms (timeout fires plus process-kill overhead). A consistent drift
to the high end signals process-termination regressions.
3. For `role=successAttempt`, `elapsedMs` is uncapped by design; collect
the distribution to know what budget margin we have if we ever want to
re-introduce a Timeout on success.

## Verification

5 consecutive local runs × 6 inline cases = **30/30 passing** on
`net10.0|x64` and `net48|x86`.

Opening as draft -- happy to iterate on shape.
jankratochvilcz added a commit that referenced this pull request May 27, 2026
…13734) (#13878)

**Human-writen TL;DR for @AlesProkop:** Basically I see in the data that
we are failing at around 2s which correlates to timeout added in
#13351. The hypothesis is that the
messages don't all make it in time to the tool task so instead of
terminating on the null message, we terminate on timeout and hence the
messages are missing. Since this test is flaky, I want to see how much
15s changes the flakiness. Note that my current feeling is that the
flakes are actually legitimate - there is a clear code path introduced
in the PR above that would cause this type of behavior, so I need to
figure out how the test design needs to evolve to meet it.

---

### Summary

Temporary test-side mitigation for the flake tracked in #13734,
specifically affecting `ToolTaskCanChangeCanonicalErrorFormat` (~21.8%
failure rate in the last 7 days of `dotnet-msbuild-public-ci`).

### Root cause (from per-failure trace evidence)

The diagnostics added in #13830 show every failure pinned at:

```
Execute()=True, ExitCode=0, Errors=0, Warnings=0, MessageCount=1, elapsedMs ∈ {2108, 2124, 2143, 2151}
```

`MessageCount=1` is just `ToolTask`'s own pre-launch cmd echo — **none**
of the spawned tool's stdout lines arrived. `elapsedMs ≈ 2100` is pinned
at the 2s `eofTimeoutSec` budget in `ToolTask.WaitForProcessExit` (added
in #13351 / Wave 18.6).

The race is not tail-truncation but whole-stream loss: on 2-vCPU CI
agents with default `ThreadPool` IOCP min=2 and many parallel test
hosts, the I/O completion that drives `AsyncStreamReader →
ReceiveStandardErrorOrOutputData → _standardOutputEOF.Set()` can be
queued behind other completions for >2s. When `WaitHandle.WaitAll` times
out, `LogMessagesFromStandardOutput` drains a still-empty queue.

### Change

Append a sleep to the shell command in the affected test so the writer
end of the pipe stays open ~15s after the data is written. This gives
the AsyncStreamReader's IOCP completion ample slack to be scheduled
before the 2s `WaitAll` budget starts being consumed.

- Windows: `ping 127.0.0.1 -n 16 > nul` (~15s)
- Unix:    `sleep 15`

### Why 15s, why temporary

15s is intentionally generous as a back-stop while we confirm the
IOCP-starvation hypothesis at CI scale. **Once CI history shows the
flake is gone, we should shrink this** to whatever the observed delivery
latency suggests (likely 1–3s is enough). I left a comment in the code
to that effect.

This is **only a test fix** — `ToolTask.cs` is intentionally untouched.
The production-side question (whether `WaitForProcessExit` should
perform a synchronous final read on timeout, or extend the 2s budget) is
real but separate and worth a follow-up issue.

### Verification

Local run on Windows passed on both TFMs (each ~16s, dominated by the
sleep as expected):

```
net10.0|x64 passed (15s 781ms)
net48|x86   passed (16s 032ms)
```

### Follow-ups (not in this PR)

- Apply the same treatment to `OverrideStdOutImportanceToHigh` (same
`Assert.Contains "hello world" not found` signature in flake data).
- Investigate `ErrorWhenTextSentToStandardError`,
`HandleExecutionErrorsWhenToolLogsError`, `ToolTaskThatTimeoutAndRetry`
— different failure signatures, may be real bugs rather than the same
race.
- File a production-side issue for the `WaitForProcessExit` 2s data-loss
window.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants