Test: keep shell alive 15s in ToolTaskCanChangeCanonicalErrorFormat (#13734) by jankratochvilcz · Pull Request #13878 · dotnet/msbuild

jankratochvilcz · 2026-05-27T12:42:42Z

Human-writen TL;DR for @AlesProkop: Basically I see in the data that we are failing at around 2s which correlates to timeout added in #13351. The hypothesis is that the messages don't all make it in time to the tool task so instead of terminating on the null message, we terminate on timeout and hence the messages are missing. Since this test is flaky, I want to see how much 15s changes the flakiness. Note that my current feeling is that the flakes are actually legitimate - there is a clear code path introduced in the PR above that would cause this type of behavior, so I need to figure out how the test design needs to evolve to meet it.

Summary

Temporary test-side mitigation for the flake tracked in #13734, specifically affecting ToolTaskCanChangeCanonicalErrorFormat (~21.8% failure rate in the last 7 days of dotnet-msbuild-public-ci).

Root cause (from per-failure trace evidence)

The diagnostics added in #13830 show every failure pinned at:

Execute()=True, ExitCode=0, Errors=0, Warnings=0, MessageCount=1, elapsedMs ∈ {2108, 2124, 2143, 2151}

MessageCount=1 is just ToolTask's own pre-launch cmd echo — none of the spawned tool's stdout lines arrived. elapsedMs ≈ 2100 is pinned at the 2s eofTimeoutSec budget in ToolTask.WaitForProcessExit (added in #13351 / Wave 18.6).

The race is not tail-truncation but whole-stream loss: on 2-vCPU CI agents with default ThreadPool IOCP min=2 and many parallel test hosts, the I/O completion that drives AsyncStreamReader → ReceiveStandardErrorOrOutputData → _standardOutputEOF.Set() can be queued behind other completions for >2s. When WaitHandle.WaitAll times out, LogMessagesFromStandardOutput drains a still-empty queue.

Change

Append a sleep to the shell command in the affected test so the writer end of the pipe stays open ~15s after the data is written. This gives the AsyncStreamReader's IOCP completion ample slack to be scheduled before the 2s WaitAll budget starts being consumed.

Windows: ping 127.0.0.1 -n 16 > nul (~15s)
Unix: sleep 15

Why 15s, why temporary

15s is intentionally generous as a back-stop while we confirm the IOCP-starvation hypothesis at CI scale. Once CI history shows the flake is gone, we should shrink this to whatever the observed delivery latency suggests (likely 1–3s is enough). I left a comment in the code to that effect.

This is only a test fix — ToolTask.cs is intentionally untouched. The production-side question (whether WaitForProcessExit should perform a synchronous final read on timeout, or extend the 2s budget) is real but separate and worth a follow-up issue.

Verification

Local run on Windows passed on both TFMs (each ~16s, dominated by the sleep as expected):

net10.0|x64 passed (15s 781ms)
net48|x86   passed (16s 032ms)

Follow-ups (not in this PR)

Apply the same treatment to OverrideStdOutImportanceToHigh (same Assert.Contains "hello world" not found signature in flake data).
Investigate ErrorWhenTextSentToStandardError, HandleExecutionErrorsWhenToolLogsError, ToolTaskThatTimeoutAndRetry — different failure signatures, may be real bugs rather than the same race.
File a production-side issue for the WaitForProcessExit 2s data-loss window.

…ErrorFormat Temporary mitigation for dotnet#13734. The 2s WaitHandle.WaitAll budget in WaitForProcessExit can be exhausted before AsyncStreamReader's IOCP completion is scheduled under loaded CI conditions, dropping all stdout lines from the test's tool. Appending a sleep after the data write keeps the pipe writer open long enough for the IOCP completion to land before the bounded wait expires. 15s is intentionally generous as a back-stop; once CI confirms the flake is gone we can shrink to whatever the observed delivery latency suggests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR applies a temporary test-only mitigation to reduce flakiness in ToolTaskCanChangeCanonicalErrorFormat by keeping the spawned shell process alive long enough for ToolTask’s async stdout reader to deliver all lines before the bounded post-exit EOF wait can elapse under heavy CI load.

Changes:

Extend the test’s command to keep cmd.exe / sh alive ~15 seconds after emitting the file contents (Windows: ping ... > nul, Unix: sleep 15).
Add explanatory comments documenting the intended temporary nature and the suspected IOCP scheduling/root-cause hypothesis.

jankratochvilcz requested a review from AlesProkop May 27, 2026 12:42

jankratochvilcz mentioned this pull request May 27, 2026

Stabilize three ToolTask flakes with output padding (test-only, #13734) #13875

Closed

jankratochvilcz marked this pull request as ready for review May 27, 2026 12:55

Copilot AI review requested due to automatic review settings May 27, 2026 12:55

Copilot started reviewing on behalf of jankratochvilcz May 27, 2026 12:56 View session

AlesProkop approved these changes May 27, 2026

View reviewed changes

Copilot AI reviewed May 27, 2026

View reviewed changes

Comment thread src/Utilities.UnitTests/ToolTask_Tests.cs

jankratochvilcz enabled auto-merge May 27, 2026 13:22

jankratochvilcz merged commit cc426ef into dotnet:main May 27, 2026
17 checks passed

This was referenced May 28, 2026

[release/10.0.4xx] Source code updates from dotnet/msbuild dotnet/dotnet#6882

Merged

[main] Source code updates from dotnet/msbuild dotnet/dotnet#6885

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test: keep shell alive 15s in ToolTaskCanChangeCanonicalErrorFormat (#13734)#13878

Test: keep shell alive 15s in ToolTaskCanChangeCanonicalErrorFormat (#13734)#13878
jankratochvilcz merged 1 commit into
dotnet:mainfrom
jankratochvilcz:jankratochvilcz/flake/tooltask-tests-sleep

jankratochvilcz commented May 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jankratochvilcz commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause (from per-failure trace evidence)

Change

Why 15s, why temporary

Verification

Follow-ups (not in this PR)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jankratochvilcz commented May 27, 2026 •

edited

Loading