Test: keep shell alive 15s in ToolTaskCanChangeCanonicalErrorFormat (#13734)#13878
Merged
jankratochvilcz merged 1 commit intoMay 27, 2026
Conversation
…ErrorFormat Temporary mitigation for dotnet#13734. The 2s WaitHandle.WaitAll budget in WaitForProcessExit can be exhausted before AsyncStreamReader's IOCP completion is scheduled under loaded CI conditions, dropping all stdout lines from the test's tool. Appending a sleep after the data write keeps the pipe writer open long enough for the IOCP completion to land before the bounded wait expires. 15s is intentionally generous as a back-stop; once CI confirms the flake is gone we can shrink to whatever the observed delivery latency suggests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
AlesProkop
approved these changes
May 27, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
This PR applies a temporary test-only mitigation to reduce flakiness in ToolTaskCanChangeCanonicalErrorFormat by keeping the spawned shell process alive long enough for ToolTask’s async stdout reader to deliver all lines before the bounded post-exit EOF wait can elapse under heavy CI load.
Changes:
- Extend the test’s command to keep
cmd.exe/shalive ~15 seconds after emitting the file contents (Windows:ping ... > nul, Unix:sleep 15). - Add explanatory comments documenting the intended temporary nature and the suspected IOCP scheduling/root-cause hypothesis.
This was referenced May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Human-writen TL;DR for @AlesProkop: Basically I see in the data that we are failing at around 2s which correlates to timeout added in #13351. The hypothesis is that the messages don't all make it in time to the tool task so instead of terminating on the null message, we terminate on timeout and hence the messages are missing. Since this test is flaky, I want to see how much 15s changes the flakiness. Note that my current feeling is that the flakes are actually legitimate - there is a clear code path introduced in the PR above that would cause this type of behavior, so I need to figure out how the test design needs to evolve to meet it.
Summary
Temporary test-side mitigation for the flake tracked in #13734, specifically affecting
ToolTaskCanChangeCanonicalErrorFormat(~21.8% failure rate in the last 7 days ofdotnet-msbuild-public-ci).Root cause (from per-failure trace evidence)
The diagnostics added in #13830 show every failure pinned at:
MessageCount=1is justToolTask's own pre-launch cmd echo — none of the spawned tool's stdout lines arrived.elapsedMs ≈ 2100is pinned at the 2seofTimeoutSecbudget inToolTask.WaitForProcessExit(added in #13351 / Wave 18.6).The race is not tail-truncation but whole-stream loss: on 2-vCPU CI agents with default
ThreadPoolIOCP min=2 and many parallel test hosts, the I/O completion that drivesAsyncStreamReader → ReceiveStandardErrorOrOutputData → _standardOutputEOF.Set()can be queued behind other completions for >2s. WhenWaitHandle.WaitAlltimes out,LogMessagesFromStandardOutputdrains a still-empty queue.Change
Append a sleep to the shell command in the affected test so the writer end of the pipe stays open ~15s after the data is written. This gives the AsyncStreamReader's IOCP completion ample slack to be scheduled before the 2s
WaitAllbudget starts being consumed.ping 127.0.0.1 -n 16 > nul(~15s)sleep 15Why 15s, why temporary
15s is intentionally generous as a back-stop while we confirm the IOCP-starvation hypothesis at CI scale. Once CI history shows the flake is gone, we should shrink this to whatever the observed delivery latency suggests (likely 1–3s is enough). I left a comment in the code to that effect.
This is only a test fix —
ToolTask.csis intentionally untouched. The production-side question (whetherWaitForProcessExitshould perform a synchronous final read on timeout, or extend the 2s budget) is real but separate and worth a follow-up issue.Verification
Local run on Windows passed on both TFMs (each ~16s, dominated by the sleep as expected):
Follow-ups (not in this PR)
OverrideStdOutImportanceToHigh(sameAssert.Contains "hello world" not foundsignature in flake data).ErrorWhenTextSentToStandardError,HandleExecutionErrorsWhenToolLogsError,ToolTaskThatTimeoutAndRetry— different failure signatures, may be real bugs rather than the same race.WaitForProcessExit2s data-loss window.