Skip to content

Test: keep shell alive 15s in ToolTaskCanChangeCanonicalErrorFormat (#13734)#13878

Merged
jankratochvilcz merged 1 commit into
dotnet:mainfrom
jankratochvilcz:jankratochvilcz/flake/tooltask-tests-sleep
May 27, 2026
Merged

Test: keep shell alive 15s in ToolTaskCanChangeCanonicalErrorFormat (#13734)#13878
jankratochvilcz merged 1 commit into
dotnet:mainfrom
jankratochvilcz:jankratochvilcz/flake/tooltask-tests-sleep

Conversation

@jankratochvilcz
Copy link
Copy Markdown
Contributor

@jankratochvilcz jankratochvilcz commented May 27, 2026

Human-writen TL;DR for @AlesProkop: Basically I see in the data that we are failing at around 2s which correlates to timeout added in #13351. The hypothesis is that the messages don't all make it in time to the tool task so instead of terminating on the null message, we terminate on timeout and hence the messages are missing. Since this test is flaky, I want to see how much 15s changes the flakiness. Note that my current feeling is that the flakes are actually legitimate - there is a clear code path introduced in the PR above that would cause this type of behavior, so I need to figure out how the test design needs to evolve to meet it.


Summary

Temporary test-side mitigation for the flake tracked in #13734, specifically affecting ToolTaskCanChangeCanonicalErrorFormat (~21.8% failure rate in the last 7 days of dotnet-msbuild-public-ci).

Root cause (from per-failure trace evidence)

The diagnostics added in #13830 show every failure pinned at:

Execute()=True, ExitCode=0, Errors=0, Warnings=0, MessageCount=1, elapsedMs ∈ {2108, 2124, 2143, 2151}

MessageCount=1 is just ToolTask's own pre-launch cmd echo — none of the spawned tool's stdout lines arrived. elapsedMs ≈ 2100 is pinned at the 2s eofTimeoutSec budget in ToolTask.WaitForProcessExit (added in #13351 / Wave 18.6).

The race is not tail-truncation but whole-stream loss: on 2-vCPU CI agents with default ThreadPool IOCP min=2 and many parallel test hosts, the I/O completion that drives AsyncStreamReader → ReceiveStandardErrorOrOutputData → _standardOutputEOF.Set() can be queued behind other completions for >2s. When WaitHandle.WaitAll times out, LogMessagesFromStandardOutput drains a still-empty queue.

Change

Append a sleep to the shell command in the affected test so the writer end of the pipe stays open ~15s after the data is written. This gives the AsyncStreamReader's IOCP completion ample slack to be scheduled before the 2s WaitAll budget starts being consumed.

  • Windows: ping 127.0.0.1 -n 16 > nul (~15s)
  • Unix: sleep 15

Why 15s, why temporary

15s is intentionally generous as a back-stop while we confirm the IOCP-starvation hypothesis at CI scale. Once CI history shows the flake is gone, we should shrink this to whatever the observed delivery latency suggests (likely 1–3s is enough). I left a comment in the code to that effect.

This is only a test fixToolTask.cs is intentionally untouched. The production-side question (whether WaitForProcessExit should perform a synchronous final read on timeout, or extend the 2s budget) is real but separate and worth a follow-up issue.

Verification

Local run on Windows passed on both TFMs (each ~16s, dominated by the sleep as expected):

net10.0|x64 passed (15s 781ms)
net48|x86   passed (16s 032ms)

Follow-ups (not in this PR)

  • Apply the same treatment to OverrideStdOutImportanceToHigh (same Assert.Contains "hello world" not found signature in flake data).
  • Investigate ErrorWhenTextSentToStandardError, HandleExecutionErrorsWhenToolLogsError, ToolTaskThatTimeoutAndRetry — different failure signatures, may be real bugs rather than the same race.
  • File a production-side issue for the WaitForProcessExit 2s data-loss window.

…ErrorFormat

Temporary mitigation for dotnet#13734. The 2s WaitHandle.WaitAll budget in WaitForProcessExit can be exhausted before AsyncStreamReader's IOCP completion is scheduled under loaded CI conditions, dropping all stdout lines from the test's tool. Appending a sleep after the data write keeps the pipe writer open long enough for the IOCP completion to land before the bounded wait expires.

15s is intentionally generous as a back-stop; once CI confirms the flake is gone we can shrink to whatever the observed delivery latency suggests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jankratochvilcz jankratochvilcz marked this pull request as ready for review May 27, 2026 12:55
Copilot AI review requested due to automatic review settings May 27, 2026 12:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR applies a temporary test-only mitigation to reduce flakiness in ToolTaskCanChangeCanonicalErrorFormat by keeping the spawned shell process alive long enough for ToolTask’s async stdout reader to deliver all lines before the bounded post-exit EOF wait can elapse under heavy CI load.

Changes:

  • Extend the test’s command to keep cmd.exe / sh alive ~15 seconds after emitting the file contents (Windows: ping ... > nul, Unix: sleep 15).
  • Add explanatory comments documenting the intended temporary nature and the suspected IOCP scheduling/root-cause hypothesis.

Comment thread src/Utilities.UnitTests/ToolTask_Tests.cs
@jankratochvilcz jankratochvilcz enabled auto-merge May 27, 2026 13:22
@jankratochvilcz jankratochvilcz merged commit cc426ef into dotnet:main May 27, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants