Skip to content

Fix ToolTask EOF-truncation flake: drain on every wait slice + 30s budget (#13734)#13874

Closed
jankratochvilcz wants to merge 1 commit into
dotnet:mainfrom
jankratochvilcz:jankratochvilcz/flake/tooltask-eof-drain
Closed

Fix ToolTask EOF-truncation flake: drain on every wait slice + 30s budget (#13734)#13874
jankratochvilcz wants to merge 1 commit into
dotnet:mainfrom
jankratochvilcz:jankratochvilcz/flake/tooltask-eof-drain

Conversation

@jankratochvilcz
Copy link
Copy Markdown
Contributor

Implements the fix sketched in #13734 for the ToolTaskCanChangeCanonicalErrorFormat flake. Latest survey (May 21-27 vs May 14-21) shows this test regressed from 15% to 22% failure rate despite the diagnostics PR #13830, so the underlying race is real and needs the structural fix the tracker has been proposing.

Root cause (per #13734)

Under ChangeWave 18.6, ToolTask.WaitForProcessExit does:

proc.WaitForExit(int.MaxValue);                                // does NOT drain pipes when a timeout is supplied
WaitHandle.WaitAll(eofEvents, TimeSpan.FromSeconds(2));        // single 2s shot

Process.WaitForExit(timeout) does not wait on async pipes, so everything depends on the 2s WaitAll. The caller then drains the queues exactly once after WaitForProcessExit returns. On a busy CI runner, the AsyncStreamReader thread can be context-switched out for >2s — the wait expires, the queues are flushed once, and anything the reader had not yet delivered is lost. For cmd /C type against a 2-line file, that final BADTHINGHAPPENED line is exactly what gets dropped.

Fix

In src/Utilities/ToolTask.cs::WaitForProcessExit:

  1. Replace the single 2s WaitAll with a 50ms-slice poll loop.
  2. After every slice, drain the queues (LogMessagesFromStandardError/Output). This is the key change: by the time we return, every line the reader has already pushed has been logged, even if EOF never arrives.
  3. Raise the outer budget from 2s to 30s. Still bounded for the grandchild-inherited-pipe case, but no longer racing CI pipe-flush latency. Tools that legitimately take longer than 30s to flush after exit would have already been killed by their own timeout.
  4. On timeout, log a low-importance diagnostic (ToolTask.PipeEofTimeoutOnExit) so the grandchild case is recognizable in the binlog without needing to repro.

The change stays inside the existing Wave18_6 gate; no new wave needed. The legacy path is untouched.

Risk

  • Behavior change is strictly additive in the success path: if EOF arrives within 2s as before, the loop exits at the first slice with eofReached = true and nothing else runs.
  • Order of log messages on the slow path now interleaves slightly differently — previously, all messages arrived in one post-wait flush; now they arrive in 50ms-slice chunks. Per-line content is unchanged.
  • Binlog impact: only the new low-importance diagnostic on the rare timeout path.

Verification

  • dotnet build src/Utilities/Microsoft.Build.Utilities.csproj — clean, 0 warnings.
  • dotnet test ... --filter-method "*ToolTask_Tests*"124/124 passing on net48|x86 and net10.0|x64.
  • 5 consecutive runs × 4 previously regressing tests (ToolTaskCanChangeCanonicalErrorFormat, ErrorWhenTextSentToStandardError, OverrideStdOutImportanceToHigh, HandleExecutionErrorsWhenToolLogsError) × 2 TFMs = 60/60 passing.

Local 100/100 repro coverage will follow before flipping out of draft; for now the goal is to get the change into CI so we can see whether the regressed ToolTaskCanChangeCanonicalErrorFormat ratio drops in next week's survey.

Why draft

Want a few days of post-merge CI data on the regressed tests before declaring victory — per the flaky-test-survey skill, the PR landing is "a hypothesis, not a fix-confirming event." The tracker (#13734) stays open until the next survey cycle.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

…dget

Addresses dotnet#13734 (ToolTaskCanChangeCanonicalErrorFormat regression to 22%).

Under ChangeWave 18.6, WaitForProcessExit waited only 2s for the async stdout/stderr pipes to reach EOF after the process exited, and on timeout the data still buffered inside AsyncStreamReader was lost: the caller drains once after WaitForProcessExit returns, so anything the reader had not yet handed to our queues was never seen.

Fix:

  1. Replace the single 2s WaitAll with a 50ms-slice poll loop.

  2. After every slice, call LogMessagesFromStandardError/Output to drain whatever the reader has already pushed onto the queues. By the time we return, the queues are caught up even if EOF was not reached.

  3. Raise the outer budget to 30s -- still bounded for the grandchild-inherited-pipe case, but no longer racing CI pipe-flush latency.

  4. On timeout, log a low-importance diagnostic (ToolTask.PipeEofTimeoutOnExit) so future occurrences are visible in the binlog.

Verification: 5 consecutive runs x 4 previously regressing ToolTask_Tests x 2 TFMs = 60/60 passing. Full ToolTask_Tests (124 tests) green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jankratochvilcz
Copy link
Copy Markdown
Contributor Author

Closing in favor of a test-only fix per maintainer feedback -- the blast radius of changing WaitForProcessExit for every ToolTask consumer is too high relative to the value of fixing one flake. Replacement PR will restructure the affected tests so the asserted lines are no longer the last lines of tool output (which is what the 2 s EOF race drops), plus add .Execute() return-value assertions, try/finally for the temp files, and diagnostic dumps on failure -- the 'Secondary improvements' list from #13734.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant