Skip to content

Stabilize TaskNodesDieAfterBuild: bump WaitForExit timeout + add process-identity telemetry#13828

Merged
jankratochvilcz merged 3 commits into
dotnet:mainfrom
jankratochvilcz:upstream-pr/taskhostfactory-die
May 21, 2026
Merged

Stabilize TaskNodesDieAfterBuild: bump WaitForExit timeout + add process-identity telemetry#13828
jankratochvilcz merged 3 commits into
dotnet:mainfrom
jankratochvilcz:upstream-pr/taskhostfactory-die

Conversation

@jankratochvilcz
Copy link
Copy Markdown
Contributor

The TaskNodesDieAfterBuild theory in src/Build.UnitTests/BackEnd/TaskHostFactory_Tests.cs flakes intermittently on slow CI agents because the post-build WaitForExit(3000) budget is too tight for the task host child process to drain stdio and exit.

Change

  • WaitForExit budget bumped from 3 s -> 15 s.
  • ProcessName and StartTime are captured up front so the failure message reveals a PID-reuse race (where the OS recycled the pid to an unrelated process between build-end and GetProcessById) instead of looking like the task host hung.
  • elapsedMs is logged via Stopwatch so a follow-up PR can shrink the budget back to a tight-but-safe value once CI data accumulates.

Risk

Test-only change. No production code touched.

Repro signal

The flaky version was observed on multiple distinct branches in dnceng-public pipeline 75 over a 7-day window. The fix has been validated locally (dotnet test ... --filter-method *TaskNodesDieAfterBuild* -> 4/4 passing).

Jan Krivanek and others added 2 commits May 21, 2026 11:09
The TaskHost child process is expected to terminate shortly after the
build completes, but on slow CI agents (Helix Linux/macOS in particular)
the existing 3-second budget has proven insufficient and produces
flaky test failures (~3 distinct branches in the past week).

Bump the wait to 15 seconds and include the captured pid + HasExited
state in the assertion message so future failures are easier to diagnose.

Fixes #43

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reviewer (expert-reviewer on PR #44) noted that a bare WaitForExit timeout
bump papers over a PID-reuse race: between build-end and GetProcessById, the
OS may have recycled the pid to an unrelated process. We capture ProcessName
and StartTime up front so a future failure shows the identity rather than
looking like the task host hung.

The Stopwatch around WaitForExit is mandated by the flaky-test-survey skill
workflow: every timeout bump must log actual elapsed ms so a follow-up PR
can tune the timeout back down to a tight-but-safe value.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 21, 2026 09:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Stabilizes the TaskNodesDieAfterBuild unit test by increasing the post-build task host exit wait budget and adding diagnostic output intended to disambiguate slow-exit vs PID-reuse scenarios on CI.

Changes:

  • Increased WaitForExit timeout from 3s to 15s in TaskNodesDieAfterBuild.
  • Captured process identity fields (name/start time) best-effort and logged elapsed wait time via Stopwatch.
  • Added SafeGetProcessField helper to prevent diagnostic reads from failing the test.

Comment thread src/Build.UnitTests/BackEnd/TaskHostFactory_Tests.cs Outdated
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary — 24 Dimensions Evaluated

24/24 dimensions clean — no findings.

This is a well-crafted test improvement that:

  • Correctly addresses the flaky test issue by increasing the timeout to a generous 15s (5× the observed maximum)
  • Adds excellent diagnostic information (process name, start time, elapsed ms) that will help distinguish PID-reuse races from genuine hangs
  • Handles edge cases gracefully via SafeGetProcessField (process may exit before diagnostics are captured)
  • Documents the reasoning clearly with "why" comments and a TELEMETRY note for future tuning
  • Maintains the same core assertion (task host must exit after build)

Key dimensions verified clean:

Dimension Result
Concurrency & Thread Safety ✅ Single-threaded test, inter-process race acknowledged and handled
Correctness & Edge Cases HasExited access in assertion message is safe (handle valid after WaitForExit)
Resource & Memory Management ✅ No new resource issues (Process not disposed is pre-existing)
Test Coverage & Completeness ✅ Assertion preserved, diagnostics improved, flakiness addressed
Scope & PR Discipline ✅ Single concern: fix flaky test + improve diagnostics
Documentation Accuracy ✅ Comments explain rationale, not just mechanics

Note

🔒 Integrity filter blocked 2 items

The following items were blocked because they don't meet the GitHub integrity level.

  • #13828 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13828 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by Expert Code Review (on open) for issue #13828 · ● 4M

@jankratochvilcz
Copy link
Copy Markdown
Contributor Author

Thanks — reviewer flagged 24/24 dimensions clean with no findings, so nothing to address here. This PR is test-only (src/Build.UnitTests/BackEnd/TaskHostFactory_Tests.cs) so the standard production guard rails don't apply.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jankratochvilcz jankratochvilcz force-pushed the upstream-pr/taskhostfactory-die branch from 9df0ded to 9b0b907 Compare May 21, 2026 09:42
Copy link
Copy Markdown
Member

@AlesProkop AlesProkop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jankratochvilcz
Copy link
Copy Markdown
Contributor Author

24h post-merge telemetry from main (dnceng-public pipeline 75, builds 1430301, 1430551, 1430938, 1431045, 1431261):

TFM/arch Runs min ms p50 ms p95 ms max ms Failed
net10.0/x64 37 6 8 44 191 0
net472/x86 20 98 105 129 136 0

The 15 000 ms budget gave ~80x headroom on the worst observed (191 ms). Zero failures.

Follow-up to honor the "shrink the budget back" promise from the PR body: draft #13845 lowers it to 2 000 ms (~10x of worst observed, comfortably above the 3 000 ms that originally flaked).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants