Skip to content

Shrink ToolTaskThatTimeoutAndRetry budgets: slowDelay 20s -> 18s, timeout 5s -> 4.5s (follow-up to #13830)#13846

Closed
jankratochvilcz wants to merge 1 commit into
dotnet:mainfrom
jankratochvilcz:jankratochvilcz/flake/tooltask-retry-shrink
Closed

Shrink ToolTaskThatTimeoutAndRetry budgets: slowDelay 20s -> 18s, timeout 5s -> 4.5s (follow-up to #13830)#13846
jankratochvilcz wants to merge 1 commit into
dotnet:mainfrom
jankratochvilcz:jankratochvilcz/flake/tooltask-retry-shrink

Conversation

@jankratochvilcz
Copy link
Copy Markdown
Contributor

Follow-up to #13830, which widened ToolTaskThatTimeoutAndRetry's slow/fast gap (slowDelay 5 s → 20 s, Timeout 2 s → 5 s) and added per-attempt elapsedMs / configuredTimeoutMs telemetry explicitly so a follow-up could tighten the budgets once data accumulated.

24 h of post-merge main runs on dnceng-public pipeline 75 (builds 1430301, 1430551, 1430938, 1431045, 1431261):

Fast path (target: complete within Timeout)

TFM/arch Runs min ms p50 ms p95 ms max ms Failed
net10.0/x64 108 102 684 1102 1136 0
net472/x86 60 1030 1048 1226 1370 0

Slow path (target: be killed by Timeout)

TFM/arch Runs min ms p50 ms p95 ms max ms Failed
net10.0/x64 18 5030 5069 5199 5269 0
net472/x86 10 5030 5068 5579 5991 0

(Slow elapsed ≈ configured Timeout because the test asserts the timeout terminates the process.)

This PR shrinks:

  • Timeout 5 000 ms → 4 500 ms — ~3.3x of fast max=1370 ms, ~4x of fast p95=1131 ms.
  • slowDelay 20 000 ms → 18 000 ms — preserves the 4x slow/fast gap (essential so the slow path reliably hits the timeout, not natural completion).

For context: the original Timeout=2000 ms before #13830 was only ~1.46x of fast max, which is exactly why it flaked. 4 500 ms keeps a comfortable safety margin while reclaiming ~17 s of wall-clock per slow attempt (and the test runs the slow path for many parameter combos).

The Stopwatch + per-attempt telemetry stays in place so future regressions show up in the test output.

Risk

Test-only change. Single file (src/Utilities.UnitTests/ToolTask_Tests.cs). No production code touched.

Why draft

Keeping in draft for one more day of main data to confirm no new outliers appear before requesting review.

…eout 5s -> 4.5s

24h of post-merge telemetry from dnceng-public pipeline 75 (introduced
by dotnet#13830) shows actual per-attempt elapsedMs distributions:

  Fast path (target: should succeed within Timeout):
    net10.0/x64: min=102 p50=684 p95=1102 max=1136 (108 runs)
    net472/x86:  min=1030 p50=1048 p95=1226 max=1370 (60 runs)

  Slow path (target: should be terminated by Timeout):
    net10.0/x64: 5030-5269ms (essentially = configured 5000ms timeout)
    net472/x86:  5030-5991ms (essentially = configured 5000ms timeout)

Shrink Timeout 5000 -> 4500 (~3.3x of observed fast max=1370, ~4x of
fast p95=1131) and slowDelay 20000 -> 18000 to preserve the same 4x
slow/fast gap. The original 2000ms timeout from before dotnet#13830 was 1.46x
of max — too tight, hence the flake. 4500ms keeps a comfortable safety
margin while reclaiming ~17s of wall-clock per slow attempt.

The Stopwatch + elapsedMs/configuredTimeoutMs telemetry stays in place
so future regressions are visible in test output.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jankratochvilcz
Copy link
Copy Markdown
Contributor Author

Keeping this in draft for now — 24h on main showed 0 flakes across ~20 runs, but that sample is too small to confirm the shrunk budget is safe (the original flake rate was probably <1%). Will revisit once we have another ~3-7 days of CI data on main + PR validation to confirm the worst-case stays well under the new budget.

@jankratochvilcz
Copy link
Copy Markdown
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant