Skip to content

ISSUE-4357: Fix shared _temp race between cancelled and new run-service workers#4371

Open
herbenderbler wants to merge 1 commit intoactions:mainfrom
herbenderbler:ISSUE-4357/tempdirectorymanager-race-condition
Open

ISSUE-4357: Fix shared _temp race between cancelled and new run-service workers#4371
herbenderbler wants to merge 1 commit intoactions:mainfrom
herbenderbler:ISSUE-4357/tempdirectorymanager-race-condition

Conversation

@herbenderbler
Copy link
Copy Markdown

Description

Fixes #4357.

For run-service jobs, JobDispatcher.EnsureDispatchFinished cancelled the previous worker and returned without awaiting its exit. The new Runner.Worker then spawned while the previous worker was still tearing down. Both processes share <work>/_temp, so the exiting worker's TempDirectoryManager.CleanupTempDirectory() wiped the new worker's _runner_file_commands/* pipes mid-job, failing it with Missing file at path: .../set_output_<uuid> and exit code 102.

Implementation Details

  • src/Runner.Listener/JobDispatcher.cs: in the _isRunServiceJob branch of EnsureDispatchFinished, after cancelling the previous worker, await Task.WhenAny(jobDispatch.WorkerDispatch, Task.Delay(45s)) and throw InvalidOperationException on timeout — matching the existing non-run-service branch in the same method.
  • src/Test/L0/Listener/JobDispatcherL0.cs: add EnsureDispatchFinishedAwaitsPreviousWorkerForRunServiceJob regression test (drives the path with a controlled TaskCompletionSource; verified to fail without the fix and pass with it).

Checklist

  • ./dev.sh format is a no-op on the diff.
  • JobDispatcherL0 suite (14 tests) passes locally.
  • Regression test added and verified to fail on main.
  • Self-contained; no public API or _temp layout change.

Copilot AI review requested due to automatic review settings April 22, 2026 06:01
@herbenderbler herbenderbler requested a review from a team as a code owner April 22, 2026 06:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race in run-service job dispatch where a canceled worker could still be tearing down (and cleaning <work>/_temp) while a new worker starts, deleting the new worker’s _runner_file_commands/* files mid-job.

Changes:

  • Update JobDispatcher.EnsureDispatchFinished for run-service jobs to cancel the previous worker and wait (up to 45s) for the previous WorkerDispatch to finish before proceeding.
  • Add an L0 regression test that verifies EnsureDispatchFinished does not complete until the previous worker dispatch task completes.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/Runner.Listener/JobDispatcher.cs Adds a 45s wait in the run-service cancellation path to prevent overlapping workers from racing on the shared _temp directory.
src/Test/L0/Listener/JobDispatcherL0.cs Adds a regression test that drives the run-service EnsureDispatchFinished path and asserts it awaits the previous worker’s completion.

Comment thread src/Test/L0/Listener/JobDispatcherL0.cs Outdated
Comment thread src/Runner.Listener/JobDispatcher.cs Outdated
Restructure EnsureDispatchFinished so run-service and shutdown cancel paths
fall through to the common try/catch/finally (log faulted WorkerDispatch;
_jobInfos TryRemove/Dispose) instead of returning before it. Dispose
reflected WorkerDispatcher in the regression test.

Fixes actions#4357
@herbenderbler herbenderbler force-pushed the ISSUE-4357/tempdirectorymanager-race-condition branch from 9bfa7b9 to 37dea8e Compare April 22, 2026 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TempDirectoryManager race condition: cancelled Worker wipes _temp directory used by concurrently spawned new Worker

2 participants