Skip to content

fix(batch): reconcile in-flight jobs on restart instead of zombies#1995

Merged
vieiralucas merged 1 commit into
mainfrom
wt-batch-restart
Jun 27, 2026
Merged

fix(batch): reconcile in-flight jobs on restart instead of zombies#1995
vieiralucas merged 1 commit into
mainfrom
wt-batch-restart

Conversation

@vieiralucas

@vieiralucas vieiralucas commented Jun 27, 2026

Copy link
Copy Markdown
Member

2026-06-27 bug-hunt Tier 4 finding 4.1.

Bug

After a restart, a Batch job restored in a non-terminal state (SUBMITTED/PENDING/RUNNABLE/STARTING/RUNNING) had no background driver (spawn_status_sync / spawn_dependency_waiter) re-spawned, and its backing ECS task was already STOPPed by the ECS reconcile — so it could never reach a terminal state and hung forever. DescribeJobs reported RUNNING indefinitely, dependsOn jobs never released, array parents never aggregated. Same failure shape as #1338/#914/#1752.

Fix

BatchService::reconcile_persisted_jobs(), invoked from main.rs right after the batch snapshot is restored (mirroring ECS reconcile_persisted_tasks). Every non-terminal job is failed with statusReason = "Job interrupted by a fakecloud restart" + a stoppedAt, then the snapshot is re-saved. Idempotent (already-terminal jobs untouched).

Tests

  • Unit: submit -> reconcile -> FAILED + reason; second reconcile is a no-op.
  • Docker-gated e2e: submit a real long-running (sleep 120) container on a persistent server, wait until it's genuinely in-flight, restart(), assert it comes back FAILED with the reason — not a zombie.

Summary by cubic

Reconcile in-flight Batch jobs on restart by failing them with a clear reason, instead of leaving zombies that hang forever and block dependencies. Adds BatchService::reconcile_persisted_jobs() and runs it on server startup so non-terminal jobs move to FAILED with stoppedAt.

  • Bug Fixes
    • On startup, fail any SUBMITTED/PENDING/RUNNABLE/STARTING/RUNNING job with statusReason = "Job interrupted by a fakecloud restart" and set stoppedAt (idempotent).
    • Mirrors ECS task reconciliation to prevent stuck RUNNING states and unblock dependsOn and array aggregations.
    • Tests: unit coverage and Docker e2e verifying a long-running job becomes FAILED after a persistent server restart.

Written for commit 7d8ec85. Summary will update on new commits.

Review in cubic

…mbies

After a restart, a job restored in a non-terminal state (SUBMITTED/PENDING/
RUNNABLE/STARTING/RUNNING) had no background driver (status-sync /
dependency-waiter) re-spawned, and its backing ECS task was already STOPPed by
the ECS reconcile — so it could never reach a terminal state and hung forever.
DescribeJobs would report RUNNING indefinitely, dependent jobs never released,
array parents never aggregated.

Add BatchService::reconcile_persisted_jobs(), invoked from main.rs right after
the batch snapshot is restored (mirroring ECS reconcile_persisted_tasks): every
non-terminal job is failed with statusReason "Job interrupted by a fakecloud
restart" and a stoppedAt, then the snapshot is re-saved.

Tests: unit test (submit -> reconcile -> FAILED + reason, idempotent); a
Docker-gated e2e that submits a real long-running container, waits until it's
in-flight, restarts the persistent server, and asserts the job comes back
FAILED rather than a zombie.
@vieiralucas vieiralucas merged commit a1d5a0b into main Jun 27, 2026
104 checks passed
@vieiralucas vieiralucas deleted the wt-batch-restart branch June 27, 2026 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant