fix(batch): reconcile in-flight jobs on restart instead of zombies by vieiralucas · Pull Request #1995 · faiscadev/fakecloud

vieiralucas · 2026-06-27T11:13:57Z

2026-06-27 bug-hunt Tier 4 finding 4.1.

Bug

After a restart, a Batch job restored in a non-terminal state (SUBMITTED/PENDING/RUNNABLE/STARTING/RUNNING) had no background driver (spawn_status_sync / spawn_dependency_waiter) re-spawned, and its backing ECS task was already STOPPed by the ECS reconcile — so it could never reach a terminal state and hung forever. DescribeJobs reported RUNNING indefinitely, dependsOn jobs never released, array parents never aggregated. Same failure shape as #1338/#914/#1752.

Fix

BatchService::reconcile_persisted_jobs(), invoked from main.rs right after the batch snapshot is restored (mirroring ECS reconcile_persisted_tasks). Every non-terminal job is failed with statusReason = "Job interrupted by a fakecloud restart" + a stoppedAt, then the snapshot is re-saved. Idempotent (already-terminal jobs untouched).

Tests

Unit: submit -> reconcile -> FAILED + reason; second reconcile is a no-op.
Docker-gated e2e: submit a real long-running (sleep 120) container on a persistent server, wait until it's genuinely in-flight, restart(), assert it comes back FAILED with the reason — not a zombie.

Summary by cubic

Reconcile in-flight Batch jobs on restart by failing them with a clear reason, instead of leaving zombies that hang forever and block dependencies. Adds BatchService::reconcile_persisted_jobs() and runs it on server startup so non-terminal jobs move to FAILED with stoppedAt.

Bug Fixes
- On startup, fail any SUBMITTED/PENDING/RUNNABLE/STARTING/RUNNING job with statusReason = "Job interrupted by a fakecloud restart" and set stoppedAt (idempotent).
- Mirrors ECS task reconciliation to prevent stuck RUNNING states and unblock dependsOn and array aggregations.
- Tests: unit coverage and Docker e2e verifying a long-running job becomes FAILED after a persistent server restart.

^{Written for commit 7d8ec85. Summary will update on new commits.}

…mbies After a restart, a job restored in a non-terminal state (SUBMITTED/PENDING/ RUNNABLE/STARTING/RUNNING) had no background driver (status-sync / dependency-waiter) re-spawned, and its backing ECS task was already STOPPed by the ECS reconcile — so it could never reach a terminal state and hung forever. DescribeJobs would report RUNNING indefinitely, dependent jobs never released, array parents never aggregated. Add BatchService::reconcile_persisted_jobs(), invoked from main.rs right after the batch snapshot is restored (mirroring ECS reconcile_persisted_tasks): every non-terminal job is failed with statusReason "Job interrupted by a fakecloud restart" and a stoppedAt, then the snapshot is re-saved. Tests: unit test (submit -> reconcile -> FAILED + reason, idempotent); a Docker-gated e2e that submits a real long-running container, waits until it's in-flight, restarts the persistent server, and asserts the job comes back FAILED rather than a zombie.

vieiralucas merged commit a1d5a0b into main Jun 27, 2026
104 checks passed

vieiralucas deleted the wt-batch-restart branch June 27, 2026 11:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(batch): reconcile in-flight jobs on restart instead of zombies#1995

fix(batch): reconcile in-flight jobs on restart instead of zombies#1995
vieiralucas merged 1 commit into
mainfrom
wt-batch-restart

vieiralucas commented Jun 27, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

vieiralucas commented Jun 27, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug

Fix

Tests

Summary by cubic

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vieiralucas commented Jun 27, 2026 •

edited by cubic-dev-ai Bot

Loading