feat(batch): retryStrategy + timeout (batch 5) by vieiralucas · Pull Request #1989 · faiscadev/fakecloud

vieiralucas · 2026-06-27T06:48:32Z

Summary

AWS Batch batch 5 — retry strategies + job timeouts.

spawn_status_sync loops over attempts: on a non-zero container exit, if retryStrategy.attempts remain it records the failed attempt under attempts[] and re-launches a fresh ECS task (status RUNNABLE → STARTING), else fails the job. Attempts clamped to AWS's 1-10.
timeout.attemptDurationSeconds caps each attempt's poll loop; an overrunning container fails the job with "Job attempt duration exceeded timeout".
Extracted set_job_terminal for the SUCCEEDED/FAILED write.

Test plan

Docker-gated e2e: retry_strategy_reattempts_a_failing_job (exit 4, attempts=2 → FAILED with one recorded retry + exitCode 4) and timeout_fails_an_overrunning_job (sleep 60, attemptDurationSeconds=2 → FAILED "timeout"). Both pass locally.
Existing success/fail/array/dependsOn e2e unaffected; cargo clippy -p fakecloud-batch -p fakecloud-e2e --all-targets -- -D warnings clean. Docs updated.

Next (final Batch batch)

CloudFormation provisioner for AWS::Batch::ComputeEnvironment/JobQueue/JobDefinition (write-through + snapshot hook) + Terraform acceptance-test coverage.

Summary by cubic

Adds AWS Batch retry strategies and per-attempt timeouts to real job execution. Failed attempts re-launch up to retryStrategy.attempts (1–10) with attempts recorded; overruns fail with a clear timeout reason. Docs updated and e2e coverage added.

New Features
- Honor retryStrategy.attempts: on non‑zero exit, record in attempts[] and re-launch a fresh ECS task (RUNNABLE → STARTING) until attempts are exhausted.
- Honor timeout.attemptDurationSeconds: cap each attempt; on overrun, fail with "Job attempt duration exceeded timeout".
- Set RUNNING on first task start; terminal status writes keep real container.exitCode and statusReason.
Refactors
- Extracted set_job_terminal helper for SUCCEEDED/FAILED writes.

^{Written for commit b670f64. Summary will update on new commits.}

- spawn_status_sync now loops over attempts: on a non-zero container exit, if retryStrategy.attempts remain it records the failed attempt under attempts[] and re-launches a fresh ECS task (status RUNNABLE -> STARTING), otherwise fails the job. Honors the 1-10 attempts clamp. - timeout.attemptDurationSeconds caps each attempt's poll loop; an overrunning container fails the job with "Job attempt duration exceeded timeout". - Extracted set_job_terminal helper for the SUCCEEDED/FAILED write. Tests: Docker-gated e2e — retry_strategy_reattempts_a_failing_job (exit 4, attempts=2 -> FAILED with one recorded retry) and timeout_fails_an_overrunning_job (sleep 60, attemptDurationSeconds=2 -> FAILED "timeout"). Existing success/fail/array/dependsOn e2e unaffected. Docs updated.

vieiralucas merged commit 643f901 into main Jun 27, 2026
126 of 127 checks passed

vieiralucas deleted the wt-batch-5 branch June 27, 2026 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(batch): retryStrategy + timeout (batch 5)#1989

feat(batch): retryStrategy + timeout (batch 5)#1989
vieiralucas merged 1 commit into
mainfrom
wt-batch-5

vieiralucas commented Jun 27, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

vieiralucas commented Jun 27, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Next (final Batch batch)

Summary by cubic

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vieiralucas commented Jun 27, 2026 •

edited by cubic-dev-ai Bot

Loading