Skip to content

feat(batch): retryStrategy + timeout (batch 5)#1989

Merged
vieiralucas merged 1 commit into
mainfrom
wt-batch-5
Jun 27, 2026
Merged

feat(batch): retryStrategy + timeout (batch 5)#1989
vieiralucas merged 1 commit into
mainfrom
wt-batch-5

Conversation

@vieiralucas

@vieiralucas vieiralucas commented Jun 27, 2026

Copy link
Copy Markdown
Member

Summary

AWS Batch batch 5 — retry strategies + job timeouts.

  • spawn_status_sync loops over attempts: on a non-zero container exit, if retryStrategy.attempts remain it records the failed attempt under attempts[] and re-launches a fresh ECS task (status RUNNABLE → STARTING), else fails the job. Attempts clamped to AWS's 1-10.
  • timeout.attemptDurationSeconds caps each attempt's poll loop; an overrunning container fails the job with "Job attempt duration exceeded timeout".
  • Extracted set_job_terminal for the SUCCEEDED/FAILED write.

Test plan

  • Docker-gated e2e: retry_strategy_reattempts_a_failing_job (exit 4, attempts=2 → FAILED with one recorded retry + exitCode 4) and timeout_fails_an_overrunning_job (sleep 60, attemptDurationSeconds=2 → FAILED "timeout"). Both pass locally.
  • Existing success/fail/array/dependsOn e2e unaffected; cargo clippy -p fakecloud-batch -p fakecloud-e2e --all-targets -- -D warnings clean. Docs updated.

Next (final Batch batch)

CloudFormation provisioner for AWS::Batch::ComputeEnvironment/JobQueue/JobDefinition (write-through + snapshot hook) + Terraform acceptance-test coverage.


Summary by cubic

Adds AWS Batch retry strategies and per-attempt timeouts to real job execution. Failed attempts re-launch up to retryStrategy.attempts (1–10) with attempts recorded; overruns fail with a clear timeout reason. Docs updated and e2e coverage added.

  • New Features

    • Honor retryStrategy.attempts: on non‑zero exit, record in attempts[] and re-launch a fresh ECS task (RUNNABLE → STARTING) until attempts are exhausted.
    • Honor timeout.attemptDurationSeconds: cap each attempt; on overrun, fail with "Job attempt duration exceeded timeout".
    • Set RUNNING on first task start; terminal status writes keep real container.exitCode and statusReason.
  • Refactors

    • Extracted set_job_terminal helper for SUCCEEDED/FAILED writes.

Written for commit b670f64. Summary will update on new commits.

Review in cubic

- spawn_status_sync now loops over attempts: on a non-zero container exit, if
  retryStrategy.attempts remain it records the failed attempt under attempts[]
  and re-launches a fresh ECS task (status RUNNABLE -> STARTING), otherwise
  fails the job. Honors the 1-10 attempts clamp.
- timeout.attemptDurationSeconds caps each attempt's poll loop; an overrunning
  container fails the job with "Job attempt duration exceeded timeout".
- Extracted set_job_terminal helper for the SUCCEEDED/FAILED write.

Tests: Docker-gated e2e — retry_strategy_reattempts_a_failing_job (exit 4,
attempts=2 -> FAILED with one recorded retry) and timeout_fails_an_overrunning_job
(sleep 60, attemptDurationSeconds=2 -> FAILED "timeout"). Existing
success/fail/array/dependsOn e2e unaffected. Docs updated.
@vieiralucas vieiralucas merged commit 643f901 into main Jun 27, 2026
126 of 127 checks passed
@vieiralucas vieiralucas deleted the wt-batch-5 branch June 27, 2026 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant