Skip to content

feat(batch): real ECS-backed SubmitJob execution (batch 2)#1986

Merged
vieiralucas merged 2 commits into
mainfrom
wt-batch-2
Jun 27, 2026
Merged

feat(batch): real ECS-backed SubmitJob execution (batch 2)#1986
vieiralucas merged 2 commits into
mainfrom
wt-batch-2

Conversation

@vieiralucas

@vieiralucas vieiralucas commented Jun 27, 2026

Copy link
Copy Markdown
Member

Summary

The headline differentiator for the cycle-5 winner: AWS Batch SubmitJob runs a REAL container, not a fake. Every other free emulator fakes Batch compute (MiniStack jumps straight to SUCCEEDED with no container; LocalStack paywalls it).

  • Wire ECS state + runtime into BatchService (with_ecs). SubmitJob resolves the job definition's containerProperties (+ this submit's containerOverrides), ensures a fakecloud-batch ECS cluster, RegisterTaskDefinition (mapping vcpus/memory — legacy fields or resourceRequirements — to ECS cpu/memory), and RunTask — reusing all of ECS's real container / portability / k8s handling through its public handle() API (mirrors how autoscaling drives EC2).
  • A background poller maps the ECS task's real lifecycle + container exit code onto the Batch job: SUBMITTED → STARTING → RUNNING → SUCCEEDED (exit 0) or FAILED (carrying the real container.exitCode). Launch is non-blocking (ECS RunTask backgrounds the container pull/start → no client timeout).
  • NO auto-success: with no container runtime wired the job stays SUBMITTED honestly — exactly the rival anti-pattern Batch beats.

Test plan

  • Unit (10): resource mapping from both shapes, job-def latest-revision + override resolution, no-runtime-stays-SUBMITTED, control-plane lifecycle.
  • Docker-gated e2e batch_real_execution.rs: real alpine container — exit 0 → SUCCEEDED/container.exitCode 0, exit 7 → FAILED/exitCode 7. Both PASS locally with Docker.
  • Control-plane e2e batch.rs still passes; cargo clippy -p fakecloud-batch --bin fakecloud --all-targets -- -D warnings clean; doc-counts passes.

Surface

Docs page updated to reflect real execution. The batch_real_execution e2e mirrors the EC2/ECS Docker-gated pattern (panics in CI without Docker, skips locally).

Next batches

Array jobs + dependsOn + retry/timeout, then a CloudFormation AWS::Batch::* provisioner + tfacc.


Summary by cubic

SubmitJob now runs real containers on ECS via fakecloud-ecs, advancing jobs through STARTING/RUNNING and finishing as SUCCEEDED or FAILED from the container’s exit code; without a container runtime they remain SUBMITTED.

  • New Features

    • Resolve container properties + overrides, ensure a fakecloud-batch ECS cluster, register a task definition (map legacy vcpus/memory or resourceRequirements → ECS cpu/memory), then run a task.
    • Background poller syncs ECS task state to job status and records container.exitCode; launch is non-blocking.
    • No auto-success if ECS isn’t available.
  • Refactors

    • Docker-gated e2e now uses resourceRequirements instead of deprecated vcpus/memory; docs updated.

Written for commit 97ca700. Summary will update on new commits.

Review in cubic

The headline differentiator: SubmitJob now runs a REAL container, not a fake.

- Wire the ECS state + runtime into BatchService (with_ecs); SubmitJob resolves
  the job definition's containerProperties (+ this submit's containerOverrides),
  ensures a `fakecloud-batch` ECS cluster, RegisterTaskDefinition mapping
  vcpus/memory (legacy fields or resourceRequirements) -> ECS cpu/memory, and
  RunTask — reusing all of ECS's real container / portability / k8s handling
  through its public handle() API (mirrors how autoscaling drives EC2).
- A background poller maps the ECS task's real lifecycle + container exit code
  onto the Batch job: SUBMITTED -> STARTING -> RUNNING -> SUCCEEDED (exit 0) or
  FAILED (carrying the real container.exitCode). The launch is non-blocking
  (ECS RunTask backgrounds the container pull/start, so no client timeout).
- NO auto-success: with no container runtime wired the job stays SUBMITTED
  honestly — exactly the rival anti-pattern Batch beats (MiniStack jumps to
  SUCCEEDED with no compute).

Tests: unit (resource mapping from both shapes, job-def latest-revision +
override resolution, no-runtime stays SUBMITTED); Docker-gated e2e
(real alpine container: exit 0 -> SUCCEEDED/exitCode 0, exit 7 -> FAILED/exitCode 7).
Docs page updated to reflect real execution.
@vieiralucas vieiralucas merged commit 3648bb5 into main Jun 27, 2026
103 checks passed
@vieiralucas vieiralucas deleted the wt-batch-2 branch June 27, 2026 04:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant