Skip to content

Add push_workers_to_ecr.sh script for ECR image deployment#141

Open
arjunrajlab wants to merge 11 commits into
masterfrom
claude/worker-image-registry-build-M4qXH
Open

Add push_workers_to_ecr.sh script for ECR image deployment#141
arjunrajlab wants to merge 11 commits into
masterfrom
claude/worker-image-registry-build-M4qXH

Conversation

@arjunrajlab
Copy link
Copy Markdown
Collaborator

@arjunrajlab arjunrajlab commented May 29, 2026

Summary

scripts/push_workers_to_ecr.sh builds NimbusImage worker Docker images for
linux/amd64 and pushes them to AWS ECR. Each worker gets two tags (short git
SHA + latest); needed base images are built for amd64 and each worker's
FROM is redirected to the ECR copy via docker buildx --build-context (no
Dockerfile edits). Full reference: scripts/PUSH_WORKERS_TO_ECR.md.

Where to run it

Two supported paths:

  • Laptop (macOS). Cross-builds linux/amd64 under QEMU. Fine for standard
    CPU workers; needs bash 4+ (brew install bash, run with /opt/homebrew/bin/bash).
    GPU/ML workers are slow/fragile under emulation — use --skip-ml for a quick
    standard-only run.
  • Native amd64 / EC2 (recommended for bulk + GPU/ML). On a native amd64 (or
    amd64/GPU) host there is no emulation — builds are fast and GPU/ML workers
    build normally. Spin up an ephemeral instance with an IAM instance profile
    that has ECR push permissions (so aws_credentials_prod.sh isn't needed),
    run the script, then terminate it (a spot instance keeps the one-off cost
    low). See "Building on a native amd64 machine" in the doc for the existing-box
    and ephemeral-EC2 recipes.

GPU/ML + piscis coverage

  • The script now builds and pushes GPU/ML workers too (cellpose family,
    sam/sam2 family, stardist, condensatenet, deconwolf, deepcell, cellori). They
    build FROM public nvidia/cuda images, so they need no base redirect and
    push like any other worker.
  • piscis (non-standard predict/ + train/ layout) is handled as two
    pseudo-workers, piscis_predict and piscis_train, both built with the
    piscis dir as context (matching its docker-compose.yaml).
  • New flags: --skip-ml (exclude nvidia/cuda workers incl. piscis) and
    --only-ml (build only those, e.g. on a GPU box). --list now shows a
    GPU/ML column, and the script warns before cross-building any CUDA worker
    under QEMU on a non-amd64 host.
  • Net result: a single ./scripts/push_workers_to_ecr.sh --all on a native
    amd64/GPU EC2 host builds + pushes everything (standard + GPU/ML +
    piscis). build_machine_learning_workers.sh remains the local-dev build path
    (local image tags, no ECR push).

Build parallelism

COMPOSE_PARALLEL_LIMIT in build_workers.sh is now env-overridable (default
1, safe on a memory-constrained laptop). Even a beefy EC2 GPU build instance
can't handle many concurrent heavy CUDA/torch builds, so 2–3 is the practical
ceiling
, not higher:

COMPOSE_PARALLEL_LIMIT=3 ./build_workers.sh

Key implementation details

  • Discovers workers from workers/annotations/ and workers/properties/;
    *_M1 arm64 dev variants are excluded entirely, and test/sample workers are
    skipped by --all (still pushable by explicit name).
  • Per-worker build context is auto-detected (repo-root vs worker-dir); piscis's
    context is pinned via an override map since it can't be inferred.
  • Validates AWS credentials early (sts get-caller-identity); appends -dirty
    to the SHA tag on a dirty working tree; auto-creates ECR repos on first push
    (with a prompt; -y to skip); re-logs-in + retries once on a push auth error;
    keeps a per-worker buildx cache under .cache/buildx/ (--no-cache to
    disable).

Image naming / tags

<acct>.dkr.ecr.<region>.amazonaws.com/<prefix>/<category>/<worker>:<short-git-sha>
<acct>.dkr.ecr.<region>.amazonaws.com/<prefix>/<category>/<worker>:latest

Prod registry: account 677276105390, region us-east-1, prefix nimbus.

Verified offline (syntax, --list, stubbed-aws --dry-run). An actual
GPU/ML build + push should be confirmed on a native amd64/GPU host with live
credentials.

🤖 Generated with Claude Code

Local-only workflow for building one or more worker images and pushing
them to AWS ECR after sourcing AWS credentials. Discovers workers from
the standard workers/annotations/<name> and workers/properties/<sub>/
<name> layouts (piscis is intentionally skipped), tags each image with
both the short git SHA and 'latest', uses a per-worker local buildx
cache, and creates ECR repos on first push (with a confirmation prompt).
Includes --list, --all, --dry-run, --region, --prefix, and -y flags.

Also adds a top-level .gitignore covering the buildx cache and the
existing build.log, and documents the workflow in CLAUDE.md alongside
the other build commands.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1ff5dd26a6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/push_workers_to_ecr.sh Outdated
-t "$img_sha" \
-t "$img_latest" \
--push \
. 2> >(tee "$stderr_file" >&2)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Build worker-context Dockerfiles from their directories

When pushing any discovered worker whose Dockerfile was written for its own directory as the build context (for example workers/annotations/cellpose/Dockerfile:35 and workers/annotations/random_point/Dockerfile:7 use COPY ./environment.yml, and there is no repo-root environment.yml), this repo-root build context makes Docker look in the wrong place and the build fails before anything can be pushed. --all includes many of these Dockerfiles, and named pushes for those workers are broken unless the context is switched to the worker directory or those files are excluded/rewritten.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in d938a07 (before this review's commit). detect_context() now picks the build context per worker by checking where each COPY/ADD source resolves — repo-root for compose-style workers, worker-dir for ones that COPY ./environment.yml (cellpose, random_point, …). Verified: cellpose → worker dir, blob_metrics_worker → repo root; both build and push.

arjunrajlab and others added 4 commits May 29, 2026 12:26
push_workers_to_ecr.sh always used the repo root as the build context and
had no concept of base images, so:

- Workers whose Dockerfile expects its own directory as context
  (cellpose, stardist, random_point, ...) failed to build under --all
  (Codex P1: no repo-root environment.yml). Now detect_context() picks
  repo-root vs worker-dir per worker by checking where each COPY/ADD
  source resolves.
- Worker `FROM nimbusimage/<base>:latest` images aren't on any registry
  and the local copies are the host arch (arm64), so amd64 cross-builds
  couldn't resolve them. ensure_base_image() now builds the needed base
  for the target platform, pushes it to nimbus/base/<base>, and redirects
  FROM to the ECR copy via `docker buildx --build-context` (no worker
  Dockerfile edits). Add --skip-base to reuse a base already in ECR.

Validated on this Mac: built+pushed amd64 nimbus/base/worker-base and
nimbus/properties/blob_metrics_worker (both confirmed linux/amd64).

Docs (CLAUDE.md): correct creds path, note worker names are directory
names, document base-image build/redirect and --skip-base.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document the full ECR push flow in scripts/PUSH_WORKERS_TO_ECR.md: auth via
../AWSDeploy/aws_credentials_prod.sh, prerequisites (bash 4+ on macOS),
all options, image naming/tags, automatic base-image build + FROM redirect,
build-context auto-detection, the per-base-image worker groups, and gotchas
(QEMU emulation on macOS is slow; GPU/ML workers should be built natively on
an x86_64 machine via build_machine_learning_workers.sh; test/sample and _M1
workers are not deployed). Link it from CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ECR repository names and Docker image tags must be lowercase, but some
worker directory names aren't (blob_point_count_3D_projection_worker, the
*_M1 variants). Lowercase the repo component when forming the repo and tags
so these build instead of failing with "repository name must be lowercase".

Also fix ensure_repo() to check the create-repository exit code and return
non-zero on failure, instead of always logging "Created repo" even when the
AWS call errored.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a "Building on a native amd64 machine" section covering (A) running the
existing script on an amd64 Linux box you already have (no emulation, fast,
GPU workers build normally) and (B) spinning up an ephemeral EC2 instance
(amd64 or GPU) with an ECR IAM instance role to build + push everything,
then terminate. Notes that GPU/ML workers go through
build_machine_learning_workers.sh and that AWSDeploy Terraform could
template the build instance.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@arjunrajlab
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a821bed23a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

while IFS= read -r f; do
name="$(basename "$(dirname "$f")")"
printf '%s\tannotations\t%s\n' "$name" "$f"
done < <(find workers/annotations -mindepth 2 -maxdepth 2 -name Dockerfile -type f 2>/dev/null | sort)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Exclude M1-only worker Dockerfiles from amd64 pushes

When --all is used, this find includes directories such as workers/annotations/annulus_generator_M1, random_point_annotation_M1, and test_multiple_annotation_M1; those Dockerfiles download Miniconda3-latest-Linux-aarch64.sh, so the script's default --platform linux/amd64 build will fail once it reaches them. Filter out the _M1 worker directories (or only include the production workers) so --all can complete for the documented amd64 ECR workflow.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b2696bf. *_M1 directories are now excluded from discovery (they hardcode the aarch64 Miniconda installer and can't build linux/amd64), so --all no longer breaks on them. Also added a denylist so --all skips the dev/test workers (random_squares, sample_interface, test_multiple_annotation, test_file_creation_worker); those stay pushable by explicit name. Confirmed --all now resolves to 49 buildable workers with no _M1.

Comment thread scripts/push_workers_to_ecr.sh Outdated
Comment on lines +233 to +234
if ! git -C "$REPO_ROOT" diff --quiet 2>/dev/null || \
! git -C "$REPO_ROOT" diff --staged --quiet 2>/dev/null; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include untracked files in dirty SHA tagging

This dirty check only looks at tracked unstaged/staged diffs, so an untracked file that is copied into the Docker build context (for example a newly added worker helper or entrypoint dependency) can be baked into the image while the tag remains the clean git SHA. That breaks the provenance guarantee documented for -dirty; include untracked files in the check before deciding the image corresponds to HEAD.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b2696bf. The dirty check now uses git status --porcelain, which catches tracked changes, staged changes, and untracked (non-gitignored) files. Since the build context is usually the repo root, an untracked file can be COPYed into an image, so it must flip the tag to -dirty. .gitignored artifacts (.cache/, etc.) are still excluded.

arjunrajlab and others added 6 commits May 29, 2026 16:31
…k untracked

Address Codex review on PR #141:

- Exclude `*_M1` worker dirs from discovery. Their Dockerfiles hardcode the
  aarch64 Miniconda installer (Apple-Silicon dev only) and cannot build for
  linux/amd64, so `--all` previously broke when it reached them.
- `--all` now skips dev/test/sample workers (random_squares, sample_interface,
  test_multiple_annotation, test_file_creation_worker) per the deployment
  policy; they remain pushable by explicit name.
- Dirty SHA check now uses `git status --porcelain`, which also catches
  untracked (non-gitignored) files. With the repo-root build context an
  untracked file can be baked into an image, so the `-dirty` provenance tag
  must account for it.

Docs updated to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…elism

push_workers_to_ecr.sh now covers GPU/ML workers (they build FROM public
nvidia/cuda images, so no base redirect is needed) and piscis, which is
emitted as two pseudo-workers piscis_predict/piscis_train with the build
context pinned to the piscis dir (matching its docker-compose). Adds
--skip-ml/--only-ml, a GPU/ML column in --list, and a QEMU emulation warning
before cross-building any CUDA worker on a non-amd64 host.

Make COMPOSE_PARALLEL_LIMIT in build_workers.sh env-overridable (default 1).
Even an EC2 GPU build instance can't handle many concurrent heavy CUDA/torch
builds, so 2-3 is the practical ceiling, not higher.

Update PUSH_WORKERS_TO_ECR.md and CLAUDE.md: a single --all run on a native
amd64/GPU host now builds+pushes everything (standard + GPU/ML + piscis);
build_machine_learning_workers.sh remains the local-dev (no-ECR) build path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Regenerated by the update-worker-docs Claude Code hook after PR creation.
Resolve the target registry from $ECR_REGISTRY when set, defaulting to the
URL derived from the authenticated account + region (which still respects
--region). This is the shared env-var contract with the AWSDeploy consume
side (Celery workers pull prebuilt images from the same registry); the full
contract is documented in AWSDeploy's
doc/Build_and_Push_Worker_Images_to_ECR.md.

No new flag, no behavior change when $ECR_REGISTRY is unset. The override
flows to worker SHA/latest tags, the base-image push, and the FROM
--build-context redirect.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cripts

Capture the deferred consume-side link (make build_workers.sh and
build_machine_learning_workers.sh pull prebuilt images from ECR when
$ECR_REGISTRY is set, falling back to a source build) along with the two
blockers that make it non-trivial: the unknown local image tag Girder
discovery expects, and the lack of a single name mapping across worker dir /
ECR repo / ML-script local tag / compose service name.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hanges

generate_worker_docs.py: prefer the concise Docker `description` label for the
REGISTRY one-liner (fall back to the in-app notes only when there's no label),
and make _strip_html replace tags with a space + drop a trailing "Learn more"
link label. This removes the verbose/truncated rows, the leaked "Learn more"
anchor text, and fused sentences (e.g. "z-slices.It") that the regenerated
registry had picked up from the notes fields.

update-worker-docs.sh: only regenerate + commit docs when the branch diff vs.
the base actually touches workers/ or the generator, so an unrelated PR (e.g.
a build-script change) no longer picks up a registry-churn commit.

REGISTRY.md regenerated with the fixed generator (idempotent).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants