Skip to content

Refactor ML worker Dockerfiles: mamba + shared base images#132

Draft
arjunrajlab wants to merge 5 commits intomasterfrom
refactor/ml-worker-dockerfiles-mamba-base-images
Draft

Refactor ML worker Dockerfiles: mamba + shared base images#132
arjunrajlab wants to merge 5 commits intomasterfrom
refactor/ml-worker-dockerfiles-mamba-base-images

Conversation

@arjunrajlab
Copy link
Collaborator

Summary

  • Create 2 shared base images (sam2-worker-base, cuda-ml-worker-base) to eliminate redundant conda environment solves, apt-get installs, Miniforge downloads, and NimbusImage clones across 14 ML workers
  • Switch all ML workers from conda to mamba for significantly faster environment resolution
  • Simplify SAM2 worker Dockerfiles from ~70 lines each to ~15 lines (all 5 share identical environment + SAM2 model setup)
  • Fix bugs: sam2_refine/Dockerfile_M1 copy-paste bug (was copying from sam2_propagate/), stardist duplicate deeptile install
  • Standardize all ML workers to arjunrajlaboratory/NimbusImage (replacing old Kitware/UPennContrast references)
  • Switch piscis from Miniconda to Miniforge (gets mamba for free, no more Anaconda ToS acceptance needed)
  • Net result: -689 lines of Dockerfile code, ~50-60% estimated build time reduction

New base images

Image CUDA Used by
nimbusimage/sam2-worker-base:latest 12.1 sam2_automatic_mask_generator, sam2_fewshot, sam2_propagate, sam2_refine, sam2_video
nimbusimage/sam2-worker-base-m1:latest 11.8 Same 5 workers (M1 variants)
nimbusimage/cuda-ml-worker-base:latest 11.8 cellpose, cellpose_train, cellposesam, stardist, sam_fewshot, sam_automatic

Files changed

  • 5 new files: 3 base Dockerfiles + 2 environment.yml files in workers/base_docker_images/
  • 8 deleted files: 5 identical SAM2 environment.yml + 3 cellpose environment.yml (now in base images or pip-installed)
  • 24 modified Dockerfiles + build_machine_learning_workers.sh + REGISTRY.md

Test plan

  • Build the two base images: docker build . -f ./workers/base_docker_images/Dockerfile.sam2_worker_base -t nimbusimage/sam2-worker-base:latest and docker build . -f ./workers/base_docker_images/Dockerfile.cuda_ml_worker_base -t nimbusimage/cuda-ml-worker-base:latest
  • Build a SAM2 worker (e.g., sam2_fewshot_segmentation) and verify it starts correctly
  • Build a cellpose worker and verify labels with docker inspect
  • Run build_machine_learning_workers.sh end-to-end on x86_64
  • For workers with tests (sam2_fewshot_segmentation), run the test suite
  • Verify sam2_video correctly uses the nimbus fork (not standard SAM2)

🤖 Generated with Claude Code

arjunrajlab and others added 2 commits March 1, 2026 13:06
Create shared base images (sam2-worker-base, cuda-ml-worker-base) to
eliminate redundant conda solves across 14 ML workers. Switch all
workers from conda to mamba for faster environment resolution.

- Add SAM2 base image (CUDA 12.1 x86_64 + CUDA 11.8 M1) with shared
  environment, SAM2 model, and checkpoints — used by all 5 SAM2 workers
- Add CUDA ML base image (CUDA 11.8) with common packages, NimbusImage
  client, annotation_utilities — used by cellpose, stardist, SAM1
- Simplify SAM2 worker Dockerfiles from ~70 lines to ~15 lines each
- Simplify cellpose/stardist/SAM1 Dockerfiles to thin layers on base
- Switch piscis from Miniconda to Miniforge (gets mamba for free)
- Switch condensatenet and deconwolf to mamba env create
- Standardize all ML workers to arjunrajlaboratory/NimbusImage
- Fix sam2_refine/Dockerfile_M1 copy-paste bug (was copying from sam2_propagate)
- Fix stardist environment.yml duplicate deeptile install
- Delete 8 redundant environment.yml files (5 SAM2 + 3 cellpose)
- Update build_machine_learning_workers.sh to build base images first

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5f104c3741

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

RUN conda env create --file /environment.yml
# Stardist requires specific pinned package versions, so recreate the environment
COPY ./workers/annotations/stardist/environment.yml /
RUN mamba env remove -n worker -y && mamba env create --file /environment.yml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reset shell before removing the inherited worker env

This RUN now executes with the parent image shell from workers/base_docker_images/Dockerfile.cuda_ml_worker_base (SHELL ["conda", "run", "-n", "worker", ...]), so the command is run inside the same worker environment it tries to delete. In that context, mamba env remove -n worker can fail (or leave the layer in a broken state), which blocks building the Stardist image. Use a non-worker shell for the removal/recreate step before switching back to conda run -n worker.

Useful? React with 👍 / 👎.

arjunrajlab and others added 3 commits March 1, 2026 20:35
… env

The base image sets SHELL to conda run -n worker, so mamba env remove
would run inside the environment it's trying to delete. Reset to
/bin/bash first.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The shared cuda-ml-worker-base image caused CUDA version mismatches
for the Cellpose family: pip-installed cellpose pulls PyTorch with
the default CUDA 12.8, but the base image only has CUDA 11.8. The
original standalone Dockerfiles let conda resolve PyTorch with the
correct CUDA version via each worker's own environment.yml.

Changes:
- Restore standalone Dockerfiles for cellpose, cellpose_train, cellposesam
- Restore their environment.yml files (deleted by prior commit)
- Update Kitware/UPennContrast references to arjunrajlaboratory/NimbusImage
- Update build script to use worker directory as build context for these 3
- Add libstdcxx-ng and LD_LIBRARY_PATH fix to cuda_ml_worker_base
  (still used by stardist and SAM1 workers)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The shared sam2-worker-base image approach caused GPU passthrough failures
in girder_worker due to missing NVIDIA Docker labels and PyTorch CUDA
version mismatches. Reverted all 5 SAM2 workers (automatic_mask_generator,
fewshot_segmentation, propagate, refine, video) to standalone Dockerfiles
with their own conda environments and environment.yml files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@arjunrajlab arjunrajlab marked this pull request as draft March 4, 2026 15:47
@arjunrajlab arjunrajlab added the archived Preserved for reference, not actively being pursued label Mar 4, 2026
@arjunrajlab
Copy link
Collaborator Author

Status: Archived

This PR is being archived rather than merged or closed. The bulk of the changes were reverted due to GPU passthrough issues with shared Docker base images.

What remains

Only 3 workers (SAM1 x2, stardist) still use the shared cuda-ml-worker-base image. The 8 GPU workers (Cellpose x3, SAM2 x5) were reverted to standalone Dockerfiles.

Learnings preserved

The technical learnings from this effort are documented in todo/ml-worker-build-optimization.md on master, with a tracking entry in todo/TODO_REGISTRY.md.

Future reference

If revisiting ML worker build optimization, start from the TODO doc rather than this PR branch. The key blocker was NVIDIA Container Toolkit GPU passthrough requiring specific Docker label configurations that conflict with shared base images.

arjunrajlab added a commit that referenced this pull request Mar 4, 2026
Create todo/ directory with a registry and detailed doc preserving
the technical learnings from the ML worker build optimization effort
(PR #132), which was archived after GPU passthrough issues forced
reverting most workers to standalone Dockerfiles.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

archived Preserved for reference, not actively being pursued

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant