Refactor ML worker Dockerfiles: mamba + shared base images#132
Refactor ML worker Dockerfiles: mamba + shared base images#132arjunrajlab wants to merge 5 commits intomasterfrom
Conversation
Create shared base images (sam2-worker-base, cuda-ml-worker-base) to eliminate redundant conda solves across 14 ML workers. Switch all workers from conda to mamba for faster environment resolution. - Add SAM2 base image (CUDA 12.1 x86_64 + CUDA 11.8 M1) with shared environment, SAM2 model, and checkpoints — used by all 5 SAM2 workers - Add CUDA ML base image (CUDA 11.8) with common packages, NimbusImage client, annotation_utilities — used by cellpose, stardist, SAM1 - Simplify SAM2 worker Dockerfiles from ~70 lines to ~15 lines each - Simplify cellpose/stardist/SAM1 Dockerfiles to thin layers on base - Switch piscis from Miniconda to Miniforge (gets mamba for free) - Switch condensatenet and deconwolf to mamba env create - Standardize all ML workers to arjunrajlaboratory/NimbusImage - Fix sam2_refine/Dockerfile_M1 copy-paste bug (was copying from sam2_propagate) - Fix stardist environment.yml duplicate deeptile install - Delete 8 redundant environment.yml files (5 SAM2 + 3 cellpose) - Update build_machine_learning_workers.sh to build base images first Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5f104c3741
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| RUN conda env create --file /environment.yml | ||
| # Stardist requires specific pinned package versions, so recreate the environment | ||
| COPY ./workers/annotations/stardist/environment.yml / | ||
| RUN mamba env remove -n worker -y && mamba env create --file /environment.yml |
There was a problem hiding this comment.
Reset shell before removing the inherited
worker env
This RUN now executes with the parent image shell from workers/base_docker_images/Dockerfile.cuda_ml_worker_base (SHELL ["conda", "run", "-n", "worker", ...]), so the command is run inside the same worker environment it tries to delete. In that context, mamba env remove -n worker can fail (or leave the layer in a broken state), which blocks building the Stardist image. Use a non-worker shell for the removal/recreate step before switching back to conda run -n worker.
Useful? React with 👍 / 👎.
… env The base image sets SHELL to conda run -n worker, so mamba env remove would run inside the environment it's trying to delete. Reset to /bin/bash first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The shared cuda-ml-worker-base image caused CUDA version mismatches for the Cellpose family: pip-installed cellpose pulls PyTorch with the default CUDA 12.8, but the base image only has CUDA 11.8. The original standalone Dockerfiles let conda resolve PyTorch with the correct CUDA version via each worker's own environment.yml. Changes: - Restore standalone Dockerfiles for cellpose, cellpose_train, cellposesam - Restore their environment.yml files (deleted by prior commit) - Update Kitware/UPennContrast references to arjunrajlaboratory/NimbusImage - Update build script to use worker directory as build context for these 3 - Add libstdcxx-ng and LD_LIBRARY_PATH fix to cuda_ml_worker_base (still used by stardist and SAM1 workers) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The shared sam2-worker-base image approach caused GPU passthrough failures in girder_worker due to missing NVIDIA Docker labels and PyTorch CUDA version mismatches. Reverted all 5 SAM2 workers (automatic_mask_generator, fewshot_segmentation, propagate, refine, video) to standalone Dockerfiles with their own conda environments and environment.yml files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Status: ArchivedThis PR is being archived rather than merged or closed. The bulk of the changes were reverted due to GPU passthrough issues with shared Docker base images. What remainsOnly 3 workers (SAM1 x2, stardist) still use the shared cuda-ml-worker-base image. The 8 GPU workers (Cellpose x3, SAM2 x5) were reverted to standalone Dockerfiles. Learnings preservedThe technical learnings from this effort are documented in todo/ml-worker-build-optimization.md on master, with a tracking entry in todo/TODO_REGISTRY.md. Future referenceIf revisiting ML worker build optimization, start from the TODO doc rather than this PR branch. The key blocker was NVIDIA Container Toolkit GPU passthrough requiring specific Docker label configurations that conflict with shared base images. |
Create todo/ directory with a registry and detailed doc preserving the technical learnings from the ML worker build optimization effort (PR #132), which was archived after GPU passthrough issues forced reverting most workers to standalone Dockerfiles. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
sam2-worker-base,cuda-ml-worker-base) to eliminate redundant conda environment solves, apt-get installs, Miniforge downloads, and NimbusImage clones across 14 ML workersarjunrajlaboratory/NimbusImage(replacing oldKitware/UPennContrastreferences)New base images
nimbusimage/sam2-worker-base:latestnimbusimage/sam2-worker-base-m1:latestnimbusimage/cuda-ml-worker-base:latestFiles changed
workers/base_docker_images/build_machine_learning_workers.sh+REGISTRY.mdTest plan
docker build . -f ./workers/base_docker_images/Dockerfile.sam2_worker_base -t nimbusimage/sam2-worker-base:latestanddocker build . -f ./workers/base_docker_images/Dockerfile.cuda_ml_worker_base -t nimbusimage/cuda-ml-worker-base:latestdocker inspectbuild_machine_learning_workers.shend-to-end on x86_64🤖 Generated with Claude Code