Refactor ML worker Dockerfiles: mamba + shared base images by arjunrajlab · Pull Request #132 · arjunrajlaboratory/ImageAnalysisProject

arjunrajlab · 2026-03-01T20:04:23Z

Summary

Create 2 shared base images (sam2-worker-base, cuda-ml-worker-base) to eliminate redundant conda environment solves, apt-get installs, Miniforge downloads, and NimbusImage clones across 14 ML workers
Switch all ML workers from conda to mamba for significantly faster environment resolution
Simplify SAM2 worker Dockerfiles from ~70 lines each to ~15 lines (all 5 share identical environment + SAM2 model setup)
Fix bugs: sam2_refine/Dockerfile_M1 copy-paste bug (was copying from sam2_propagate/), stardist duplicate deeptile install
Standardize all ML workers to arjunrajlaboratory/NimbusImage (replacing old Kitware/UPennContrast references)
Switch piscis from Miniconda to Miniforge (gets mamba for free, no more Anaconda ToS acceptance needed)
Net result: -689 lines of Dockerfile code, ~50-60% estimated build time reduction

New base images

Image	CUDA	Used by
`nimbusimage/sam2-worker-base:latest`	12.1	sam2_automatic_mask_generator, sam2_fewshot, sam2_propagate, sam2_refine, sam2_video
`nimbusimage/sam2-worker-base-m1:latest`	11.8	Same 5 workers (M1 variants)
`nimbusimage/cuda-ml-worker-base:latest`	11.8	cellpose, cellpose_train, cellposesam, stardist, sam_fewshot, sam_automatic

Files changed

5 new files: 3 base Dockerfiles + 2 environment.yml files in workers/base_docker_images/
8 deleted files: 5 identical SAM2 environment.yml + 3 cellpose environment.yml (now in base images or pip-installed)
24 modified Dockerfiles + build_machine_learning_workers.sh + REGISTRY.md

Test plan

Build the two base images: docker build . -f ./workers/base_docker_images/Dockerfile.sam2_worker_base -t nimbusimage/sam2-worker-base:latest and docker build . -f ./workers/base_docker_images/Dockerfile.cuda_ml_worker_base -t nimbusimage/cuda-ml-worker-base:latest
Build a SAM2 worker (e.g., sam2_fewshot_segmentation) and verify it starts correctly
Build a cellpose worker and verify labels with docker inspect
Run build_machine_learning_workers.sh end-to-end on x86_64
For workers with tests (sam2_fewshot_segmentation), run the test suite
Verify sam2_video correctly uses the nimbus fork (not standard SAM2)

🤖 Generated with Claude Code

Create shared base images (sam2-worker-base, cuda-ml-worker-base) to eliminate redundant conda solves across 14 ML workers. Switch all workers from conda to mamba for faster environment resolution. - Add SAM2 base image (CUDA 12.1 x86_64 + CUDA 11.8 M1) with shared environment, SAM2 model, and checkpoints — used by all 5 SAM2 workers - Add CUDA ML base image (CUDA 11.8) with common packages, NimbusImage client, annotation_utilities — used by cellpose, stardist, SAM1 - Simplify SAM2 worker Dockerfiles from ~70 lines to ~15 lines each - Simplify cellpose/stardist/SAM1 Dockerfiles to thin layers on base - Switch piscis from Miniconda to Miniforge (gets mamba for free) - Switch condensatenet and deconwolf to mamba env create - Standardize all ML workers to arjunrajlaboratory/NimbusImage - Fix sam2_refine/Dockerfile_M1 copy-paste bug (was copying from sam2_propagate) - Fix stardist environment.yml duplicate deeptile install - Delete 8 redundant environment.yml files (5 SAM2 + 3 cellpose) - Update build_machine_learning_workers.sh to build base images first Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5f104c3741

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-01T20:11:45Z

workers/annotations/stardist/Dockerfile

-RUN conda env create --file /environment.yml
+# Stardist requires specific pinned package versions, so recreate the environment
+COPY ./workers/annotations/stardist/environment.yml /
+RUN mamba env remove -n worker -y && mamba env create --file /environment.yml


Reset shell before removing the inherited worker env

This RUN now executes with the parent image shell from workers/base_docker_images/Dockerfile.cuda_ml_worker_base (SHELL ["conda", "run", "-n", "worker", ...]), so the command is run inside the same worker environment it tries to delete. In that context, mamba env remove -n worker can fail (or leave the layer in a broken state), which blocks building the Stardist image. Use a non-worker shell for the removal/recreate step before switching back to conda run -n worker.

Useful? React with 👍 / 👎.

… env The base image sets SHELL to conda run -n worker, so mamba env remove would run inside the environment it's trying to delete. Reset to /bin/bash first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The shared cuda-ml-worker-base image caused CUDA version mismatches for the Cellpose family: pip-installed cellpose pulls PyTorch with the default CUDA 12.8, but the base image only has CUDA 11.8. The original standalone Dockerfiles let conda resolve PyTorch with the correct CUDA version via each worker's own environment.yml. Changes: - Restore standalone Dockerfiles for cellpose, cellpose_train, cellposesam - Restore their environment.yml files (deleted by prior commit) - Update Kitware/UPennContrast references to arjunrajlaboratory/NimbusImage - Update build script to use worker directory as build context for these 3 - Add libstdcxx-ng and LD_LIBRARY_PATH fix to cuda_ml_worker_base (still used by stardist and SAM1 workers) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The shared sam2-worker-base image approach caused GPU passthrough failures in girder_worker due to missing NVIDIA Docker labels and PyTorch CUDA version mismatches. Reverted all 5 SAM2 workers (automatic_mask_generator, fewshot_segmentation, propagate, refine, video) to standalone Dockerfiles with their own conda environments and environment.yml files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

arjunrajlab · 2026-03-04T15:48:09Z

Status: Archived

This PR is being archived rather than merged or closed. The bulk of the changes were reverted due to GPU passthrough issues with shared Docker base images.

What remains

Only 3 workers (SAM1 x2, stardist) still use the shared cuda-ml-worker-base image. The 8 GPU workers (Cellpose x3, SAM2 x5) were reverted to standalone Dockerfiles.

Learnings preserved

The technical learnings from this effort are documented in todo/ml-worker-build-optimization.md on master, with a tracking entry in todo/TODO_REGISTRY.md.

Future reference

If revisiting ML worker build optimization, start from the TODO doc rather than this PR branch. The key blocker was NVIDIA Container Toolkit GPU passthrough requiring specific Docker label configurations that conflict with shared base images.

Create todo/ directory with a registry and detailed doc preserving the technical learnings from the ML worker build optimization effort (PR #132), which was archived after GPU passthrough issues forced reverting most workers to standalone Dockerfiles. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

arjunrajlab and others added 2 commits March 1, 2026 13:06

Update REGISTRY.md with shared ML base image information

5f104c3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector bot reviewed Mar 1, 2026

View reviewed changes

arjunrajlab and others added 3 commits March 1, 2026 20:35

arjunrajlab marked this pull request as draft March 4, 2026 15:47

arjunrajlab added the archived Preserved for reference, not actively being pursued label Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor ML worker Dockerfiles: mamba + shared base images#132

Refactor ML worker Dockerfiles: mamba + shared base images#132
arjunrajlab wants to merge 5 commits intomasterfrom
refactor/ml-worker-dockerfiles-mamba-base-images

arjunrajlab commented Mar 1, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 1, 2026

Uh oh!

arjunrajlab commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arjunrajlab commented Mar 1, 2026

Summary

New base images

Files changed

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

arjunrajlab commented Mar 4, 2026

Status: Archived

What remains

Learnings preserved

Future reference

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant