Speed up Docker build: free disk space, drop broken cache#204
Speed up Docker build: free disk space, drop broken cache#204Abdelsalam-Abbas merged 9 commits intomainfrom
Conversation
- Switch from GHA cache (10 GB limit) to Docker Hub registry cache (unlimited) - Split pixi install into separate RUN layers per environment - Move checkpoint COPY before pixi installs for better cache reuse - Build-only mode (push: false) for testing on this branch
Test build speed on self-hosted runner with ample disk/memory. Remove cache-to to avoid writing to registry during testing.
GPU runner can't schedule when GPUs are in use. The Docker build doesn't need a GPU, so use standard runner with ~30 GB reclaimed disk.
Cache layers to diffuseproject/sampleworks:buildcache on Docker Hub. First build seeds the cache; subsequent builds skip unchanged layers (including the 10 GB checkpoint layer).
Cache backends don't work well for this image size: - GHA cache: 10 GB limit, too small - Registry cache: fills runner disk during export - Inline cache: limited to final layers only The free-disk-space action alone makes builds reliable at ~15-20 min.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdded a pre-checkout disk cleanup step in CI and removed GitHub Actions cache directives from the Docker build; in the Dockerfile, moved external checkpoints Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
.github/workflows/docker.yml (1)
73-73: Consider wording the digest log as “pushed” for clarity.With
push: true, “Image pushed with digest …” better reflects the job behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/docker.yml at line 73, Update the run step that currently echoes "Image built with digest ${{ steps.build-push.outputs.digest }}" to reflect that the image was pushed (since push: true); modify the command which uses steps.build-push.outputs.digest so the log reads "Image pushed with digest …" to improve clarity.Dockerfile (1)
111-111: Pin the checkpoints source image to an immutable digest.Using
:latestmakes builds non-deterministic and weakens supply-chain reproducibility. Replace with@sha256:...(optionally viaARG) so builds are auditable and repeatable.Suggested change
+ARG CHECKPOINTS_IMAGE=diffuseproject/sampleworks-checkpoints@sha256:<resolved_digest> -COPY --from=diffuseproject/sampleworks-checkpoints:latest /checkpoints/ /checkpoints/ +COPY --from=${CHECKPOINTS_IMAGE} /checkpoints/ /checkpoints/🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` at line 111, The Dockerfile currently pulls the checkpoints image using the mutable tag diffuseproject/sampleworks-checkpoints:latest in the COPY --from stage; change this to reference an immutable digest (e.g., diffuseproject/sampleworks-checkpoints@sha256:...) or expose the digest via an ARG and use that ARG in the COPY --from to ensure deterministic, auditable builds and reproducible supply chain behavior; update any relevant build docs or CI to set the ARG if you choose that route.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/docker.yml:
- Line 27: The workflow currently references the third-party action via a
floating ref "jlumbroso/free-disk-space@main"; replace that with a specific
commit SHA to pin the action (e.g., "jlumbroso/free-disk-space@<commit-sha>") so
the CI uses a fixed immutable version, updating the uses entry where "uses:
jlumbroso/free-disk-space@main" appears in the Docker workflow file.
---
Nitpick comments:
In @.github/workflows/docker.yml:
- Line 73: Update the run step that currently echoes "Image built with digest
${{ steps.build-push.outputs.digest }}" to reflect that the image was pushed
(since push: true); modify the command which uses
steps.build-push.outputs.digest so the log reads "Image pushed with digest …" to
improve clarity.
In `@Dockerfile`:
- Line 111: The Dockerfile currently pulls the checkpoints image using the
mutable tag diffuseproject/sampleworks-checkpoints:latest in the COPY --from
stage; change this to reference an immutable digest (e.g.,
diffuseproject/sampleworks-checkpoints@sha256:...) or expose the digest via an
ARG and use that ARG in the COPY --from to ensure deterministic, auditable
builds and reproducible supply chain behavior; update any relevant build docs or
CI to set the ARG if you choose that route.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 11a2a681-7d4a-45d6-900b-dfd4832cfe1e
📒 Files selected for processing (2)
.github/workflows/docker.ymlDockerfile
marcuscollins
left a comment
There was a problem hiding this comment.
@Abdelsalam-Abbas just left a couple comments/questions, but LGTM.
Abdelsalam-Abbas
left a comment
There was a problem hiding this comment.
Nit: the image digest echo was changed from "pushed" to "built", but since push: true is set, "pushed" is the correct wording. This should stay as-is (or be reverted to the original).
Reverting to the original PR #204 ordering (checkpoints COPY before pixi installs) to capture df -h readings and verify disk usage theory. This will likely fail on 75G runners but the df output will confirm why.
Separate RUN commands for each pixi environment create overlay layers that duplicate shared conda packages (numpy, CUDA libs, etc.), consuming ~37 GB vs ~14 GB in a single layer. This causes disk-full failures on CI runners with 72 GB disks (ubuntu-latest provisions either 72 or 145 GB non-deterministically). Reverts the split introduced in #204 while keeping the other optimizations (free-disk-space, checkpoint layer ordering).
…ll (#206) ## Summary - Combines the three separate `RUN pixi install` commands back into a single `RUN` - Separate RUNs create overlay layers that duplicate shared conda packages (numpy, CUDA libs, etc.) across environments -- measured **~37 GB** (3 layers) vs **~14 GB** (1 layer) - `ubuntu-latest` non-deterministically provisions runners with 72 GB or 145 GB disks. On 72 GB runners, the split-RUN approach exceeds available space during build ### Root cause investigation The split was introduced in #204 for better layer caching. However, the three pixi environments share many conda packages, and overlay layers store full copies of files per layer. This ~23 GB overhead pushes the build past the disk limit on smaller runners. Disk usage measured via `df -h` inside Docker build steps: | Metric | Split RUN (3 layers) | Single RUN (1 layer) | |--------|---------------------|---------------------| | boltz | 8 GB | 8 GB | | protenix | 12 GB (new layer) | 4 GB (shared pkgs deduped) | | rf3 | 12 GB (new layer) | 2 GB (shared pkgs deduped) | | **Total pixi disk** | **~37 GB** | **~14 GB** | ## Test plan - [x] Verified split-RUN fails on 72 GB runners (jobs 70663224672, 70670232201) - [x] Verified single-RUN disk usage via df -h debugging (job 70682099334) - [x] Confirmed checkpoint image unchanged since March 25 (same SHA across all builds) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Streamlined the deployment build process by consolidating multiple environment setup commands into a single optimized layer, resulting in improved build performance, reduced container image overhead, better dependency caching efficiency, and enhanced operational efficiency during containerization and deployment cycles while maintaining full functionality. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
jlumbroso/free-disk-spaceaction to reclaim ~30 GB on the runner, preventing OOM/disk-full crashestype=gha) which was counterproductive — the 10 GB limit caused unpredictable cache evictions on our ~20 GB image, leading to build times ranging 25-71 minTested builds on this branch: 12-20 min (runner variance) vs baseline 28 min average (up to 71 min with cache thrashing). No more disk-full failures.
Closes https://app.clickup.com/t/86e0pc15n
Test plan
ubuntu-latest(runs #24156031787, #24159607975)Summary by CodeRabbit