Skip to content

Speed up Docker build: free disk space, drop broken cache#204

Merged
Abdelsalam-Abbas merged 9 commits intomainfrom
speed-up-docker-build
Apr 9, 2026
Merged

Speed up Docker build: free disk space, drop broken cache#204
Abdelsalam-Abbas merged 9 commits intomainfrom
speed-up-docker-build

Conversation

@Abdelsalam-Abbas
Copy link
Copy Markdown
Contributor

@Abdelsalam-Abbas Abdelsalam-Abbas commented Apr 8, 2026

Summary

  • Add jlumbroso/free-disk-space action to reclaim ~30 GB on the runner, preventing OOM/disk-full crashes
  • Remove GHA cache (type=gha) which was counterproductive — the 10 GB limit caused unpredictable cache evictions on our ~20 GB image, leading to build times ranging 25-71 min
  • Split pixi install into separate RUN layers per environment for better local cache reuse
  • Move checkpoint COPY before pixi installs so the ~10 GB checkpoint layer stays cached when only dependencies change

Tested builds on this branch: 12-20 min (runner variance) vs baseline 28 min average (up to 71 min with cache thrashing). No more disk-full failures.

Closes https://app.clickup.com/t/86e0pc15n

Test plan

  • Verified build succeeds with free-disk-space on ubuntu-latest (runs #24156031787, #24159607975)
  • Confirmed GHA cache removal doesn't regress build time (cache was net-negative)
  • Tested registry cache alternative — runner disk too small for cache export, not viable

Summary by CodeRabbit

  • Chores
    • Added automatic disk cleanup step on CI runners to prevent storage-related build failures.
    • Removed CI-level cache integration for Docker builds to simplify caching behaviour.
    • Reorganized image build steps and split dependency installs into separate layers to reduce unnecessary rebuilds.
    • Kept the final build output message showing the pushed image digest.

- Switch from GHA cache (10 GB limit) to Docker Hub registry cache (unlimited)
- Split pixi install into separate RUN layers per environment
- Move checkpoint COPY before pixi installs for better cache reuse
- Build-only mode (push: false) for testing on this branch
Test build speed on self-hosted runner with ample disk/memory.
Remove cache-to to avoid writing to registry during testing.
GPU runner can't schedule when GPUs are in use. The Docker build
doesn't need a GPU, so use standard runner with ~30 GB reclaimed disk.
Cache layers to diffuseproject/sampleworks:buildcache on Docker Hub.
First build seeds the cache; subsequent builds skip unchanged layers
(including the 10 GB checkpoint layer).
Cache backends don't work well for this image size:
- GHA cache: 10 GB limit, too small
- Registry cache: fills runner disk during export
- Inline cache: limited to final layers only

The free-disk-space action alone makes builds reliable at ~15-20 min.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 8, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d6f5d5d6-4f52-4b18-841e-f1a4141f7274

📥 Commits

Reviewing files that changed from the base of the PR and between b22f0a0 and c8b7b70.

📒 Files selected for processing (1)
  • .github/workflows/docker.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/docker.yml

📝 Walkthrough

Walkthrough

Added a pre-checkout disk cleanup step in CI and removed GitHub Actions cache directives from the Docker build; in the Dockerfile, moved external checkpoints COPY earlier and split Pixi dependency installation into three separate RUN layers.

Changes

Cohort / File(s) Summary
GitHub Actions Workflow
.github/workflows/docker.yml
Added jlumbroso/free-disk-space pre-checkout step (tool-cache disabled, Android/.NET/Haskell/large-packages/swap cleanup enabled); removed cache-from: type=gha and cache-to: type=gha,mode=max from the Docker build/push. Final digest log remains unchanged.
Dockerfile
Dockerfile
Moved COPY --from=diffuseproject/sampleworks-checkpoints:latest /checkpoints/ /checkpoints/ to occur before Pixi installs; replaced one chained Pixi install RUN with three separate RUN layers (one per environment) and removed the later duplicate checkpoint COPY.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • marcuscollins

Poem

🐰 I nibbled logs and cleared the space,
split installs with a careful pace.
Checkpoints hopped up, snug in their place,
CI breathes easy, builds embrace. 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the two main changes: adding disk space freeing and removing broken GitHub Actions cache, which are the primary objectives for speeding up Docker builds.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch speed-up-docker-build

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Abdelsalam-Abbas Abdelsalam-Abbas requested review from k-chrispens, marcuscollins and xraymemory and removed request for xraymemory April 8, 2026 21:58
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
.github/workflows/docker.yml (1)

73-73: Consider wording the digest log as “pushed” for clarity.

With push: true, “Image pushed with digest …” better reflects the job behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/docker.yml at line 73, Update the run step that currently
echoes "Image built with digest ${{ steps.build-push.outputs.digest }}" to
reflect that the image was pushed (since push: true); modify the command which
uses steps.build-push.outputs.digest so the log reads "Image pushed with digest
…" to improve clarity.
Dockerfile (1)

111-111: Pin the checkpoints source image to an immutable digest.

Using :latest makes builds non-deterministic and weakens supply-chain reproducibility. Replace with @sha256:... (optionally via ARG) so builds are auditable and repeatable.

Suggested change
+ARG CHECKPOINTS_IMAGE=diffuseproject/sampleworks-checkpoints@sha256:<resolved_digest>
-COPY --from=diffuseproject/sampleworks-checkpoints:latest /checkpoints/ /checkpoints/
+COPY --from=${CHECKPOINTS_IMAGE} /checkpoints/ /checkpoints/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` at line 111, The Dockerfile currently pulls the checkpoints image
using the mutable tag diffuseproject/sampleworks-checkpoints:latest in the COPY
--from stage; change this to reference an immutable digest (e.g.,
diffuseproject/sampleworks-checkpoints@sha256:...) or expose the digest via an
ARG and use that ARG in the COPY --from to ensure deterministic, auditable
builds and reproducible supply chain behavior; update any relevant build docs or
CI to set the ARG if you choose that route.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/docker.yml:
- Line 27: The workflow currently references the third-party action via a
floating ref "jlumbroso/free-disk-space@main"; replace that with a specific
commit SHA to pin the action (e.g., "jlumbroso/free-disk-space@<commit-sha>") so
the CI uses a fixed immutable version, updating the uses entry where "uses:
jlumbroso/free-disk-space@main" appears in the Docker workflow file.

---

Nitpick comments:
In @.github/workflows/docker.yml:
- Line 73: Update the run step that currently echoes "Image built with digest
${{ steps.build-push.outputs.digest }}" to reflect that the image was pushed
(since push: true); modify the command which uses
steps.build-push.outputs.digest so the log reads "Image pushed with digest …" to
improve clarity.

In `@Dockerfile`:
- Line 111: The Dockerfile currently pulls the checkpoints image using the
mutable tag diffuseproject/sampleworks-checkpoints:latest in the COPY --from
stage; change this to reference an immutable digest (e.g.,
diffuseproject/sampleworks-checkpoints@sha256:...) or expose the digest via an
ARG and use that ARG in the COPY --from to ensure deterministic, auditable
builds and reproducible supply chain behavior; update any relevant build docs or
CI to set the ARG if you choose that route.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 11a2a681-7d4a-45d6-900b-dfd4832cfe1e

📥 Commits

Reviewing files that changed from the base of the PR and between cd0e7e6 and 898d3e5.

📒 Files selected for processing (2)
  • .github/workflows/docker.yml
  • Dockerfile

Comment thread .github/workflows/docker.yml Outdated
Copy link
Copy Markdown
Collaborator

@marcuscollins marcuscollins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Abdelsalam-Abbas just left a couple comments/questions, but LGTM.

Comment thread .github/workflows/docker.yml
Comment thread .github/workflows/docker.yml Outdated
Copy link
Copy Markdown
Contributor Author

@Abdelsalam-Abbas Abdelsalam-Abbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the image digest echo was changed from "pushed" to "built", but since push: true is set, "pushed" is the correct wording. This should stay as-is (or be reverted to the original).

@Abdelsalam-Abbas Abdelsalam-Abbas merged commit 41512b1 into main Apr 9, 2026
1 check passed
Abdelsalam-Abbas added a commit that referenced this pull request Apr 9, 2026
Reverting to the original PR #204 ordering (checkpoints COPY before pixi
installs) to capture df -h readings and verify disk usage theory. This
will likely fail on 75G runners but the df output will confirm why.
Abdelsalam-Abbas added a commit that referenced this pull request Apr 9, 2026
Separate RUN commands for each pixi environment create overlay layers
that duplicate shared conda packages (numpy, CUDA libs, etc.), consuming
~37 GB vs ~14 GB in a single layer. This causes disk-full failures on
CI runners with 72 GB disks (ubuntu-latest provisions either 72 or
145 GB non-deterministically).

Reverts the split introduced in #204 while keeping the other
optimizations (free-disk-space, checkpoint layer ordering).
marcuscollins pushed a commit that referenced this pull request Apr 9, 2026
…ll (#206)

## Summary
- Combines the three separate `RUN pixi install` commands back into a
single `RUN`
- Separate RUNs create overlay layers that duplicate shared conda
packages (numpy, CUDA libs, etc.) across environments -- measured **~37
GB** (3 layers) vs **~14 GB** (1 layer)
- `ubuntu-latest` non-deterministically provisions runners with 72 GB or
145 GB disks. On 72 GB runners, the split-RUN approach exceeds available
space during build

### Root cause investigation
The split was introduced in #204 for better layer caching. However, the
three pixi environments share many conda packages, and overlay layers
store full copies of files per layer. This ~23 GB overhead pushes the
build past the disk limit on smaller runners.

Disk usage measured via `df -h` inside Docker build steps:

| Metric | Split RUN (3 layers) | Single RUN (1 layer) |
|--------|---------------------|---------------------|
| boltz | 8 GB | 8 GB |
| protenix | 12 GB (new layer) | 4 GB (shared pkgs deduped) |
| rf3 | 12 GB (new layer) | 2 GB (shared pkgs deduped) |
| **Total pixi disk** | **~37 GB** | **~14 GB** |

## Test plan
- [x] Verified split-RUN fails on 72 GB runners (jobs 70663224672,
70670232201)
- [x] Verified single-RUN disk usage via df -h debugging (job
70682099334)
- [x] Confirmed checkpoint image unchanged since March 25 (same SHA
across all builds)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Streamlined the deployment build process by consolidating multiple
environment setup commands into a single optimized layer, resulting in
improved build performance, reduced container image overhead, better
dependency caching efficiency, and enhanced operational efficiency
during containerization and deployment cycles while maintaining full
functionality.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@k-chrispens k-chrispens deleted the speed-up-docker-build branch April 22, 2026 00:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants