Speed up Docker build: free disk space, drop broken cache by Abdelsalam-Abbas · Pull Request #204 · diff-use/sampleworks

Abdelsalam-Abbas · 2026-04-08T21:57:58Z

Summary

Add jlumbroso/free-disk-space action to reclaim ~30 GB on the runner, preventing OOM/disk-full crashes
Remove GHA cache (type=gha) which was counterproductive — the 10 GB limit caused unpredictable cache evictions on our ~20 GB image, leading to build times ranging 25-71 min
Split pixi install into separate RUN layers per environment for better local cache reuse
Move checkpoint COPY before pixi installs so the ~10 GB checkpoint layer stays cached when only dependencies change

Tested builds on this branch: 12-20 min (runner variance) vs baseline 28 min average (up to 71 min with cache thrashing). No more disk-full failures.

Closes https://app.clickup.com/t/86e0pc15n

Test plan

Verified build succeeds with free-disk-space on ubuntu-latest (runs #24156031787, #24159607975)
Confirmed GHA cache removal doesn't regress build time (cache was net-negative)
Tested registry cache alternative — runner disk too small for cache export, not viable

Summary by CodeRabbit

Chores
- Added automatic disk cleanup step on CI runners to prevent storage-related build failures.
- Removed CI-level cache integration for Docker builds to simplify caching behaviour.
- Reorganized image build steps and split dependency installs into separate layers to reduce unnecessary rebuilds.
- Kept the final build output message showing the pushed image digest.

- Switch from GHA cache (10 GB limit) to Docker Hub registry cache (unlimited) - Split pixi install into separate RUN layers per environment - Move checkpoint COPY before pixi installs for better cache reuse - Build-only mode (push: false) for testing on this branch

Test build speed on self-hosted runner with ample disk/memory. Remove cache-to to avoid writing to registry during testing.

GPU runner can't schedule when GPUs are in use. The Docker build doesn't need a GPU, so use standard runner with ~30 GB reclaimed disk.

Cache layers to diffuseproject/sampleworks:buildcache on Docker Hub. First build seeds the cache; subsequent builds skip unchanged layers (including the 10 GB checkpoint layer).

Cache backends don't work well for this image size: - GHA cache: 10 GB limit, too small - Registry cache: fills runner disk during export - Inline cache: limited to final layers only The free-disk-space action alone makes builds reliable at ~15-20 min.

coderabbitai · 2026-04-08T21:58:11Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d6f5d5d6-4f52-4b18-841e-f1a4141f7274

📥 Commits

Reviewing files that changed from the base of the PR and between b22f0a0 and c8b7b70.

📒 Files selected for processing (1)

.github/workflows/docker.yml

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/docker.yml

📝 Walkthrough

Walkthrough

Added a pre-checkout disk cleanup step in CI and removed GitHub Actions cache directives from the Docker build; in the Dockerfile, moved external checkpoints COPY earlier and split Pixi dependency installation into three separate RUN layers.

Changes

Cohort / File(s)	Summary
GitHub Actions Workflow `.github/workflows/docker.yml`	Added `jlumbroso/free-disk-space` pre-checkout step (tool-cache disabled, Android/.NET/Haskell/large-packages/swap cleanup enabled); removed `cache-from: type=gha` and `cache-to: type=gha,mode=max` from the Docker build/push. Final digest log remains unchanged.
Dockerfile `Dockerfile`	Moved `COPY --from=diffuseproject/sampleworks-checkpoints:latest /checkpoints/ /checkpoints/` to occur before Pixi installs; replaced one chained Pixi install `RUN` with three separate `RUN` layers (one per environment) and removed the later duplicate checkpoint `COPY`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

fix: use checkpoints base image from Docker Hub for CI builds #143: Adjusts Dockerfile handling of COPY --from=diffuseproject/sampleworks-checkpoints similar to these changes.

Suggested reviewers

marcuscollins

Poem

🐰 I nibbled logs and cleared the space,
split installs with a careful pace.
Checkpoints hopped up, snug in their place,
CI breathes easy, builds embrace. 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the two main changes: adding disk space freeing and removing broken GitHub Actions cache, which are the primary objectives for speeding up Docker builds.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch speed-up-docker-build

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

.github/workflows/docker.yml (1)

73-73: Consider wording the digest log as “pushed” for clarity.

With push: true, “Image pushed with digest …” better reflects the job behavior.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/docker.yml at line 73, Update the run step that currently
echoes "Image built with digest ${{ steps.build-push.outputs.digest }}" to
reflect that the image was pushed (since push: true); modify the command which
uses steps.build-push.outputs.digest so the log reads "Image pushed with digest
…" to improve clarity.

Dockerfile (1)

111-111: Pin the checkpoints source image to an immutable digest.

Using :latest makes builds non-deterministic and weakens supply-chain reproducibility. Replace with @sha256:... (optionally via ARG) so builds are auditable and repeatable.

Suggested change

+ARG CHECKPOINTS_IMAGE=diffuseproject/sampleworks-checkpoints@sha256:<resolved_digest>
-COPY --from=diffuseproject/sampleworks-checkpoints:latest /checkpoints/ /checkpoints/
+COPY --from=${CHECKPOINTS_IMAGE} /checkpoints/ /checkpoints/

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` at line 111, The Dockerfile currently pulls the checkpoints image
using the mutable tag diffuseproject/sampleworks-checkpoints:latest in the COPY
--from stage; change this to reference an immutable digest (e.g.,
diffuseproject/sampleworks-checkpoints@sha256:...) or expose the digest via an
ARG and use that ARG in the COPY --from to ensure deterministic, auditable
builds and reproducible supply chain behavior; update any relevant build docs or
CI to set the ARG if you choose that route.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/docker.yml:
- Line 27: The workflow currently references the third-party action via a
floating ref "jlumbroso/free-disk-space@main"; replace that with a specific
commit SHA to pin the action (e.g., "jlumbroso/free-disk-space@<commit-sha>") so
the CI uses a fixed immutable version, updating the uses entry where "uses:
jlumbroso/free-disk-space@main" appears in the Docker workflow file.

---

Nitpick comments:
In @.github/workflows/docker.yml:
- Line 73: Update the run step that currently echoes "Image built with digest
${{ steps.build-push.outputs.digest }}" to reflect that the image was pushed
(since push: true); modify the command which uses
steps.build-push.outputs.digest so the log reads "Image pushed with digest …" to
improve clarity.

In `@Dockerfile`:
- Line 111: The Dockerfile currently pulls the checkpoints image using the
mutable tag diffuseproject/sampleworks-checkpoints:latest in the COPY --from
stage; change this to reference an immutable digest (e.g.,
diffuseproject/sampleworks-checkpoints@sha256:...) or expose the digest via an
ARG and use that ARG in the COPY --from to ensure deterministic, auditable
builds and reproducible supply chain behavior; update any relevant build docs or
CI to set the ARG if you choose that route.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 11a2a681-7d4a-45d6-900b-dfd4832cfe1e

📥 Commits

Reviewing files that changed from the base of the PR and between cd0e7e6 and 898d3e5.

📒 Files selected for processing (2)

.github/workflows/docker.yml
Dockerfile

marcuscollins

@Abdelsalam-Abbas just left a couple comments/questions, but LGTM.

Abdelsalam-Abbas

Nit: the image digest echo was changed from "pushed" to "built", but since push: true is set, "pushed" is the correct wording. This should stay as-is (or be reverted to the original).

Reverting to the original PR #204 ordering (checkpoints COPY before pixi installs) to capture df -h readings and verify disk usage theory. This will likely fail on 75G runners but the df output will confirm why.

Separate RUN commands for each pixi environment create overlay layers that duplicate shared conda packages (numpy, CUDA libs, etc.), consuming ~37 GB vs ~14 GB in a single layer. This causes disk-full failures on CI runners with 72 GB disks (ubuntu-latest provisions either 72 or 145 GB non-deterministically). Reverts the split introduced in #204 while keeping the other optimizations (free-disk-space, checkpoint layer ordering).

…ll (#206) ## Summary - Combines the three separate `RUN pixi install` commands back into a single `RUN` - Separate RUNs create overlay layers that duplicate shared conda packages (numpy, CUDA libs, etc.) across environments -- measured **~37 GB** (3 layers) vs **~14 GB** (1 layer) - `ubuntu-latest` non-deterministically provisions runners with 72 GB or 145 GB disks. On 72 GB runners, the split-RUN approach exceeds available space during build ### Root cause investigation The split was introduced in #204 for better layer caching. However, the three pixi environments share many conda packages, and overlay layers store full copies of files per layer. This ~23 GB overhead pushes the build past the disk limit on smaller runners. Disk usage measured via `df -h` inside Docker build steps: | Metric | Split RUN (3 layers) | Single RUN (1 layer) | |--------|---------------------|---------------------| | boltz | 8 GB | 8 GB | | protenix | 12 GB (new layer) | 4 GB (shared pkgs deduped) | | rf3 | 12 GB (new layer) | 2 GB (shared pkgs deduped) | | **Total pixi disk** | **~37 GB** | **~14 GB** | ## Test plan - [x] Verified split-RUN fails on 72 GB runners (jobs 70663224672, 70670232201) - [x] Verified single-RUN disk usage via df -h debugging (job 70682099334) - [x] Confirmed checkpoint image unchanged since March 25 (same SHA across all builds)  ## Summary by CodeRabbit * **Chores** * Streamlined the deployment build process by consolidating multiple environment setup commands into a single optimized layer, resulting in improved build performance, reduced container image overhead, better dependency caching efficiency, and enhanced operational efficiency during containerization and deployment cycles while maintaining full functionality.

Abdelsalam-Abbas added 6 commits April 8, 2026 20:35

Run Docker build on gpu-1 runner, no push or cache write

d960fd3

Test build speed on self-hosted runner with ample disk/memory. Remove cache-to to avoid writing to registry during testing.

Switch to ubuntu-latest with free-disk-space action

3ce5e2f

GPU runner can't schedule when GPUs are in use. The Docker build doesn't need a GPU, so use standard runner with ~30 GB reclaimed disk.

Enable registry cache to speed up subsequent builds

0b96e36

Cache layers to diffuseproject/sampleworks:buildcache on Docker Hub. First build seeds the cache; subsequent builds skip unchanged layers (including the 10 GB checkpoint layer).

Prepare workflow for merge: restore push, remove branch trigger

898d3e5

Abdelsalam-Abbas requested review from k-chrispens, marcuscollins and xraymemory and removed request for xraymemory April 8, 2026 21:58

Merge branch 'main' into speed-up-docker-build

9b96f09

coderabbitai Bot reviewed Apr 8, 2026

View reviewed changes

Comment thread .github/workflows/docker.yml Outdated

Pin free-disk-space action to commit SHA

b22f0a0

marcuscollins approved these changes Apr 9, 2026

View reviewed changes

Comment thread .github/workflows/docker.yml

Comment thread .github/workflows/docker.yml Outdated

Abdelsalam-Abbas commented Apr 9, 2026

View reviewed changes

Fix image digest echo: pushed not built

c8b7b70

Abdelsalam-Abbas merged commit 41512b1 into main Apr 9, 2026
1 check passed

coderabbitai Bot mentioned this pull request Apr 9, 2026

fix(docker): combine pixi installs into single RUN to prevent disk-full #206

Merged

3 tasks

k-chrispens deleted the speed-up-docker-build branch April 22, 2026 00:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up Docker build: free disk space, drop broken cache#204

Speed up Docker build: free disk space, drop broken cache#204
Abdelsalam-Abbas merged 9 commits intomainfrom
speed-up-docker-build

Abdelsalam-Abbas commented Apr 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

marcuscollins left a comment

Uh oh!

Uh oh!

Uh oh!

Abdelsalam-Abbas left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Abdelsalam-Abbas commented Apr 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

marcuscollins left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Abdelsalam-Abbas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Abdelsalam-Abbas commented Apr 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 8, 2026 •

edited

Loading