ci(ocapn-guile-interop): cache the Guix runtime store across runs#3264
Open
kriskowal wants to merge 1 commit into
Open
ci(ocapn-guile-interop): cache the Guix runtime store across runs#3264kriskowal wants to merge 1 commit into
kriskowal wants to merge 1 commit into
Conversation
|
kriskowal
added a commit
to kriskowal/garden
that referenced
this pull request
May 15, 2026
9f1ac2d to
fa035f5
Compare
fa035f5 to
c89593c
Compare
|
Mirror of endojs/endo-but-for-bots#258 (head c89593c). |
c89593c to
08263cd
Compare
Iteration III of the guix-CI resilience pattern. PR #82 (iteration I) established the two-substitute-server pattern and the installer-tarball cache; PR #255 (iteration II) reordered the substitute URLs and widened the polling and timeout windows after a Bordeaux outage exposed the "first server unreachable" slow path. The 2026-05-14 outage exposed a third failure mode that neither prior iteration addressed: both substitute servers degraded simultaneously, with the result that the daemon could not resolve the runtime closure (guile + fibers + websocket + gnutls + gcrypt) from either upstream. Reorder and wider timeouts did not help because no amount of waiting brings up a server that is down. The existing cache step amortizes only the installer tarball, not the runtime store the daemon resolves at runtime. Each run pays the full substitute-fetch cost end-to-end, and a both-servers-degraded day means that cost cannot be paid at all. This change adds a second `actions/cache` step that caches the daemon's runtime store (`/gnu/store`) together with the daemon database (`/var/guix/db`). Both paths are root-owned with strict permissions, so the cache targets a runner-owned staging directory containing a zstd tarball of the two store paths; paired `sudo tar` shell steps wrap the actual extract and create, matching the install step's existing `sudo tar --extract` pattern. On a cache hit the daemon's next `guix build` finds the resolved closure already on disk and the daemon DB already records it as valid, so the substitute round-trip is short-circuited entirely. A degraded-substitute-server day no longer blocks the workflow's runtime path. The cache key includes the pinned Guix version (a version bump may change the on-disk DB schema) and a hash of the workflow file (the package set and daemon configuration both live in the workflow, so any change to either forces a fresh snapshot). A `restore-keys` prefix lets a workflow edit that does not actually invalidate the store (a comment tweak, a timeout bump) still seed from the prior snapshot and re-save at job end. The daemon is stopped across the restore extract so the on-disk store and the daemon's in-memory view of it cannot diverge mid-flight; a divergence would surface later as missing-store-item errors from `guix build`. The snapshot step at the end does not need to stop the daemon because tar captures the SQLite DB as a point-in-time copy at the filesystem layer. The change is additive to iteration II's reorder + widen; both mitigations stay in place and remain load-bearing for the one-server-degraded case where the cache key changes (a workflow edit that flushes the snapshot leaves the daemon dependent on substitute fetches for the next run).
08263cd to
5036d22
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a second
actions/cachestep to theocapn-guile-interopworkflow that caches theguix-daemonruntime store (/gnu/store) and the daemon database (/var/guix/db) across runs, so a workflow run on which both substitute servers are degraded can still resolve the Guile + fibers + websocket + gnutls + gcrypt closure from the local store rather than blocking on substitute fetches.Third iteration of the guix-CI resilience pattern.
246c6a6c(iteration I) added the second substitute server and the installer-tarball cache.0ec70c6d(iteration II) reordered the substitute URLs and widened the polling and timeout windows after a Bordeaux outage exposed the "first server unreachable" slow path. The 2026-05-14 outage exposed a third failure mode neither prior addressed: both substitute servers degraded simultaneously, with the result that the daemon could not resolve the runtime closure from either upstream. Reorder and wider timeouts did not help because no amount of waiting brings up a server that is down.The existing installer-tarball cache amortizes only the installer, not the runtime store. Each run pays the full substitute-fetch cost end-to-end, and a both-servers-degraded day means that cost cannot be paid at all. The new cache step targets the runtime store itself: on a cache hit the daemon's next
guix buildfinds the resolved closure already on disk and the daemon DB already records it as valid, so the substitute round-trip is short-circuited entirely.The two cached paths are root-owned with strict permissions, so the
actions/cachestep targets a runner-owned staging directory containing a zstd tarball of the two store paths; pairedsudo tarshell steps wrap the extract and the create, matching the install step's existingsudo tar --extractpattern. The cache key includes the pinned Guix version (a version bump may change the on-disk DB schema) and a hash of the workflow file (the package set and daemon configuration both live in the workflow, so any change to either forces a fresh snapshot). Arestore-keysprefix lets a workflow edit that does not actually invalidate the store still seed from the prior snapshot and re-save at job end.The daemon is stopped across the restore extract so the on-disk store and the daemon's in-memory view of it cannot diverge mid-flight; a divergence would surface later as missing-store-item errors from
guix build. The snapshot step at the end does not need to stop the daemon because tar captures the SQLite DB as a point-in-time copy at the filesystem layer.The change is additive to iteration II's reorder + widen; both mitigations stay in place and remain load-bearing for the one-server-degraded case where the cache key changes (a workflow edit that flushes the snapshot leaves the daemon dependent on substitute fetches for the next run).
Security Considerations
No change to the trust boundary. Substitute servers and their authorized keys are untouched. The cached payload is the daemon's own resolved store; it is keyed by the pinned Guix version and the workflow file's content hash, so a change to the package set or daemon configuration invalidates the cache rather than letting a stale closure persist across configurations. The
sudo tarinvocations operate on a runner-owned staging path; the privileged step is exactly the extract and the create, mirroring the install step's existing pattern.Scaling Considerations
Cache hits trade a substitute-fetch round-trip (variable, multi-minute on a slow day) for a local tar extract (bounded, seconds). Cache misses regenerate the snapshot at job end and pay the substitute cost for that one run, falling back to iteration II's behavior. Cache storage is sized by the runtime closure (Guile + fibers + websocket + gnutls + gcrypt and their transitive dependencies), which is small relative to GitHub's per-repository cache budget; the prior installer-tarball cache continues to live alongside it.
Documentation Considerations
No user-facing or API-facing surface changes. The cache step's comments explain the staging-directory pattern and the daemon-stop-across-restore invariant adjacent to the lines they govern.
Testing Considerations
The change is CI-workflow-only and is exercised by the workflow itself on every PR. Cache-hit behavior can be observed once a baseline run has populated the cache; cache-miss behavior is exercised by any workflow edit that changes the key, and degraded-substitute behavior can only be confirmed during an actual upstream outage. The PR is opened as draft so the next slow-substitute day can validate the behavior on the bot-side workflow before flagging for review.
Compatibility Considerations
No effect on consumers of any published package. The change is internal to the CI workflow.
Upgrade Considerations
None.