Skip to content

ci(ocapn-guile-interop): cache the Guix runtime store across runs#3264

Open
kriskowal wants to merge 1 commit into
masterfrom
kriskowal-ocapn-guile-interop-cache-store
Open

ci(ocapn-guile-interop): cache the Guix runtime store across runs#3264
kriskowal wants to merge 1 commit into
masterfrom
kriskowal-ocapn-guile-interop-cache-store

Conversation

@kriskowal

Copy link
Copy Markdown
Member

Description

Adds a second actions/cache step to the ocapn-guile-interop workflow that caches the guix-daemon runtime store (/gnu/store) and the daemon database (/var/guix/db) across runs, so a workflow run on which both substitute servers are degraded can still resolve the Guile + fibers + websocket + gnutls + gcrypt closure from the local store rather than blocking on substitute fetches.

Third iteration of the guix-CI resilience pattern. 246c6a6c (iteration I) added the second substitute server and the installer-tarball cache. 0ec70c6d (iteration II) reordered the substitute URLs and widened the polling and timeout windows after a Bordeaux outage exposed the "first server unreachable" slow path. The 2026-05-14 outage exposed a third failure mode neither prior addressed: both substitute servers degraded simultaneously, with the result that the daemon could not resolve the runtime closure from either upstream. Reorder and wider timeouts did not help because no amount of waiting brings up a server that is down.

The existing installer-tarball cache amortizes only the installer, not the runtime store. Each run pays the full substitute-fetch cost end-to-end, and a both-servers-degraded day means that cost cannot be paid at all. The new cache step targets the runtime store itself: on a cache hit the daemon's next guix build finds the resolved closure already on disk and the daemon DB already records it as valid, so the substitute round-trip is short-circuited entirely.

The two cached paths are root-owned with strict permissions, so the actions/cache step targets a runner-owned staging directory containing a zstd tarball of the two store paths; paired sudo tar shell steps wrap the extract and the create, matching the install step's existing sudo tar --extract pattern. The cache key includes the pinned Guix version (a version bump may change the on-disk DB schema) and a hash of the workflow file (the package set and daemon configuration both live in the workflow, so any change to either forces a fresh snapshot). A restore-keys prefix lets a workflow edit that does not actually invalidate the store still seed from the prior snapshot and re-save at job end.

The daemon is stopped across the restore extract so the on-disk store and the daemon's in-memory view of it cannot diverge mid-flight; a divergence would surface later as missing-store-item errors from guix build. The snapshot step at the end does not need to stop the daemon because tar captures the SQLite DB as a point-in-time copy at the filesystem layer.

The change is additive to iteration II's reorder + widen; both mitigations stay in place and remain load-bearing for the one-server-degraded case where the cache key changes (a workflow edit that flushes the snapshot leaves the daemon dependent on substitute fetches for the next run).

Security Considerations

No change to the trust boundary. Substitute servers and their authorized keys are untouched. The cached payload is the daemon's own resolved store; it is keyed by the pinned Guix version and the workflow file's content hash, so a change to the package set or daemon configuration invalidates the cache rather than letting a stale closure persist across configurations. The sudo tar invocations operate on a runner-owned staging path; the privileged step is exactly the extract and the create, mirroring the install step's existing pattern.

Scaling Considerations

Cache hits trade a substitute-fetch round-trip (variable, multi-minute on a slow day) for a local tar extract (bounded, seconds). Cache misses regenerate the snapshot at job end and pay the substitute cost for that one run, falling back to iteration II's behavior. Cache storage is sized by the runtime closure (Guile + fibers + websocket + gnutls + gcrypt and their transitive dependencies), which is small relative to GitHub's per-repository cache budget; the prior installer-tarball cache continues to live alongside it.

Documentation Considerations

No user-facing or API-facing surface changes. The cache step's comments explain the staging-directory pattern and the daemon-stop-across-restore invariant adjacent to the lines they govern.

Testing Considerations

The change is CI-workflow-only and is exercised by the workflow itself on every PR. Cache-hit behavior can be observed once a baseline run has populated the cache; cache-miss behavior is exercised by any workflow edit that changes the key, and degraded-substitute behavior can only be confirmed during an actual upstream outage. The PR is opened as draft so the next slow-substitute day can validate the behavior on the bot-side workflow before flagging for review.

Compatibility Considerations

No effect on consumers of any published package. The change is internal to the CI workflow.

Upgrade Considerations

None.

@changeset-bot

changeset-bot Bot commented May 15, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 5036d22

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

kriskowal added a commit to kriskowal/garden that referenced this pull request May 15, 2026
@kriskowal kriskowal force-pushed the kriskowal-ocapn-guile-interop-cache-store branch from 9f1ac2d to fa035f5 Compare May 15, 2026 04:34
@kriskowal kriskowal requested a review from kumavis May 15, 2026 04:34
@kriskowal kriskowal marked this pull request as ready for review May 22, 2026 02:39
@kriskowal kriskowal force-pushed the kriskowal-ocapn-guile-interop-cache-store branch from fa035f5 to c89593c Compare May 22, 2026 02:39
@kriscendobot

Copy link
Copy Markdown

Mirror of endojs/endo-but-for-bots#258 (head c89593c).

@kriskowal kriskowal force-pushed the kriskowal-ocapn-guile-interop-cache-store branch from c89593c to 08263cd Compare May 29, 2026 20:02
@kriskowal kriskowal enabled auto-merge May 29, 2026 20:02
Iteration III of the guix-CI resilience pattern. PR #82 (iteration I)
established the two-substitute-server pattern and the installer-tarball
cache; PR #255 (iteration II) reordered the substitute URLs and widened
the polling and timeout windows after a Bordeaux outage exposed the
"first server unreachable" slow path. The 2026-05-14 outage exposed a
third failure mode that neither prior iteration addressed: both
substitute servers degraded simultaneously, with the result that the
daemon could not resolve the runtime closure (guile + fibers + websocket
+ gnutls + gcrypt) from either upstream. Reorder and wider timeouts did
not help because no amount of waiting brings up a server that is down.

The existing cache step amortizes only the installer tarball, not the
runtime store the daemon resolves at runtime. Each run pays the full
substitute-fetch cost end-to-end, and a both-servers-degraded day means
that cost cannot be paid at all.

This change adds a second `actions/cache` step that caches the daemon's
runtime store (`/gnu/store`) together with the daemon database
(`/var/guix/db`). Both paths are root-owned with strict permissions, so
the cache targets a runner-owned staging directory containing a zstd
tarball of the two store paths; paired `sudo tar` shell steps wrap the
actual extract and create, matching the install step's existing
`sudo tar --extract` pattern. On a cache hit the daemon's next
`guix build` finds the resolved closure already on disk and the daemon
DB already records it as valid, so the substitute round-trip is
short-circuited entirely. A degraded-substitute-server day no longer
blocks the workflow's runtime path.

The cache key includes the pinned Guix version (a version bump may
change the on-disk DB schema) and a hash of the workflow file (the
package set and daemon configuration both live in the workflow, so any
change to either forces a fresh snapshot). A `restore-keys` prefix lets
a workflow edit that does not actually invalidate the store (a comment
tweak, a timeout bump) still seed from the prior snapshot and re-save
at job end.

The daemon is stopped across the restore extract so the on-disk store
and the daemon's in-memory view of it cannot diverge mid-flight; a
divergence would surface later as missing-store-item errors from
`guix build`. The snapshot step at the end does not need to stop the
daemon because tar captures the SQLite DB as a point-in-time copy at
the filesystem layer.

The change is additive to iteration II's reorder + widen; both
mitigations stay in place and remain load-bearing for the
one-server-degraded case where the cache key changes (a workflow edit
that flushes the snapshot leaves the daemon dependent on substitute
fetches for the next run).
@kriskowal kriskowal force-pushed the kriskowal-ocapn-guile-interop-cache-store branch from 08263cd to 5036d22 Compare June 3, 2026 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants