ci(deploy): pull+restart container before register; route deploy through publish-edge only#5
Merged
Merged
Conversation
…y via publish-edge Two bugs combined to break every prod deploy after PR #1: 1. deploy-hetzner.yml never ran 'docker compose pull && up -d'. It rsynced the website and POSTed the register payload but left the running beava container on whatever image was last manually pulled. Even after publish-edge-image built a new :edge digest, the box stayed on the old binary — and the new pipeline shape (PageView with session_id) hit the old server's diff path, returning 409. 2. The push trigger on deploy-hetzner.yml fired in parallel with publish-edge-image when a single PR touched both server-code paths AND website paths (PR #4 did exactly this). Even fixing #1 wouldn't help if deploy fired BEFORE publish-edge finished — the pull would pick up the previous :edge digest. Fix: * deploy-hetzner.yml gains a 'Pull latest beava image + restart container' step that runs FIRST: docker compose pull beava + up -d --force-recreate --no-deps beava + 20s health probe loop. Hard-fail with logs if the new image doesn't come up ready. * deploy-hetzner.yml drops the push trigger entirely. Every deploy chains off publish-edge-image's completion (workflow_run trigger) or workflow_dispatch. One trigger path, no parallel race. * publish-edge-image.yml drops its path filter — fires on every push to main. Buildx cache makes non-server commits finish in 2-3 min (cache-hit on cargo + image layers, just a manifest re-tag). The cost is small CI burn for trivial doc commits; the benefit is deploy is always behind a fresh publish, no races. Trade-off explicit in comments: website-only commits no longer auto-deploy in 30 seconds — they wait for publish-edge's 2-3 min cache-hit cycle. Acceptable for a single-maintainer project where ordering > deploy latency. Verified manually: SSH'd to box, pulled new :edge (consolidated server), restarted container, then triggered deploy via workflow_dispatch. Run 25518077565 — succeeded.
petrpan26
pushed a commit
that referenced
this pull request
May 7, 2026
test_apply_drains_more_than_1024_items_per_iteration in phase12_08_drain_until_empty_test.rs is genuinely flaky on CI under shared runner load — the assertion is 'drained MORE than DRAIN_CAP=1024 items in one event-loop iteration', and runner contention sometimes keeps that watermark below 1024 even though the test logic is sound. Hit it twice on PR #6 (a docs-only PR with zero Rust changes), and on PR #5's CI history. Retry 3x via a per-test nextest override. Cite the symptom + reason in the config so future maintainers don't think this is masking a real regression.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Every deploy since PR #1 has been hitting 409 `registration_conflict`. PR #3/4 fixed the server-side bug — but the box never picked up the new image because deploy-hetzner.yml never restarts the container. Compose's `pull_policy: always` only matters when you call `docker compose up`; the workflow only ran rsync + curl POST.
Plus a parallel race: when a single PR touches both server-code and website paths, both publish-edge-image and deploy fired simultaneously, deploy ran first against the still-old image, register 409'd before the new image was even built.
Changes
deploy-hetzner.yml
publish-edge-image.yml
Trade-off
Website-only commits no longer auto-deploy in 30s — they wait ~2–3 min for publish-edge's cache-hit cycle. For a single-maintainer project with one prod box, ordering > deploy latency.
Verified
Manual repro of the fix steps unblocked prod:
Related