Skip to content

ci(deploy): pull+restart container before register; route deploy through publish-edge only#5

Merged
petrpan26 merged 1 commit into
mainfrom
fix/deploy-pull-and-restart
May 7, 2026
Merged

ci(deploy): pull+restart container before register; route deploy through publish-edge only#5
petrpan26 merged 1 commit into
mainfrom
fix/deploy-pull-and-restart

Conversation

@petrpan26
Copy link
Copy Markdown
Contributor

Every deploy since PR #1 has been hitting 409 `registration_conflict`. PR #3/4 fixed the server-side bug — but the box never picked up the new image because deploy-hetzner.yml never restarts the container. Compose's `pull_policy: always` only matters when you call `docker compose up`; the workflow only ran rsync + curl POST.

Plus a parallel race: when a single PR touches both server-code and website paths, both publish-edge-image and deploy fired simultaneously, deploy ran first against the still-old image, register 409'd before the new image was even built.

Changes

deploy-hetzner.yml

  • New first step: `docker compose pull beava && up -d --force-recreate --no-deps beava` over SSH, plus a 20-iter health probe (`http://beava:8090/ready\`). Hard-fail with logs if the new image doesn't come up ready in 20s.
  • Drop the `push` trigger entirely. Every deploy chains off `workflow_run` (publish-edge succeeded) or manual `workflow_dispatch`. One trigger path.

publish-edge-image.yml

  • Drop the path filter — fires on every push to main. Buildx + cargo caching make non-server commits finish in 2–3 min (cache-hit, manifest re-tag).

Trade-off

Website-only commits no longer auto-deploy in 30s — they wait ~2–3 min for publish-edge's cache-hit cycle. For a single-maintainer project with one prod box, ordering > deploy latency.

Verified

Manual repro of the fix steps unblocked prod:

  1. SSH'd to box, `docker compose pull && up -d --force-recreate beava` → new :edge running
  2. `gh workflow run deploy-hetzner.yml --ref main` → run 25518077565 → success
  3. Pipeline registered cleanly against the new server.

Related

…y via publish-edge

Two bugs combined to break every prod deploy after PR #1:

1. deploy-hetzner.yml never ran 'docker compose pull && up -d'. It
   rsynced the website and POSTed the register payload but left the
   running beava container on whatever image was last manually pulled.
   Even after publish-edge-image built a new :edge digest, the box
   stayed on the old binary — and the new pipeline shape (PageView with
   session_id) hit the old server's diff path, returning 409.

2. The push trigger on deploy-hetzner.yml fired in parallel with
   publish-edge-image when a single PR touched both server-code paths
   AND website paths (PR #4 did exactly this). Even fixing #1 wouldn't
   help if deploy fired BEFORE publish-edge finished — the pull would
   pick up the previous :edge digest.

Fix:

  * deploy-hetzner.yml gains a 'Pull latest beava image + restart
    container' step that runs FIRST: docker compose pull beava + up -d
    --force-recreate --no-deps beava + 20s health probe loop. Hard-fail
    with logs if the new image doesn't come up ready.

  * deploy-hetzner.yml drops the push trigger entirely. Every deploy
    chains off publish-edge-image's completion (workflow_run trigger)
    or workflow_dispatch. One trigger path, no parallel race.

  * publish-edge-image.yml drops its path filter — fires on every push
    to main. Buildx cache makes non-server commits finish in 2-3 min
    (cache-hit on cargo + image layers, just a manifest re-tag). The
    cost is small CI burn for trivial doc commits; the benefit is
    deploy is always behind a fresh publish, no races.

Trade-off explicit in comments: website-only commits no longer
auto-deploy in 30 seconds — they wait for publish-edge's 2-3 min
cache-hit cycle. Acceptable for a single-maintainer project where
ordering > deploy latency.

Verified manually: SSH'd to box, pulled new :edge (consolidated
server), restarted container, then triggered deploy via
workflow_dispatch. Run 25518077565 — succeeded.
@petrpan26 petrpan26 merged commit 82b1b10 into main May 7, 2026
8 checks passed
@petrpan26 petrpan26 deleted the fix/deploy-pull-and-restart branch May 7, 2026 20:00
petrpan26 pushed a commit that referenced this pull request May 7, 2026
test_apply_drains_more_than_1024_items_per_iteration in
phase12_08_drain_until_empty_test.rs is genuinely flaky on CI under
shared runner load — the assertion is 'drained MORE than DRAIN_CAP=1024
items in one event-loop iteration', and runner contention sometimes
keeps that watermark below 1024 even though the test logic is sound.
Hit it twice on PR #6 (a docs-only PR with zero Rust changes), and on
PR #5's CI history. Retry 3x via a per-test nextest override.

Cite the symptom + reason in the config so future maintainers don't
think this is masking a real regression.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant