Skip to content

ci(deploy): chain deploy after publish-edge to kill version-mismatch race#2

Merged
petrpan26 merged 1 commit into
mainfrom
ci/sequential-publish-then-deploy
May 7, 2026
Merged

ci(deploy): chain deploy after publish-edge to kill version-mismatch race#2
petrpan26 merged 1 commit into
mainfrom
ci/sequential-publish-then-deploy

Conversation

@petrpan26
Copy link
Copy Markdown
Contributor

Summary

Today's PR #1 merge hit a workflow race: deploy-hetzner and publish-edge-image both fired on the same commit, ran in parallel, and deploy gave up before the new image was published. The pipeline re-register hit HTTP 409 because the old running image didn't honor force=true for destructive diffs.

This chains them: deploy now waits for publish-edge-image to succeed before firing.

What changed

  • deploy-hetzner.yml on: adds workflow_run trigger on publish-edge completion (branches: main).
  • Job-level if: skips the deploy if the upstream failed — a broken image never reaches prod.
  • The original push: trigger stays for website-only changes (path-filtered, no server rebuild needed).

Behavior matrix

Change shape publish-edge fires? deploy fires
Server code (crates/**) yes after publish succeeds
Website only (beava-website/project/**) no immediately on push
Both yes after publish succeeds (single deploy run)
register_pipeline.py only no immediately
workflow_dispatch immediately

Companion change

docker-compose.prod.yml was already updated on main as f0a02354 (beava:nextbeavadev/beava:edge + pull_policy: always). Without that, even a post-publish deploy wouldn't pull the new image. Both are needed.

Test plan

…-mismatch race

Today's PR-merge incident: deploy-hetzner and publish-edge-image both
fired on the same merge commit and ran in parallel. deploy finished in
~4 min and tried to /register the new pipeline shape (PageView gained
session_id, three feature names changed) against the old running image
that didn't honor force=true for destructive diffs → HTTP 409 conflict.
publish-edge was still building the new image (~20 min release compile)
when deploy gave up.

Fix: deploy now triggers via `workflow_run` on publish-edge-image
completion. The job-level `if:` guard ensures it only runs when the
upstream succeeded — a broken image never reaches the box.

Behavior matrix:

  Change shape                          publish-edge?   deploy fires
  ────────────────────────────────────  ─────────────   ─────────────────
  Server code (crates/**)               yes             after publish ok
  Website only (project/**)             no              immediately on push
  Both                                  yes             after publish ok
  register_pipeline.py only             no              immediately
  Manual workflow_dispatch              —               immediately

Path filter on the push trigger is unchanged — it stays narrow so
website/SDK changes deploy without round-tripping through publish-edge.

Companion change to docker-compose.prod.yml (track :edge + pull_policy:
always) already landed on main as f0a0235 — without that, even a
post-publish deploy wouldn't pick up the new image.
@petrpan26 petrpan26 merged commit 071ead7 into main May 7, 2026
1 check passed
@petrpan26 petrpan26 deleted the ci/sequential-publish-then-deploy branch May 7, 2026 17:33
petrpan26 added a commit that referenced this pull request May 7, 2026
…ugh publish-edge only (#5)

Every deploy since PR #1 has been hitting 409 \`registration_conflict\`.
PR #3/4 fixed the server-side bug — but the box never picked up the new
image because **deploy-hetzner.yml never restarts the container**.
Compose's \`pull_policy: always\` only matters when you call \`docker
compose up\`; the workflow only ran rsync + curl POST.

Plus a parallel race: when a single PR touches both server-code and
website paths, both publish-edge-image and deploy fired simultaneously,
deploy ran first against the still-old image, register 409'd before the
new image was even built.

## Changes

**deploy-hetzner.yml**
- New first step: \`docker compose pull beava && up -d --force-recreate
--no-deps beava\` over SSH, plus a 20-iter health probe
(\`http://beava:8090/ready\`). Hard-fail with logs if the new image
doesn't come up ready in 20s.
- Drop the \`push\` trigger entirely. Every deploy chains off
\`workflow_run\` (publish-edge succeeded) or manual
\`workflow_dispatch\`. One trigger path.

**publish-edge-image.yml**
- Drop the path filter — fires on every push to main. Buildx + cargo
caching make non-server commits finish in 2–3 min (cache-hit, manifest
re-tag).

## Trade-off

Website-only commits no longer auto-deploy in 30s — they wait ~2–3 min
for publish-edge's cache-hit cycle. For a single-maintainer project with
one prod box, ordering > deploy latency.

## Verified

Manual repro of the fix steps unblocked prod:
1. SSH'd to box, \`docker compose pull && up -d --force-recreate beava\`
→ new :edge running
2. \`gh workflow run deploy-hetzner.yml --ref main\` → run 25518077565 →
success
3. Pipeline registered cleanly against the new server.

## Related

- PR #1 introduced the new pipeline shape that exposed the missing pull
step.
- PR #2 added the \`workflow_run\` chain — necessary but not sufficient
(didn't drop the push trigger; didn't add pull).
- PR #3/#4 fixed the server's diff handling — necessary but not
sufficient (box never got the fix).
- This PR closes the loop.

Co-authored-by: Hoang Phan <hoang.phan@viggle.ai>
petrpan26 added a commit that referenced this pull request May 14, 2026
…of returning the error envelope (#130)

## Summary

\`TcpTransport.send_push\` was blindly JSON-decoding the response frame
and returning the dict to user code. When the server emitted
\`OP_ERROR_RESPONSE\` (e.g. \`invalid_event\` on a type mismatch), the
error envelope was returned as a regular dict — **no exception**. Embed
mode defaults to TCP, so fire-and-forget pushes silently \`/dev/null\`
on validation failure.

PR #120 documented this and locked the buggy behaviour with
\`test_type_error_at_push.py\`. This PR fixes the bug and flips those
tests to assert the correct contract.

## Fix in \`python/beava/_transport.py\`

\`send_push\` (lines 443-490) now mirrors \`send_get\`:
- After \`read_frame\`, check \`frame.op != OP_PUSH\` (the success-echo
opcode — server reuses \`OP_PUSH\`, not a separate
\`OP_PUSH_RESPONSE\`).
- If \`OP_ERROR_RESPONSE\`: parse the JSON body with try/except guards
and \`raise RegistrationError(code=err_body["error"]["code"],
message=...)\`.
- Fallback: \`"unparseable_error"\` for bad bytes /
\`"unexpected_frame"\` for missing code.

Docstring expanded with success/error wire shapes and \`Raises:\`
section.

## Test flips in \`python/tests/test_type_error_at_push.py\`

- 5 type-mismatch tests flipped from \`_assert_push_error(...)\` (which
asserted the buggy return-dict shape) to \`with
pytest.raises(RegistrationError) as exc_info: ...; assert
exc_info.value.code == "<code>"\`.
- Test #2 (float→int silent accept) unchanged — server still
legitimately accepts that case via numeric I64↔F64 compat, returns
ack_lsn.
- Added 2 new tests with in-process TCP mock server:
- \`test_push_response_unexpected_opcode_raises\` — server replies with
bogus opcode → \`RegistrationError(code="unexpected_frame")\`.
- \`test_push_error_response_with_unparseable_body_raises\` — non-JSON
error body still raises cleanly (no crash).

## Test plan

- [x] Before fix: 6/6 passed by asserting the buggy return shape.
- [x] After fix: 8/8 passed (\`pytest
python/tests/test_type_error_at_push.py\`, 21.26s).
- [x] Re-run against current \`main\` (\`b20d2b83\`) — clean.
- [x] \`ruff check\` clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant