ci(deploy): chain deploy after publish-edge to kill version-mismatch race#2
Merged
Merged
Conversation
…-mismatch race Today's PR-merge incident: deploy-hetzner and publish-edge-image both fired on the same merge commit and ran in parallel. deploy finished in ~4 min and tried to /register the new pipeline shape (PageView gained session_id, three feature names changed) against the old running image that didn't honor force=true for destructive diffs → HTTP 409 conflict. publish-edge was still building the new image (~20 min release compile) when deploy gave up. Fix: deploy now triggers via `workflow_run` on publish-edge-image completion. The job-level `if:` guard ensures it only runs when the upstream succeeded — a broken image never reaches the box. Behavior matrix: Change shape publish-edge? deploy fires ──────────────────────────────────── ───────────── ───────────────── Server code (crates/**) yes after publish ok Website only (project/**) no immediately on push Both yes after publish ok register_pipeline.py only no immediately Manual workflow_dispatch — immediately Path filter on the push trigger is unchanged — it stays narrow so website/SDK changes deploy without round-tripping through publish-edge. Companion change to docker-compose.prod.yml (track :edge + pull_policy: always) already landed on main as f0a0235 — without that, even a post-publish deploy wouldn't pick up the new image.
petrpan26
added a commit
that referenced
this pull request
May 7, 2026
…ugh publish-edge only (#5) Every deploy since PR #1 has been hitting 409 \`registration_conflict\`. PR #3/4 fixed the server-side bug — but the box never picked up the new image because **deploy-hetzner.yml never restarts the container**. Compose's \`pull_policy: always\` only matters when you call \`docker compose up\`; the workflow only ran rsync + curl POST. Plus a parallel race: when a single PR touches both server-code and website paths, both publish-edge-image and deploy fired simultaneously, deploy ran first against the still-old image, register 409'd before the new image was even built. ## Changes **deploy-hetzner.yml** - New first step: \`docker compose pull beava && up -d --force-recreate --no-deps beava\` over SSH, plus a 20-iter health probe (\`http://beava:8090/ready\`). Hard-fail with logs if the new image doesn't come up ready in 20s. - Drop the \`push\` trigger entirely. Every deploy chains off \`workflow_run\` (publish-edge succeeded) or manual \`workflow_dispatch\`. One trigger path. **publish-edge-image.yml** - Drop the path filter — fires on every push to main. Buildx + cargo caching make non-server commits finish in 2–3 min (cache-hit, manifest re-tag). ## Trade-off Website-only commits no longer auto-deploy in 30s — they wait ~2–3 min for publish-edge's cache-hit cycle. For a single-maintainer project with one prod box, ordering > deploy latency. ## Verified Manual repro of the fix steps unblocked prod: 1. SSH'd to box, \`docker compose pull && up -d --force-recreate beava\` → new :edge running 2. \`gh workflow run deploy-hetzner.yml --ref main\` → run 25518077565 → success 3. Pipeline registered cleanly against the new server. ## Related - PR #1 introduced the new pipeline shape that exposed the missing pull step. - PR #2 added the \`workflow_run\` chain — necessary but not sufficient (didn't drop the push trigger; didn't add pull). - PR #3/#4 fixed the server's diff handling — necessary but not sufficient (box never got the fix). - This PR closes the loop. Co-authored-by: Hoang Phan <hoang.phan@viggle.ai>
petrpan26
added a commit
that referenced
this pull request
May 14, 2026
…of returning the error envelope (#130) ## Summary \`TcpTransport.send_push\` was blindly JSON-decoding the response frame and returning the dict to user code. When the server emitted \`OP_ERROR_RESPONSE\` (e.g. \`invalid_event\` on a type mismatch), the error envelope was returned as a regular dict — **no exception**. Embed mode defaults to TCP, so fire-and-forget pushes silently \`/dev/null\` on validation failure. PR #120 documented this and locked the buggy behaviour with \`test_type_error_at_push.py\`. This PR fixes the bug and flips those tests to assert the correct contract. ## Fix in \`python/beava/_transport.py\` \`send_push\` (lines 443-490) now mirrors \`send_get\`: - After \`read_frame\`, check \`frame.op != OP_PUSH\` (the success-echo opcode — server reuses \`OP_PUSH\`, not a separate \`OP_PUSH_RESPONSE\`). - If \`OP_ERROR_RESPONSE\`: parse the JSON body with try/except guards and \`raise RegistrationError(code=err_body["error"]["code"], message=...)\`. - Fallback: \`"unparseable_error"\` for bad bytes / \`"unexpected_frame"\` for missing code. Docstring expanded with success/error wire shapes and \`Raises:\` section. ## Test flips in \`python/tests/test_type_error_at_push.py\` - 5 type-mismatch tests flipped from \`_assert_push_error(...)\` (which asserted the buggy return-dict shape) to \`with pytest.raises(RegistrationError) as exc_info: ...; assert exc_info.value.code == "<code>"\`. - Test #2 (float→int silent accept) unchanged — server still legitimately accepts that case via numeric I64↔F64 compat, returns ack_lsn. - Added 2 new tests with in-process TCP mock server: - \`test_push_response_unexpected_opcode_raises\` — server replies with bogus opcode → \`RegistrationError(code="unexpected_frame")\`. - \`test_push_error_response_with_unparseable_body_raises\` — non-JSON error body still raises cleanly (no crash). ## Test plan - [x] Before fix: 6/6 passed by asserting the buggy return shape. - [x] After fix: 8/8 passed (\`pytest python/tests/test_type_error_at_push.py\`, 21.26s). - [x] Re-run against current \`main\` (\`b20d2b83\`) — clean. - [x] \`ruff check\` clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Today's PR #1 merge hit a workflow race:
deploy-hetznerandpublish-edge-imageboth fired on the same commit, ran in parallel, anddeploygave up before the new image was published. The pipeline re-register hit HTTP 409 because the old running image didn't honorforce=truefor destructive diffs.This chains them: deploy now waits for
publish-edge-imageto succeed before firing.What changed
deploy-hetzner.ymlon:addsworkflow_runtrigger on publish-edge completion (branches: main).if:skips the deploy if the upstream failed — a broken image never reaches prod.push:trigger stays for website-only changes (path-filtered, no server rebuild needed).Behavior matrix
crates/**)beava-website/project/**)register_pipeline.pyonlyworkflow_dispatchCompanion change
docker-compose.prod.ymlwas already updated on main asf0a02354(beava:next→beavadev/beava:edge+pull_policy: always). Without that, even a post-publish deploy wouldn't pull the new image. Both are needed.Test plan
gh workflow run deploy-hetzner.ymlstill works as the manual escape hatch