Skip to content

fix(register): force=true must replace additive-against-existing descriptors (prod 409 hotfix)#3

Closed
petrpan26 wants to merge 6 commits into
mainfrom
fix/register-force-replaces-additive-changes
Closed

fix(register): force=true must replace additive-against-existing descriptors (prod 409 hotfix)#3
petrpan26 wants to merge 6 commits into
mainfrom
fix/register-force-replaces-additive-changes

Conversation

@petrpan26
Copy link
Copy Markdown
Contributor

@petrpan26 petrpan26 commented May 7, 2026

Root cause

Two diff systems on the server disagree on schema additions:

  • classify_register_diff (apply_shard pre-flight): NewField on existing event → additive. force-required gate passes.
  • compute_diff (legacy, in execute_register_inner): same shape → changed: [ConflictDetail], returns RegisterOutcome::Conflict unconditionally — never reads force.

apply_shard's pre-removal at line 269 was gated force && !diff.destructive.is_empty(). Pure-additive payloads slipped through, hit the legacy path, 409'd. Wire force=true silently dropped.

This was the prod 409 on every deploy-hetzner retry after PR #1 (PageView gained session_id).

Fix

When force=true, also pre-remove descriptors named in NewField{event} and NewAgg{table} additive entries via force_remove_descriptors. NewDescriptor skipped (genuinely-new). Destructive loop unchanged.

Verification

phase13_4_force_register     10/10  (incl. new test)
phase13_4_dry_run_register    4/4
phase13_5_1_07b_register_no_schema  1/1
phase12_7_no_table_surface    3/3

Bundled

  • c54d4499 homepage curl example: /api/features/site (404) → real POST /api/get.
  • 3a8843b8 + a400b70a design assets: org-avatar-280.png and repo-social-preview-1280x640.png in cream variant. No GitHub API for either — upload via Settings UI.

Followup (out of scope)

Additive without force should also succeed — register_check_force_required says it's allowed, but the legacy compute_diff 409s anyway. Real fix is to align compute_diff with classify_register_diff or thread the new diff into execute_register. Tracked separately.

Hoang Phan added 6 commits May 7, 2026 13:49
…ource (RED)

Reproduces the prod 409 from the deploy-hetzner failure on PR #1 merge:
the pipeline added `session_id` to PageView, expected force=true to
replace the descriptor, but the legacy compute_diff in
execute_register_inner unconditionally returned `registration_conflict`
because apply_shard's pre-removal block only ran for destructive entries.

NewField on an existing event source classifies as `additive` in
classify_register_diff but `changed` in the legacy compute_diff. The
two diff systems disagree, force=true is silently dropped, the legacy
path wins, 409.

Test currently fails with the legacy 409 envelope (status=409,
error.code=registration_conflict, diff.changed contains the Tx schema
mismatch). Will go green once apply_shard pre-removes additive-changed
descriptors when force=true.
…xisting descriptors (GREEN)

Two diff systems on the server disagreed on schema additions:

  - classify_register_diff (apply_shard pre-flight, register_validate.rs):
    new field on existing event source → additive[NewField], no destructive.
    register_check_force_required passes since destructive is empty.

  - compute_diff (legacy, registry_diff.rs, called by execute_register_inner):
    same shape → changed[ConflictDetail{reason: schema_mismatch}].
    register.rs:257 returns RegisterOutcome::Conflict unconditionally —
    no force check on this path.

apply_shard's force-handling block at line 269 was gated on
`force && !diff.destructive.is_empty()`, so additive-only payloads
slipped through unchanged. The legacy path then 409'd, and the wire
`force=true` was effectively dropped. Production deploy of
PageView{path,dwell_ms} → PageView{session_id,path,dwell_ms} +
SiteMetrics has been failing on every retry since PR #1 merged.

Fix: when force=true, also walk diff.additive and pre-remove descriptors
named in NewField{event} and NewAgg{table} entries. NewDescriptor is
genuinely-new (not present in the registry) and stays untouched. The
existing destructive loop is unchanged. Pre-removal goes through the
same registry.force_remove_descriptors call as the destructive path,
so state-loss semantics for force=true are unchanged: callers who want
to add a field without losing accumulated state should send the request
without `force` (additive is allowed without force per
register_check_force_required design).

Targeted: phase13_4_force_register passes 10/10 including the new test.
The pipeline showcase showed:
  $ curl https://beava.dev/api/features/site

That route doesn't exist on the live Caddy proxy and never has —
beava's HTTP surface is `POST /api/get` with `{table, key}` body. The
LiveMetrics card a few lines up already hits the real endpoint, but
the prose example drifted at some point. Anyone copy-pasting hit a 404.

Replaced with the real shape:
  $ curl https://beava.dev/api/get -d '{"table":"SiteMetrics","key":""}'
…w (1280x640)

These two surfaces aren't in the rendered website — they're uploaded
manually via GitHub's Settings UI. Committing the canonical renders
to the design-system assets dir so future updates don't have to
redraw them from scratch.

  - org-avatar-280.png: 280x280 square, transparent geometric mascot
    centered on #000. Drop-in for Settings → Profile → upload picture
    on github.com/beava-dev (or petrpan26).

  - repo-social-preview-1280x640.png: 1280x640, same mascot on #000,
    centered. Drop-in for Settings → General → Social preview on
    github.com/beava-dev/beava.

Both follow the brand-github-banner.html spec from the design bundle
(geometric-transparent on near-black canvas).
Per request, both logo surfaces use the cream-on-mascot variant from
the brand-github-banner.html spec instead of the black variant. Same
canvas dimensions (280x280, 1280x640), same centered geometric mascot,
just --beava-cream (#fdfaf4) bg matching the README banner cutover.
@petrpan26
Copy link
Copy Markdown
Contributor Author

Superseded by #4 — folding the apply_shard fix + integration test + bundled docs/design changes into the broader consolidation PR. Single review, single merge for the full prod fix + cleanup story.

@petrpan26 petrpan26 closed this May 7, 2026
petrpan26 added a commit that referenced this pull request May 7, 2026
…nors additive-against-existing (#4)

**Folds in #3** (closed as superseded). Single PR for the prod 409 fix +
the underlying dual-diff cleanup.

## Root cause

Two diff systems disagreed on schema additions:
- `classify_register_diff` (apply_shard pre-flight): NewField →
`additive`, no force needed.
- `compute_diff` (legacy, in `execute_register_inner`): same shape →
`changed`, returned Conflict regardless of force.

`apply_shard`'s pre-removal block was gated on `force &&
!destructive.is_empty()` — additive-only payloads slipped through, hit
the legacy path, 409'd. Wire `force=true` was effectively dropped.

## Fix in 5 commits

| # | Step | What |
|---|---|---|
| `92af5b87` | hotfix RED | integration test reproducing prod 409 |
| `ff8dd73f` | hotfix GREEN | apply_shard pre-removes
additive-against-existing when force=true |
| `29bbe4d7` | refactor 1/3 | `RegisterDiff` gains `already_present`
field + 6 unit tests |
| `e7d66e76` | refactor 2/3 | `execute_register_inner` switches to
`classify_register_diff` |
| `6d801f40` | refactor 3/3 | Delete legacy `compute_diff` +
`ConflictDetail` + `DiffReason` + 25 tests + `RegisterOutcome::Conflict`
|

Plus `97794ce3` fmt cleanup, `f221dc54` homepage curl example fix,
`4505a60b` + `e17cebcc` brand asset renders.

**Net diff: +74 / −1295 lines.**

## Verification

```
cargo clippy --all-targets --features testing -- -D warnings  → clean
beava-core unit tests           599 passed (incl. 6 new classify tests)
phase13_4_force_register         10 passed (incl. force-additive)
phase13_4_dry_run_register        4 passed
phase2_smoke                      6 passed (already_present preserved)
```

Full workspace gate runs in CI on this PR.

## Out of scope

Allowing additive-without-force to apply in-place. Today the UX is "send
`force=true` to land any change against an existing descriptor." Deeper
engine work (in-place schema mutation without state loss) is a separate
effort.

---------

Co-authored-by: Hoang Phan <hoang.phan@viggle.ai>
petrpan26 added a commit that referenced this pull request May 7, 2026
…ugh publish-edge only (#5)

Every deploy since PR #1 has been hitting 409 \`registration_conflict\`.
PR #3/4 fixed the server-side bug — but the box never picked up the new
image because **deploy-hetzner.yml never restarts the container**.
Compose's \`pull_policy: always\` only matters when you call \`docker
compose up\`; the workflow only ran rsync + curl POST.

Plus a parallel race: when a single PR touches both server-code and
website paths, both publish-edge-image and deploy fired simultaneously,
deploy ran first against the still-old image, register 409'd before the
new image was even built.

## Changes

**deploy-hetzner.yml**
- New first step: \`docker compose pull beava && up -d --force-recreate
--no-deps beava\` over SSH, plus a 20-iter health probe
(\`http://beava:8090/ready\`). Hard-fail with logs if the new image
doesn't come up ready in 20s.
- Drop the \`push\` trigger entirely. Every deploy chains off
\`workflow_run\` (publish-edge succeeded) or manual
\`workflow_dispatch\`. One trigger path.

**publish-edge-image.yml**
- Drop the path filter — fires on every push to main. Buildx + cargo
caching make non-server commits finish in 2–3 min (cache-hit, manifest
re-tag).

## Trade-off

Website-only commits no longer auto-deploy in 30s — they wait ~2–3 min
for publish-edge's cache-hit cycle. For a single-maintainer project with
one prod box, ordering > deploy latency.

## Verified

Manual repro of the fix steps unblocked prod:
1. SSH'd to box, \`docker compose pull && up -d --force-recreate beava\`
→ new :edge running
2. \`gh workflow run deploy-hetzner.yml --ref main\` → run 25518077565 →
success
3. Pipeline registered cleanly against the new server.

## Related

- PR #1 introduced the new pipeline shape that exposed the missing pull
step.
- PR #2 added the \`workflow_run\` chain — necessary but not sufficient
(didn't drop the push trigger; didn't add pull).
- PR #3/#4 fixed the server's diff handling — necessary but not
sufficient (box never got the fix).
- This PR closes the loop.

Co-authored-by: Hoang Phan <hoang.phan@viggle.ai>
petrpan26 pushed a commit that referenced this pull request May 7, 2026
…riter

Three review-driven changes on top of the orphan-recovery fix:

1. (test reviewer #3) Wrap `BadMagic` and `UnsupportedVersion` errors
   from `segment::read_header` inside `reuse_orphan` into the structured
   `AlreadyExists` Io error. Pre-fix, a 24-byte file with garbage magic
   or a future format_version would propagate the raw error variant
   from `read_header`, breaking the boot-time error contract uniformity
   (operator gets `BadMagic` here, `AlreadyExists` everywhere else).
   The wrap includes the original error in the message for forensics.
   Makes `open_refuses_orphan_with_bad_magic` and
   `open_refuses_orphan_with_unsupported_format_version` go green.

2. (impl reviewer #1) Document the single-writer invariant on
   `WalWriter::open`. The reuse path opens with `read+write` and takes
   no `flock`/`fcntl(F_SETLK)`. In normal beava deploys (single
   container, restart: unless-stopped, --force-recreate serializing on
   container lifecycle) this never bites, but two concurrent processes
   pointed at the same WAL dir can both pass the size==HEADER_SIZE
   check and start interleaving record bytes. Documented as an
   explicit prerequisite so a future multi-process scenario doesn't
   silently produce CRC corruption.

3. (impl reviewer #7) Update the now-stale comment in `rotation::rotate`
   that said `WalWriter::open` errors loudly on existing files. After
   the orphan-recovery fix it can also REUSE a header-only segment;
   reflect the new contract.

Plus a clippy fix in the new test (`OpenOptions::write(true).append(true)`
collapses to `.append(true)` — the `append` flag implies `write`).

Verified:
  cargo test -p beava-persistence    → all 9 + pre-existing 7 + others green
  cargo clippy -p beava-persistence --all-targets -- -D warnings → clean
  cargo fmt --check -p beava-persistence → clean
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant