fix(sync): federated collection reconciliation + Pull-from-peer + drawer resilience by atomantic · Pull Request #472 · atomantic/PortOS

atomantic · 2026-05-24T14:56:15Z

Follow-up to #468, fixing real issues found running federated media-collection sync across two machines (void ↔ null).

What was broken

Universe/series-linked collections never reconciled across peers. Each machine minted a random UUID for its Universe: X collection, so per-record sync (keyed on id) saw each machine's copy as a different record → integrity reported both sides "local only", and the merge created a duplicate collection on every peer.
"Sync to peer" was push-only → a record the local machine was behind on (e.g. a universe whose canon images live only on the peer) could not be fixed from the machine you were looking at. ("none of the buttons are fixing it")
Sync Details drawer hung on "Loading…" when the record fetch stalled.
The "Live-pushed records" list on the Instances page grew unbounded and added no value.

Fixes

Deterministic linked-collection ids (uc-<universeId> / sc-<seriesId> via linkedCollectionId) so every machine converges on one record. Create paths reclaim the id when a tombstone holds it (re-create after delete). Migration 038 canonicalizes existing ids on each machine, merging (union of items) same-owner duplicates and rewriting mediaCollection subscription recordIds — idempotent; both machines converge independently.
"Pull from peer" — GET /api/peer-sync/record (returns a record's push-payload) + POST /api/peer-sync/pull-record (fetch from peer → applyIncomingPush, which merges the record + background-pulls missing asset bytes). The drawer is now direction-aware: Pull for peer-only, Sync for local-only, both for diverged/assets-missing.
Drawer resilience — 12s timeout + explicit error state; no more permanent spinner.
Removed the "Live-pushed records" list (the Sync Details drawer covers per-record status); kept the peerSubs fetch that SchemaGapBadge/SyncStatusSection still use.

mediaCollections sync stays opt-in per peer (enable "Media Collections" on the Instances page).

Tests

Migration 038 (pure transforms + idempotent up()), deterministic-id + tombstone-reclaim units, GET /record & POST /pull-record routes, pullRecordFromPeer/getRecordPayloadForPeer service, drawer pull/push buttons + reject/timeout resilience, apiPeerSync wrapper. Server 7331, client 493, build + lint clean.

Problem: universe/series-linked collections got a random UUID per machine, so per-record sync (keyed on id) treated each machine's copy of the SAME universe's collection as distinct — integrity showed both as 'local only' and the merge created duplicates on every peer. - linkedCollectionId({universeId|seriesId}) → 'uc-<universeId>' / 'sc-<seriesId>'; findOrCreate{Universe,Series}Collection + findOrCreateCollectionByName mint it so every federated machine derives the SAME id and converges. Standalone collections keep random ids. - Create paths reclaim the deterministic id if a tombstone holds it (re-creating a deleted universe collection) — a plain append would leave two records at one id and listCollections' dedup would keep the tombstone, dropping the live one. - Migration 038 canonicalizes existing linked-collection ids on each machine, merging (union items) any same-owner duplicates and rewriting mediaCollection subscription recordIds. Idempotent; both machines converge independently. - Instances page: removed the unbounded 'Live-pushed records' per-record list (the Sync Details drawer covers per-record status); kept the peerSubs fetch that SchemaGapBadge/SyncStatusSection still need. Tests: migration 038 (11), deterministic-id + tombstone-reclaim units, existing sharing round-trip green. Server 7317, client 488, lint + build clean.

…state) The drawer's record fetch could leave a permanent 'Loading…' spinner if the request hung (the symptom seen opening 'Universe: Clandestiny'). Add: - recordError state + a 12s hard timeout on the record fetch — a stalled request now flips to an explicit error instead of an infinite spinner (the timeout also bumps the generation so a late response is dropped). - CollectionPreview renders a 'Couldn't load this collection — try Refresh.' state on error, distinct from the loaded-but-empty 'not found' case. Tests: +reject→error, +hung-fetch→timeout→error (fake timers). Client 490, build + lint clean.

'Sync to peer' was push-only, so a record the LOCAL machine was BEHIND on (e.g. a universe whose canon images live only on the peer) could never be fixed from the machine you were looking at. Add the mirror: - Server: GET /api/peer-sync/record returns a single record's push-payload (buildPushPayload without a subscription); POST /api/peer-sync/pull-record fetches that payload from a peer and applies it locally via applyIncomingPush — which already merges the record (+ bundled collection) AND background-pulls missing asset bytes. The pulled payload is validated with the same peerSyncPushSchema the inbound /push route uses. - Client: pullRecordFromPeer() API wrapper + a direction-aware Sync Details drawer: 'Pull from peer' for peer-only, 'Sync to peer' for local-only, BOTH for diverged/assets-missing (ambiguous). Reasoned toasts on no-op results. Tests: +GET /record & POST /pull-record routes, +pullRecordFromPeer service (peer-not-found/unreachable/not-on-peer/invalid/applies), +getRecordPayloadForPeer, +drawer pull-button + both-buttons, +apiPeerSync wrapper. Server 7331, client 493, build + lint clean.

… follow-up

Copilot

Pull request overview

This PR fixes real-world issues in federated media-collection sync by making universe/series-linked collection identities converge across peers, adding a receiver-initiated “pull” path for per-record sync, and hardening the Sync Details drawer against hung record fetches while removing an unbounded UI list.

Changes:

Introduces deterministic IDs for universe/series-linked media collections (uc-<universeId> / sc-<seriesId>) plus migration 038 to canonicalize existing data and rewrite affected peer subscriptions.
Adds peer-sync “pull” capability via GET /api/peer-sync/record (serve push payload) and POST /api/peer-sync/pull-record (fetch + apply payload), with client wrapper + drawer actions.
Improves SyncDetailDrawer resilience (12s timeout + explicit error state) and removes the Instances-page “Live-pushed records” list while keeping the underlying peerSubs fetch.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
server/services/sharing/peerSync.js	Adds `getRecordPayloadForPeer` and `pullRecordFromPeer` service functions.
server/services/sharing/peerSync.test.js	Adds unit tests for new peer-sync pull/payload helpers.
server/routes/peerSync.js	Adds `GET /record` and `POST /pull-record` routes.
server/routes/peerSync.test.js	Adds route tests for `/record` and `/pull-record`.
server/services/mediaCollections.js	Switches linked collections to deterministic IDs and reclaims tombstoned IDs on create paths; adds `linkedCollectionId`.
server/services/mediaCollections.test.js	Adds tests for deterministic IDs and tombstone reclaim behavior.
scripts/migrations/038-canonicalize-linked-collection-ids.js	New migration to canonicalize linked collection IDs and rewrite mediaCollection peer subscriptions.
scripts/migrations/038-canonicalize-linked-collection-ids.test.js	Tests for migration transforms, idempotence, and on-disk rewrite.
client/src/services/apiPeerSync.js	Adds `pullRecordFromPeer` API wrapper.
client/src/services/apiPeerSync.test.js	Adds test coverage for the new API wrapper.
client/src/components/sync/SyncDetailDrawer.jsx	Adds direction-aware Pull/Sync buttons and record-fetch timeout/error state.
client/src/components/sync/SyncDetailDrawer.test.jsx	Adds UI tests for pull button visibility/call, and timeout/reject error states.
client/src/pages/Instances.jsx	Removes “Live-pushed records” section while keeping shared peerSubs fetch for status/badges.
.changelog/NEXT.md	Documents the reconciliation + pull-from-peer + drawer resilience + UI removal fixes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- pullRecordFromPeer: add an AbortController timeout (PUSH_TIMEOUT_MS) — peerFetch has no built-in timeout, so a hung peer would otherwise pin the pull (and the UI action) forever; abort maps to peer-unreachable. (Copilot) - SyncDetailDrawer Refresh button now also re-runs loadRecord, so the 'try Refresh' error copy is honest — Refresh was only re-fetching integrity, not the record. (Copilot) - findOrCreateCollectionByName: when backfilling a universe link onto a legacy orphan, adopt the deterministic id (was keeping the random id → reintroduced non-convergent linked ids). Prefers an existing live canonical collection over clobbering it; only reclaims a tombstone at the canonical id. (Copilot) - migration 038: import rename at top instead of a dynamic import('fs/promises') inside writeJsonAtomic. (self-review) Tests: +backfill canonicalization + canonical-preference cases. Server 7333, client 493, build + lint clean.

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

pullRecordFromPeer trusted the payload's self-reported sourceInstanceId; a misconfigured/buggy peer returning a record that originated elsewhere would bind our reverse-subscription + asset-pull to a peer we never contacted. Reject when sourceInstanceId != the resolved peer.instanceId.

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

…, migration metric - SyncDetailDrawer: PeerRow onRefresh now reloads the record too, not just integrity — a peer-only record pulled local was leaving a stale preview. - pullRecordFromPeer: also assert the returned payload is the requested kind+record.id (not just origin) before applyIncomingPush. - migration 038: surface the previously-dead subscription metric as subsDeduped in the log + return (dropped the meaningless id-map addend).

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

…r load timer - findOrCreateCollectionByName: normalize universeId to UNIVERSE_ID_MAX once (backfill + create) so its id/stored field match findOrCreateUniverseCollection. - migration 038: slice owner ids to the runtime cap before deriving uc-/sc- and storing universeId/seriesId, so a migrated record agrees with what the live service computes (overlong/hand-edited owner id can't desync the row). - SyncDetailDrawer: hold the record-load timeout in a ref and clear it on unmount and rapid re-load, not only when the fetch settles (no accumulating 12s timers). Root cause: linkedCollectionId trusted callers to pre-slice the owner id — the backfill path and migration both forgot. Absorbed the slice INTO linkedCollectionId (and its inlined migration copy) so the derived id is bounded by construction and no future caller can over-run the peer-sync id cap.

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

…CreateCollectionByName The presence guard checked universeId.trim() but the slice used the raw value, so a padded id (' u1 ') would store/derive 'uc- u1 ' and not converge with an unpadded caller's 'uc-u1'. Trim before slicing so the guard and the stored/derived value operate on the same string.

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

…ed gate - SyncDetailDrawer: a newer loadRecord() overwrote loadTimeoutRef.current, so an earlier fetch's .finally() would clear the NEWER request's timer — disabling its timeout and reintroducing the stuck-'Loading…' failure. Capture each invocation's timer locally, clear only that one, and null the ref only while it still points at us. +regression test. - Instances: removing peerSubsLoaded let SyncStatusSection render snapshot badges as 'behind' before peerSubs resolved (livePushCovered is derived from them). Restore a loaded flag and suppress the 'behind' state until the first listPeerSubscriptions settles — falls through to the neutral 'pending' badge.

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

client/src/pages/Instances.jsx:787

peerSubsLoaded is never reset to false when peer.instanceId becomes available (null → value) or changes, so SnapshotSyncBadge can compute behind=true while peerSubs is still the initial [] and before the first listPeerSubscriptions() settles. Set peerSubsLoaded(false) at the start of this effect (when peer.instanceId is truthy) so “behind” stays suppressed until the first fetch completes for the current instanceId.

  useEffect(() => {
    if (!peer.instanceId) {
      setPeerSubs([]);
      setPeerSubsLoaded(true); // no instanceId → nothing to load; don't suppress forever
      return;
    }
    let cancelled = false;
    const refetch = () => listPeerSubscriptions({ peerId: peer.instanceId }, { silent: true })
      .then((r) => {
        if (!cancelled) setPeerSubs(r?.subscriptions || []);
      })
      .catch(() => {
        if (!cancelled) setPeerSubs([]);
      })
      .finally(() => {
        if (!cancelled) setPeerSubsLoaded(true);
      });

…ear stale record - getRecordPayloadForPeer: guard getInstanceId() like pushRecordToPeer — return null (route 404s) when self-identity is missing/UNKNOWN instead of 500ing or emitting a sourceInstanceId='unknown' payload the puller rejects. - Instances: reset peerSubsLoaded=false when instanceId becomes available/changes so 'behind' stays suppressed until that peer's first listPeerSubscriptions settles (the null→value / change edge the initial fix missed). - SyncDetailDrawer: clear the prior record when switching to a different recordId (track lastRecordIdRef) so the header/preview never shows stale metadata while loading — or stays stuck on the wrong record if the new fetch errors. Same-id refresh keeps the record visible. +2 tests.

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

…em order, record-pull size cap - migration 038: replace the -Infinity parseMs with mediaCollections.js's null-based compareEarlierWins/compareNewerWins so a corrupted addedAt/createdAt /lastPushedAt LOSES instead of winning 'earliest'. Sort unionItems by itemKey to match runtime mergeCollectionItems (else two installs churn snapshot checksums until the next merge). - pullRecordFromPeer: cap the response with maxBytes (HTTPS shim streaming cap) + a pre-buffer Content-Length check (plain-HTTP path) so a buggy/runaway peer can't drive unbounded memory. +tests.

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

The HTTPS-shim cap rejection was caught → null → reported as 'peer-unreachable', misreading an oversized response as offline and hiding the 'payload-too-large' reason the Content-Length path already uses. Surface payload-too-large from the catch when the cap trips. +test.

atomantic added 4 commits May 24, 2026 07:43

docs(changelog): federated collection reconciliation + Pull-from-peer…

d93a714

… follow-up

atomantic requested a review from Copilot May 24, 2026 14:56

Copilot started reviewing on behalf of atomantic May 24, 2026 14:57 View session