Skip to content

fix(sync): federated collection reconciliation + Pull-from-peer + drawer resilience#472

Merged
atomantic merged 13 commits into
mainfrom
fix/federated-collection-reconciliation
May 24, 2026
Merged

fix(sync): federated collection reconciliation + Pull-from-peer + drawer resilience#472
atomantic merged 13 commits into
mainfrom
fix/federated-collection-reconciliation

Conversation

@atomantic
Copy link
Copy Markdown
Owner

Follow-up to #468, fixing real issues found running federated media-collection sync across two machines (void ↔ null).

What was broken

  1. Universe/series-linked collections never reconciled across peers. Each machine minted a random UUID for its Universe: X collection, so per-record sync (keyed on id) saw each machine's copy as a different record → integrity reported both sides "local only", and the merge created a duplicate collection on every peer.
  2. "Sync to peer" was push-only → a record the local machine was behind on (e.g. a universe whose canon images live only on the peer) could not be fixed from the machine you were looking at. ("none of the buttons are fixing it")
  3. Sync Details drawer hung on "Loading…" when the record fetch stalled.
  4. The "Live-pushed records" list on the Instances page grew unbounded and added no value.

Fixes

  • Deterministic linked-collection ids (uc-<universeId> / sc-<seriesId> via linkedCollectionId) so every machine converges on one record. Create paths reclaim the id when a tombstone holds it (re-create after delete). Migration 038 canonicalizes existing ids on each machine, merging (union of items) same-owner duplicates and rewriting mediaCollection subscription recordIds — idempotent; both machines converge independently.
  • "Pull from peer"GET /api/peer-sync/record (returns a record's push-payload) + POST /api/peer-sync/pull-record (fetch from peer → applyIncomingPush, which merges the record + background-pulls missing asset bytes). The drawer is now direction-aware: Pull for peer-only, Sync for local-only, both for diverged/assets-missing.
  • Drawer resilience — 12s timeout + explicit error state; no more permanent spinner.
  • Removed the "Live-pushed records" list (the Sync Details drawer covers per-record status); kept the peerSubs fetch that SchemaGapBadge/SyncStatusSection still use.

mediaCollections sync stays opt-in per peer (enable "Media Collections" on the Instances page).

Tests

Migration 038 (pure transforms + idempotent up()), deterministic-id + tombstone-reclaim units, GET /record & POST /pull-record routes, pullRecordFromPeer/getRecordPayloadForPeer service, drawer pull/push buttons + reject/timeout resilience, apiPeerSync wrapper. Server 7331, client 493, build + lint clean.

atomantic added 4 commits May 24, 2026 07:43
Problem: universe/series-linked collections got a random UUID per machine, so
per-record sync (keyed on id) treated each machine's copy of the SAME universe's
collection as distinct — integrity showed both as 'local only' and the merge
created duplicates on every peer.

- linkedCollectionId({universeId|seriesId}) → 'uc-<universeId>' / 'sc-<seriesId>';
  findOrCreate{Universe,Series}Collection + findOrCreateCollectionByName mint it
  so every federated machine derives the SAME id and converges. Standalone
  collections keep random ids.
- Create paths reclaim the deterministic id if a tombstone holds it (re-creating
  a deleted universe collection) — a plain append would leave two records at one
  id and listCollections' dedup would keep the tombstone, dropping the live one.
- Migration 038 canonicalizes existing linked-collection ids on each machine,
  merging (union items) any same-owner duplicates and rewriting mediaCollection
  subscription recordIds. Idempotent; both machines converge independently.
- Instances page: removed the unbounded 'Live-pushed records' per-record list
  (the Sync Details drawer covers per-record status); kept the peerSubs fetch
  that SchemaGapBadge/SyncStatusSection still need.

Tests: migration 038 (11), deterministic-id + tombstone-reclaim units, existing
sharing round-trip green. Server 7317, client 488, lint + build clean.
…state)

The drawer's record fetch could leave a permanent 'Loading…' spinner if the
request hung (the symptom seen opening 'Universe: Clandestiny'). Add:
- recordError state + a 12s hard timeout on the record fetch — a stalled
  request now flips to an explicit error instead of an infinite spinner (the
  timeout also bumps the generation so a late response is dropped).
- CollectionPreview renders a 'Couldn't load this collection — try Refresh.'
  state on error, distinct from the loaded-but-empty 'not found' case.

Tests: +reject→error, +hung-fetch→timeout→error (fake timers). Client 490,
build + lint clean.
'Sync to peer' was push-only, so a record the LOCAL machine was BEHIND on (e.g.
a universe whose canon images live only on the peer) could never be fixed from
the machine you were looking at. Add the mirror:

- Server: GET /api/peer-sync/record returns a single record's push-payload
  (buildPushPayload without a subscription); POST /api/peer-sync/pull-record
  fetches that payload from a peer and applies it locally via applyIncomingPush
  — which already merges the record (+ bundled collection) AND background-pulls
  missing asset bytes. The pulled payload is validated with the same
  peerSyncPushSchema the inbound /push route uses.
- Client: pullRecordFromPeer() API wrapper + a direction-aware Sync Details
  drawer: 'Pull from peer' for peer-only, 'Sync to peer' for local-only, BOTH
  for diverged/assets-missing (ambiguous). Reasoned toasts on no-op results.

Tests: +GET /record & POST /pull-record routes, +pullRecordFromPeer service
(peer-not-found/unreachable/not-on-peer/invalid/applies), +getRecordPayloadForPeer,
+drawer pull-button + both-buttons, +apiPeerSync wrapper. Server 7331, client
493, build + lint clean.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes real-world issues in federated media-collection sync by making universe/series-linked collection identities converge across peers, adding a receiver-initiated “pull” path for per-record sync, and hardening the Sync Details drawer against hung record fetches while removing an unbounded UI list.

Changes:

  • Introduces deterministic IDs for universe/series-linked media collections (uc-<universeId> / sc-<seriesId>) plus migration 038 to canonicalize existing data and rewrite affected peer subscriptions.
  • Adds peer-sync “pull” capability via GET /api/peer-sync/record (serve push payload) and POST /api/peer-sync/pull-record (fetch + apply payload), with client wrapper + drawer actions.
  • Improves SyncDetailDrawer resilience (12s timeout + explicit error state) and removes the Instances-page “Live-pushed records” list while keeping the underlying peerSubs fetch.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
server/services/sharing/peerSync.js Adds getRecordPayloadForPeer and pullRecordFromPeer service functions.
server/services/sharing/peerSync.test.js Adds unit tests for new peer-sync pull/payload helpers.
server/routes/peerSync.js Adds GET /record and POST /pull-record routes.
server/routes/peerSync.test.js Adds route tests for /record and /pull-record.
server/services/mediaCollections.js Switches linked collections to deterministic IDs and reclaims tombstoned IDs on create paths; adds linkedCollectionId.
server/services/mediaCollections.test.js Adds tests for deterministic IDs and tombstone reclaim behavior.
scripts/migrations/038-canonicalize-linked-collection-ids.js New migration to canonicalize linked collection IDs and rewrite mediaCollection peer subscriptions.
scripts/migrations/038-canonicalize-linked-collection-ids.test.js Tests for migration transforms, idempotence, and on-disk rewrite.
client/src/services/apiPeerSync.js Adds pullRecordFromPeer API wrapper.
client/src/services/apiPeerSync.test.js Adds test coverage for the new API wrapper.
client/src/components/sync/SyncDetailDrawer.jsx Adds direction-aware Pull/Sync buttons and record-fetch timeout/error state.
client/src/components/sync/SyncDetailDrawer.test.jsx Adds UI tests for pull button visibility/call, and timeout/reject error states.
client/src/pages/Instances.jsx Removes “Live-pushed records” section while keeping shared peerSubs fetch for status/badges.
.changelog/NEXT.md Documents the reconciliation + pull-from-peer + drawer resilience + UI removal fixes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread server/services/sharing/peerSync.js
Comment thread client/src/components/sync/SyncDetailDrawer.jsx
Comment thread server/services/mediaCollections.js
- pullRecordFromPeer: add an AbortController timeout (PUSH_TIMEOUT_MS) — peerFetch
  has no built-in timeout, so a hung peer would otherwise pin the pull (and the
  UI action) forever; abort maps to peer-unreachable. (Copilot)
- SyncDetailDrawer Refresh button now also re-runs loadRecord, so the 'try
  Refresh' error copy is honest — Refresh was only re-fetching integrity, not
  the record. (Copilot)
- findOrCreateCollectionByName: when backfilling a universe link onto a legacy
  orphan, adopt the deterministic id (was keeping the random id → reintroduced
  non-convergent linked ids). Prefers an existing live canonical collection
  over clobbering it; only reclaims a tombstone at the canonical id. (Copilot)
- migration 038: import rename at top instead of a dynamic import('fs/promises')
  inside writeJsonAtomic. (self-review)

Tests: +backfill canonicalization + canonical-preference cases. Server 7333,
client 493, build + lint clean.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Comment thread server/services/sharing/peerSync.js
pullRecordFromPeer trusted the payload's self-reported sourceInstanceId;
a misconfigured/buggy peer returning a record that originated elsewhere
would bind our reverse-subscription + asset-pull to a peer we never
contacted. Reject when sourceInstanceId != the resolved peer.instanceId.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Comment thread client/src/components/sync/SyncDetailDrawer.jsx
Comment thread scripts/migrations/038-canonicalize-linked-collection-ids.js Outdated
Comment thread server/services/sharing/peerSync.js
…, migration metric

- SyncDetailDrawer: PeerRow onRefresh now reloads the record too, not just
  integrity — a peer-only record pulled local was leaving a stale preview.
- pullRecordFromPeer: also assert the returned payload is the requested
  kind+record.id (not just origin) before applyIncomingPush.
- migration 038: surface the previously-dead subscription metric as
  subsDeduped in the log + return (dropped the meaningless id-map addend).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Comment thread server/services/mediaCollections.js
Comment thread scripts/migrations/038-canonicalize-linked-collection-ids.js Outdated
Comment thread client/src/components/sync/SyncDetailDrawer.jsx Outdated
…r load timer

- findOrCreateCollectionByName: normalize universeId to UNIVERSE_ID_MAX once
  (backfill + create) so its id/stored field match findOrCreateUniverseCollection.
- migration 038: slice owner ids to the runtime cap before deriving uc-/sc- and
  storing universeId/seriesId, so a migrated record agrees with what the live
  service computes (overlong/hand-edited owner id can't desync the row).
- SyncDetailDrawer: hold the record-load timeout in a ref and clear it on unmount
  and rapid re-load, not only when the fetch settles (no accumulating 12s timers).

Root cause: linkedCollectionId trusted callers to pre-slice the owner id — the
backfill path and migration both forgot. Absorbed the slice INTO
linkedCollectionId (and its inlined migration copy) so the derived id is bounded
by construction and no future caller can over-run the peer-sync id cap.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Comment thread server/services/mediaCollections.js Outdated
…CreateCollectionByName

The presence guard checked universeId.trim() but the slice used the raw value,
so a padded id (' u1 ') would store/derive 'uc- u1 ' and not converge with an
unpadded caller's 'uc-u1'. Trim before slicing so the guard and the
stored/derived value operate on the same string.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Comment thread client/src/components/sync/SyncDetailDrawer.jsx
Comment thread client/src/pages/Instances.jsx
…ed gate

- SyncDetailDrawer: a newer loadRecord() overwrote loadTimeoutRef.current, so an
  earlier fetch's .finally() would clear the NEWER request's timer — disabling
  its timeout and reintroducing the stuck-'Loading…' failure. Capture each
  invocation's timer locally, clear only that one, and null the ref only while it
  still points at us. +regression test.
- Instances: removing peerSubsLoaded let SyncStatusSection render snapshot badges
  as 'behind' before peerSubs resolved (livePushCovered is derived from them).
  Restore a loaded flag and suppress the 'behind' state until the first
  listPeerSubscriptions settles — falls through to the neutral 'pending' badge.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

client/src/pages/Instances.jsx:787

  • peerSubsLoaded is never reset to false when peer.instanceId becomes available (null → value) or changes, so SnapshotSyncBadge can compute behind=true while peerSubs is still the initial [] and before the first listPeerSubscriptions() settles. Set peerSubsLoaded(false) at the start of this effect (when peer.instanceId is truthy) so “behind” stays suppressed until the first fetch completes for the current instanceId.
  useEffect(() => {
    if (!peer.instanceId) {
      setPeerSubs([]);
      setPeerSubsLoaded(true); // no instanceId → nothing to load; don't suppress forever
      return;
    }
    let cancelled = false;
    const refetch = () => listPeerSubscriptions({ peerId: peer.instanceId }, { silent: true })
      .then((r) => {
        if (!cancelled) setPeerSubs(r?.subscriptions || []);
      })
      .catch(() => {
        if (!cancelled) setPeerSubs([]);
      })
      .finally(() => {
        if (!cancelled) setPeerSubsLoaded(true);
      });

Comment thread server/services/sharing/peerSync.js Outdated
Comment thread client/src/components/sync/SyncDetailDrawer.jsx
…ear stale record

- getRecordPayloadForPeer: guard getInstanceId() like pushRecordToPeer — return
  null (route 404s) when self-identity is missing/UNKNOWN instead of 500ing or
  emitting a sourceInstanceId='unknown' payload the puller rejects.
- Instances: reset peerSubsLoaded=false when instanceId becomes available/changes
  so 'behind' stays suppressed until that peer's first listPeerSubscriptions
  settles (the null→value / change edge the initial fix missed).
- SyncDetailDrawer: clear the prior record when switching to a different recordId
  (track lastRecordIdRef) so the header/preview never shows stale metadata while
  loading — or stays stuck on the wrong record if the new fetch errors. Same-id
  refresh keeps the record visible. +2 tests.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Comment thread scripts/migrations/038-canonicalize-linked-collection-ids.js Outdated
Comment thread scripts/migrations/038-canonicalize-linked-collection-ids.js
Comment thread server/services/sharing/peerSync.js
…em order, record-pull size cap

- migration 038: replace the -Infinity parseMs with mediaCollections.js's
  null-based compareEarlierWins/compareNewerWins so a corrupted addedAt/createdAt
  /lastPushedAt LOSES instead of winning 'earliest'. Sort unionItems by itemKey
  to match runtime mergeCollectionItems (else two installs churn snapshot
  checksums until the next merge).
- pullRecordFromPeer: cap the response with maxBytes (HTTPS shim streaming cap)
  + a pre-buffer Content-Length check (plain-HTTP path) so a buggy/runaway peer
  can't drive unbounded memory. +tests.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Comment thread server/services/sharing/peerSync.js
The HTTPS-shim cap rejection was caught → null → reported as 'peer-unreachable',
misreading an oversized response as offline and hiding the 'payload-too-large'
reason the Content-Length path already uses. Surface payload-too-large from the
catch when the cap trips. +test.
@atomantic atomantic merged commit a6ebed9 into main May 24, 2026
2 checks passed
@atomantic atomantic deleted the fix/federated-collection-reconciliation branch May 24, 2026 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants