fix(sync): federated collection reconciliation + Pull-from-peer + drawer resilience#472
Merged
Merged
Conversation
Problem: universe/series-linked collections got a random UUID per machine, so
per-record sync (keyed on id) treated each machine's copy of the SAME universe's
collection as distinct — integrity showed both as 'local only' and the merge
created duplicates on every peer.
- linkedCollectionId({universeId|seriesId}) → 'uc-<universeId>' / 'sc-<seriesId>';
findOrCreate{Universe,Series}Collection + findOrCreateCollectionByName mint it
so every federated machine derives the SAME id and converges. Standalone
collections keep random ids.
- Create paths reclaim the deterministic id if a tombstone holds it (re-creating
a deleted universe collection) — a plain append would leave two records at one
id and listCollections' dedup would keep the tombstone, dropping the live one.
- Migration 038 canonicalizes existing linked-collection ids on each machine,
merging (union items) any same-owner duplicates and rewriting mediaCollection
subscription recordIds. Idempotent; both machines converge independently.
- Instances page: removed the unbounded 'Live-pushed records' per-record list
(the Sync Details drawer covers per-record status); kept the peerSubs fetch
that SchemaGapBadge/SyncStatusSection still need.
Tests: migration 038 (11), deterministic-id + tombstone-reclaim units, existing
sharing round-trip green. Server 7317, client 488, lint + build clean.
…state) The drawer's record fetch could leave a permanent 'Loading…' spinner if the request hung (the symptom seen opening 'Universe: Clandestiny'). Add: - recordError state + a 12s hard timeout on the record fetch — a stalled request now flips to an explicit error instead of an infinite spinner (the timeout also bumps the generation so a late response is dropped). - CollectionPreview renders a 'Couldn't load this collection — try Refresh.' state on error, distinct from the loaded-but-empty 'not found' case. Tests: +reject→error, +hung-fetch→timeout→error (fake timers). Client 490, build + lint clean.
'Sync to peer' was push-only, so a record the LOCAL machine was BEHIND on (e.g. a universe whose canon images live only on the peer) could never be fixed from the machine you were looking at. Add the mirror: - Server: GET /api/peer-sync/record returns a single record's push-payload (buildPushPayload without a subscription); POST /api/peer-sync/pull-record fetches that payload from a peer and applies it locally via applyIncomingPush — which already merges the record (+ bundled collection) AND background-pulls missing asset bytes. The pulled payload is validated with the same peerSyncPushSchema the inbound /push route uses. - Client: pullRecordFromPeer() API wrapper + a direction-aware Sync Details drawer: 'Pull from peer' for peer-only, 'Sync to peer' for local-only, BOTH for diverged/assets-missing (ambiguous). Reasoned toasts on no-op results. Tests: +GET /record & POST /pull-record routes, +pullRecordFromPeer service (peer-not-found/unreachable/not-on-peer/invalid/applies), +getRecordPayloadForPeer, +drawer pull-button + both-buttons, +apiPeerSync wrapper. Server 7331, client 493, build + lint clean.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes real-world issues in federated media-collection sync by making universe/series-linked collection identities converge across peers, adding a receiver-initiated “pull” path for per-record sync, and hardening the Sync Details drawer against hung record fetches while removing an unbounded UI list.
Changes:
- Introduces deterministic IDs for universe/series-linked media collections (
uc-<universeId>/sc-<seriesId>) plus migration 038 to canonicalize existing data and rewrite affected peer subscriptions. - Adds peer-sync “pull” capability via
GET /api/peer-sync/record(serve push payload) andPOST /api/peer-sync/pull-record(fetch + apply payload), with client wrapper + drawer actions. - Improves SyncDetailDrawer resilience (12s timeout + explicit error state) and removes the Instances-page “Live-pushed records” list while keeping the underlying peerSubs fetch.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| server/services/sharing/peerSync.js | Adds getRecordPayloadForPeer and pullRecordFromPeer service functions. |
| server/services/sharing/peerSync.test.js | Adds unit tests for new peer-sync pull/payload helpers. |
| server/routes/peerSync.js | Adds GET /record and POST /pull-record routes. |
| server/routes/peerSync.test.js | Adds route tests for /record and /pull-record. |
| server/services/mediaCollections.js | Switches linked collections to deterministic IDs and reclaims tombstoned IDs on create paths; adds linkedCollectionId. |
| server/services/mediaCollections.test.js | Adds tests for deterministic IDs and tombstone reclaim behavior. |
| scripts/migrations/038-canonicalize-linked-collection-ids.js | New migration to canonicalize linked collection IDs and rewrite mediaCollection peer subscriptions. |
| scripts/migrations/038-canonicalize-linked-collection-ids.test.js | Tests for migration transforms, idempotence, and on-disk rewrite. |
| client/src/services/apiPeerSync.js | Adds pullRecordFromPeer API wrapper. |
| client/src/services/apiPeerSync.test.js | Adds test coverage for the new API wrapper. |
| client/src/components/sync/SyncDetailDrawer.jsx | Adds direction-aware Pull/Sync buttons and record-fetch timeout/error state. |
| client/src/components/sync/SyncDetailDrawer.test.jsx | Adds UI tests for pull button visibility/call, and timeout/reject error states. |
| client/src/pages/Instances.jsx | Removes “Live-pushed records” section while keeping shared peerSubs fetch for status/badges. |
| .changelog/NEXT.md | Documents the reconciliation + pull-from-peer + drawer resilience + UI removal fixes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- pullRecordFromPeer: add an AbortController timeout (PUSH_TIMEOUT_MS) — peerFetch
has no built-in timeout, so a hung peer would otherwise pin the pull (and the
UI action) forever; abort maps to peer-unreachable. (Copilot)
- SyncDetailDrawer Refresh button now also re-runs loadRecord, so the 'try
Refresh' error copy is honest — Refresh was only re-fetching integrity, not
the record. (Copilot)
- findOrCreateCollectionByName: when backfilling a universe link onto a legacy
orphan, adopt the deterministic id (was keeping the random id → reintroduced
non-convergent linked ids). Prefers an existing live canonical collection
over clobbering it; only reclaims a tombstone at the canonical id. (Copilot)
- migration 038: import rename at top instead of a dynamic import('fs/promises')
inside writeJsonAtomic. (self-review)
Tests: +backfill canonicalization + canonical-preference cases. Server 7333,
client 493, build + lint clean.
pullRecordFromPeer trusted the payload's self-reported sourceInstanceId; a misconfigured/buggy peer returning a record that originated elsewhere would bind our reverse-subscription + asset-pull to a peer we never contacted. Reject when sourceInstanceId != the resolved peer.instanceId.
…, migration metric - SyncDetailDrawer: PeerRow onRefresh now reloads the record too, not just integrity — a peer-only record pulled local was leaving a stale preview. - pullRecordFromPeer: also assert the returned payload is the requested kind+record.id (not just origin) before applyIncomingPush. - migration 038: surface the previously-dead subscription metric as subsDeduped in the log + return (dropped the meaningless id-map addend).
…r load timer - findOrCreateCollectionByName: normalize universeId to UNIVERSE_ID_MAX once (backfill + create) so its id/stored field match findOrCreateUniverseCollection. - migration 038: slice owner ids to the runtime cap before deriving uc-/sc- and storing universeId/seriesId, so a migrated record agrees with what the live service computes (overlong/hand-edited owner id can't desync the row). - SyncDetailDrawer: hold the record-load timeout in a ref and clear it on unmount and rapid re-load, not only when the fetch settles (no accumulating 12s timers). Root cause: linkedCollectionId trusted callers to pre-slice the owner id — the backfill path and migration both forgot. Absorbed the slice INTO linkedCollectionId (and its inlined migration copy) so the derived id is bounded by construction and no future caller can over-run the peer-sync id cap.
…CreateCollectionByName
The presence guard checked universeId.trim() but the slice used the raw value,
so a padded id (' u1 ') would store/derive 'uc- u1 ' and not converge with an
unpadded caller's 'uc-u1'. Trim before slicing so the guard and the
stored/derived value operate on the same string.
…ed gate - SyncDetailDrawer: a newer loadRecord() overwrote loadTimeoutRef.current, so an earlier fetch's .finally() would clear the NEWER request's timer — disabling its timeout and reintroducing the stuck-'Loading…' failure. Capture each invocation's timer locally, clear only that one, and null the ref only while it still points at us. +regression test. - Instances: removing peerSubsLoaded let SyncStatusSection render snapshot badges as 'behind' before peerSubs resolved (livePushCovered is derived from them). Restore a loaded flag and suppress the 'behind' state until the first listPeerSubscriptions settles — falls through to the neutral 'pending' badge.
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
client/src/pages/Instances.jsx:787
- peerSubsLoaded is never reset to false when peer.instanceId becomes available (null → value) or changes, so SnapshotSyncBadge can compute behind=true while peerSubs is still the initial [] and before the first listPeerSubscriptions() settles. Set peerSubsLoaded(false) at the start of this effect (when peer.instanceId is truthy) so “behind” stays suppressed until the first fetch completes for the current instanceId.
useEffect(() => {
if (!peer.instanceId) {
setPeerSubs([]);
setPeerSubsLoaded(true); // no instanceId → nothing to load; don't suppress forever
return;
}
let cancelled = false;
const refetch = () => listPeerSubscriptions({ peerId: peer.instanceId }, { silent: true })
.then((r) => {
if (!cancelled) setPeerSubs(r?.subscriptions || []);
})
.catch(() => {
if (!cancelled) setPeerSubs([]);
})
.finally(() => {
if (!cancelled) setPeerSubsLoaded(true);
});
…ear stale record - getRecordPayloadForPeer: guard getInstanceId() like pushRecordToPeer — return null (route 404s) when self-identity is missing/UNKNOWN instead of 500ing or emitting a sourceInstanceId='unknown' payload the puller rejects. - Instances: reset peerSubsLoaded=false when instanceId becomes available/changes so 'behind' stays suppressed until that peer's first listPeerSubscriptions settles (the null→value / change edge the initial fix missed). - SyncDetailDrawer: clear the prior record when switching to a different recordId (track lastRecordIdRef) so the header/preview never shows stale metadata while loading — or stays stuck on the wrong record if the new fetch errors. Same-id refresh keeps the record visible. +2 tests.
…em order, record-pull size cap - migration 038: replace the -Infinity parseMs with mediaCollections.js's null-based compareEarlierWins/compareNewerWins so a corrupted addedAt/createdAt /lastPushedAt LOSES instead of winning 'earliest'. Sort unionItems by itemKey to match runtime mergeCollectionItems (else two installs churn snapshot checksums until the next merge). - pullRecordFromPeer: cap the response with maxBytes (HTTPS shim streaming cap) + a pre-buffer Content-Length check (plain-HTTP path) so a buggy/runaway peer can't drive unbounded memory. +tests.
The HTTPS-shim cap rejection was caught → null → reported as 'peer-unreachable', misreading an oversized response as offline and hiding the 'payload-too-large' reason the Content-Length path already uses. Surface payload-too-large from the catch when the cap trips. +test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #468, fixing real issues found running federated media-collection sync across two machines (void ↔ null).
What was broken
Universe: Xcollection, so per-record sync (keyed onid) saw each machine's copy as a different record → integrity reported both sides "local only", and the merge created a duplicate collection on every peer.Fixes
uc-<universeId>/sc-<seriesId>vialinkedCollectionId) so every machine converges on one record. Create paths reclaim the id when a tombstone holds it (re-create after delete). Migration 038 canonicalizes existing ids on each machine, merging (union of items) same-owner duplicates and rewritingmediaCollectionsubscription recordIds — idempotent; both machines converge independently.GET /api/peer-sync/record(returns a record's push-payload) +POST /api/peer-sync/pull-record(fetch from peer →applyIncomingPush, which merges the record + background-pulls missing asset bytes). The drawer is now direction-aware: Pull forpeer-only, Sync forlocal-only, both fordiverged/assets-missing.peerSubsfetch thatSchemaGapBadge/SyncStatusSectionstill use.mediaCollectionssync stays opt-in per peer (enable "Media Collections" on the Instances page).Tests
Migration 038 (pure transforms + idempotent
up()), deterministic-id + tombstone-reclaim units,GET /record&POST /pull-recordroutes,pullRecordFromPeer/getRecordPayloadForPeerservice, drawer pull/push buttons + reject/timeout resilience, apiPeerSync wrapper. Server 7331, client 493, build + lint clean.