RTC: fix five bugs exposed by enabling WebSockets in e2e tests (stale rest sent to syncManager, WebSocket can publish local bootstrap state before receiving room state, remote CRDT update overwritten by stale queued local store, block/merge identity ambiguous for re-ordered blocks, nested table/array treated stale unchanged values as local intent)#77924
RTC: fix five bugs exposed by enabling WebSockets in e2e tests (stale rest sent to syncManager, WebSocket can publish local bootstrap state before receiving room state, remote CRDT update overwritten by stale queued local store, block/merge identity ambiguous for re-ordered blocks, nested table/array treated stale unchanged values as local intent)#77924danluu wants to merge 5 commits intoWordPress:trunkfrom
Conversation
|
Warning: Type of PR label mismatch To merge this PR, it requires exactly 1 label indicating the type of PR. Other labels are optional and not being checked here.
Read more about Type labels in Gutenberg. Don't worry if you don't have the required permissions to add labels; the PR reviewer should be able to help with the task. |
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message. To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
fa3f2b6 to
1a46ebf
Compare
This is part of an AI fuzzing project, where an AI wrote a fuzzer and then triages bugs from the fuzzer and creates fixes. See #77716 for the tracking issue. As of this writing, there have been no known false positives from this project, but there have been some issues, which are documented in #77716. I expect we’ll see false positives at some point (and may even have one that’s been filed in a PR that hasn’t been inspected by a code owner yet).
What?
Depends on #77920 (add e2e WebSocket tests).
This fixes a set of issues that were exposed by porting the existing RTC e2e tests to WebSockets. Here are videos that show some symptoms from the set of bugs this fixes:
table-duplicate-row-followup-visible-divergence-fullwindow.mp4
ws-title-reload-loss-current-known-fixes-wrong-convergence.mp4
ws-concurrent-list-item-move-loss-test4-annotated.mp4
Two of these videos look very similar to other PRs that are part of #77716, but they still reproduce with all of those fixes applied, so the source of the bugs seems to be distinct.
Only the WebSocket room readiness bug requires WebSockets, and that's a bug in the test WebSocket provider, but just running the e2e tests triggered multiple failures and on debugging and fixing those, there are nearby failures that seem to make sense to bundle into a fix since a lot of the failures are related. Some of these bugs are easier to hit with WebSockets than HTTP, but they're generally bugs that could be hit over WebSockets or HTTP.
AI TEXT
This branch fixes stale-state bugs at the boundaries between the WordPress
entity store, Yjs CRDT documents, and the local WebSocket RTC transport. The
user-visible repros are:
move;
Those symptoms are grouped below by source rather than by surface behavior. The
common failure pattern is that stale state is treated as fresh authority: a REST
record, local bootstrap document, queued editor-store snapshot, or value-based
table row match becomes a CRDT update after newer peer state already exists.
Bug Source 1: CRDT Persistence Saves Re-Applied Stale REST Fields
Cause
persistCRDTDoc()needs to save the entity so the pre-persist hook can refreshmeta._crdt_document. That save is not meant to make the REST title/contentauthoritative. Before this branch, it could still pass the full edited record to
saveEntityRecord(), thensaveEntityRecord()could feed the full REST saveresponse back into
syncManager.update( ..., { isSave: true } ). If the editedrecord or REST response was older than the current Y.Doc, stale title/content
became a fresh collaborative update.
How It Was Introduced
84019935998cfrom WordPress/gutenberg#72262
introduced the post-entity CRDT merge path, including
mergeCrdtBlocks()andconversion between edited post records and Yjs state. That was the point where
the WordPress entity store and the Y.Doc both started carrying values for the
same logical post fields.
be1c20e213efrom WordPress/gutenberg#74637
added saved-entity metadata and refetch behavior for RTC. As part of that, a
normal entity save began calling
syncManager.update( updatedRecord, ..., { isSave: true } ). That is reasonablefor a user save, but unsafe for a save whose only purpose is persisting the CRDT
blob.
c4e4fac0a26from WordPress/gutenberg#75699
removed the Gutenberg-plugin-only guard around collaborative editing paths,
making this save-response sync path generally active.
4961c7e3c3from WordPress/gutenberg#76206
changed save-response updates to use
LOCAL_UNDO_IGNORED_ORIGIN, which avoidedcreating undo levels. It did not change the fact that the full REST response
could still be applied as a new CRDT update.
Reproduced Symptom
Same-user title reload loss:
persistence save writes the old REST title back into the CRDT and broadcasts
it to Browser A.
Evidence:
/private/tmp/gutenberg-ws-current-known-fixes-videos-20260505/ws-title-reload-loss-current-known-fixes-wrong-convergence.mp4;/private/tmp/gutenberg-ws-current-known-fixes-videos-20260505/ws-title-reload-loss-current-known-fixes-wrong-convergence-action-log.md;test/e2e/specs/editor/collaboration/websocket/collaboration-same-user-title-reload-loss.spec.ts;packages/core-data/src/test/resolvers.js.Fix
persistCRDTDoc()now saves only the entity key andmeta, and passes__unstableSkipSyncUpdate.saveEntityRecord()can still mark the Y.Doc saved,but it does not apply REST title/content fields back into the collaborative
document for this persistence-only save.
Bug Source 2: WebSocket Join Readiness Published Local Bootstrap State Too Early
Cause
The local WebSocket provider used by the RTC e2e suite treated an open socket as
effectively ready. A joining peer sent its local Yjs state in the
joinmessage, and the relay stored and broadcast that state before the peer had
received a canonical room snapshot. A reload or late join could therefore turn
local bootstrap/REST state into room history.
This source is WebSocket-transport-specific. The other bug sources in this
report are core-data/sync/merge bugs that WebSockets make easier to trigger but
do not inherently require WebSockets.
How It Was Introduced
77132ffead0added the branch-local RTC WebSocket e2e suite and test provider. There is no
upstream WordPress PR for this local test-provider commit. Its first relay
stored
message.stateduringjoinand immediately broadcast it. The provideralso emitted
connectedon socket open and flushed queued messages before asnapshot/peer-state boundary existed.
The same bug source was fixed in this branch by
c72cbb4a0e8,which changed the relay to maintain room Y.Doc state, ask existing peers for
missing state via state vectors, and resolve provider readiness only after the
snapshot or peer-state response has been applied.
Reproduced Symptoms
The title reload repro above depends on this source to make the stale state
easy to hit: the reloaded browser joins while it has REST/bootstrap state and
must not publish that state before it receives the room's current Y.Doc.
The same readiness boundary also protects list/table tests from starting user
actions while a peer's document can still be local bootstrap state.
Fix
The WebSocket relay now stores only updates that apply to the room Y.Doc and
uses Yjs state vectors for join synchronization. The provider resolves
readiness after it has applied a snapshot or a peer-state update. Joining peers
that received remote state mark the Y.Doc with provider-synced metadata, discard
queued local bootstrap messages, and prevent local sync-manager bootstrap
updates from becoming room history.
Bug Source 3: Remote Reconciliation Allowed Stale Local Echoes
Cause
Remote CRDT updates are reconciled into the editor store by computing changes
from the Y.Doc and dispatching
EDIT_ENTITY_RECORD. Local editor-store updatesflow the other direction through
syncManager.update(). Before this branch,there was no per-key guard for the window where a remote
blocksupdate hadentered the Y.Doc but the edited record had not fully caught up. A scheduled
local update based on the pre-remote edited record could echo the stale value
back into the Y.Doc.
For
blocks, that stale value is a whole block tree, not an operation. Adelayed whole-tree write can erase a non-conflicting remote move.
How It Was Introduced
84019935998cfrom WordPress/gutenberg#72262
introduced the bidirectional entity/CRDT conversion for post blocks. The code
was designed around snapshots of entity fields. That made the editor/CRDT
handoff generic, but it also meant a list-item move reached the CRDT layer as
"here is the current
blocksarray" rather than "move this item after thatitem".
The original merge path did not carry the pre-edit base record into
applyChangesToCRDTDoc(), and the sync manager did not track which top-levelkeys were still reconciling from remote state. Those missing happens-before
edges are what allowed stale whole-key echoes.
Reproduced Symptom
Concurrent list-item move loss:
Alpha, Beta, Gamma, Delta, Epsilon, Zeta.Betadown with the normal toolbar button.Epsilonup with the normal toolbar button.Alpha, Gamma, Beta, Epsilon, Delta, Zeta.Evidence:
/private/tmp/gutenberg-ws-current-known-fixes-videos-20260505/ws-concurrent-list-item-move-loss-test4-annotated.mp4;/private/tmp/gutenberg-ws-current-known-fixes-videos-20260505/ws-concurrent-list-item-move-loss-test4-annotated-action-log.md;test/e2e/specs/editor/collaboration/collaboration-stress.spec.ts(
two users concurrently move list items), imported by the WebSocket suite;packages/sync/src/test/manager.tsandpackages/core-data/src/utils/test/crdt-blocks.ts.Fix
Local store-to-CRDT updates now carry the pre-edit
baseRecord. The syncmanager tracks remote key versions and keeps keys in a reconciling set while
remote changes are being applied to the edited record. Non-save local updates
for those keys are filtered if a newer remote version arrived after the local
update was scheduled.
Bug Source 4: Block Reorder Merges Lacked Identity/Base Rebasing
Cause
Even after stale local echoes are filtered, block moves still need a way to
preserve non-conflicting remote edits. The original block merge used left/right
equality sweeps over whole arrays. That can update the changed middle of a list,
but it does not by itself rebase a delayed local move over a remote move or
preserve a remote edit inside a block that moved.
List items do have stable block
clientIds during an editing session, so themerge can use them as an identity source when the base and current arrays have
the same set of blocks.
How It Was Introduced
84019935998cfrom WordPress/gutenberg#72262
introduced
mergeCrdtBlocks()with a generic array-diff approach. That was areasonable starting point for many block edits, but it intentionally avoided
block-specific operation semantics.
The list-move failure exposed the missing rebase step. A branch-local fix in
c72cbb4a0e8added base-record propagation and
clientIdreorder support.87a680ece2ethen fixed the follow-up case where rebasing the order could still overwrite a
remote edit inside a moved block by treating base-unchanged fields as stale, not
local intent.
Reproduced Symptom
This source contributes to the same concurrent list-item move repro as Bug
Source 3. The additional unit evidence is:
does not overwrite remote list item content while rebasing a delayed movein
packages/core-data/src/utils/test/crdt-blocks.ts.Fix
mergeCrdtBlocks()can rebase block arrays byclientIdwhen the base,incoming, and current arrays contain the same block identities. During the
field merge, values unchanged from the base are skipped so a delayed local move
does not overwrite remote edits inside the moved blocks.
Bug Source 5: Nested Table Array/Object Merges Had No Durable Row Identity
Cause
core/tablerows and cells are nested schema-aware attributes. Unlike listitems, table rows/cells do not have durable block
clientIds. When two rowscontain the same visible value, a value-based two-way merge can match the wrong
duplicate. A stale local snapshot that says "one
samerow remains" can besatisfied by deleting a remotely edited duplicate and keeping the stale
unchanged duplicate.
Nested object merging had a second version of the same authority bug: while a
base object was available, the merge still applied every key present in the
incoming object. A key unchanged from the base could overwrite a remote sibling
edit in the current Y.Map.
How It Was Introduced
09a21c64b5b9from WordPress/gutenberg#76913
changed
core/tablecells from plain replacement values into schema-aware Yjsstructures. This fixed important cell-level merge cases, but table rows and
cells still had no durable identity. Matching remained value/position based.
a6bfd3e55432from WordPress/gutenberg#77164
improved array attribute stability with a left/right sweep. That preserved
Yjs object identity for many structural edits, but it was still a two-way diff
between the incoming snapshot and current CRDT value. It did not use the local
pre-edit base value to identify which duplicate row was deleted.
The duplicate-row follow-up repro is the bad case for that design: Browser A's
stale local table body says one duplicate remains; Browser B has already edited
the other duplicate and added a cell to it.
Reproduced Symptom
Duplicate table row follow-up divergence:
anchor,same,same.edited-duplicate.extra.anchor / emptyandedited-duplicate / extra.save/reload can then converge the editors to content that differs from saved
raw post content.
Evidence:
/private/tmp/gutenberg-ws-current-known-fixes-videos-20260505/table-duplicate-row-followup-visible-divergence-fullwindow.mp4;/private/tmp/gutenberg-ws-current-known-fixes-videos-20260505/table-duplicate-row-followup-visible-divergence-fullwindow-action-log.md;preserves a remote-edited duplicate table row when a stale local snapshot deletes the earlier duplicate rowand
preserves remote sibling fields when a stale local nested object changes another fieldin
packages/core-data/src/utils/test/crdt-blocks.ts;test/e2e/specs/editor/collaboration/websocket/collaboration-table-followups.spec.ts.Fix
Nested schema-aware array merges now receive the pre-edit base value. When the
current Y.Array still has the base shape, the merge applies structural changes
relative to that base and skips elements/fields unchanged from the base. For an
ambiguous single deletion among duplicate values, it prefers deleting the
duplicate whose current Yjs value still equals the base row, preserving the row
that has remote edits. Nested object merges also skip keys unchanged from the
base so stale sibling fields do not clobber remote edits.
This is intentionally conservative. If the current Y.Array no longer has the
base shape, the merge falls back to the older two-way behavior. Durable row/cell
identity would be a better long-term model for more complex simultaneous table
structure edits.
Fix Plan
meta, anduse
__unstableSkipSyncUpdateso REST fields do not become collaborativeupdates.
room snapshot or peer-state response before publishing local bootstrap state.
scheduled writes for keys being reconciled from remote state are dropped.
clientIdrebasing for block arrays where stable identity exists.follow-up work beyond the reproduced bugs fixed here.
END AI TEXT