Skip to content

RTC: fix five bugs exposed by enabling WebSockets in e2e tests (stale rest sent to syncManager, WebSocket can publish local bootstrap state before receiving room state, remote CRDT update overwritten by stale queued local store, block/merge identity ambiguous for re-ordered blocks, nested table/array treated stale unchanged values as local intent)#77924

Open
danluu wants to merge 5 commits intoWordPress:trunkfrom
danluu:codex/rtc-websocket-e2e-explanation-20260502-pr

Conversation

@danluu
Copy link
Copy Markdown
Contributor

@danluu danluu commented May 5, 2026

This is part of an AI fuzzing project, where an AI wrote a fuzzer and then triages bugs from the fuzzer and creates fixes. See #77716 for the tracking issue. As of this writing, there have been no known false positives from this project, but there have been some issues, which are documented in #77716. I expect we’ll see false positives at some point (and may even have one that’s been filed in a PR that hasn’t been inspected by a code owner yet).

What?

Depends on #77920 (add e2e WebSocket tests).

This fixes a set of issues that were exposed by porting the existing RTC e2e tests to WebSockets. Here are videos that show some symptoms from the set of bugs this fixes:

table-duplicate-row-followup-visible-divergence-fullwindow.mp4
ws-title-reload-loss-current-known-fixes-wrong-convergence.mp4
ws-concurrent-list-item-move-loss-test4-annotated.mp4

Two of these videos look very similar to other PRs that are part of #77716, but they still reproduce with all of those fixes applied, so the source of the bugs seems to be distinct.

Only the WebSocket room readiness bug requires WebSockets, and that's a bug in the test WebSocket provider, but just running the e2e tests triggered multiple failures and on debugging and fixing those, there are nearby failures that seem to make sense to bundle into a fix since a lot of the failures are related. Some of these bugs are easier to hit with WebSockets than HTTP, but they're generally bugs that could be hit over WebSockets or HTTP.

AI TEXT

This branch fixes stale-state bugs at the boundaries between the WordPress
entity store, Yjs CRDT documents, and the local WebSocket RTC transport. The
user-visible repros are:

  • same-user unsaved title loss after another browser session reloads;
  • concurrent list-item moves where each browser can lose the other user's
    move;
  • duplicate table row follow-up divergence after normal table toolbar edits.

Those symptoms are grouped below by source rather than by surface behavior. The
common failure pattern is that stale state is treated as fresh authority: a REST
record, local bootstrap document, queued editor-store snapshot, or value-based
table row match becomes a CRDT update after newer peer state already exists.

Bug Source 1: CRDT Persistence Saves Re-Applied Stale REST Fields

Cause

persistCRDTDoc() needs to save the entity so the pre-persist hook can refresh
meta._crdt_document. That save is not meant to make the REST title/content
authoritative. Before this branch, it could still pass the full edited record to
saveEntityRecord(), then saveEntityRecord() could feed the full REST save
response back into syncManager.update( ..., { isSave: true } ). If the edited
record or REST response was older than the current Y.Doc, stale title/content
became a fresh collaborative update.

How It Was Introduced

84019935998c
from WordPress/gutenberg#72262
introduced the post-entity CRDT merge path, including mergeCrdtBlocks() and
conversion between edited post records and Yjs state. That was the point where
the WordPress entity store and the Y.Doc both started carrying values for the
same logical post fields.

be1c20e213e
from WordPress/gutenberg#74637
added saved-entity metadata and refetch behavior for RTC. As part of that, a
normal entity save began calling
syncManager.update( updatedRecord, ..., { isSave: true } ). That is reasonable
for a user save, but unsafe for a save whose only purpose is persisting the CRDT
blob.

c4e4fac0a26
from WordPress/gutenberg#75699
removed the Gutenberg-plugin-only guard around collaborative editing paths,
making this save-response sync path generally active.

4961c7e3c3
from WordPress/gutenberg#76206
changed save-response updates to use LOCAL_UNDO_IGNORED_ORIGIN, which avoided
creating undo levels. It did not change the fact that the full REST response
could still be applied as a new CRDT update.

Reproduced Symptom

Same-user title reload loss:

  1. Browser A and Browser B open the same draft as the same user.
  2. Browser A changes the title but does not save.
  3. Browser B makes a companion edit, reloads, and rejoins the WebSocket room.
  4. Expected: both browsers keep Browser A's unsaved title.
  5. Failing behavior: Browser B briefly receives the unsaved title, then a stale
    persistence save writes the old REST title back into the CRDT and broadcasts
    it to Browser A.

Evidence:

  • video:
    /private/tmp/gutenberg-ws-current-known-fixes-videos-20260505/ws-title-reload-loss-current-known-fixes-wrong-convergence.mp4;
  • action log:
    /private/tmp/gutenberg-ws-current-known-fixes-videos-20260505/ws-title-reload-loss-current-known-fixes-wrong-convergence-action-log.md;
  • Playwright regression:
    test/e2e/specs/editor/collaboration/websocket/collaboration-same-user-title-reload-loss.spec.ts;
  • unit coverage:
    packages/core-data/src/test/resolvers.js.

Fix

persistCRDTDoc() now saves only the entity key and meta, and passes
__unstableSkipSyncUpdate. saveEntityRecord() can still mark the Y.Doc saved,
but it does not apply REST title/content fields back into the collaborative
document for this persistence-only save.

Bug Source 2: WebSocket Join Readiness Published Local Bootstrap State Too Early

Cause

The local WebSocket provider used by the RTC e2e suite treated an open socket as
effectively ready. A joining peer sent its local Yjs state in the join
message, and the relay stored and broadcast that state before the peer had
received a canonical room snapshot. A reload or late join could therefore turn
local bootstrap/REST state into room history.

This source is WebSocket-transport-specific. The other bug sources in this
report are core-data/sync/merge bugs that WebSockets make easier to trigger but
do not inherently require WebSockets.

How It Was Introduced

77132ffead0
added the branch-local RTC WebSocket e2e suite and test provider. There is no
upstream WordPress PR for this local test-provider commit. Its first relay
stored message.state during join and immediately broadcast it. The provider
also emitted connected on socket open and flushed queued messages before a
snapshot/peer-state boundary existed.

The same bug source was fixed in this branch by
c72cbb4a0e8,
which changed the relay to maintain room Y.Doc state, ask existing peers for
missing state via state vectors, and resolve provider readiness only after the
snapshot or peer-state response has been applied.

Reproduced Symptoms

The title reload repro above depends on this source to make the stale state
easy to hit: the reloaded browser joins while it has REST/bootstrap state and
must not publish that state before it receives the room's current Y.Doc.

The same readiness boundary also protects list/table tests from starting user
actions while a peer's document can still be local bootstrap state.

Fix

The WebSocket relay now stores only updates that apply to the room Y.Doc and
uses Yjs state vectors for join synchronization. The provider resolves
readiness after it has applied a snapshot or a peer-state update. Joining peers
that received remote state mark the Y.Doc with provider-synced metadata, discard
queued local bootstrap messages, and prevent local sync-manager bootstrap
updates from becoming room history.

Bug Source 3: Remote Reconciliation Allowed Stale Local Echoes

Cause

Remote CRDT updates are reconciled into the editor store by computing changes
from the Y.Doc and dispatching EDIT_ENTITY_RECORD. Local editor-store updates
flow the other direction through syncManager.update(). Before this branch,
there was no per-key guard for the window where a remote blocks update had
entered the Y.Doc but the edited record had not fully caught up. A scheduled
local update based on the pre-remote edited record could echo the stale value
back into the Y.Doc.

For blocks, that stale value is a whole block tree, not an operation. A
delayed whole-tree write can erase a non-conflicting remote move.

How It Was Introduced

84019935998c
from WordPress/gutenberg#72262
introduced the bidirectional entity/CRDT conversion for post blocks. The code
was designed around snapshots of entity fields. That made the editor/CRDT
handoff generic, but it also meant a list-item move reached the CRDT layer as
"here is the current blocks array" rather than "move this item after that
item".

The original merge path did not carry the pre-edit base record into
applyChangesToCRDTDoc(), and the sync manager did not track which top-level
keys were still reconciling from remote state. Those missing happens-before
edges are what allowed stale whole-key echoes.

Reproduced Symptom

Concurrent list-item move loss:

  1. Browser A and Browser B open the same list:
    Alpha, Beta, Gamma, Delta, Epsilon, Zeta.
  2. Browser A moves Beta down with the normal toolbar button.
  3. Browser B moves Epsilon up with the normal toolbar button.
  4. Expected merged order:
    Alpha, Gamma, Beta, Epsilon, Delta, Zeta.
  5. Failing behavior: each browser keeps its own move and loses the other move.

Evidence:

  • clearer attempt-4 video:
    /private/tmp/gutenberg-ws-current-known-fixes-videos-20260505/ws-concurrent-list-item-move-loss-test4-annotated.mp4;
  • action log:
    /private/tmp/gutenberg-ws-current-known-fixes-videos-20260505/ws-concurrent-list-item-move-loss-test4-annotated-action-log.md;
  • Playwright regression:
    test/e2e/specs/editor/collaboration/collaboration-stress.spec.ts
    (two users concurrently move list items), imported by the WebSocket suite;
  • sync-manager and block-merge unit coverage:
    packages/sync/src/test/manager.ts and
    packages/core-data/src/utils/test/crdt-blocks.ts.

Fix

Local store-to-CRDT updates now carry the pre-edit baseRecord. The sync
manager tracks remote key versions and keeps keys in a reconciling set while
remote changes are being applied to the edited record. Non-save local updates
for those keys are filtered if a newer remote version arrived after the local
update was scheduled.

Bug Source 4: Block Reorder Merges Lacked Identity/Base Rebasing

Cause

Even after stale local echoes are filtered, block moves still need a way to
preserve non-conflicting remote edits. The original block merge used left/right
equality sweeps over whole arrays. That can update the changed middle of a list,
but it does not by itself rebase a delayed local move over a remote move or
preserve a remote edit inside a block that moved.

List items do have stable block clientIds during an editing session, so the
merge can use them as an identity source when the base and current arrays have
the same set of blocks.

How It Was Introduced

84019935998c
from WordPress/gutenberg#72262
introduced mergeCrdtBlocks() with a generic array-diff approach. That was a
reasonable starting point for many block edits, but it intentionally avoided
block-specific operation semantics.

The list-move failure exposed the missing rebase step. A branch-local fix in
c72cbb4a0e8
added base-record propagation and clientId reorder support.
87a680ece2e
then fixed the follow-up case where rebasing the order could still overwrite a
remote edit inside a moved block by treating base-unchanged fields as stale, not
local intent.

Reproduced Symptom

This source contributes to the same concurrent list-item move repro as Bug
Source 3. The additional unit evidence is:

  • does not overwrite remote list item content while rebasing a delayed move
    in packages/core-data/src/utils/test/crdt-blocks.ts.

Fix

mergeCrdtBlocks() can rebase block arrays by clientId when the base,
incoming, and current arrays contain the same block identities. During the
field merge, values unchanged from the base are skipped so a delayed local move
does not overwrite remote edits inside the moved blocks.

Bug Source 5: Nested Table Array/Object Merges Had No Durable Row Identity

Cause

core/table rows and cells are nested schema-aware attributes. Unlike list
items, table rows/cells do not have durable block clientIds. When two rows
contain the same visible value, a value-based two-way merge can match the wrong
duplicate. A stale local snapshot that says "one same row remains" can be
satisfied by deleting a remotely edited duplicate and keeping the stale
unchanged duplicate.

Nested object merging had a second version of the same authority bug: while a
base object was available, the merge still applied every key present in the
incoming object. A key unchanged from the base could overwrite a remote sibling
edit in the current Y.Map.

How It Was Introduced

09a21c64b5b9
from WordPress/gutenberg#76913
changed core/table cells from plain replacement values into schema-aware Yjs
structures. This fixed important cell-level merge cases, but table rows and
cells still had no durable identity. Matching remained value/position based.

a6bfd3e55432
from WordPress/gutenberg#77164
improved array attribute stability with a left/right sweep. That preserved
Yjs object identity for many structural edits, but it was still a two-way diff
between the incoming snapshot and current CRDT value. It did not use the local
pre-edit base value to identify which duplicate row was deleted.

The duplicate-row follow-up repro is the bad case for that design: Browser A's
stale local table body says one duplicate remains; Browser B has already edited
the other duplicate and added a cell to it.

Reproduced Symptom

Duplicate table row follow-up divergence:

  1. Browser A creates a 1-column table with rows anchor, same, same.
  2. Browser B edits the second duplicate row to edited-duplicate.
  3. Browser B inserts a column after that row and types extra.
  4. Browser A deletes the earlier duplicate row.
  5. Expected: both browsers show anchor / empty and
    edited-duplicate / extra.
  6. Failing behavior: Browser B can keep an extra stale duplicate row. Follow-up
    save/reload can then converge the editors to content that differs from saved
    raw post content.

Evidence:

  • video:
    /private/tmp/gutenberg-ws-current-known-fixes-videos-20260505/table-duplicate-row-followup-visible-divergence-fullwindow.mp4;
  • action log:
    /private/tmp/gutenberg-ws-current-known-fixes-videos-20260505/table-duplicate-row-followup-visible-divergence-fullwindow-action-log.md;
  • unit regressions:
    preserves a remote-edited duplicate table row when a stale local snapshot deletes the earlier duplicate row
    and
    preserves remote sibling fields when a stale local nested object changes another field
    in packages/core-data/src/utils/test/crdt-blocks.ts;
  • Playwright regression:
    test/e2e/specs/editor/collaboration/websocket/collaboration-table-followups.spec.ts.

Fix

Nested schema-aware array merges now receive the pre-edit base value. When the
current Y.Array still has the base shape, the merge applies structural changes
relative to that base and skips elements/fields unchanged from the base. For an
ambiguous single deletion among duplicate values, it prefers deleting the
duplicate whose current Yjs value still equals the base row, preserving the row
that has remote edits. Nested object merges also skip keys unchanged from the
base so stale sibling fields do not clobber remote edits.

This is intentionally conservative. If the current Y.Array no longer has the
base shape, the merge falls back to the older two-way behavior. Durable row/cell
identity would be a better long-term model for more complex simultaneous table
structure edits.

Fix Plan

  1. Keep CRDT persistence saves narrow: save only entity key plus meta, and
    use __unstableSkipSyncUpdate so REST fields do not become collaborative
    updates.
  2. Keep the WebSocket snapshot/readiness protocol: joining peers must receive a
    room snapshot or peer-state response before publishing local bootstrap state.
  3. Keep remote-key reconciliation filtering in the sync manager: stale local
    scheduled writes for keys being reconciled from remote state are dropped.
  4. Keep base-record propagation through store-to-CRDT updates.
  5. Keep block clientId rebasing for block arrays where stable identity exists.
  6. Keep base-aware nested array/object merging for schema-aware attributes.
  7. Treat durable identity for structured attributes such as table rows/cells as
    follow-up work beyond the reproduced bugs fixed here.

END AI TEXT

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Warning: Type of PR label mismatch

To merge this PR, it requires exactly 1 label indicating the type of PR. Other labels are optional and not being checked here.

  • Required label: Any label starting with [Type].
  • Labels found: .

Read more about Type labels in Gutenberg. Don't worry if you don't have the required permissions to add labels; the PR reviewer should be able to help with the task.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: danluu <danluu@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant