RTC: fix stale block snapshot overwriting newer state#77876
RTC: fix stale block snapshot overwriting newer state#77876danluu wants to merge 3 commits intoWordPress:trunkfrom
Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message. To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
|
Warning: Type of PR label mismatch To merge this PR, it requires exactly 1 label indicating the type of PR. Other labels are optional and not being checked here.
Read more about Type labels in Gutenberg. Don't worry if you don't have the required permissions to add labels; the PR reviewer should be able to help with the task. |
057793c to
fdd92d7
Compare
| const clientId = getBlockClientId( block ); | ||
|
|
||
| if ( ! clientId || seenIds.has( clientId ) ) { | ||
| return null; |
There was a problem hiding this comment.
this seems to contradict the name of the function, or the return was meant to be written inside a callback?
There was a problem hiding this comment.
Not sure if the new name is too verbose, but codex gave this a very verbose name that describes what it does (this apparently intentionally returns null unless every block has a non-empty unique ID, in which case there's a non-ambiguous identity match). With the fixed name, the function is still odd in a way that points to a serious problem.
The approach is a stop-gap due to this being a relatively small PR that doesn't do major surgery. I'm not sure there's a really good fix here with the overall design as it is, but I've been trying to avoid suggesting any kind of major changes because I don't know anything about the history or intent of the project or code. There's a combination of factors here that makes this fraught:
- Save write full snapshots (and can therefore overwrite a lot of stuff when stale)
- No server-side CAS; last write to the server wins, so the client has to enforce correctness
clientIDis local to a session- IDs are discarded on serialization
- Serialized HTML doesn't carry any operations; a missing X due to intentional deletion vs. a stale snapshot can't be distinguished
This makes it impossible to, in general, tell which parts of a snapshot are stale or not when the editor is saving and RTC "wants to" merge individual edits.
This PR fixes a hole in the logic here, but I'm not sure if this class of bug can be eliminated without a major change in the approach.
I'm not really a distributed systems person in the sense of working on the algorithms that make the distributed stuff correct, but if I think back to when I was in school and they talked about Lamport clocks, instead of using the actual timestamp/clock here, the system can (as in, does under some circumstances) assume that the arrival order is causal. This doesn't seem right in the general case and the thing I'm not sure about is how one would patch all such holes without saving more information. It seems that the lack of this information is intentional / by design.
fdd92d7 to
78ce698
Compare
78ce698 to
41df616
Compare
This is part of an AI fuzzing project, where an AI wrote a fuzzer and then triages bugs from the fuzzer and creates fixes. See #77716 for the tracking issue. As of this writing, there have been no known false positives from this project, but there have been some issues, which are documented in #77716. I expect we’ll see false positives at some point (and may even have one that’s been filed in a PR that hasn’t been inspected by a code owner yet).
What?
stale-content-overwrite-repro.mp4
BEGIN AI GENERATED TEXT
Two editor windows for the same WordPress account, including support/SU-style sessions, can edit the same post or page. Window A saves current content. Window B was opened before A's save and still has stale local editor state. If B makes a small edit and saves, the REST
contentbody can be replaced by B's stale full body plus B's small edit, dropping A's already-saved content.This is the same-account/support-session content/title-loss class. It is distinct from the large-update "Connection lost" issue, which matches #77669.
Repros
Manual browser repro:
Alpha,Beta.same-account-current-*and save.Alphawithsame-account-stale-*and save.Beta, but no longer contains A's saved marker.Committed unit-level repros on
try/stale-content-overwrite-pr:packages/core-data/src/utils/test/crdt-stale-top-level-blocks.test.tspackages/core-data/src/test/entities.jscontentfrom merged blocks, and save-time merging with the latest persisted CRDT record.Committed browser repro:
test/e2e/specs/editor/collaboration/collaboration-same-user-stale-content-overwrite.spec.tswp.data, stub requests, inject faults, alter clocks, or synthesize blocks in the browser.Browser command:
Known-fixes status
Checked against:
/Users/danluu/dev/fuzz/gutenberg-stale-content-overwrite-known/Users/danluu/dev/fuzz/gutenberg-fuzz-all-local-known-fixes-clonetry/fuzz-all-local-known-fixes-clone6a1a8d30794Focused known-base command:
WP_BASE_URL=http://localhost:8910 npm run test:e2e -- test/e2e/specs/editor/collaboration/collaboration-same-user-stale-save-content-loss.spec.ts --project=chromium --grep "reproduces saved content loss"Result: still reproduces. The failure had final REST content containing
same-user-second-session-*and the stale initial paragraph, but missingsame-user-customer-saved-*.The known fixes include stale nested/table/rich-text fixes, but they do not fix this same-account stale save path.
Video
Local repro video:
The video was generated against the known-fixes base on
http://localhost:8910. It shows both editor windows side-by-side and an annotation log. Window A saves a new paragraph, then window B saves a stale body; final REST content contains B's edit and is missing A's marker.Failure mechanism
The editor stores post content as a full serialized body in the REST
contentfield. With collaboration enabled, Gutenberg also persists a CRDT document in post meta.The stale window's local state can lag behind the current saved server state. Before this fix,
prePersistPostTypecreated the persisted CRDT meta from the stale local sync document during B's save. It did not first fetch and merge the latest saved server record/CRDT document. The subsequentPUTtherefore sent stale fullcontent, and WordPress accepted it as the latest version.There was a second merge hazard inside the block CRDT path. Gutenberg receives full block snapshots, not granular "I changed only block X" operations. When a stale snapshot touched one block, unchanged sibling blocks and missing remote top-level blocks could be interpreted as local updates/deletes. That could discard remote additions after a save-time CRDT rebase.
Introduction history
This was introduced as an architectural gap across the RTC/Yjs save and persistence work, not as a single obvious typo.
The original design direction came from WordPress/gutenberg#68483, "[Yjs Collab] Reliable sync with the backend." That PR called out the same class of risk: earlier collaboration did not reliably sync backend state with the Yjs document, collaborative documents were not preserved across sessions, and concurrent changes could overwrite each other. It proposed reconciling backend state into the Yjs document and persisting that rebased document.
The production path then landed incrementally:
contentsnapshots while a side Yjs document tried to track collaborative state.prePersistPostTypeserializes the local Y.Doc throughSyncManager#createPersistedCRDTDocand stores it in post meta as_crdt_document. This made saves preserve collaboration state, but it also meant a stale editor window could serialize and persist its stale local Y.Doc during a normal toolbar save. There was no save-time fetch/merge of the latest server record before forming the RESTPUT.blocksvalue," and 22e067b02438 / #75448, "Use Y.Text for title, content and excerpt," brought serializedcontent,title, andexcerptinto the CRDT path. That made the stale full serialized body another value that could be written from B's local state unless the save path first merged the latest persisted server state.Put together, the bug appears once these conditions are all true: post/page saves still send full serialized
content; RTC persists a local Y.Doc into post meta on save; persisted CRDT docs are reconciled on load/refetch rather than immediately before every conflicting save; and full block/content snapshots are merged without a stale-base comparison. Same-account/support sessions make the timing realistic because the stale editor can save before it observes the other window's saved state.This bug is a gap in that architecture, not a regression from the two recent fixes called out in the assignment:
The same-account stale-save bug remains when a stale editor has a local CRDT document but has not incorporated the latest saved server CRDT document before its save.
Initial fix plan
The first plan was to fix
mergeCrdtBlocksonly:contentfrom merged blocks whenblocksare present.This improved lower-level CRDT behavior but did not fix the browser repro by itself, because the stale same-account window may never receive A's save through live sync before B saves.
Audit: Linus Torvalds
The block-merge-only plan was too clever and too low-level. The actual bug is a stale save overwriting newer persistent state. Fixing only an internal merge algorithm assumes the stale window already has the newer state locally. That assumption is exactly what the bug disproves. The save path must not blindly write an old full body when the server has moved on.
Audit: Kyle Kingsbury / Jepsen
This is a lost-update anomaly. The system has no explicit compare-and-set on the post body and no user-visible conflict check. A last-writer-wins
PUTis unsafe when clients send full snapshots. A correct mitigation needs to read the current server state, merge against it, and write a result that includes both updates, or else refuse the write.Audit: Dan Luu
The realistic support-session shape matters more than a unit-level CRDT proof. Same-account sessions often do not look like ordinary two-user collaboration in the UI, and the repro must show a user typing and pressing Save. The fix also needs a false-positive guard: do not rewrite the first user's save from an older local CRDT snapshot when the server has not changed.
Revised fix plan
The PR branch implements the revised plan:
content,title, orexcerpt.blocksandcontentare present, derivecontentfrom the merged block tree rather than trusting the stale serialized full-string snapshot.False-positive analysis
This is not the large-update connection loss issue: the repro uses tiny paragraph edits and no network faults.
This is not a direct REST last-writer smoke test: the browser repro uses normal editor UI actions for both writes. REST is only used for test setup and final observation.
This is not the #77865 autosave/revision bug: the page is published and the repro uses toolbar saves.
This is not the #77866 no-persisted-CRDT duplicate-table bug: the repro uses ordinary paragraph blocks and same-account stale editor state.
The first save is verified to contain A's marker before B saves. The final failure is therefore not "A failed to save"; it is B's stale save overwriting A's already-saved content.
END AI GENERATED TEXT