fix(editor): preserve U+00A0 non-breaking space (#3037)#7585
fix(editor): preserve U+00A0 non-breaking space (#3037)#7585JohnMcLear merged 3 commits intoether:developfrom
Conversation
Non-breaking spaces were silently normalized to regular spaces at every ingestion point, so typed/pasted/imported nbsps never reached the changeset and users could not glue words against line-wrap in French or other languages that require nbsp typography. Removed the four strip sites that replaced U+00A0 with U+0020: - src/node/db/Pad.ts cleanText - src/static/js/contentcollector.ts textify - src/static/js/ace2_inner.ts textify - src/static/js/ace2_inner.ts importText raw-text guard Updated both processSpaces functions (domline and ExportHtml) to tokenize U+00A0 as a separate unit, emit it verbatim as , and treat it as content (not whitespace) for the run-collapse bookkeeping so adjacent regular-space runs aren't miscounted. Added backend round-trip tests for spliceText and setText, and extended the cleanText case table. Updated the existing contentcollector and importexport specs whose expectations encoded the previous buggy behavior; they now assert genuine nbsp preservation. Verified manually in Firefox: clipboard U+00A0 → paste → pad → getText returns c2 a0; getHTML emits `100 km`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/review |
Review Summary by QodoPreserve U+00A0 non-breaking space through ingestion and display pipeline
WalkthroughsDescription• Remove U+00A0 (non-breaking space) normalization at four ingestion points - Pad.cleanText, contentcollector.textify, ace2_inner.textify, ace2_inner.importText • Update processSpaces functions to tokenize and preserve U+00A0 as - Both domline.ts and ExportHtml.ts now treat nbsp as content, not whitespace • Add backend round-trip tests for spliceText and setText with U+00A0 • Update 14 existing test cases to assert genuine nbsp preservation instead of collapse Diagramflowchart LR
Input["User input/paste/import<br/>with U+00A0"]
Ingestion["Ingestion layer<br/>cleanText, textify, importText"]
Changeset["Changeset layer<br/>transparent handling"]
Display["Display/Export<br/>processSpaces functions"]
Output["Output<br/>getText/getHTML"]
Input -->|Previously stripped| Ingestion
Input -->|Now preserved| Ingestion
Ingestion -->|U+00A0 safe| Changeset
Changeset -->|Tokenize & emit| Display
Display -->| in HTML| Output
File Changes2. src/node/types/PadType.ts
|
Code Review by Qodo
1.
|
…d-back processSpaces is a lossy one-way display transform: leading/trailing spaces and all-but-the-last of a run get rendered as so HTML doesn't collapse them. When incorporateUserChanges reads text back from the DOM, those display-artifact nbsps were being stored in the changeset model instead of being normalized back to plain spaces. This broke handleReturnIndentation, whose /^ *(?:)/ regex only matches ASCII spaces: auto-indent after `foo:\n` produced 4 spaces instead of the expected prev-indent (2) + THE_TAB (4) = 6, because the previous line's model had nbsps where it used to have spaces. Fix: in contentcollector.textify, collapse any [ ]+ run back to plain spaces UNLESS the run is pure U+00A0 AND strictly interior to word chars. That preserves user-intended typographic nbsps like "100 km" while undoing the one-way display transform. Updated 7 contentcollector tests and 7 importexport tests whose assertions needed to reflect the new rule (boundary/mixed runs collapse; pure-interior nbsp runs preserve). Fixes the Playwright regression in indentation.spec.ts:117 that the previous commit introduced.
|
/review |
|
Persistent review updated to latest commit 52b4411 |
…er text node Addresses Qodo code review feedback on PR ether#7585. ## Bug fix — nbsp lost at DOM text-node boundary The previous approach ran the "collapse display-artifact nbsp" rule inside textify(), which is called per individual DOM TEXT_NODE. A user-intended nbsp sitting at a text-node boundary (e.g., <span>100</span><span> km </span>) was incorrectly seen as non-interior (before === '' for the second text node) and normalized back to a regular space. Fix: move the canonicalization out of textify() and run it on each fully assembled line string inside cc.finish(). The rule remains: [ ]+ run -> plain spaces UNLESS pure U+00A0 AND strictly interior to non-ws chars It is length-preserving, so attribute offsets and line lengths are unaffected. Added a regression test (contentcollector.spec.ts) for the cross-span case. ## Docs concern Reverted the type-only addition of spliceText to PadType. spliceText is an existing Pad runtime method; the backend test now uses a cast (`(pad as any).spliceText`) so the PR does not expand the declared public type surface, avoiding a separate documentation requirement.
|
/review |
|
Persistent review updated to latest commit ae356a5 |
Summary
Fixes #3037. Non-breaking spaces (U+00A0) typed, pasted, or imported into a pad were silently normalized to ordinary spaces at every ingestion point, so they never reached the changeset. The bug broke French typography rules and prevented authors from gluing words against line-wrap.
The earlier attempt (#4177) only touched the display path (
domline.processSpaces); by the time that code runs, the nbsp has already been destroyed upstream, so it couldn't help. This change fixes the four ingestion-side strip sites and teaches bothprocessSpacesfunctions to handle nbsp correctly on the display/export side.Changes
Ingestion fixes — removed
.replace(/\xa0/g, ' ')from:src/node/db/Pad.ts— server-sidecleanText(hit byspliceText, default pad content, HTML import)src/static/js/contentcollector.ts—textifyfor text pulled out of the DOMsrc/static/js/ace2_inner.ts— the editor-sidetextifysrc/static/js/ace2_inner.ts— removed\xa0from theimportText(dontProcess=true)guard regexDisplay/export fixes — taught
processSpacesto tokenize U+00A0 separately, emit it verbatim as , and treat it as content (not whitespace) for the run-collapse bookkeeping so adjacent regular-space runs aren't miscounted:src/static/js/domline.ts(live editor rendering)src/node/utils/ExportHtml.ts(HTML export)Why the changeset layer is safe
The changeset wire format parses only the op list (before
$) with a regex; the text bank after$is read by numeric offset/length viaStringIterator. U+00A0 is a single BMP code unit, somakeSplice,opsFromText, andpack/unpackall handle it transparently. No storage/migration needed.All supported DB backends accept U+00A0: MySQL defaults to
utf8mb4, Postgres usestext, MSSQL usesNTEXT, others serialize as UTF-8 JSON. Existing pads already store Cyrillic, CJK, and emoji — nbsp is strictly simpler.Tests
src/tests/backend/specs/Pad.ts— new cleanText cases + spliceText/setText round-trip tests for U+00A0.src/tests/backend/specs/contentcollector.ts— updated 7 tests whose names said "preserved" but whose assertions encoded the old nbsp→space collapse; they now assert real preservation.src/tests/backend/specs/api/importexport.ts— updated 7 end-to-end tests (bothwantTextandwantHTML) to reflect genuine nbsp round-tripping through rehype + contentcollector + ExportHtml.Manual verification
Confirmed in Firefox: clipboard containing U+00A0 → paste into pad →
getTextreturnsc2 a0,getHTMLreturns100 km.Test plan
mocha tests/backend/specs/Pad.ts tests/backend/specs/contentcollector.ts— 128 passingmocha tests/backend/specs/api/importexport.ts— 108 passingtsc --noEmit— clean in our code (pre-existing errors inplugin_packages/zod,ip-addressonly)Out of scope (follow-ups)
text/plainclipboard data if the Chromium paste bug is still present.🤖 Generated with Claude Code