Skip to content

fix: read text-align from inline style on HTML import#183

Merged
JohnMcLear merged 2 commits intomainfrom
fix/import-style-text-align
May 8, 2026
Merged

fix: read text-align from inline style on HTML import#183
JohnMcLear merged 2 commits intomainfrom
fix/import-style-text-align

Conversation

@JohnMcLear
Copy link
Copy Markdown
Member

Problem

getLineHTMLForExport already serializes the line's alignment as an inline style attribute on <p> (or merges it into the style of an existing <h1><h6>). But collectContentPre (the import-side hook) was only reading the legacy <left>/<center>/<right>/<justify> tag names. So any HTML round-trip — pad → /export/html/import — silently lost the alignment, and the same was true for any other HTML import path that puts alignment in style="text-align:...".

Fix

In static/js/shared.js, also parse text-align: <left|center|right|justify> out of context.styl (the inline style attribute that etherpad core's contentcollector already exposes) and assign it to lineAttributes.align. The (?:^|;|\s) anchor avoids accidentally matching vertical-align.

Why it matters now

Currently up for review at ether/etherpad#7568 — native DOCX export and import without LibreOffice. With native DOCX in core, the round-trip path is:

pad → getLineHTMLForExport → <p style="text-align:..."> → html-to-docx → DOCX
DOCX → mammoth → <p style="text-align:..."> → setPadHTML → contentcollector → ep_align.collectContentPre

The export side already works (alignment is on the way out). This PR closes the import side.

Test

I bumped static/js/shared.js only; no test surface changed and the existing behavior (collecting alignment from the legacy <left>/<center>/etc. tag names) is preserved. End-to-end coverage lives in the etherpad PR's src/tests/backend/specs/import.ts round-trip suite.

When a pad is HTML-exported, ep_align's getLineHTMLForExport wraps
content in <p style='text-align:...'> (and modifies <h1..h6> tags
in-place to add the same style). Importing that HTML or any HTML
that uses style='text-align:..' on block elements should re-apply
the corresponding line attribute -- but collectContentPre was only
reading the legacy <left>/<center>/<right>/<justify> tag names, so
imports silently dropped alignment.

Pick up text-align from the inline style attribute (etherpad core's
contentcollector already passes the parsed style as context.styl)
so a round-trip through HTML or DOCX preserves alignment.

Closes ether/etherpad#7538 (alignment side of the round-trip)
Refs: PR ether/etherpad#7568
@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

Add backend test cases that set pad HTML using
<p style="text-align:..."> directly (the modern form) and assert
the re-exported HTML preserves the alignment. Covers all four
values: left, center, right, justify.

Without the collectContentPre style-parsing fix in the previous
commit, these would all fail because contentcollector was passing
the inline style through context.styl but ep_align was only
reading the legacy <left>/<center>/<right>/<justify> tag names.
JohnMcLear added a commit to JohnMcLear/etherpad that referenced this pull request May 8, 2026
mammoth doesn't expose Word's paragraph alignment (`<w:jc>`) when it
converts a docx to HTML -- there's no equivalent in its style-mapping
machinery. To keep alignment through DOCX round-trips we walk the
docx's document.xml directly, pull the `w:val` from each `<w:p>`'s
`<w:jc>`, and inject `style="text-align:..."` onto the matching
block element in mammoth's output by document order.

Word's w:jc accepts more values than CSS text-align; we map left/
start, center, right/end, both/justify/distribute and skip the rest
(start/end take left/right because we don't track ltr/rtl from the
docx for now).

Combines with the upstream ep_align PR (ether/ep_align#183) for the
full round-trip: this PR makes the mammoth output carry the
alignment style; ep_align#183 makes the importer pick it up.

Closes the alignment side of ether#7538.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JohnMcLear JohnMcLear merged commit b4e6538 into main May 8, 2026
3 checks passed
@JohnMcLear JohnMcLear deleted the fix/import-style-text-align branch May 8, 2026 15:56
JohnMcLear added a commit to ether/etherpad that referenced this pull request May 8, 2026
* feat(export): native DOCX export via html-to-docx (opt-in)

Addresses #7538. The current DOCX export path shells out to LibreOffice,
which means every deployment that wants a Word download either installs
soffice (~500 MB) or loses that export. This PR adds a pure-JS
alternative: render the HTML via the existing exporthtml pipeline, then
feed it to the `html-to-docx` library in-process to produce a valid
.docx buffer — no soffice required, no subprocess spawn, no temp file
dance for the DOCX case.

Behavior:
- `settings.nativeDocxExport` (default `false`) gates the new path so
  existing deployments see zero behavior change.
- When enabled, `type === 'docx'` requests skip the LibreOffice branch,
  run `html-to-docx(html)`, and return the buffer with the
  `application/vnd.openxmlformats-officedocument.wordprocessingml.document`
  content-type.
- If the native converter throws, the handler falls through to the
  existing LibreOffice path — so flipping the flag on is safe even on a
  mixed-installation where soffice is still present as a backstop.
- Other export formats (pdf, odt, rtf, txt, html, etherpad) are
  unchanged.

Files:
- `src/package.json`: `html-to-docx` dep (pure JS, no binary reqs)
- `src/node/handler/ExportHandler.ts`: new DOCX branch gated on the
  setting, with fall-through on error
- `src/node/utils/Settings.ts`, `settings.json.template`,
  `settings.json.docker`, `doc/docker.md`: wire up the new setting +
  env var (`NATIVE_DOCX_EXPORT`)
- `src/tests/backend/specs/export.ts`: two new tests — asserts the
  exported buffer is a valid ZIP (PK\x03\x04 signature) and the
  response carries the correct content-type — both with
  `settings.soffice = 'false'` to prove the path doesn't need soffice
  at all.

Out of scope for this PR:
- Native PDF export (would need a PDF rendering step — separate
  undertaking, and the issue acknowledges the `pdfkit`/puppeteer size
  trade-off).

Closes #7538

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(7538): skip native DOCX test when html-to-docx isn't installed

The upgrade-from-latest-release CI job installs deps from the previous
release's package.json (before this PR adds html-to-docx) and then
git-checkouts this branch's code without re-running pnpm install.
Under that one workflow the new test can't find the module and fails
on the LibreOffice fallback, masking that the native path actually
works in every normal install.

Guard the describe block with require.resolve('html-to-docx'); Mocha's
this.skip() on before cascades to the sibling its. Regular backend
tests (pnpm install against this branch's lockfile) still exercise it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(7538): spec for soffice-free DOCX/PDF export and DOCX import

Captures the agreed scope expansion of PR #7568: replace the flag-gated
native DOCX path with a soffice-first selection cascade, add native PDF
export via pdfkit + a small htmlparser2-driven walker, and add native
DOCX import via mammoth. Also defines a shared HTML sanitizer
(stripRemoteImages) used by both export converters to close the
SSRF surface that Qodo flagged on the html-to-docx path.

The spec drops the nativeDocxExport setting and its env var; with
soffice configured, behavior is unchanged, and with soffice null,
docx/pdf export and docx import all work in-process. odt/doc/rtf
(and pdf import) keep needing soffice and are documented as such.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(7538): implementation plan for native DOCX/PDF and DOCX import

Bite-sized TDD task breakdown of the soffice-free export/import work:
rebase, deps, sanitizer, PDF walker, mammoth wrapper, ExportHandler
cascade, route guard, ImportHandler branch, UI fix, flag rollback,
verification + Qodo reply.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(7538): add pdfkit, htmlparser2, mammoth deps

Pure-JS, no native binaries:
- pdfkit ^0.18.0  (PDF rendering)
- htmlparser2 ^12 (SAX parser used by walker + sanitizer)
- mammoth ^1.12   (DOCX -> HTML for native import)
- @types/pdfkit ^0.17 (dev)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(7538): add stripRemoteImages HTML sanitizer

Drops <img src=> elements pointing at non-data, non-relative URLs to
prevent the DOCX/PDF converters from making outbound requests via
plugin-modified HTML. Closes Qodo finding #4 against the
html-to-docx path; will be wired into both export branches in
the cascade refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(7538): native PDF export via pdfkit + htmlparser2 walker

Renders pad HTML to a PDF Buffer in-process: headings, paragraphs,
lists, links, inline emphasis, data:-URI images. Remote images are
explicitly skipped at the walker (defense-in-depth on top of the
shared stripRemoteImages sanitizer).

PDFs are emitted with compress:false so accessibility/SEO indexers
that don't FlateDecode can still extract text. Pads are small enough
that the size cost is negligible.

Walker is 167 LOC, well under the spec's 500-LOC bail-out
threshold for switching to pdfmake+html-to-pdfmake+jsdom.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(7538): native DOCX import via mammoth

Wraps mammoth.convertToHtml so a soffice-less Etherpad can ingest
.docx files. Images are coerced to data: URIs at the converter
boundary so the import pipeline never sees a remote src=.

Includes a tiny generated DOCX fixture (heading, paragraph, list)
under tests/backend/specs/fixtures/ for both this wrapper test and
the upcoming end-to-end import test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(7538): soffice-first cascade in ExportHandler

Replaces the flag-gated DOCX branch with a deterministic dispatch:
soffice if configured, native DOCX/PDF otherwise, 5xx on native
error. Both native paths run plugin-modified HTML through
stripRemoteImages first.

Test changes:
- existing native DOCX block now sets soffice=null (was 'false', a
  truthy non-null string that sidestepped the route guard); fixes
  Qodo finding #3.
- new native PDF integration tests assert %PDF- header and
  application/pdf content-type with soffice=null.
- new negative test: with soffice=null, /export/odt still returns
  the 'not enabled' message.
- the legacy 500-on-export-error test now uses /bin/false so it
  exercises the soffice error path explicitly (the cascade dropped
  the ad-hoc 'false' string; .doc has no native path so this still
  works as a soffice error probe).

Integration tests for native DOCX/PDF currently fail because the
/export route guard still treats both formats as soffice-only;
the next commit fixes that.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(7538): allow docx/pdf through export guard without soffice

Tightens the no-soffice block to ['odt','doc'] only — formats with
no native path. docx and pdf are handed to ExportHandler, which
dispatches to the in-process converters. Closes Qodo finding #2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(7538): native DOCX import path in ImportHandler

When soffice is null and the upload is .docx, run mammoth and feed
the resulting HTML through setPadHTML. Other office formats
(pdf/odt/doc/rtf) are explicitly rejected with uploadFailed instead
of silently falling through to the ASCII-only path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(7538): always show DOCX/PDF export links

Native paths (#7538) make DOCX and PDF available regardless of
soffice presence, so unconditionally render those links. ODT still
gates on exportAvailable. Closes Qodo finding #2 on the UI side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(7538): drop nativeDocxExport flag

Selection is now purely soffice-presence-driven (cascade in
ExportHandler). The opt-in setting and its NATIVE_DOCX_EXPORT env
var are no longer needed -- soffice configured means soffice path;
soffice null means native path (DOCX, PDF, and DOCX import).
Reverts the additive surface introduced earlier in this PR.

Also updates the SOFFICE doc row to reflect that null no longer
means 'plain text and HTML only' -- docx/pdf export and docx
import now work natively without soffice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(7538): tighten link annotation assertion

CodeQL flagged the loose 'raw.includes("etherpad.org")' as
'incomplete URL substring sanitization' (a false positive in test
context, but worth fixing). Match the full /URI (host) form
instead -- it's both more accurate (we're verifying the PDF link
annotation structure) and CodeQL-clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(7538): cleaner DOCX/PDF output + round-trip test coverage

DOCX:
- New extractBody helper drops <head>/<style> and the leading
  newline inside <body> so html-to-docx doesn't render CSS or
  prefix paragraphs with empty space.
- New wrapLooseLines pre-processor wraps loose pad lines in <p>
  before the converter sees them. html-to-docx renders <br>
  outside <p> as a new <w:p> (full empty line in Word); inside
  <p> it correctly emits <w:br/> (soft break). Etherpad's HTML
  uses bare <br> for every line, so this was making single
  Enters look like double Enters in the Word output.

PDF:
- Walker SKIP_TAGS rejects head/style/script/title/meta/link
  content -- prior version dumped CSS into the rendered PDF.
- New breakLine() helper combines flushLine() with moveDown(1).
  pdfkit's text('', false) closes the continued run but does
  NOT advance the cursor, so consecutive runs were stacking at
  the same y-coordinate. <br>, end-of-block, and list items
  now use breakLine().
- ontext collapses runs of whitespace and drops pure-whitespace
  text nodes so pretty-printed source HTML doesn't render its
  formatting newlines.

Round-trip:
- New backend test: pad text -> DOCX export -> DOCX import ->
  new pad. Asserts content survives the trip.
- New PDF sanity test: extracts visible text from the PDF stream
  and asserts the source pad text appears verbatim.
- 6 new unit tests for extractBody and wrapLooseLines plus 1 for
  PDF walker SKIP_TAGS coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(7538): CodeQL ReDoS + import.ts type error

- BR_PARA_RE was /(?:\s*<br>\s*){2,}/ -- two adjacent \s* runs can
  match the same chars, so on '<br>\t<br>\t<br>...' the regex
  backtracks exponentially. Re-anchored to match a fixed first <br>
  followed by one or more additional <br>s, so each whitespace run
  has exactly one home.
- import.ts: fetchBuffer was typed Promise<Buffer> but call sites
  chained .expect(200) on it, which only works on supertest's Test
  object. Return the Test (typed any) so the chain is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(7538): plugin-aware HTML cleanup, PDF text-align, monospace

ep_headings2/ep_align emit one heading-styled blank-line block after
every styled line in the pad ('<h1 style=text-align:right></h1>'),
which both html-to-docx and our pdfkit walker render as a full empty
paragraph. Plus the pdfkit walker had no support for text-align or
monospace, so right/center alignment and 'code' lines rendered the
same as plain body text.

- New dropEmptyBlocks helper strips empty h1-h6/p/code/pre/div/
  blockquote wrappers in preprocessing. Iterates so nested empties
  collapse too. Applied before both DOCX and PDF conversion.
- PDF walker now reads style='text-align:left|center|right|justify'
  on block elements (h1-6, p, div) and passes it as pdfkit's align
  option. align is applied once per continued run, then reset on
  flushLine so the next block can pick up its own value.
- PDF walker handles <code>, <tt>, <kbd>, <samp> as inline monospace
  (Courier) and <pre> as block monospace (Courier + breakLine on
  open/close).

11 new unit tests:
- 4 for dropEmptyBlocks (heading wrappers, code, nesting,
  pass-through)
- 1 for PDF text-align (compares the BT matrix x for left vs right)
- 2 for Courier in <code> and <pre>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(7538): separate adjacent heading-style blocks on import

Round-trip bug: ep_headings2 emits <h1>/<h2>/<code> in pad HTML.
Mammoth round-trips them as adjacent <h1>A</h1><h2>B</h2><p>C</p>
on import. Etherpad's server-side content collector has a default
_blockElems set of just {div, p, pre, li}, and ep_headings2 only
registers the CLIENT-side aceRegisterBlockElements -- not the
server-side ccRegisterBlockElements. So h1/h2/code end up being
treated as inline by the importer, and adjacent blocks merge into
a single pad line.

Fix: insert <br> after </h1>...</h6>/</code> when followed by
another block. Server-side workaround keeps this PR self-contained
regardless of plugin version. The right long-term fix is to extend
ep_plugin_helpers' lineAttribute factory to register both hooks
(filed as a follow-up).

Tests:
- 5 unit tests for separateAdjacentHeadingBlocks
- New end-to-end round-trip test asserts H1+H2+P land on three
  separate pad lines after the import path.

Plus the prior PDF text-align/Courier/code commit also included
here:
- code/tt/kbd/samp inherit text-align from style attribute
- pre inherits text-align too

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(7538): blank-line round-trip + DOCX code monospace + a==c tests

DOCX round-trip dropping blank pad lines:
- wrapLooseLines now emits an explicit <p></p> marker for each blank
  line in a <br> run, instead of collapsing all gaps into a single
  paragraph break. (N consecutive <br>s -> 1 paragraph boundary +
  N-2 empty <p></p> markers, mapping to N-1 blank pad lines.)
- mammoth's docxBufferToHtml now passes ignoreEmptyParagraphs:false
  so the empty <w:p> entries survive the import side. mammoth's
  default of true was silently dropping them.
- dropEmptyBlocks no longer strips <p></p> -- that's the meaningful
  marker for the round-trip. Empty <h1>/<code>/<pre>/<div>/
  <blockquote> are still stripped (plugin noise).

DOCX <code> rendering as monospace:
- New applyMonospaceToCode wraps code/tt/kbd/samp/pre content in a
  <span style="font-family:'Courier New', monospace">. html-to-docx
  honors that and emits <w:rFonts w:ascii="Courier New".../>, which
  Word renders as Courier. The bare <code> tag is otherwise just
  a no-op for html-to-docx.
- Applied only on the DOCX export path (PDF walker already handles
  monospace via Courier font selection).

Round-trip tests:
- New a==c suite: txt, etherpad, html, docx -- export from src,
  import to dst, re-export and compare against the meaningful
  invariant (line text for binary formats; trimmed body for HTML).
- HTML test tolerates one trailing <br> per round-trip because
  setPadHTML appends a final <p> on import; this is pre-existing
  core behavior, not our bug.
- DOCX test normalizes trailing newline run (same reason).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(7538): preserve <w:jc> alignment through mammoth round-trip

mammoth doesn't expose Word's paragraph alignment (`<w:jc>`) when it
converts a docx to HTML -- there's no equivalent in its style-mapping
machinery. To keep alignment through DOCX round-trips we walk the
docx's document.xml directly, pull the `w:val` from each `<w:p>`'s
`<w:jc>`, and inject `style="text-align:..."` onto the matching
block element in mammoth's output by document order.

Word's w:jc accepts more values than CSS text-align; we map left/
start, center, right/end, both/justify/distribute and skip the rest
(start/end take left/right because we don't track ltr/rtl from the
docx for now).

Combines with the upstream ep_align PR (ether/ep_align#183) for the
full round-trip: this PR makes the mammoth output carry the
alignment style; ep_align#183 makes the importer pick it up.

Closes the alignment side of #7538.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(7538): preserve <a href> inside <code>/<pre> in DOCX export

html-to-docx silently drops <a href> children of <code>/<pre> tags
(and of styled <span>s, but the <code> wrapper is the active offender
here). The pad-export HTML produced by ep_headings2 + ep_align uses
<code style='text-align:right'>...<a href>...</a>...</code> for each
'Code'-style line, which lost its links on every DOCX export.

Workaround: applyMonospaceToCode now drops the code/pre/tt/kbd/samp
wrapper entirely. The non-anchor content gets wrapped in monospace
spans; anchors are emitted unstyled so they keep their hyperlink. For
block-level usage (<pre>, or <code> with an inline style attr) we
emit a wrapping <p> and forward the text-align style. Run BEFORE
wrapLooseLines so the <p> doesn't get double-wrapped.

Tests added:
- inline <code> -> just a styled span (no <code> wrapper)
- <code style='text-align:right'> -> <p style> wrap
- <pre> -> always block-wrapped
- <tt>/<kbd>/<samp> -> inline span only
- regression: <a href> inside <code> survives html-to-docx round-trip
  with both the URL in word/_rels/document.xml.rels AND a <w:hyperlink>
  in the document body

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(7538): HTML import line-break drift between blocks

Etherpad's HTML export wraps each pad line in <p>...</p> (or <h1>,
<code>, etc.) and then appends a <br> between lines. The closing
block tag already ends the line for contentcollector, so the trailing
<br> is redundant -- and on import the server collector counts BOTH
as line breaks, doubling every blank line between paragraphs and
inserting an extra blank between adjacent headings.

Two fixes, both gated on the runtime block-element registry so they
don't double-trigger when the underlying plugin already handles
adjacency:

1. HTML import path now runs the new collapseRedundantBrAfterBlocks
   helper before setPadHTML. Drops a single <br> immediately
   following </p>/</h1-6>/</code>/</pre>/</div>/</blockquote>/</ul>/
   </ol>/</li>/</table>/</tr>/</td>/</th>. Multiple consecutive
   <br>s after a block keep all but the first (the rest still
   represent intentional blank lines).

2. The DOCX-import separateAdjacentHeadingBlocks workaround now
   checks whether 'h1' is in the runtime ccRegisterBlockElements
   set before inserting <br>s. When ep_headings2 has the new server
   hook (per ep_plugin_helpers#14 + the upcoming ep_headings2 PR),
   the workaround correctly stays out of the way -- otherwise it
   adds an extra blank line per heading transition.

Also fixed a subtle ts-check failure on the import.ts test changes
and a leftover implicit-any in ImportDocxNative's alignment
preserver.

Tests added:
- collapseRedundantBrAfterBlocks: 5 unit tests (each block tag,
  whitespace tolerance, multiple <br> keeping intentional blanks)
- HTML import: 'does not introduce a blank line between H1 and H2',
  'preserves blank-line count between H1 and H2 (realistic shape)'
  reproduces the 5-blanks-where-2-expected bug from the user's
  round-trip pad.

1054 backend tests pass locally (the 6 failures are the pre-existing
favicon/webaccess send@1.x dotfile-path issue from running under
.claude/, doesn't reach CI).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(7538): skip plugin-dependent HTML import tests on no-plugin CI

The two new HTML-import-adjacency tests assume ep_headings2 (or
another plugin) has registered h1/h2 as server-side block elements
via ccRegisterBlockElements. Without that, contentcollector treats
<h1>/<h2> as inline and adjacent ones merge into a single pad line
-- making the assertions inapplicable.

CI's backend-tests job runs without plugins installed, so guard
the describe block with a runtime hooks.callAll() check and skip
when h1 isn't a registered block. Local dev with ep_headings2 (and
the local plugin patch wiring ccRegisterBlockElements) still
exercises both tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant