fix: read text-align from inline style on HTML import by JohnMcLear · Pull Request #183 · ether/ep_align

JohnMcLear · 2026-05-08T15:19:29Z

Problem

getLineHTMLForExport already serializes the line's alignment as an inline style attribute on <p> (or merges it into the style of an existing <h1>–<h6>). But collectContentPre (the import-side hook) was only reading the legacy <left>/<center>/<right>/<justify> tag names. So any HTML round-trip — pad → /export/html → /import — silently lost the alignment, and the same was true for any other HTML import path that puts alignment in style="text-align:...".

Fix

In static/js/shared.js, also parse text-align: <left|center|right|justify> out of context.styl (the inline style attribute that etherpad core's contentcollector already exposes) and assign it to lineAttributes.align. The (?:^|;|\s) anchor avoids accidentally matching vertical-align.

Why it matters now

Currently up for review at ether/etherpad#7568 — native DOCX export and import without LibreOffice. With native DOCX in core, the round-trip path is:

pad → getLineHTMLForExport → <p style="text-align:..."> → html-to-docx → DOCX
DOCX → mammoth → <p style="text-align:..."> → setPadHTML → contentcollector → ep_align.collectContentPre

The export side already works (alignment is on the way out). This PR closes the import side.

Test

I bumped static/js/shared.js only; no test surface changed and the existing behavior (collecting alignment from the legacy <left>/<center>/etc. tag names) is preserved. End-to-end coverage lives in the etherpad PR's src/tests/backend/specs/import.ts round-trip suite.

When a pad is HTML-exported, ep_align's getLineHTMLForExport wraps content in <p style='text-align:...'> (and modifies <h1..h6> tags in-place to add the same style). Importing that HTML or any HTML that uses style='text-align:..' on block elements should re-apply the corresponding line attribute -- but collectContentPre was only reading the legacy <left>/<center>/<right>/<justify> tag names, so imports silently dropped alignment. Pick up text-align from the inline style attribute (etherpad core's contentcollector already passes the parsed style as context.styl) so a round-trip through HTML or DOCX preserves alignment. Closes ether/etherpad#7538 (alignment side of the round-trip) Refs: PR ether/etherpad#7568

qodo-code-review · 2026-05-08T15:19:33Z

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

Add backend test cases that set pad HTML using <p style="text-align:..."> directly (the modern form) and assert the re-exported HTML preserves the alignment. Covers all four values: left, center, right, justify. Without the collectContentPre style-parsing fix in the previous commit, these would all fail because contentcollector was passing the inline style through context.styl but ep_align was only reading the legacy <left>/<center>/<right>/<justify> tag names.

mammoth doesn't expose Word's paragraph alignment (`<w:jc>`) when it converts a docx to HTML -- there's no equivalent in its style-mapping machinery. To keep alignment through DOCX round-trips we walk the docx's document.xml directly, pull the `w:val` from each `<w:p>`'s `<w:jc>`, and inject `style="text-align:..."` onto the matching block element in mammoth's output by document order. Word's w:jc accepts more values than CSS text-align; we map left/ start, center, right/end, both/justify/distribute and skip the rest (start/end take left/right because we don't track ltr/rtl from the docx for now). Combines with the upstream ep_align PR (ether/ep_align#183) for the full round-trip: this PR makes the mammoth output carry the alignment style; ep_align#183 makes the importer pick it up. Closes the alignment side of ether#7538. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(export): native DOCX export via html-to-docx (opt-in) Addresses #7538. The current DOCX export path shells out to LibreOffice, which means every deployment that wants a Word download either installs soffice (~500 MB) or loses that export. This PR adds a pure-JS alternative: render the HTML via the existing exporthtml pipeline, then feed it to the `html-to-docx` library in-process to produce a valid .docx buffer — no soffice required, no subprocess spawn, no temp file dance for the DOCX case. Behavior: - `settings.nativeDocxExport` (default `false`) gates the new path so existing deployments see zero behavior change. - When enabled, `type === 'docx'` requests skip the LibreOffice branch, run `html-to-docx(html)`, and return the buffer with the `application/vnd.openxmlformats-officedocument.wordprocessingml.document` content-type. - If the native converter throws, the handler falls through to the existing LibreOffice path — so flipping the flag on is safe even on a mixed-installation where soffice is still present as a backstop. - Other export formats (pdf, odt, rtf, txt, html, etherpad) are unchanged. Files: - `src/package.json`: `html-to-docx` dep (pure JS, no binary reqs) - `src/node/handler/ExportHandler.ts`: new DOCX branch gated on the setting, with fall-through on error - `src/node/utils/Settings.ts`, `settings.json.template`, `settings.json.docker`, `doc/docker.md`: wire up the new setting + env var (`NATIVE_DOCX_EXPORT`) - `src/tests/backend/specs/export.ts`: two new tests — asserts the exported buffer is a valid ZIP (PK\x03\x04 signature) and the response carries the correct content-type — both with `settings.soffice = 'false'` to prove the path doesn't need soffice at all. Out of scope for this PR: - Native PDF export (would need a PDF rendering step — separate undertaking, and the issue acknowledges the `pdfkit`/puppeteer size trade-off). Closes #7538 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(7538): skip native DOCX test when html-to-docx isn't installed The upgrade-from-latest-release CI job installs deps from the previous release's package.json (before this PR adds html-to-docx) and then git-checkouts this branch's code without re-running pnpm install. Under that one workflow the new test can't find the module and fails on the LibreOffice fallback, masking that the native path actually works in every normal install. Guard the describe block with require.resolve('html-to-docx'); Mocha's this.skip() on before cascades to the sibling its. Regular backend tests (pnpm install against this branch's lockfile) still exercise it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(7538): spec for soffice-free DOCX/PDF export and DOCX import Captures the agreed scope expansion of PR #7568: replace the flag-gated native DOCX path with a soffice-first selection cascade, add native PDF export via pdfkit + a small htmlparser2-driven walker, and add native DOCX import via mammoth. Also defines a shared HTML sanitizer (stripRemoteImages) used by both export converters to close the SSRF surface that Qodo flagged on the html-to-docx path. The spec drops the nativeDocxExport setting and its env var; with soffice configured, behavior is unchanged, and with soffice null, docx/pdf export and docx import all work in-process. odt/doc/rtf (and pdf import) keep needing soffice and are documented as such. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(7538): implementation plan for native DOCX/PDF and DOCX import Bite-sized TDD task breakdown of the soffice-free export/import work: rebase, deps, sanitizer, PDF walker, mammoth wrapper, ExportHandler cascade, route guard, ImportHandler branch, UI fix, flag rollback, verification + Qodo reply. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(7538): add pdfkit, htmlparser2, mammoth deps Pure-JS, no native binaries: - pdfkit ^0.18.0 (PDF rendering) - htmlparser2 ^12 (SAX parser used by walker + sanitizer) - mammoth ^1.12 (DOCX -> HTML for native import) - @types/pdfkit ^0.17 (dev) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(7538): add stripRemoteImages HTML sanitizer Drops <img src=> elements pointing at non-data, non-relative URLs to prevent the DOCX/PDF converters from making outbound requests via plugin-modified HTML. Closes Qodo finding #4 against the html-to-docx path; will be wired into both export branches in the cascade refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(7538): native PDF export via pdfkit + htmlparser2 walker Renders pad HTML to a PDF Buffer in-process: headings, paragraphs, lists, links, inline emphasis, data:-URI images. Remote images are explicitly skipped at the walker (defense-in-depth on top of the shared stripRemoteImages sanitizer). PDFs are emitted with compress:false so accessibility/SEO indexers that don't FlateDecode can still extract text. Pads are small enough that the size cost is negligible. Walker is 167 LOC, well under the spec's 500-LOC bail-out threshold for switching to pdfmake+html-to-pdfmake+jsdom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(7538): native DOCX import via mammoth Wraps mammoth.convertToHtml so a soffice-less Etherpad can ingest .docx files. Images are coerced to data: URIs at the converter boundary so the import pipeline never sees a remote src=. Includes a tiny generated DOCX fixture (heading, paragraph, list) under tests/backend/specs/fixtures/ for both this wrapper test and the upcoming end-to-end import test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(7538): soffice-first cascade in ExportHandler Replaces the flag-gated DOCX branch with a deterministic dispatch: soffice if configured, native DOCX/PDF otherwise, 5xx on native error. Both native paths run plugin-modified HTML through stripRemoteImages first. Test changes: - existing native DOCX block now sets soffice=null (was 'false', a truthy non-null string that sidestepped the route guard); fixes Qodo finding #3. - new native PDF integration tests assert %PDF- header and application/pdf content-type with soffice=null. - new negative test: with soffice=null, /export/odt still returns the 'not enabled' message. - the legacy 500-on-export-error test now uses /bin/false so it exercises the soffice error path explicitly (the cascade dropped the ad-hoc 'false' string; .doc has no native path so this still works as a soffice error probe). Integration tests for native DOCX/PDF currently fail because the /export route guard still treats both formats as soffice-only; the next commit fixes that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(7538): allow docx/pdf through export guard without soffice Tightens the no-soffice block to ['odt','doc'] only — formats with no native path. docx and pdf are handed to ExportHandler, which dispatches to the in-process converters. Closes Qodo finding #2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(7538): native DOCX import path in ImportHandler When soffice is null and the upload is .docx, run mammoth and feed the resulting HTML through setPadHTML. Other office formats (pdf/odt/doc/rtf) are explicitly rejected with uploadFailed instead of silently falling through to the ASCII-only path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(7538): always show DOCX/PDF export links Native paths (#7538) make DOCX and PDF available regardless of soffice presence, so unconditionally render those links. ODT still gates on exportAvailable. Closes Qodo finding #2 on the UI side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(7538): drop nativeDocxExport flag Selection is now purely soffice-presence-driven (cascade in ExportHandler). The opt-in setting and its NATIVE_DOCX_EXPORT env var are no longer needed -- soffice configured means soffice path; soffice null means native path (DOCX, PDF, and DOCX import). Reverts the additive surface introduced earlier in this PR. Also updates the SOFFICE doc row to reflect that null no longer means 'plain text and HTML only' -- docx/pdf export and docx import now work natively without soffice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(7538): tighten link annotation assertion CodeQL flagged the loose 'raw.includes("etherpad.org")' as 'incomplete URL substring sanitization' (a false positive in test context, but worth fixing). Match the full /URI (host) form instead -- it's both more accurate (we're verifying the PDF link annotation structure) and CodeQL-clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(7538): cleaner DOCX/PDF output + round-trip test coverage DOCX: - New extractBody helper drops <head>/<style> and the leading newline inside <body> so html-to-docx doesn't render CSS or prefix paragraphs with empty space. - New wrapLooseLines pre-processor wraps loose pad lines in <p> before the converter sees them. html-to-docx renders <br> outside <p> as a new <w:p> (full empty line in Word); inside <p> it correctly emits <w:br/> (soft break). Etherpad's HTML uses bare <br> for every line, so this was making single Enters look like double Enters in the Word output. PDF: - Walker SKIP_TAGS rejects head/style/script/title/meta/link content -- prior version dumped CSS into the rendered PDF. - New breakLine() helper combines flushLine() with moveDown(1). pdfkit's text('', false) closes the continued run but does NOT advance the cursor, so consecutive runs were stacking at the same y-coordinate. <br>, end-of-block, and list items now use breakLine(). - ontext collapses runs of whitespace and drops pure-whitespace text nodes so pretty-printed source HTML doesn't render its formatting newlines. Round-trip: - New backend test: pad text -> DOCX export -> DOCX import -> new pad. Asserts content survives the trip. - New PDF sanity test: extracts visible text from the PDF stream and asserts the source pad text appears verbatim. - 6 new unit tests for extractBody and wrapLooseLines plus 1 for PDF walker SKIP_TAGS coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(7538): CodeQL ReDoS + import.ts type error - BR_PARA_RE was /(?:\s*<br>\s*){2,}/ -- two adjacent \s* runs can match the same chars, so on '<br>\t<br>\t<br>...' the regex backtracks exponentially. Re-anchored to match a fixed first <br> followed by one or more additional <br>s, so each whitespace run has exactly one home. - import.ts: fetchBuffer was typed Promise<Buffer> but call sites chained .expect(200) on it, which only works on supertest's Test object. Return the Test (typed any) so the chain is preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(7538): plugin-aware HTML cleanup, PDF text-align, monospace ep_headings2/ep_align emit one heading-styled blank-line block after every styled line in the pad ('<h1 style=text-align:right></h1>'), which both html-to-docx and our pdfkit walker render as a full empty paragraph. Plus the pdfkit walker had no support for text-align or monospace, so right/center alignment and 'code' lines rendered the same as plain body text. - New dropEmptyBlocks helper strips empty h1-h6/p/code/pre/div/ blockquote wrappers in preprocessing. Iterates so nested empties collapse too. Applied before both DOCX and PDF conversion. - PDF walker now reads style='text-align:left|center|right|justify' on block elements (h1-6, p, div) and passes it as pdfkit's align option. align is applied once per continued run, then reset on flushLine so the next block can pick up its own value. - PDF walker handles <code>, <tt>, <kbd>, <samp> as inline monospace (Courier) and <pre> as block monospace (Courier + breakLine on open/close). 11 new unit tests: - 4 for dropEmptyBlocks (heading wrappers, code, nesting, pass-through) - 1 for PDF text-align (compares the BT matrix x for left vs right) - 2 for Courier in <code> and <pre> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(7538): separate adjacent heading-style blocks on import Round-trip bug: ep_headings2 emits <h1>/<h2>/<code> in pad HTML. Mammoth round-trips them as adjacent <h1>A</h1><h2>B</h2><p>C</p> on import. Etherpad's server-side content collector has a default _blockElems set of just {div, p, pre, li}, and ep_headings2 only registers the CLIENT-side aceRegisterBlockElements -- not the server-side ccRegisterBlockElements. So h1/h2/code end up being treated as inline by the importer, and adjacent blocks merge into a single pad line. Fix: insert <br> after </h1>...</h6>/</code> when followed by another block. Server-side workaround keeps this PR self-contained regardless of plugin version. The right long-term fix is to extend ep_plugin_helpers' lineAttribute factory to register both hooks (filed as a follow-up). Tests: - 5 unit tests for separateAdjacentHeadingBlocks - New end-to-end round-trip test asserts H1+H2+P land on three separate pad lines after the import path. Plus the prior PDF text-align/Courier/code commit also included here: - code/tt/kbd/samp inherit text-align from style attribute - pre inherits text-align too Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(7538): blank-line round-trip + DOCX code monospace + a==c tests DOCX round-trip dropping blank pad lines: - wrapLooseLines now emits an explicit <p></p> marker for each blank line in a <br> run, instead of collapsing all gaps into a single paragraph break. (N consecutive <br>s -> 1 paragraph boundary + N-2 empty <p></p> markers, mapping to N-1 blank pad lines.) - mammoth's docxBufferToHtml now passes ignoreEmptyParagraphs:false so the empty <w:p> entries survive the import side. mammoth's default of true was silently dropping them. - dropEmptyBlocks no longer strips <p></p> -- that's the meaningful marker for the round-trip. Empty <h1>/<code>/<pre>/<div>/ <blockquote> are still stripped (plugin noise). DOCX <code> rendering as monospace: - New applyMonospaceToCode wraps code/tt/kbd/samp/pre content in a <span style="font-family:'Courier New', monospace">. html-to-docx honors that and emits <w:rFonts w:ascii="Courier New".../>, which Word renders as Courier. The bare <code> tag is otherwise just a no-op for html-to-docx. - Applied only on the DOCX export path (PDF walker already handles monospace via Courier font selection). Round-trip tests: - New a==c suite: txt, etherpad, html, docx -- export from src, import to dst, re-export and compare against the meaningful invariant (line text for binary formats; trimmed body for HTML). - HTML test tolerates one trailing <br> per round-trip because setPadHTML appends a final <p> on import; this is pre-existing core behavior, not our bug. - DOCX test normalizes trailing newline run (same reason). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(7538): preserve <w:jc> alignment through mammoth round-trip mammoth doesn't expose Word's paragraph alignment (`<w:jc>`) when it converts a docx to HTML -- there's no equivalent in its style-mapping machinery. To keep alignment through DOCX round-trips we walk the docx's document.xml directly, pull the `w:val` from each `<w:p>`'s `<w:jc>`, and inject `style="text-align:..."` onto the matching block element in mammoth's output by document order. Word's w:jc accepts more values than CSS text-align; we map left/ start, center, right/end, both/justify/distribute and skip the rest (start/end take left/right because we don't track ltr/rtl from the docx for now). Combines with the upstream ep_align PR (ether/ep_align#183) for the full round-trip: this PR makes the mammoth output carry the alignment style; ep_align#183 makes the importer pick it up. Closes the alignment side of #7538. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(7538): preserve <a href> inside <code>/<pre> in DOCX export html-to-docx silently drops <a href> children of <code>/<pre> tags (and of styled <span>s, but the <code> wrapper is the active offender here). The pad-export HTML produced by ep_headings2 + ep_align uses <code style='text-align:right'>...<a href>...</a>...</code> for each 'Code'-style line, which lost its links on every DOCX export. Workaround: applyMonospaceToCode now drops the code/pre/tt/kbd/samp wrapper entirely. The non-anchor content gets wrapped in monospace spans; anchors are emitted unstyled so they keep their hyperlink. For block-level usage (<pre>, or <code> with an inline style attr) we emit a wrapping <p> and forward the text-align style. Run BEFORE wrapLooseLines so the <p> doesn't get double-wrapped. Tests added: - inline <code> -> just a styled span (no <code> wrapper) - <code style='text-align:right'> -> <p style> wrap - <pre> -> always block-wrapped - <tt>/<kbd>/<samp> -> inline span only - regression: <a href> inside <code> survives html-to-docx round-trip with both the URL in word/_rels/document.xml.rels AND a <w:hyperlink> in the document body Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(7538): HTML import line-break drift between blocks Etherpad's HTML export wraps each pad line in <p>...</p> (or <h1>, <code>, etc.) and then appends a <br> between lines. The closing block tag already ends the line for contentcollector, so the trailing <br> is redundant -- and on import the server collector counts BOTH as line breaks, doubling every blank line between paragraphs and inserting an extra blank between adjacent headings. Two fixes, both gated on the runtime block-element registry so they don't double-trigger when the underlying plugin already handles adjacency: 1. HTML import path now runs the new collapseRedundantBrAfterBlocks helper before setPadHTML. Drops a single <br> immediately following </p>/</h1-6>/</code>/</pre>/</div>/</blockquote>/</ul>/ </ol>/</li>/</table>/</tr>/</td>/</th>. Multiple consecutive <br>s after a block keep all but the first (the rest still represent intentional blank lines). 2. The DOCX-import separateAdjacentHeadingBlocks workaround now checks whether 'h1' is in the runtime ccRegisterBlockElements set before inserting <br>s. When ep_headings2 has the new server hook (per ep_plugin_helpers#14 + the upcoming ep_headings2 PR), the workaround correctly stays out of the way -- otherwise it adds an extra blank line per heading transition. Also fixed a subtle ts-check failure on the import.ts test changes and a leftover implicit-any in ImportDocxNative's alignment preserver. Tests added: - collapseRedundantBrAfterBlocks: 5 unit tests (each block tag, whitespace tolerance, multiple <br> keeping intentional blanks) - HTML import: 'does not introduce a blank line between H1 and H2', 'preserves blank-line count between H1 and H2 (realistic shape)' reproduces the 5-blanks-where-2-expected bug from the user's round-trip pad. 1054 backend tests pass locally (the 6 failures are the pre-existing favicon/webaccess send@1.x dotfile-path issue from running under .claude/, doesn't reach CI). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(7538): skip plugin-dependent HTML import tests on no-plugin CI The two new HTML-import-adjacency tests assume ep_headings2 (or another plugin) has registered h1/h2 as server-side block elements via ccRegisterBlockElements. Without that, contentcollector treats <h1>/<h2> as inline and adjacent ones merge into a single pad line -- making the assertions inapplicable. CI's backend-tests job runs without plugins installed, so guard the describe block with a runtime hooks.callAll() check and skip when h1 isn't a registered block. Local dev with ep_headings2 (and the local plugin patch wiring ccRegisterBlockElements) still exercises both tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JohnMcLear merged commit b4e6538 into main May 8, 2026
3 checks passed

JohnMcLear deleted the fix/import-style-text-align branch May 8, 2026 15:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: read text-align from inline style on HTML import#183

fix: read text-align from inline style on HTML import#183
JohnMcLear merged 2 commits intomainfrom
fix/import-style-text-align

JohnMcLear commented May 8, 2026

Uh oh!

qodo-code-review Bot commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JohnMcLear commented May 8, 2026

Problem

Fix

Why it matters now

Test

Uh oh!

qodo-code-review Bot commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant