Skip to content

fix(docx-core): emit paragraph-level markers outside <w:r> on rebuild#109

Merged
stevenobiajulu merged 1 commit intomainfrom
106-paragraph-level-markers-20260427
Apr 28, 2026
Merged

fix(docx-core): emit paragraph-level markers outside <w:r> on rebuild#109
stevenobiajulu merged 1 commit intomainfrom
106-paragraph-level-markers-20260427

Conversation

@stevenobiajulu
Copy link
Copy Markdown
Member

Summary

Fixes #106. Restores w:commentRangeStart / w:commentRangeEnd and w:bookmarkStart / w:bookmarkEnd on rebuild output by atomizing paragraph-internal markers and routing them outside the synthetic <w:r> wrapper that buildSingleRun emits for run-level leaves.

PR #101 traded these markers away to fix a "Word can't open" regression (issue #94), accepting that comment ranges (the highlighted span) and paragraph-internal bookmark spans were lost on rebuild output. This PR puts them back — at the schema-correct position (siblings of <w:r>, not children) — so Word still opens the file and the spans survive.

Mechanics

  • New PARAGRAPH_LEVEL_TAGS set + isParagraphLevelLeaf() predicate in atomizer.ts.
  • Ancestry-sensitive atomization: only emit a marker atom when the element sits inside a <w:p>. Body/table-sibling placements stay out of the atom stream and are still handled by the existing scaffold-strip block (avoids misattachment via assignParagraphIndices, which only assigns indices to atoms with a w:p ancestor).
  • Rebuild-only gate: new atomizeParagraphLevelMarkers flag on AtomizeTreeOptions, set true only on the rebuild path. The inplace path's bookmark reconciliation breaks if bookmarkStart/End atoms enter the stream (orphaned-bookmark warnings, round-trip safety failures).
  • buildRunContentWithParagraphMarkers(): walks a RunGroup left to right, buffers run-level atoms, flushes the buffer to <w:r> blocks via subGroupByRPr + buildSingleRun on each marker, and emits markers as bare elements. Always sub-groups by per-atom rPr — group.rPr is captured from the first atom and is intentionally suppressed for moved groups, so trusting it would bleed formatting across segments.
  • Bookmark integrity: applies enforceConsumerCompatibility on the rebuilt body (mirrors inplace mode) to dedupe bookmark names/IDs and synthesize recovery starts/ends for orphans. Required because cross-context bookmark pairs (one half paragraph-internal, other body-level) appear unbalanced after rebuild.

Tests

  • Strengthen "Comment added on revised side" in rebuild-auxiliary-merge.test.ts to verify markers are present and that their immediate parent is one of w:p / w:ins / w:del / w:moveFrom / w:moveTo (DOM-based, not regex). Original "must NOT appear inside <w:r>" regex regression stays as a floor.
  • New "Cross-paragraph comment span" — markers preserved across paragraph boundary with matching w:id.
  • New "Sibling-style scaffold markers" — body-level bookmarks don't leak into reconstructed paragraphs.
  • New "Inplace regression — comment span" — verifies inplace mode still succeeds without falling back to rebuild when revised has a spanning comment.
  • Extend synthetic-docx-fixture.ts with commentSpanParagraphs, bookmarkOnParagraph, siblingBookmarkBefore.

Counts: docx-core 1094 pass (was 1091, +3 new), docx-mcp 638 pass, full lint clean.

Out of scope (follow-ups will be filed)

  • moveFromRange* / moveToRange* — collides with synthetic emission in wrapWithMoveFrom/To; needs scaffold-cleanup + dedup work.
  • permStart / permEnd — same placement rule per ECMA-376; mechanical extension once helper exists, but lacks fixture coverage.
  • buildDocumentPreservingStructure cross-paragraph comment-range stripping in inplace mode — separate code path; explicitly deferred by issue Reconstruct paragraph-level markers (commentRange, bookmark) outside <w:r> in rebuild mode #106 itself.

Peer review

Reviewed by Codex CLI (with repo access) and Gemini CLI in parallel before implementation; both reviews informed the final design — particularly Codex's catches on:

  • ancestry-sensitive atomization (vs naive tag-only)
  • gating to rebuild-only via options flag (shouldStartNewParagraph keeps paragraphIndex === undefined atoms in the current paragraph, so body-level markers would otherwise misattach)
  • always sub-grouping by rPr in the marker-aware helper (overriding group.rPr because moved groups suppress rPr-splitting)
  • DOM-based test assertions (the existing "comment added" scenario produces an Inserted range, so the marker's parent is <w:ins> — not raw <w:p>)

Test plan

Fixes: #106
Ref: #101

Restores commentRange* and bookmark* on rebuild output by atomizing
paragraph-internal markers and routing them outside the synthetic <w:r>
wrapper that buildSingleRun emits for run-level leaves.

PR #101 traded these markers away to fix a Word-can't-open regression
(issue #94), accepting that the highlighted span on commentRange and
paragraph-internal bookmark spans were lost on rebuild output. This
PR puts them back — but at the schema-correct position (siblings of
<w:r>, not children) — so Word still opens the file and the spans
survive.

Mechanics:
- Add PARAGRAPH_LEVEL_TAGS (commentRangeStart/End, bookmarkStart/End) and
  isParagraphLevelLeaf() in atomizer.ts.
- Make atomization ancestry-sensitive: only emit a marker atom when the
  element sits inside a <w:p>. Body/table-sibling placements stay out of
  the atom stream and are still handled by the existing scaffold-strip
  block in the reconstructor (avoids misattachment via assignParagraph-
  Indices, which only assigns indices to atoms with a w:p ancestor).
- Gate atomization with the new atomizeParagraphLevelMarkers flag in
  AtomizeTreeOptions, set true only on the rebuild path. The inplace
  path's bookmark reconciliation breaks if bookmarkStart/End atoms enter
  the stream (orphaned-bookmark warnings, round-trip safety failures).
- Add buildRunContentWithParagraphMarkers(): walks a RunGroup left to
  right, buffers run-level atoms, flushes the buffer to <w:r> blocks via
  subGroupByRPr+buildSingleRun on each marker, and emits markers as bare
  elements. Always sub-groups by per-atom rPr — group.rPr is captured
  from the first atom and is intentionally suppressed for moved groups,
  so trusting it would bleed formatting across segments.
- Apply enforceConsumerCompatibility on the rebuilt body to dedupe
  bookmark names/IDs and synthesize recovery starts/ends for orphans
  (mirrors the post-processing already used in inplace mode). This is
  required because cross-context bookmark pairs (one half paragraph-
  internal, the other body-level) appear unbalanced after rebuild.

Tests:
- Strengthen the existing "Comment added on revised side" assertion in
  rebuild-auxiliary-merge.test.ts to verify markers are present and that
  their immediate parent is one of w:p / w:ins / w:del / w:moveFrom /
  w:moveTo (DOM-based, not regex). The original "must NOT appear inside
  <w:r>" regex regression stays as a floor.
- Add cross-paragraph comment-span test, sibling-style scaffold-marker
  test, and an inplace-mode regression test for the comment-span
  scenario. Extend synthetic-docx-fixture.ts with commentSpanParagraphs,
  bookmarkOnParagraph, and siblingBookmarkBefore options.

Move-range markers (moveFromRange*, moveToRange*) and permStart/permEnd
are out of scope here — moveRange collides with the synthetic emission
in wrapWithMoveFrom/To and permRange lacks fixture coverage. Tracked in
follow-up issues.

Reviewed by Codex CLI (with repo access) and Gemini CLI in parallel
before implementation; both reviews informed the final design (ancestry-
sensitive atomization, rebuild-only gate, sub-group-by-rPr override of
group.rPr).

Fixes: #106
Ref: #101
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
site Ready Ready Preview, Comment Apr 27, 2026 11:23pm

Request Review

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 27, 2026

Codecov Report

❌ Patch coverage is 86.95652% with 15 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ocx-core/src/integration/synthetic-docx-fixture.ts 65.78% 12 Missing and 1 partial ⚠️
packages/docx-core/src/atomizer.ts 95.65% 1 Missing ⚠️
...re/src/baselines/atomizer/documentReconstructor.ts 97.82% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@stevenobiajulu stevenobiajulu merged commit d878bd5 into main Apr 28, 2026
23 checks passed
stevenobiajulu added a commit that referenced this pull request Apr 28, 2026
* chore(release): bump workspace versions to 0.9.1

Patch release covering bug fixes accumulated since v0.9.0:

- fix(docx-core): merge auxiliary parts in rebuild mode (#94, #101)
- fix(docx-core): emit paragraph-level markers outside <w:r> on rebuild (#109)
- fix(docx-core): preserve reply comments in rebuild mode (#108, #112)
- test(docx-mcp): harden vitest path allowlist against /tmp symlink topology (#105)
- ci: SHA-pin all GitHub Actions, add dependency review and Dependabot
- ci: fix PR title check for acronyms and fork PRs (#88)
- chore(skills): sync docx-editing to 0.3.0 (#95)

Generated via scripts/bump_version.mjs.

* fix(docx-mcp): declare @xmldom/xmldom dependency explicitly

The bumped package-lock.json regenerated cleanly and exposed a latent
bug: docx-mcp directly imports @xmldom/xmldom in tag_parser.ts but did
not declare it as a dependency. It was previously masked by npm
hoisting from docx-core's transitive dep — the workspace happened to
work because the lockfile carried a stale hoisted xmldom@0.8.11.

Pinning to ^0.8.11 (matching what was already running in production via
the silent hoist) preserves observable behavior. tag_parser depends on
xmldom 0.8's lenient parsing of cross-nested tags
(<b><i>text</b></i>) — xmldom 0.9 throws on such input. Migrating
docx-mcp's tag_parser to xmldom 0.9 is its own concern; this commit
keeps the 0.9.1 release a pure version bump.

docx-core continues to use xmldom ^0.9.9 (PR #75); the two versions
coexist under their respective package's node_modules. No DOM nodes
cross the docx-core/docx-mcp boundary, so the version split is safe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reconstruct paragraph-level markers (commentRange, bookmark) outside <w:r> in rebuild mode

1 participant