Skip to content

fix(latex): fully unwrap deeply nested formatting macros#3249

Merged
PeterStaar-IBM merged 4 commits intodocling-project:mainfrom
Smeet23:fix/nested-formatting-macros
Apr 17, 2026
Merged

fix(latex): fully unwrap deeply nested formatting macros#3249
PeterStaar-IBM merged 4 commits intodocling-project:mainfrom
Smeet23:fix/nested-formatting-macros

Conversation

@Smeet23
Copy link
Copy Markdown
Contributor

@Smeet23 Smeet23 commented Apr 7, 2026

Summary

Fixes #3207

Two related bugs when formatting macros are nested inside each other:

Bug 1 — Colour name leaks into extracted text

_nodes_to_text fell through to its generic else-branch for \textcolor, which concatenates all arguments including the colour name. For example:

\section{\textcolor{blue}{\textbf{\textsc{[SEP]}}}}

Previously produced heading text "blue [SEP]" instead of "[SEP]".

Bug 2 — Inline paragraph broken into fragments

\textsc, \textsf, \textrm, \textnormal, \mbox, \textcolor, and \colorbox are listed in MACROS_STRUCTURAL. When encountered mid-sentence, _process_macro_node_inline flushed the text buffer and created a new doc node, splitting a single sentence into separate paragraphs.

Changes

  • handlers/macros.py — Add explicit MACROS_TEXT_STYLE and textcolor/colorbox branches in _process_macro_node_inline before the MACROS_STRUCTURAL flush path, so they are accumulated inline (same pattern as MACROS_TEXT_FORMATTING).

  • utils/text.py — Add matching branches in _nodes_to_text so that MACROS_TEXT_STYLE macros extract only their text argument, and textcolor/colorbox skip the colour argument and recurse into the text-content argument only.

Test plan

  • New test test_latex_nested_formatting_macros covering all five patterns (textsc inline, textcolor+textbf inline, deep three-level nesting, textbf wrapping textsc, heading with nested textcolor)
  • All 66 existing latex backend tests pass
  • No raw LaTeX commands (\textbf, \textsc, \textcolor, …) appear in converted output

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

DCO Check Passed

Thanks @Smeet23, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 7, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@Smeet23 Smeet23 force-pushed the fix/nested-formatting-macros branch from d48ff45 to 57e84f4 Compare April 7, 2026 17:08
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 85.00000% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/latex/utils/text.py 81.25% 9 Missing ⚠️

📢 Thoughts on this report? Let us know!

PeterStaar-IBM
PeterStaar-IBM previously approved these changes Apr 8, 2026
@PeterStaar-IBM
Copy link
Copy Markdown
Member

@Smeet23 Can you fix the merge conflicts?

@Smeet23 Smeet23 force-pushed the fix/nested-formatting-macros branch from 9e644c3 to 9b2d166 Compare April 9, 2026 11:05
@Smeet23
Copy link
Copy Markdown
Contributor Author

Smeet23 commented Apr 9, 2026

Hi @PeterStaar-IBM — merge conflicts have been resolved. Here is what happened and what was done:

Conflict source: tests/test_backend_latex.py had a conflict between the two \vspace/\hspace leak tests that were merged into main (from another PR) and our new test_latex_nested_formatting_macros test. Both additions were to the end of the same file, so git flagged it as a conflict.

Resolution: Rebased onto the latest main and kept all three tests — test_vspace_argument_does_not_leak, test_hspace_argument_does_not_leak, and test_latex_nested_formatting_macros. Also moved a stray import re (which was inside the test function body) to the top-level imports where it belongs.

The branch is now clean and up to date with main. Ready for re-review when you have a moment.

PeterStaar-IBM
PeterStaar-IBM previously approved these changes Apr 12, 2026
@PeterStaar-IBM
Copy link
Copy Markdown
Member

@Smeet23 Please run uv run pre-commit run --all-files and fix all errors you encounter, we are close to get this merged!

@Smeet23
Copy link
Copy Markdown
Contributor Author

Smeet23 commented Apr 12, 2026

Done @PeterStaar-IBM! Pre-commit is now clean:

  • Ruff formatter — Passed
  • Ruff linter — Fixed the one remaining C901 complexity error in _nodes_to_text by extracting the macro-handling logic into a _macro_node_to_text helper method (complexity dropped from 31 → within limit)
  • MyPy — The only remaining mypy error is the pre-existing playwright.sync_api import in html_backend.py, unrelated to this PR

Ready for merge!

@Smeet23 Smeet23 force-pushed the fix/nested-formatting-macros branch from e742925 to 931f0bf Compare April 14, 2026 10:30
@PeterStaar-IBM
Copy link
Copy Markdown
Member

@Smeet23 Can you redo the DCO? Then we merge this.

smeetagrawal23-sys and others added 4 commits April 15, 2026 14:41
Two related bugs when formatting macros are nested:

1. `\textcolor{color}{...}` extracted the color name alongside the
   text content because `_nodes_to_text` fell through to the generic
   else branch, which concatenates all arguments. E.g.
   `\section{\textcolor{blue}{\textbf{[SEP]}}}` produced heading text
   "blue [SEP]" instead of "[SEP]".

2. `\textsc`, `\textsf`, `\textrm`, `\textnormal`, `\mbox` and
   `\textcolor`/`\colorbox` are listed in MACROS_STRUCTURAL, so when
   encountered mid-sentence `_process_macro_node_inline` flushed the
   text buffer and called `_process_macro`, which creates a new doc
   node. This broke inline paragraphs into fragments.

Fix:
- Add explicit handlers for MACROS_TEXT_STYLE and textcolor/colorbox
  in `_process_macro_node_inline` (before the MACROS_STRUCTURAL flush
  path) so they are accumulated inline like MACROS_TEXT_FORMATTING.
- Add matching handlers in `_nodes_to_text` so colour names are
  skipped and only the text-content argument is returned.

Fixes docling-project#3207

Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
Two related bugs when formatting macros are nested inside each other:

1. `\textcolor{color}{...}` extracted the color name alongside the
   text content because `_nodes_to_text` fell through to the generic
   else branch, which concatenates all arguments. E.g.
   `\section{\textcolor{blue}{\textbf{[SEP]}}}` produced heading text
   "blue [SEP]" instead of "[SEP]".

2. `\textsc`, `\textsf`, `\textrm`, `\textnormal`, `\mbox` and
   `\textcolor`/`\colorbox` are listed in MACROS_STRUCTURAL, so when
   encountered mid-sentence `_process_macro_node_inline` flushed the
   text buffer and called `_process_macro`, which creates a new doc
   node. This broke inline paragraphs into fragments.

Fix:
- Add MACROS_COLOR_INLINE constant for textcolor/colorbox to keep
  all macro classifications in one place (constants.py).
- Add explicit handlers for MACROS_TEXT_STYLE and MACROS_COLOR_INLINE
  in `_process_macro_node_inline` (before the MACROS_STRUCTURAL flush
  path) so they are accumulated inline like MACROS_TEXT_FORMATTING.
- Merge the identical MACROS_TEXT_FORMATTING and MACROS_TEXT_STYLE
  branches in `_nodes_to_text` into a single branch.
- Use argnlist[-1] instead of reversed() iteration for
  MACROS_COLOR_INLINE since the text content is always the last arg,
  consistent with _extract_macro_arg.

Fixes docling-project#3207

Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
Split the macro-handling branch of `_nodes_to_text` into a dedicated
`_macro_node_to_text` helper so that cyclomatic complexity stays within
the ruff C901 limit (was 31, now < 30 for both methods).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
Upstream reorganised all latex tests from tests/test_backend_latex.py
into tests/test_latex/. Move test_latex_nested_formatting_macros to
tests/test_latex/test_macros.py and fix ruff-reported style nits.

Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
@Smeet23 Smeet23 force-pushed the fix/nested-formatting-macros branch from fcabc97 to fb2c708 Compare April 15, 2026 09:12
@Smeet23
Copy link
Copy Markdown
Contributor Author

Smeet23 commented Apr 15, 2026

Hi @cau-git and @PeterStaar-IBM — all CI checks are now passing and the DCO sign-off has been fixed across all commits. Could you please take a final look and merge if everything looks good? Happy to address any remaining concerns. Thanks!

@PeterStaar-IBM PeterStaar-IBM merged commit 101233e into docling-project:main Apr 17, 2026
25 checks passed
orbisai0security pushed a commit to orbisai0security/docling that referenced this pull request May 5, 2026
…ject#3249)

* fix(latex): fully unwrap deeply nested formatting macros

Two related bugs when formatting macros are nested:

1. `\textcolor{color}{...}` extracted the color name alongside the
   text content because `_nodes_to_text` fell through to the generic
   else branch, which concatenates all arguments. E.g.
   `\section{\textcolor{blue}{\textbf{[SEP]}}}` produced heading text
   "blue [SEP]" instead of "[SEP]".

2. `\textsc`, `\textsf`, `\textrm`, `\textnormal`, `\mbox` and
   `\textcolor`/`\colorbox` are listed in MACROS_STRUCTURAL, so when
   encountered mid-sentence `_process_macro_node_inline` flushed the
   text buffer and called `_process_macro`, which creates a new doc
   node. This broke inline paragraphs into fragments.

Fix:
- Add explicit handlers for MACROS_TEXT_STYLE and textcolor/colorbox
  in `_process_macro_node_inline` (before the MACROS_STRUCTURAL flush
  path) so they are accumulated inline like MACROS_TEXT_FORMATTING.
- Add matching handlers in `_nodes_to_text` so colour names are
  skipped and only the text-content argument is returned.

Fixes docling-project#3207

Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>

* fix(latex): fully unwrap deeply nested formatting macros

Two related bugs when formatting macros are nested inside each other:

1. `\textcolor{color}{...}` extracted the color name alongside the
   text content because `_nodes_to_text` fell through to the generic
   else branch, which concatenates all arguments. E.g.
   `\section{\textcolor{blue}{\textbf{[SEP]}}}` produced heading text
   "blue [SEP]" instead of "[SEP]".

2. `\textsc`, `\textsf`, `\textrm`, `\textnormal`, `\mbox` and
   `\textcolor`/`\colorbox` are listed in MACROS_STRUCTURAL, so when
   encountered mid-sentence `_process_macro_node_inline` flushed the
   text buffer and called `_process_macro`, which creates a new doc
   node. This broke inline paragraphs into fragments.

Fix:
- Add MACROS_COLOR_INLINE constant for textcolor/colorbox to keep
  all macro classifications in one place (constants.py).
- Add explicit handlers for MACROS_TEXT_STYLE and MACROS_COLOR_INLINE
  in `_process_macro_node_inline` (before the MACROS_STRUCTURAL flush
  path) so they are accumulated inline like MACROS_TEXT_FORMATTING.
- Merge the identical MACROS_TEXT_FORMATTING and MACROS_TEXT_STYLE
  branches in `_nodes_to_text` into a single branch.
- Use argnlist[-1] instead of reversed() iteration for
  MACROS_COLOR_INLINE since the text content is always the last arg,
  consistent with _extract_macro_arg.

Fixes docling-project#3207

Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>

* refactor(latex): extract _macro_node_to_text to reduce complexity

Split the macro-handling branch of `_nodes_to_text` into a dedicated
`_macro_node_to_text` helper so that cyclomatic complexity stays within
the ruff C901 limit (was 31, now < 30 for both methods).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>

* fix(latex): migrate nested-formatting test to tests/test_latex/

Upstream reorganised all latex tests from tests/test_backend_latex.py
into tests/test_latex/. Move test_latex_nested_formatting_macros to
tests/test_latex/test_macros.py and fix ruff-reported style nits.

Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>

---------

Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
Co-authored-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: OrbisAI Security <mediratta01.pally@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: deeply nested formatting macros not fully unwrapped in LaTeX backend

4 participants