fix(latex): fully unwrap deeply nested formatting macros#3249
fix(latex): fully unwrap deeply nested formatting macros#3249PeterStaar-IBM merged 4 commits intodocling-project:mainfrom
Conversation
|
✅ DCO Check Passed Thanks @Smeet23, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
d48ff45 to
57e84f4
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
faa1bde to
9e644c3
Compare
|
@Smeet23 Can you fix the merge conflicts? |
9e644c3 to
9b2d166
Compare
|
Hi @PeterStaar-IBM — merge conflicts have been resolved. Here is what happened and what was done: Conflict source: Resolution: Rebased onto the latest The branch is now clean and up to date with |
|
@Smeet23 Please run |
|
Done @PeterStaar-IBM! Pre-commit is now clean:
Ready for merge! |
e742925 to
931f0bf
Compare
|
@Smeet23 Can you redo the DCO? Then we merge this. |
Two related bugs when formatting macros are nested:
1. `\textcolor{color}{...}` extracted the color name alongside the
text content because `_nodes_to_text` fell through to the generic
else branch, which concatenates all arguments. E.g.
`\section{\textcolor{blue}{\textbf{[SEP]}}}` produced heading text
"blue [SEP]" instead of "[SEP]".
2. `\textsc`, `\textsf`, `\textrm`, `\textnormal`, `\mbox` and
`\textcolor`/`\colorbox` are listed in MACROS_STRUCTURAL, so when
encountered mid-sentence `_process_macro_node_inline` flushed the
text buffer and called `_process_macro`, which creates a new doc
node. This broke inline paragraphs into fragments.
Fix:
- Add explicit handlers for MACROS_TEXT_STYLE and textcolor/colorbox
in `_process_macro_node_inline` (before the MACROS_STRUCTURAL flush
path) so they are accumulated inline like MACROS_TEXT_FORMATTING.
- Add matching handlers in `_nodes_to_text` so colour names are
skipped and only the text-content argument is returned.
Fixes docling-project#3207
Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
Two related bugs when formatting macros are nested inside each other:
1. `\textcolor{color}{...}` extracted the color name alongside the
text content because `_nodes_to_text` fell through to the generic
else branch, which concatenates all arguments. E.g.
`\section{\textcolor{blue}{\textbf{[SEP]}}}` produced heading text
"blue [SEP]" instead of "[SEP]".
2. `\textsc`, `\textsf`, `\textrm`, `\textnormal`, `\mbox` and
`\textcolor`/`\colorbox` are listed in MACROS_STRUCTURAL, so when
encountered mid-sentence `_process_macro_node_inline` flushed the
text buffer and called `_process_macro`, which creates a new doc
node. This broke inline paragraphs into fragments.
Fix:
- Add MACROS_COLOR_INLINE constant for textcolor/colorbox to keep
all macro classifications in one place (constants.py).
- Add explicit handlers for MACROS_TEXT_STYLE and MACROS_COLOR_INLINE
in `_process_macro_node_inline` (before the MACROS_STRUCTURAL flush
path) so they are accumulated inline like MACROS_TEXT_FORMATTING.
- Merge the identical MACROS_TEXT_FORMATTING and MACROS_TEXT_STYLE
branches in `_nodes_to_text` into a single branch.
- Use argnlist[-1] instead of reversed() iteration for
MACROS_COLOR_INLINE since the text content is always the last arg,
consistent with _extract_macro_arg.
Fixes docling-project#3207
Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
Split the macro-handling branch of `_nodes_to_text` into a dedicated `_macro_node_to_text` helper so that cyclomatic complexity stays within the ruff C901 limit (was 31, now < 30 for both methods). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
Upstream reorganised all latex tests from tests/test_backend_latex.py into tests/test_latex/. Move test_latex_nested_formatting_macros to tests/test_latex/test_macros.py and fix ruff-reported style nits. Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
fcabc97 to
fb2c708
Compare
|
Hi @cau-git and @PeterStaar-IBM — all CI checks are now passing and the DCO sign-off has been fixed across all commits. Could you please take a final look and merge if everything looks good? Happy to address any remaining concerns. Thanks! |
…ject#3249) * fix(latex): fully unwrap deeply nested formatting macros Two related bugs when formatting macros are nested: 1. `\textcolor{color}{...}` extracted the color name alongside the text content because `_nodes_to_text` fell through to the generic else branch, which concatenates all arguments. E.g. `\section{\textcolor{blue}{\textbf{[SEP]}}}` produced heading text "blue [SEP]" instead of "[SEP]". 2. `\textsc`, `\textsf`, `\textrm`, `\textnormal`, `\mbox` and `\textcolor`/`\colorbox` are listed in MACROS_STRUCTURAL, so when encountered mid-sentence `_process_macro_node_inline` flushed the text buffer and called `_process_macro`, which creates a new doc node. This broke inline paragraphs into fragments. Fix: - Add explicit handlers for MACROS_TEXT_STYLE and textcolor/colorbox in `_process_macro_node_inline` (before the MACROS_STRUCTURAL flush path) so they are accumulated inline like MACROS_TEXT_FORMATTING. - Add matching handlers in `_nodes_to_text` so colour names are skipped and only the text-content argument is returned. Fixes docling-project#3207 Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com> Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com> * fix(latex): fully unwrap deeply nested formatting macros Two related bugs when formatting macros are nested inside each other: 1. `\textcolor{color}{...}` extracted the color name alongside the text content because `_nodes_to_text` fell through to the generic else branch, which concatenates all arguments. E.g. `\section{\textcolor{blue}{\textbf{[SEP]}}}` produced heading text "blue [SEP]" instead of "[SEP]". 2. `\textsc`, `\textsf`, `\textrm`, `\textnormal`, `\mbox` and `\textcolor`/`\colorbox` are listed in MACROS_STRUCTURAL, so when encountered mid-sentence `_process_macro_node_inline` flushed the text buffer and called `_process_macro`, which creates a new doc node. This broke inline paragraphs into fragments. Fix: - Add MACROS_COLOR_INLINE constant for textcolor/colorbox to keep all macro classifications in one place (constants.py). - Add explicit handlers for MACROS_TEXT_STYLE and MACROS_COLOR_INLINE in `_process_macro_node_inline` (before the MACROS_STRUCTURAL flush path) so they are accumulated inline like MACROS_TEXT_FORMATTING. - Merge the identical MACROS_TEXT_FORMATTING and MACROS_TEXT_STYLE branches in `_nodes_to_text` into a single branch. - Use argnlist[-1] instead of reversed() iteration for MACROS_COLOR_INLINE since the text content is always the last arg, consistent with _extract_macro_arg. Fixes docling-project#3207 Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com> Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com> * refactor(latex): extract _macro_node_to_text to reduce complexity Split the macro-handling branch of `_nodes_to_text` into a dedicated `_macro_node_to_text` helper so that cyclomatic complexity stays within the ruff C901 limit (was 31, now < 30 for both methods). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com> * fix(latex): migrate nested-formatting test to tests/test_latex/ Upstream reorganised all latex tests from tests/test_backend_latex.py into tests/test_latex/. Move test_latex_nested_formatting_macros to tests/test_latex/test_macros.py and fix ruff-reported style nits. Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com> --------- Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com> Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com> Co-authored-by: Smeet Agrawal <smeetagrawal23@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: OrbisAI Security <mediratta01.pally@gmail.com>
Summary
Fixes #3207
Two related bugs when formatting macros are nested inside each other:
Bug 1 — Colour name leaks into extracted text
_nodes_to_textfell through to its generic else-branch for\textcolor, which concatenates all arguments including the colour name. For example:Previously produced heading text
"blue [SEP]"instead of"[SEP]".Bug 2 — Inline paragraph broken into fragments
\textsc,\textsf,\textrm,\textnormal,\mbox,\textcolor, and\colorboxare listed inMACROS_STRUCTURAL. When encountered mid-sentence,_process_macro_node_inlineflushed the text buffer and created a new doc node, splitting a single sentence into separate paragraphs.Changes
handlers/macros.py— Add explicitMACROS_TEXT_STYLEandtextcolor/colorboxbranches in_process_macro_node_inlinebefore theMACROS_STRUCTURALflush path, so they are accumulated inline (same pattern asMACROS_TEXT_FORMATTING).utils/text.py— Add matching branches in_nodes_to_textso thatMACROS_TEXT_STYLEmacros extract only their text argument, andtextcolor/colorboxskip the colour argument and recurse into the text-content argument only.Test plan
test_latex_nested_formatting_macroscovering all five patterns (textsc inline, textcolor+textbf inline, deep three-level nesting, textbf wrapping textsc, heading with nested textcolor)\textbf,\textsc,\textcolor, …) appear in converted output