Skip to content

Conversation

@SimJeg
Copy link
Contributor

@SimJeg SimJeg commented Dec 19, 2024

Hi,

This PR adds markdown text formatting for docx documents (italic, bold, underline and hyperlinks). I included a new tests/data/docx/unit_test_formatting.docx document to illustrate it. Using the latest docling main the output of export_to_markdown is:

italic
bold
underline
hyperlink
italic and bold hyperlink
italic bold underline and hyperlink on the same line

with this PR it becomes:

italic
bold
underline
hyperlink
italic and bold hyperlink
italic bold underline and hyperlink on the same line

@mergify
Copy link

mergify bot commented Dec 19, 2024

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Signed-off-by: SimJeg <sjegou@nvidia.com>
@SimJeg SimJeg force-pushed the docx-markdown-formatting branch from a221428 to 7f9464b Compare December 19, 2024 11:09
@SimJeg
Copy link
Contributor Author

SimJeg commented Dec 19, 2024

Note: for underline I used the <u> / </u> tags that are not rendered on GitHub 😅

@SimJeg
Copy link
Contributor Author

SimJeg commented Dec 26, 2024

@maxmnemonic @PeterStaar-IBM do you need any additional info for this PR ?

@dolfim-ibm
Copy link
Member

@SimJeg this is an interesting feature, but we should introduce it with an option for enable/disable, because not all output formats will be compatible with markdown styling. There could also be some consideration on whether to propagate text styling in the Docling document format, but the option will be needed.

@SimJeg
Copy link
Contributor Author

SimJeg commented Jan 6, 2025

Hi @dolfim-ibm,

Indeed, a different function should be applied for HTML for instance. I can add an argument to the convert function (e.g. style=[None, "markdown", "htlm"]).

As there are several options to do this and I don't know very well docling API, I'll wait for your confirmation before pushing updates.

@SimJeg
Copy link
Contributor Author

SimJeg commented Jan 13, 2025

@dolfim-ibm any update on it ?

@dolfim-ibm
Copy link
Member

We actually are considering something similar to what you are proposing.

Adding the option for the format at convert time (with default None) is good, but we would like to have them in the PipelineOptions for the MS Word backend, since it will be something specific to it.

We will soon post more details, but the above is the general idea.

@cau-git
Copy link
Member

cau-git commented Feb 7, 2025

@SimJeg We will implement a design as proposed here: #894
Then, this work will be able to make use of it.

@SimJeg
Copy link
Contributor Author

SimJeg commented Feb 7, 2025 via email

@PeterStaar-IBM PeterStaar-IBM requested review from vagenas and removed request for maxmnemonic February 27, 2025 13:56
@PeterStaar-IBM
Copy link
Member

@vagenas Can you have a look here and see how this intersects with our new concept of INLINE groups. I would if we need to extend docling-doc with BOLD, ITALIC, UNDERLINE and STRIPED groups and adapt this PR.

FYI: @cau-git @dolfim-ibm

@vagenas vagenas self-assigned this Feb 28, 2025
@vagenas
Copy link
Member

vagenas commented Mar 17, 2025

Hi @SimJeg, with docling-project/docling-core#182 we introduced —as beta— a Serialization API operating against the DoclingDocument. This also includes formatting.

This test code shows how the various formatting options can be set.

👉 Can you update your PR so that it sets these formatting options when adding the respective items to the DoclingDocument?

The actual export to the various output formats should not be part of this PR as it will be taken care of by the new Serialization API — e.g. the Markdown export is already using the new API & automatically exports bold, italics, strikethrough, and hyperlinks.

@mergify
Copy link

mergify bot commented Mar 31, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@vagenas
Copy link
Member

vagenas commented Mar 31, 2025

Hi @SimJeg, where do you stand on the discussed updates?

To provide some more context on our example snippet:

  • we use formatting & hyperlink options to specify how individual items are to be formatted
  • we create an inline group to indicate that multiple items (that may be differently formatted) should actually be interpreted as parts of a single "inline" component instead of separate "paragraphs" (details here)

Hope that explains this a bit better.

Looking forward to your updates — would be great to have the DOCX backend updated this week! 🙌

Signed-off-by: SimJeg <sjegou@nvidia.com>
@SimJeg
Copy link
Contributor Author

SimJeg commented Mar 31, 2025

@vagenas currently looking at it. I started by merging the current main and noticed that the following code

from docling.document_converter import DocumentConverter

source = "/path/to/docling/tests/data/docx/unit_test_formatting.docx"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

now returns &lt;u&gt; instead of <u> for underline.

@SimJeg
Copy link
Contributor Author

SimJeg commented Mar 31, 2025

My current proposal is mainly to update the handle_text_elements function with 1 line:

+ text = self.format_paragraph(paragraph)

where format_paragraph handles formatting. The resulting text can then be used by 3 different functions depending on the style (self.add_listitem, doc.add_text or self.add_header).

In docx, a paragraph is a list of different runs, each of which can have a different font (bold, italic and underline). This implies that to use the new Formatting option, my self.format_paragraph function should now return a list of tuple (text, format, hyperlink) that should then be handled by the 3 different function. Is that correct ?

@vagenas
Copy link
Member

vagenas commented Mar 31, 2025

@SimJeg

  1. underline is indeed deliberately not considered by the Markdown export.
  2. to get that you'll need to create an inline group as mentioned above.

E.g. you could do something like this once you have your paragraph_elements:

parent: NodeItem = self.parents.get(self.get_level() - 1)
if len(paragraph_elements) > 1:
    parent = doc.add_group(
        label=GroupLabel.INLINE, parent=parent,
    )

and then pass that parent in your doc.add_text() invocations (please also strip the text there, as I think most Markdown interpreters only apply formatting on strings with no leading/trailing whitespace).

@SimJeg
Copy link
Contributor Author

SimJeg commented Mar 31, 2025

@vagenas what's wrong with the code I shared were I (try to) use inline groups ? (we posted almost simultaneously)

For stripping, my (deleted) format_text made sure the leading and trailing whitespaces were preserved because in word, you can have a text italic bold where the space between "italic" and "bold" can be in italic. If you don't preserve the whitespaces, this would become italicbold without spacing. It seems that export_to_markdown instead insert \n\n between them by default but that's not correct.

@vagenas
Copy link
Member

vagenas commented Mar 31, 2025

@SimJeg

what's wrong with the code I shared were I (try to) use inline groups ?

Well, you don't want to have a separate inline group for each paragraph element — instead you want a single inline group for the whole paragraph (in case it comprises more than one elements), so the snippet I shared shall be used right after getting the paragraph_elements (not inside a paragraph_elements for-loop).

For stripping, my (deleted) format_text made sure the leading and trailing whitespaces were preserved because in word, you can have a text italic bold where the space between "italic" and "bold" can be in italic. If you don't preserve the whitespaces, this would become italicbold without spacing. It seems that export_to_markdown instead insert \n\n between them by default but that's not correct.

The exporter will add a single space between inline elements (not \n\n). Formatted spaces may be technically possible in Word, but they appear problematic in Markdown.

Signed-off-by: SimJeg <sjegou@nvidia.com>
@SimJeg
Copy link
Contributor Author

SimJeg commented Mar 31, 2025

Well, you don't want to have a separate inline group for each paragraph element

I fixed it thanks, the parent was indeed not on the right side of the for loop 😅
I will move forward and update all other doc.add_* using the for loop

Signed-off-by: SimJeg <sjegou@nvidia.com>
@SimJeg
Copy link
Contributor Author

SimJeg commented Mar 31, 2025

@vagenas I now handled lists too and updated tests/data/docx/unit_test_formatting.docx to have associated tests. For title, headers and equations, I did not change anything. The output looks good:

*italic*

**bold**

underline

[hyperlink](https://github.com/DS4SD/docling)

[***italic and bold hyperlink***](https://github.com/DS4SD/docling)

*italic* **bold** underline and [hyperlink](https://github.com/DS4SD/docling) on the same line

- *Italic bullet 1*
- **Bold bullet 2**
- Underline bullet 3

Your review is welcome

SimJeg added 4 commits March 31, 2025 16:38
Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: SimJeg <sjegou@nvidia.com>
@SimJeg
Copy link
Contributor Author

SimJeg commented Apr 1, 2025

@vagenas I also added 2 lines for a missing feature: handle headers and footers in MS word document (see #632) . I added the header of the first section and footer of the last section and updated tests/data/docx/unit_test_formatting.docx to include a header and footer

A better implementation would be to handle all sections properly but the following code did not work (I did not look deeply into walk_linear however).

            for section in self.docx_obj.sections:
                doc = self.walk_linear(section.header._element, self.docx_obj, doc)
                for e in section.iter_inner_content():
                    doc = self.walk_linear(e._element, self.docx_obj, doc) # does not add anything
                doc = self.walk_linear(section.footer._element, self.docx_obj, doc)

Signed-off-by: SimJeg <sjegou@nvidia.com>
@SimJeg
Copy link
Contributor Author

SimJeg commented Apr 2, 2025

@vagenas any feedback ? could you run the tests ? Would be great to merge today if possible

SimJeg added 3 commits April 2, 2025 13:47
Copy link
Member

@vagenas vagenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have left some inline comments for final improvements.

SimJeg and others added 3 commits April 2, 2025 17:20
Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Copy link
Contributor

@rateixei rateixei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for this! So in this approach, paragraphs that contain equations wouldn’t get any formatting, is this correct?

For a first iteration, that’s fine. But I think it should be doable to merge the two approaches — the equation extraction does a similar loop through the paragraph elements. The only problem is that, afaik, the docx library doesn’t have an Equation element like HyperLink for example, so I don’t know if iter_inner_content would catch equation fields. So, for a v2, we'd need a more flexible way to iterate through the paragraph elements.

@vagenas
Copy link
Member

vagenas commented Apr 3, 2025

Thanks for the valuable input @rateixei — formatting in special case of equations to be addressed in follow-up iteration.

@vagenas vagenas merged commit bfcab3d into docling-project:main Apr 3, 2025
8 checks passed
@vagenas
Copy link
Member

vagenas commented Apr 3, 2025

Thanks for this nice contribution @SimJeg! 🙌

rateixei pushed a commit that referenced this pull request Apr 8, 2025
* feat: Enable markdown text formatting for docx

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix imports

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use Formatting

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle hyperlink

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle formatting properly for DocItemLabel.PARAGRAPH

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline group

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle bullet lists

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run black and mypy

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle header and footer

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline_fmt everywhere

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run precommit

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Address feedback

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix add_list_item

Signed-off-by: SimJeg <sjegou@nvidia.com>

* fix minor bugs, mark helper methods internal

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
ceberam pushed a commit that referenced this pull request Apr 8, 2025
…ded to text (#1295)

* Adding new latex symbols, simplifying how equations are added to text

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Identify headers through inhenrited style

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Log warning message instead of print

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Adding new latex symbols, simplifying how equations are added to text

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Identify headers through inhenrited style

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Log warning message instead of print

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201)

fix wrong type text extracted by tesseract_ocr_cli_model

Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix(docx): Improve text parsing (#1268)

* chore: bump version to 2.28.4 [skip ci]

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Improve text parsing

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix: Tesseract OCR CLI can't process images composed with numbers only (#1201)

fix wrong type text extracted by tesseract_ocr_cli_model

Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Flexibilize heading detection

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Fix trailing space

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Remove trailing space

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* docs: add visual grounding example (#1270)

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* feat(docx): add text formatting and hyperlink support (#630)

* feat: Enable markdown text formatting for docx

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix imports

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use Formatting

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle hyperlink

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle formatting properly for DocItemLabel.PARAGRAPH

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline group

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle bullet lists

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run black and mypy

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle header and footer

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline_fmt everywhere

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run precommit

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Address feedback

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix add_list_item

Signed-off-by: SimJeg <sjegou@nvidia.com>

* fix minor bugs, mark helper methods internal

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* fix(pptx): check if picture shape has an image attached (#1316)

Check if picture shape has an image attached in pptx backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* chore: update lock file (#1315)

chore: update lock

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* docs: add plugins docs (#1319)

add plugin docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* feat: handle <code> tags as code blocks (#1320)

handle <code> tags as code blocks

Signed-off-by: FernandoSSI <fernandosi2005@gmail.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Adding new latex symbols, simplifying how equations are added to text

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Identify headers through inhenrited style

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Log warning message instead of print

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

* Adding new latex symbols, simplifying how equations are added to text

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>

---------

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: FernandoSSI <fernandosi2005@gmail.com>
Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com>
Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Simon Jégou <SimJeg@users.noreply.github.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Fernando Santos <121275806+FernandoSSI@users.noreply.github.com>
benichou pushed a commit to benichou/docling that referenced this pull request Jun 20, 2025
…t#630)

* feat: Enable markdown text formatting for docx

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix imports

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use Formatting

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle hyperlink

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle formatting properly for DocItemLabel.PARAGRAPH

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline group

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle bullet lists

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Strip elements

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run black and mypy

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Handle header and footer

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Use inline_fmt everywhere

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Run precommit

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Address feedback

Signed-off-by: SimJeg <sjegou@nvidia.com>

* Fix add_list_item

Signed-off-by: SimJeg <sjegou@nvidia.com>

* fix minor bugs, mark helper methods internal

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Benichou <fbenichou@deloitte.ca>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants