-
Notifications
You must be signed in to change notification settings - Fork 3.5k
feat: Enable markdown text formatting for docx #630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Enable markdown text formatting for docx #630
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Signed-off-by: SimJeg <sjegou@nvidia.com>
a221428 to
7f9464b
Compare
|
Note: for underline I used the |
|
@maxmnemonic @PeterStaar-IBM do you need any additional info for this PR ? |
|
@SimJeg this is an interesting feature, but we should introduce it with an option for enable/disable, because not all output formats will be compatible with markdown styling. There could also be some consideration on whether to propagate text styling in the Docling document format, but the option will be needed. |
|
Hi @dolfim-ibm, Indeed, a different function should be applied for HTML for instance. I can add an argument to the convert function (e.g. style=[None, "markdown", "htlm"]). As there are several options to do this and I don't know very well docling API, I'll wait for your confirmation before pushing updates. |
|
@dolfim-ibm any update on it ? |
|
We actually are considering something similar to what you are proposing. Adding the option for the format at convert time (with default None) is good, but we would like to have them in the PipelineOptions for the MS Word backend, since it will be something specific to it. We will soon post more details, but the above is the general idea. |
|
Thanks for the update!
Le ven. 7 févr. 2025, 16:24, Christoph Auer ***@***.***> a
écrit :
… @SimJeg <https://github.com/SimJeg> We will implement a design as
proposed here: #894 <#894>
Then, this work will be able to make use of it.
—
Reply to this email directly, view it on GitHub
<#630 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADE64VNKSZC6VT5UQUL3VM32OTFZJAVCNFSM6AAAAABT4XSVWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNBTGIZTOOBSGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
@vagenas Can you have a look here and see how this intersects with our new concept of INLINE groups. I would if we need to extend docling-doc with BOLD, ITALIC, UNDERLINE and STRIPED groups and adapt this PR. FYI: @cau-git @dolfim-ibm |
|
Hi @SimJeg, with docling-project/docling-core#182 we introduced —as beta— a Serialization API operating against the DoclingDocument. This also includes formatting. This test code shows how the various formatting options can be set. 👉 Can you update your PR so that it sets these formatting options when adding the respective items to the DoclingDocument? The actual export to the various output formats should not be part of this PR as it will be taken care of by the new Serialization API — e.g. the Markdown export is already using the new API & automatically exports bold, italics, strikethrough, and hyperlinks. |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🟢 Require two reviewer for test updatesWonderful, this rule succeeded.When test data is updated, we require two reviewers
|
Signed-off-by: SimJeg <sjegou@nvidia.com>
|
Hi @SimJeg, where do you stand on the discussed updates? To provide some more context on our example snippet:
Hope that explains this a bit better. Looking forward to your updates — would be great to have the DOCX backend updated this week! 🙌 |
Signed-off-by: SimJeg <sjegou@nvidia.com>
|
@vagenas currently looking at it. I started by merging the current main and noticed that the following code from docling.document_converter import DocumentConverter
source = "/path/to/docling/tests/data/docx/unit_test_formatting.docx"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())now returns |
|
My current proposal is mainly to update the + text = self.format_paragraph(paragraph)where In docx, a paragraph is a list of different runs, each of which can have a different font (bold, italic and underline). This implies that to use the new |
E.g. you could do something like this once you have your parent: NodeItem = self.parents.get(self.get_level() - 1)
if len(paragraph_elements) > 1:
parent = doc.add_group(
label=GroupLabel.INLINE, parent=parent,
)and then pass that |
|
@vagenas what's wrong with the code I shared were I (try to) use inline groups ? (we posted almost simultaneously) For stripping, my (deleted) |
Well, you don't want to have a separate inline group for each paragraph element — instead you want a single inline group for the whole paragraph (in case it comprises more than one elements), so the snippet I shared shall be used right after getting the
The exporter will add a single space between inline elements (not |
Signed-off-by: SimJeg <sjegou@nvidia.com>
I fixed it thanks, the parent was indeed not on the right side of the for loop 😅 |
Signed-off-by: SimJeg <sjegou@nvidia.com>
|
@vagenas I now handled lists too and updated Your review is welcome |
Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: SimJeg <sjegou@nvidia.com>
|
@vagenas I also added 2 lines for a missing feature: handle headers and footers in MS word document (see #632) . I added the header of the first section and footer of the last section and updated A better implementation would be to handle all sections properly but the following code did not work (I did not look deeply into for section in self.docx_obj.sections:
doc = self.walk_linear(section.header._element, self.docx_obj, doc)
for e in section.iter_inner_content():
doc = self.walk_linear(e._element, self.docx_obj, doc) # does not add anything
doc = self.walk_linear(section.footer._element, self.docx_obj, doc) |
Signed-off-by: SimJeg <sjegou@nvidia.com>
|
@vagenas any feedback ? could you run the tests ? Would be great to merge today if possible |
…tting Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: SimJeg <sjegou@nvidia.com>
vagenas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have left some inline comments for final improvements.
Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: SimJeg <sjegou@nvidia.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
rateixei
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks for this! So in this approach, paragraphs that contain equations wouldn’t get any formatting, is this correct?
For a first iteration, that’s fine. But I think it should be doable to merge the two approaches — the equation extraction does a similar loop through the paragraph elements. The only problem is that, afaik, the docx library doesn’t have an Equation element like HyperLink for example, so I don’t know if iter_inner_content would catch equation fields. So, for a v2, we'd need a more flexible way to iterate through the paragraph elements.
|
Thanks for the valuable input @rateixei — formatting in special case of equations to be addressed in follow-up iteration. |
|
Thanks for this nice contribution @SimJeg! 🙌 |
* feat: Enable markdown text formatting for docx Signed-off-by: SimJeg <sjegou@nvidia.com> * Fix imports Signed-off-by: SimJeg <sjegou@nvidia.com> * Use Formatting Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle hyperlink Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle formatting properly for DocItemLabel.PARAGRAPH Signed-off-by: SimJeg <sjegou@nvidia.com> * Use inline group Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle bullet lists Signed-off-by: SimJeg <sjegou@nvidia.com> * Strip elements Signed-off-by: SimJeg <sjegou@nvidia.com> * Strip elements Signed-off-by: SimJeg <sjegou@nvidia.com> * Run black and mypy Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle header and footer Signed-off-by: SimJeg <sjegou@nvidia.com> * Use inline_fmt everywhere Signed-off-by: SimJeg <sjegou@nvidia.com> * Run precommit Signed-off-by: SimJeg <sjegou@nvidia.com> * Address feedback Signed-off-by: SimJeg <sjegou@nvidia.com> * Fix add_list_item Signed-off-by: SimJeg <sjegou@nvidia.com> * fix minor bugs, mark helper methods internal Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: SimJeg <sjegou@nvidia.com> Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Co-authored-by: Panos Vagenas <pva@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
…ded to text (#1295) * Adding new latex symbols, simplifying how equations are added to text Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Identify headers through inhenrited style Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Log warning message instead of print Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Adding new latex symbols, simplifying how equations are added to text Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Identify headers through inhenrited style Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Log warning message instead of print Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * fix: Tesseract OCR CLI can't process images composed with numbers only (#1201) fix wrong type text extracted by tesseract_ocr_cli_model Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com> Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * fix(docx): Improve text parsing (#1268) * chore: bump version to 2.28.4 [skip ci] Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Improve text parsing Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * fix: Tesseract OCR CLI can't process images composed with numbers only (#1201) fix wrong type text extracted by tesseract_ocr_cli_model Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com> Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Flexibilize heading detection Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Fix trailing space Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Remove trailing space Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> --------- Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com> Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * docs: add visual grounding example (#1270) Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * feat(docx): add text formatting and hyperlink support (#630) * feat: Enable markdown text formatting for docx Signed-off-by: SimJeg <sjegou@nvidia.com> * Fix imports Signed-off-by: SimJeg <sjegou@nvidia.com> * Use Formatting Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle hyperlink Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle formatting properly for DocItemLabel.PARAGRAPH Signed-off-by: SimJeg <sjegou@nvidia.com> * Use inline group Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle bullet lists Signed-off-by: SimJeg <sjegou@nvidia.com> * Strip elements Signed-off-by: SimJeg <sjegou@nvidia.com> * Strip elements Signed-off-by: SimJeg <sjegou@nvidia.com> * Run black and mypy Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle header and footer Signed-off-by: SimJeg <sjegou@nvidia.com> * Use inline_fmt everywhere Signed-off-by: SimJeg <sjegou@nvidia.com> * Run precommit Signed-off-by: SimJeg <sjegou@nvidia.com> * Address feedback Signed-off-by: SimJeg <sjegou@nvidia.com> * Fix add_list_item Signed-off-by: SimJeg <sjegou@nvidia.com> * fix minor bugs, mark helper methods internal Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: SimJeg <sjegou@nvidia.com> Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Co-authored-by: Panos Vagenas <pva@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * fix(pptx): check if picture shape has an image attached (#1316) Check if picture shape has an image attached in pptx backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * chore: update lock file (#1315) chore: update lock Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * docs: add plugins docs (#1319) add plugin docs Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * feat: handle <code> tags as code blocks (#1320) handle <code> tags as code blocks Signed-off-by: FernandoSSI <fernandosi2005@gmail.com> Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Adding new latex symbols, simplifying how equations are added to text Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Identify headers through inhenrited style Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Log warning message instead of print Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> * Adding new latex symbols, simplifying how equations are added to text Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> --------- Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com> Signed-off-by: gvl4 <Guilhem.VERMOREL@3ds.com> Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Signed-off-by: SimJeg <sjegou@nvidia.com> Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: FernandoSSI <fernandosi2005@gmail.com> Co-authored-by: Guilhem VERMOREL <83694424+guilhemvermorel@users.noreply.github.com> Co-authored-by: gvl4 <Guilhem.VERMOREL@3ds.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Simon Jégou <SimJeg@users.noreply.github.com> Co-authored-by: Panos Vagenas <pva@zurich.ibm.com> Co-authored-by: Maxim Lysak <101627549+maxmnemonic@users.noreply.github.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Fernando Santos <121275806+FernandoSSI@users.noreply.github.com>
…t#630) * feat: Enable markdown text formatting for docx Signed-off-by: SimJeg <sjegou@nvidia.com> * Fix imports Signed-off-by: SimJeg <sjegou@nvidia.com> * Use Formatting Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle hyperlink Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle formatting properly for DocItemLabel.PARAGRAPH Signed-off-by: SimJeg <sjegou@nvidia.com> * Use inline group Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle bullet lists Signed-off-by: SimJeg <sjegou@nvidia.com> * Strip elements Signed-off-by: SimJeg <sjegou@nvidia.com> * Strip elements Signed-off-by: SimJeg <sjegou@nvidia.com> * Run black and mypy Signed-off-by: SimJeg <sjegou@nvidia.com> * Handle header and footer Signed-off-by: SimJeg <sjegou@nvidia.com> * Use inline_fmt everywhere Signed-off-by: SimJeg <sjegou@nvidia.com> * Run precommit Signed-off-by: SimJeg <sjegou@nvidia.com> * Address feedback Signed-off-by: SimJeg <sjegou@nvidia.com> * Fix add_list_item Signed-off-by: SimJeg <sjegou@nvidia.com> * fix minor bugs, mark helper methods internal Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: SimJeg <sjegou@nvidia.com> Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Co-authored-by: Panos Vagenas <pva@zurich.ibm.com> Signed-off-by: Benichou <fbenichou@deloitte.ca>
Hi,
This PR adds markdown text formatting for docx documents (italic, bold, underline and hyperlinks). I included a new
tests/data/docx/unit_test_formatting.docxdocument to illustrate it. Using the latest docling main the output ofexport_to_markdownis:with this PR it becomes: