Open
Conversation
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
…ox methods Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Contributor
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesWaiting for:
This rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Contributor
|
✅ DCO Check Passed Thanks @ceberam, all your commits are properly signed off. 🎉 |
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: Docling What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?View Suggested Changes@@ -110,6 +110,7 @@
- **Inline equations**: Multiple equations within text-containing paragraphs are preserved as distinct formula items
- Previously, multiple sibling equations in a single paragraph were concatenated into a single LaTeX string, but this has been fixed to maintain each equation as a separate document item
- **Inline Equations in List Items**: Inline formulas appearing in list items (both bulleted and numbered) are correctly processed and preserved in markdown exports. When a list item contains inline equations, they are exported with LaTeX `$` delimiters alongside the surrounding text
+ - **Checkboxes**: Checkboxes in DOCX files using Word 2010+ XML elements (`w14:checkbox`) are detected and parsed. Text with checkboxes is labeled as `CHECKBOX_SELECTED` or `CHECKBOX_UNSELECTED` in the resulting DoclingDocument. Checkbox symbols (☐, ☑, ☒, etc.) are automatically removed from the text content
- **Notes**: Header/footer are automatically detected as FURNITURE layer. CLI/Serve API exports only BODY. [Example](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/documents/e596ee79-fc7f-43a4-90e2-74891e0cf12f).
--- |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR adds support for parsing checkboxes in
docxdocuments. Text with checkboxes is now properly identified and labeled asCHECKBOX_SELECTEDorCHECKBOX_UNSELECTEDin the resulting DoclingDocument.Changes
w14:checkbox)DocItemLabel.CHECKBOX_SELECTEDorDocItemLabel.CHECKBOX_UNSELECTED_get_paragraph_elements()(also fixes textbox test)Testing
tests/data/docx/docx_checkboxes.docx(12 checkboxes: 8 selected, 4 unselected)test_checkbox_detection_and_parsing(): Verifies checkbox detection and proper labelingtest_checkbox_labels_in_tables(): Verifies checkboxes in table cellsdocumentsfixture for efficiencyResolves #858
Checklist: