Skip to content

feat(docx): add checkbox parsing support#3349

Open
ceberam wants to merge 4 commits intomainfrom
feat/docx-checkbox
Open

feat(docx): add checkbox parsing support#3349
ceberam wants to merge 4 commits intomainfrom
feat/docx-checkbox

Conversation

@ceberam
Copy link
Copy Markdown
Member

@ceberam ceberam commented Apr 22, 2026

Description

This PR adds support for parsing checkboxes in docx documents. Text with checkboxes is now properly identified and labeled as CHECKBOX_SELECTED or CHECKBOX_UNSELECTED in the resulting DoclingDocument.

Changes

  • Checkbox detection and parsing: Added methods to detect checkboxes in DOCX files using Word 2010+ XML elements (w14:checkbox)
  • Automatic symbol removal: Checkbox symbols (☐, ☑, ☒, etc.) are automatically removed from text content
  • Proper labeling: Text items with checkboxes are labeled with DocItemLabel.CHECKBOX_SELECTED or DocItemLabel.CHECKBOX_UNSELECTED
  • Code quality improvements:
    • Fixed missing return statement in _get_paragraph_elements() (also fixes textbox test)
    • Refactored duplicate code in text element handling (~25 lines removed)
    • Added Google-style docstrings to all new methods
  • Comprehensive tests: Added tests for checkbox detection in both paragraphs and table cells

Testing

  • Test file: tests/data/docx/docx_checkboxes.docx (12 checkboxes: 8 selected, 4 unselected)
  • Added test_checkbox_detection_and_parsing(): Verifies checkbox detection and proper labeling
  • Added test_checkbox_labels_in_tables(): Verifies checkboxes in table cells
  • All tests use the documents fixture for efficiency

Resolves #858

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

ceberam added 4 commits April 22, 2026 13:02
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
…ox methods

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
@ceberam ceberam added enhancement New feature or request docx issue related to docx backend labels Apr 22, 2026
@ceberam ceberam changed the title feat(docx): add checkbox parsing support to MsWordDocumentBackend feat(docx): add checkbox parsing support Apr 22, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 22, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

Waiting for:

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@github-actions
Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @ceberam, all your commits are properly signed off. 🎉

@dosubot
Copy link
Copy Markdown

dosubot Bot commented Apr 22, 2026

Related Documentation

1 document(s) may need updating based on files changed in this PR:

Docling

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?
View Suggested Changes
@@ -110,6 +110,7 @@
         - **Inline equations**: Multiple equations within text-containing paragraphs are preserved as distinct formula items
     - Previously, multiple sibling equations in a single paragraph were concatenated into a single LaTeX string, but this has been fixed to maintain each equation as a separate document item
     - **Inline Equations in List Items**: Inline formulas appearing in list items (both bulleted and numbered) are correctly processed and preserved in markdown exports. When a list item contains inline equations, they are exported with LaTeX `$` delimiters alongside the surrounding text
+      - **Checkboxes**: Checkboxes in DOCX files using Word 2010+ XML elements (`w14:checkbox`) are detected and parsed. Text with checkboxes is labeled as `CHECKBOX_SELECTED` or `CHECKBOX_UNSELECTED` in the resulting DoclingDocument. Checkbox symbols (☐, ☑, ☒, etc.) are automatically removed from the text content
 - **Notes**: Header/footer are automatically detected as FURNITURE layer. CLI/Serve API exports only BODY. [Example](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/documents/e596ee79-fc7f-43a4-90e2-74891e0cf12f).
 
 ---

[Accept] [Decline]

How did I do? Any feedback?  Join Discord

@ceberam ceberam self-assigned this Apr 22, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 94.28571% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/msword_backend.py 94.28% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docx issue related to docx backend enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

No translation of option buttons when converting docx to MarkDown

1 participant