Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: include text from shapes in docx #2510

Merged
merged 8 commits into from
Feb 14, 2024

Conversation

ds-filipknefel
Copy link
Contributor

Reported bug: Text from docx shapes is not included in the partition output.
Fix: Extend docx partition to search for text tags nested inside structures responsible for creating the shape.

@ds-filipknefel
Copy link
Contributor Author

@scanny

I've settled on this xpath for now: w:r/descendant::wp:inline[ancestor::w:drawing][1]//w:r

It'll find first occurence of w:drawing/wp:inline for each w:r directly under current paragraph and return w:r from inside it.

  • Each single image should be contained within w:r directly under paragraph so I think w:r/ a reasonable limitation
  • In case both choice and fallback use w:drawing (I think it's unlikely but maybe not impossible) we need to select the first one descendant::wp:inline[ancestor::w:drawing][1]
  • We take wp:inline to only include inline shapes
  • At the end I take w:r at the end because we want to use .text method of C_R the same way we do for other regions.

I've added two example documents and one test for each.

Also this was reported as a bug but I wonder if this should be considered a fix or a feature, what's your take? I'll update changelog accordingly.

Copy link
Collaborator

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few remarks :)

unstructured/partition/docx.py Outdated Show resolved Hide resolved
test_unstructured/partition/docx/test_docx.py Outdated Show resolved Hide resolved
test_unstructured/partition/docx/test_docx.py Outdated Show resolved Hide resolved
@scanny
Copy link
Collaborator

scanny commented Feb 6, 2024

Hi @ds-filipknefel that XPath expression looks good to me. I think the trailing //w:r is fine because the scope is so narrow by then that the search will be quick.

Btw, to understand the possible parent-child relationships in the XML you can consult the XML Schema for Open XML here: https://github.com/python-openxml/python-docx/blob/master/ref/xsd/wml.xsd. That said I feel pretty confident this XPath will do just what we want.

I'd call this a feature, albeit a small one. This is extending the range of what partition_docx() will detect. There are quite a few other corner-caes items it still won't detect and I don't believe we have either promised that it does or committed to make it do so (given the large engineering investment required). So I'd say we celebrate each new extension of its detection capability as a new goodness :)

Copy link
Collaborator

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think this is good to go. One remark on test that can now be simplified.

If we're ready, you can change it from Draft to "ready" status and let me know and I'll approve it :)

test_unstructured/partition/docx/test_docx.py Outdated Show resolved Hide resolved
@ds-filipknefel ds-filipknefel changed the title fix: include text from shapes in docx feat: include text from shapes in docx Feb 8, 2024
@ds-filipknefel ds-filipknefel marked this pull request as ready for review February 8, 2024 09:38
Copy link
Collaborator

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Amazingly compact :)

One small nit but approving in advance.

CHANGELOG.md Show resolved Hide resolved
@ds-filipknefel ds-filipknefel added this pull request to the merge queue Feb 14, 2024
Merged via the queue into main with commit f048695 Feb 14, 2024
43 checks passed
@ds-filipknefel ds-filipknefel deleted the fix/docx-include-text-from-shapes branch February 14, 2024 18:14
@ds-filipknefel ds-filipknefel linked an issue Feb 19, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat/.docx include inline shape text in partition
2 participants