Skip to content

Conversation

@qued
Copy link
Contributor

@qued qued commented Mar 9, 2023

Stopgap fix for a bug that causes the parsing procedure to ignore pdf elements that are not contained within the bounds of an inferred/specified layout element.

Testing:

With the following linked file, loremipsum-flat.pdf, try the code:

from unstructured_inference.inference.layout import DocumentLayout
doc = DocumentLayout.from_file('loremipsum-flat.pdf')
print(' '.join([el.text for el in doc.pages[0].elements]))

Observe that the output is blank on the main branch, but contains text on this branch.

@qued qued requested review from MthwRobinson and benjats07 March 9, 2023 04:50
Copy link
Contributor

@benjats07 benjats07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qued qued merged commit 4814a72 into main Mar 10, 2023
@qued qued deleted the fix/ocr-when-no-elements branch March 10, 2023 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants