Skip to content

Conversation

@qued
Copy link
Contributor

@qued qued commented Mar 22, 2023

Updated the OCR logic to be aware of image elements.

LayoutParser only deals with text objects, so this PR removes LayoutParser from the internals and replaces the components. (LayoutParser remains a dependency because of detectron2.)

Testing:

Run:

from unstructured_inference.inference.layout import DocumentLayout

doc = DocumentLayout.from_file('sample-docs/loremipsum-flat.pdf')

doc.pages[0].elements should contain elements with the text of the document.

Copy link

@mallorih mallorih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@benjats07 benjats07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ajjimeno
Copy link
Contributor

LGTM!

In the function in line 158 from the code below, I think if no model name is passed, it will raise an exception, I think get_model() would raise an exception.

https://github.com/Unstructured-IO/unstructured-inference/blob/f23a13af897d8e7348b7948bd4b237f5bd0722e6/unstructured_inference/inference/layout.py

@qued qued merged commit 4a52922 into main Mar 25, 2023
@qued qued deleted the feat/better-image-support branch March 25, 2023 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants