Skip to content

Conversation

@christinestraub
Copy link
Contributor

@christinestraub christinestraub commented Nov 15, 2023

Closes #285.

Summary

  • support extracting elements with types Picture and Figure
  • add a class ElementType for the element type constants and use the constants to replace element type strings

Testing

PDF: algebra-graph-level1-1.pdf

from unstructured_inference.inference.layout import DocumentLayout

doc = DocumentLayout.from_file(
    filename="algebra-graph-level1-1.pdf",
    extract_images_in_pdf=True,
)

Copy link
Contributor

@benjats07 benjats07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
In the future we may explore the possiblity of just renaming Picture and Figure to be Image and simplify this.

@cragwolfe cragwolfe merged commit f35b830 into main Nov 16, 2023
@cragwolfe cragwolfe deleted the feat/285-improve-image-extraction branch November 16, 2023 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhancement: improve image extraction by supporting all types of image elements detected by detection models

4 participants