Skip to content

enhancement: bring back embedded images in PDF's #194

@christinestraub

Description

@christinestraub

While the fix for the PR that addressed #176 was a big move forward in merging OCR bboxes+text with layout-model bboxes+text, the issue remains that we are still dropping the "Image" element category that was lost in https://github.com/Unstructured-IO/unstructured/pull/1142/files / de19ace. Actually, in the comment there

        # Skip extracted images for this purpose, we don't have the text from them and they
        # don't provide good text bounding boxes.

Users may want the Image bbox info anyway even if there is no text. One thing we are doing correctly now that shouldn't change is full-page embedded Image PDF's. (e.g. the IRS pdf in ingest tests).

Extracted bboxes:
main PMC6312790_1_extracted

Final bboxes:
main PMC6312790_1_final

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions