enhancement: bring back embedded images in PDF's

While the fix for the PR that addressed  [#176](https://github.com/Unstructured-IO/unstructured-inference/issues/176) was a big move forward in merging OCR bboxes+text with layout-model bboxes+text, the issue remains that we are still dropping the "Image" element category that was lost in https://github.com/Unstructured-IO/unstructured/pull/1142/files / https://github.com/Unstructured-IO/unstructured-inference/commit/de19ace101632ed5a8b8e0fed62cef0795fd55ac. Actually, in the comment there

>             # Skip extracted images for this purpose, we don't have the text from them and they
>             # don't provide good text bounding boxes.

Users may want the Image bbox info anyway even if there is no text. One thing we are doing correctly now that shouldn't change is full-page embedded Image PDF's. (e.g. the [IRS pdf](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/IRS-form-1987.pdf) in ingest tests).

**Extracted bboxes:**
![main PMC6312790_1_extracted](https://github.com/Unstructured-IO/unstructured-inference/assets/9475974/28f73da2-9a15-444d-9986-105a19f4cc49)

Final bboxes:
![main PMC6312790_1_final](https://github.com/Unstructured-IO/unstructured-inference/assets/9475974/b4a8e008-383c-4023-85b4-68861f98d5d1)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

enhancement: bring back embedded images in PDF's #194

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

enhancement: bring back embedded images in PDF's #194

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions