Feat/save embedded images in pdf #208

christinestraub · 2023-09-11T23:10:02Z

Addresses unstructured issue #1332. This PR will work together with unstructured PR #1371.
This PR also addresses "true" embedded images issue #215.

Summary

Add functionality to extract and save images from the page
- add the extract_images method to the PageLayout class
- pass parameters related to extracting images from the page
- add Python script to evaluate image extraction with various PDF processing libraries
Add functionality to get only "true" embedded images when extracting elements from PDF pages
- add functionality to extract image objects (LTImage) from a PDF layout element parsed by pdfminer.high_level.extract_pages
- update logic to determine ImageTextRegion in load_pdf()
Update the layout visualization script to be able to show only image elements if need

The following documents can be used for testing and evaluation.

Testing

from unstructured_inference.inference.layout import DocumentLayout

f_path = "sample-docs/embedded-images.pdf"

# default image output directory
doc = DocumentLayout.from_file(
    filename=f_path,
    extract_images_in_pdf=True,
)

# specific image output directory
doc = DocumentLayout.from_file(
    filename=f_path,
    extract_images_in_pdf=True,
    image_output_dir_path=<directory_path>,
)

Evaluation

// Extracting Images
$ PYTHONPATH=. python examples/image-extraction/embedded-image-extraction.py Captur-1317-5_ENG-p23.pdf unstructured

// Layout Visualziation
$ PYTHONPATH=. python examples/layout_analysis/visualization.py Captur-1317-5_ENG-p23.pdf image_oly

NOTE: To reproduce the original results for comparision, you need to replace the lines with the following code snippet

_text, element_class = (
    (element.get_text(), EmbeddedTextRegion)
    if hasattr(element, "get_text")
    else (None, ImageTextRegion)
)

# Conflicts: # CHANGELOG.md

qued

Overall looks good, just a pair of small change requests. Meanwhile I haven't run the test code yet, so that's my next step.

unstructured_inference/inference/layout.py

qued

Several things on further inspection / after running test code:

If I understand the motivation behind this issue correctly, I think we want the embedded image, meaning the image that is actually stored in the PDF. This appears to identify the image location, then use the bounding box information to crop that portion of the rendered PDF and save it. This results in something potentially different in format and resolution than what the PDF stores internally. (@cragwolfe correct me if I'm off base here)
Running the test code results in a few things that seem off to me. It produces outputs figure-1-x.jpg where x ranges from 1 to 14, skipping 5. I get the logged error message showing that the exception handler triggers (presumably on image 5). Is it expected that we get an error on this sample doc?
Additionally output images 3 and 4 appear to be duplicates, 6 and 7 appear to be duplicates, 8 and 9 appear to be duplicates, 10 and 11 appear to be duplicates, and 13 and 14 appear to be duplicates.

christinestraub · 2023-09-13T17:41:11Z

Several things on further inspection / after running test code:

If I understand the motivation behind this issue correctly, I think we want the embedded image, meaning the image that is actually stored in the PDF. This appears to identify the image location, then use the bounding box information to crop that portion of the rendered PDF and save it. This results in something potentially different in format and resolution than what the PDF stores internally. (@cragwolfe correct me if I am wrong please)

Running the test code results in a few things seem off to me. It produces outputs figure-1-x.jpg where x ranges from 1 to 14, skipping 5. I get log error message showing that the exception handler triggers (presumably on image 5). Is it expected that we get an error on this sample doc?

Additionally output images 3 and 4 appear to be duplicates, 6 and 7 appear to be duplicates, 8 and 9 appear to be duplicates, 10 and 11 appear to be duplicates, and 13 and 14 appear to be duplicates.

The resolution (width, height) of the rendered PDF page image depends on the dpi(https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layout.py#L612-L624) and the bounding boxes of the image elements are also calculated using the same dpi(https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layout.py#L597-L603) so I don't think there would be a potential difference in format and resolution than what the PDF stores internally.
Yes, It is expected that we get an error on this sample doc. The width of the cropped image is 0 for the image 5 (figure-1-5.jpg) so it gets an error - ValueError: cannot write empty image as JPEG.
It is the issue of the pdfminer library (pdfminer.high_level.extract_pages). @crag and I talked about that issue in this Slack channel.

…mages_in_pdf`, and `image_output_dir_path` to the signature

# Conflicts: # CHANGELOG.md

…ject parsed by `pdfminer`

…lement` & always keep embedded images unless they are full page images

…keep embedded images

Talked over the use case with Crag, former implementation is probably okay with some improvements.

… processing libraries

…`L` and `RGB` mode images

… image based on the image elements bounding boxes

christinestraub · 2023-09-19T17:49:03Z

@cragwolfe @qued Reverted back to the former implementation (extracting embedded images by cropping each page image based on the image elements bounding boxes) with some improvements.

…ed_layout` logic

# Conflicts: # CHANGELOG.md # test_unstructured_inference/inference/test_layout.py # unstructured_inference/__version__.py

…o the output directory path

christinestraub · 2023-09-20T19:54:16Z

@cragwolfe @qued As of now, some extracted image elements are missing in the final result and are not being saved. This issue will be addressed by issue #219.

Addresses [#1332](#1332) with `unstructured-inference` PR [#208](Unstructured-IO/unstructured-inference#208). ### Summary - Add `image_path` to element metadata - Pass parameters related to extracting images in PDF - Preserve image elements ignored due to garbage text if `el.metadata.image_path` is `True` ### Testing from unstructured.partition.pdf import partition_pdf f_path = "example-docs/embedded-images.pdf" # default image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, ) # specific image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, image_output_dir_path=<directory path>, )

christinestraub added 9 commits September 11, 2023 02:31

feat: add functionality to extract and save images from the page

5c6853d

feat: pass parameters related to extracting images from the page

eeff6c6

feat: update error handling

3e6d3a0

chore: update logging text

33912c3

Merge branch 'main' into feat/save-embedded-images-in-pdf

fea710c

refactor: PageLayout.extract_images

42db5be

test: add test case for PageLayout.extract_images

d06e955

chore: update changelog & version

fd978b5

feat: update visualization script to show only image elements

889a6ad

christinestraub marked this pull request as ready for review September 12, 2023 19:37

christinestraub added 2 commits September 12, 2023 12:40

Merge branch 'main' into feat/save-embedded-images-in-pdf

7564e63

# Conflicts: # CHANGELOG.md

chore: update version

e40d802

christinestraub requested review from qued and cragwolfe September 12, 2023 19:41

christinestraub mentioned this pull request Sep 12, 2023

Feat/1332 save embedded images in pdf Unstructured-IO/unstructured#1371

Merged

qued reviewed Sep 12, 2023

View reviewed changes

unstructured_inference/inference/layout.py Outdated Show resolved Hide resolved

unstructured_inference/inference/layout.py Outdated Show resolved Hide resolved

qued previously requested changes Sep 13, 2023

View reviewed changes

christinestraub added 2 commits September 13, 2023 10:50

feat: specify general exception

4de48ad

refactor: move analysis, supplement_with_ocr_elements, `extract_i…

201fedb

…mages_in_pdf`, and `image_output_dir_path` to the signature

christinestraub requested a review from qued September 13, 2023 18:20

christinestraub added 9 commits September 13, 2023 11:22

Merge branch 'main' into feat/save-embedded-images-in-pdf

61ec2c3

# Conflicts: # CHANGELOG.md

chore: update changelog & version

b1c9a98

Merge branch 'main' into feat/save-embedded-images-in-pdf

ac11384

# Conflicts: # CHANGELOG.md

feat: add functionality to extract image objects from a PDF layout ob…

a0913f2

…ject parsed by `pdfminer`

refactor: renaming...

59c4c5d

feat: get only writable images as ImageTextRegion

9f2e327

feat: add image_raw_data property to ImageTextRegion and `LayoutE…

887adf8

…lement` & always keep embedded images unless they are full page images

refactor: extracted_is_image

1ffadca

feat: update merge_inferred_layout_with_extracted_layout to always …

e6f3985

…keep embedded images

christinestraub added 9 commits September 18, 2023 10:05

feat: invert image colors

7b829d2

feat: add python script to evaluate image extraction with various pdf…

8a7e818

… processing libraries

test: fix lint errors

26d355e

chore: update changelog & version

197ae7e

test: update test case for PageLayout.extract_images

759765e

test: fix lint error

f5c75af

feat: invert image colors using opencv because PIL only works on …

acb9ebb

…`L` and `RGB` mode images

test: fix lint error

de89f74

feat: revert back to extracting embedded images by cropping each page…

c35a883

… image based on the image elements bounding boxes

christinestraub added 8 commits September 19, 2023 13:44

feat: revert back to the original `merge_inferred_layout_with_extract…

6b7a21a

…ed_layout` logic

chore: update changelog

576b976

Merge branch 'main' into feat/save-embedded-images-in-pdf

4c7bb37

# Conflicts: # CHANGELOG.md # test_unstructured_inference/inference/test_layout.py # unstructured_inference/__version__.py

refactor: test_extract_images

8954cd7

refactor: revert type checking for PageLayout.layout

12c1b9c

feat: update embedded-image-extraction script to add pdf filename t…

9fb3f02

…o the output directory path

chore: update README file for examples/layout_analysis

b534799

chore: add README file for examples/image-extraction

ec47989

christinestraub added 2 commits September 20, 2023 13:13

Merge branch 'main' into feat/save-embedded-images-in-pdf

542898e

chore: update changelog & version

8245e1b

cragwolfe approved these changes Sep 20, 2023

View reviewed changes

cragwolfe merged commit b9f032c into main Sep 21, 2023
5 of 8 checks passed

cragwolfe deleted the feat/save-embedded-images-in-pdf branch September 21, 2023 05:09

christinestraub mentioned this pull request Sep 21, 2023

enhancement: Get only "true" embedded images when extracting elements from PDF pages #215

Closed

cragwolfe mentioned this pull request Aug 26, 2023

docs: how to view bounding box and ordering output Unstructured-IO/unstructured#1208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/save embedded images in pdf #208

Feat/save embedded images in pdf #208

christinestraub commented Sep 11, 2023 •

edited

Loading

qued left a comment

qued left a comment

christinestraub commented Sep 13, 2023 •

edited

Loading

christinestraub commented Sep 19, 2023

christinestraub commented Sep 20, 2023 •

edited

Loading

Feat/save embedded images in pdf #208

Feat/save embedded images in pdf #208

Conversation

christinestraub commented Sep 11, 2023 • edited Loading

Summary

Testing

Evaluation

qued left a comment

Choose a reason for hiding this comment

qued left a comment

Choose a reason for hiding this comment

christinestraub commented Sep 13, 2023 • edited Loading

christinestraub commented Sep 19, 2023

christinestraub commented Sep 20, 2023 • edited Loading

christinestraub commented Sep 11, 2023 •

edited

Loading

christinestraub commented Sep 13, 2023 •

edited

Loading

christinestraub commented Sep 20, 2023 •

edited

Loading