Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/save embedded images in pdf #208

Merged
merged 41 commits into from
Sep 21, 2023
Merged

Conversation

christinestraub
Copy link
Contributor

@christinestraub christinestraub commented Sep 11, 2023

Addresses unstructured issue #1332. This PR will work together with unstructured PR #1371.
This PR also addresses "true" embedded images issue #215.

Summary

  • Add functionality to extract and save images from the page
    • add the extract_images method to the PageLayout class
    • pass parameters related to extracting images from the page
    • add Python script to evaluate image extraction with various PDF processing libraries
  • Add functionality to get only "true" embedded images when extracting elements from PDF pages
    • add functionality to extract image objects (LTImage) from a PDF layout element parsed by pdfminer.high_level.extract_pages
    • update logic to determine ImageTextRegion in load_pdf()
  • Update the layout visualization script to be able to show only image elements if need

The following documents can be used for testing and evaluation.

Testing

from unstructured_inference.inference.layout import DocumentLayout

f_path = "sample-docs/embedded-images.pdf"

# default image output directory
doc = DocumentLayout.from_file(
    filename=f_path,
    extract_images_in_pdf=True,
)

# specific image output directory
doc = DocumentLayout.from_file(
    filename=f_path,
    extract_images_in_pdf=True,
    image_output_dir_path=<directory_path>,
)

Evaluation

// Extracting Images
$ PYTHONPATH=. python examples/image-extraction/embedded-image-extraction.py Captur-1317-5_ENG-p23.pdf unstructured

// Layout Visualziation
$ PYTHONPATH=. python examples/layout_analysis/visualization.py Captur-1317-5_ENG-p23.pdf image_oly

NOTE: To reproduce the original results for comparision, you need to replace the lines with the following code snippet

_text, element_class = (
    (element.get_text(), EmbeddedTextRegion)
    if hasattr(element, "get_text")
    else (None, ImageTextRegion)
)

@christinestraub christinestraub marked this pull request as ready for review September 12, 2023 19:37
Copy link
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, just a pair of small change requests. Meanwhile I haven't run the test code yet, so that's my next step.

unstructured_inference/inference/layout.py Outdated Show resolved Hide resolved
unstructured_inference/inference/layout.py Outdated Show resolved Hide resolved
qued
qued previously requested changes Sep 13, 2023
Copy link
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several things on further inspection / after running test code:

  1. If I understand the motivation behind this issue correctly, I think we want the embedded image, meaning the image that is actually stored in the PDF. This appears to identify the image location, then use the bounding box information to crop that portion of the rendered PDF and save it. This results in something potentially different in format and resolution than what the PDF stores internally. (@cragwolfe correct me if I'm off base here)
  2. Running the test code results in a few things that seem off to me. It produces outputs figure-1-x.jpg where x ranges from 1 to 14, skipping 5. I get the logged error message showing that the exception handler triggers (presumably on image 5). Is it expected that we get an error on this sample doc?
  3. Additionally output images 3 and 4 appear to be duplicates, 6 and 7 appear to be duplicates, 8 and 9 appear to be duplicates, 10 and 11 appear to be duplicates, and 13 and 14 appear to be duplicates.

@christinestraub
Copy link
Contributor Author

christinestraub commented Sep 13, 2023

Several things on further inspection / after running test code:

  1. If I understand the motivation behind this issue correctly, I think we want the embedded image, meaning the image that is actually stored in the PDF. This appears to identify the image location, then use the bounding box information to crop that portion of the rendered PDF and save it. This results in something potentially different in format and resolution than what the PDF stores internally. (@cragwolfe correct me if I am wrong please)
  2. Running the test code results in a few things seem off to me. It produces outputs figure-1-x.jpg where x ranges from 1 to 14, skipping 5. I get log error message showing that the exception handler triggers (presumably on image 5). Is it expected that we get an error on this sample doc?
  3. Additionally output images 3 and 4 appear to be duplicates, 6 and 7 appear to be duplicates, 8 and 9 appear to be duplicates, 10 and 11 appear to be duplicates, and 13 and 14 appear to be duplicates.
  1. The resolution (width, height) of the rendered PDF page image depends on the dpi(https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layout.py#L612-L624) and the bounding boxes of the image elements are also calculated using the same dpi(https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layout.py#L597-L603) so I don't think there would be a potential difference in format and resolution than what the PDF stores internally.
  2. Yes, It is expected that we get an error on this sample doc. The width of the cropped image is 0 for the image 5 (figure-1-5.jpg) so it gets an error - ValueError: cannot write empty image as JPEG.
  3. It is the issue of the pdfminer library (pdfminer.high_level.extract_pages). @crag and I talked about that issue in this Slack channel.

@qued qued dismissed their stale review September 18, 2023 14:36

Talked over the use case with Crag, former implementation is probably okay with some improvements.

@christinestraub
Copy link
Contributor Author

@cragwolfe @qued Reverted back to the former implementation (extracting embedded images by cropping each page image based on the image elements bounding boxes) with some improvements.

@christinestraub
Copy link
Contributor Author

christinestraub commented Sep 20, 2023

@cragwolfe @qued As of now, some extracted image elements are missing in the final result and are not being saved. This issue will be addressed by issue #219.

@cragwolfe cragwolfe merged commit b9f032c into main Sep 21, 2023
5 of 8 checks passed
@cragwolfe cragwolfe deleted the feat/save-embedded-images-in-pdf branch September 21, 2023 05:09
cragwolfe pushed a commit to Unstructured-IO/unstructured that referenced this pull request Sep 22, 2023
Addresses
[#1332](#1332)
with `unstructured-inference` PR
[#208](Unstructured-IO/unstructured-inference#208).
### Summary
- Add `image_path` to element metadata
- Pass parameters related to extracting images in PDF
- Preserve image elements ignored due to garbage text if
`el.metadata.image_path` is `True`
### Testing


from unstructured.partition.pdf import partition_pdf

f_path = "example-docs/embedded-images.pdf"

# default image output directory
elements = partition_pdf(
    f_path,
    strategy=strategy,
    extract_images_in_pdf=True,
)

# specific image output directory
elements = partition_pdf(
    f_path,
    strategy=strategy,
    extract_images_in_pdf=True,
    image_output_dir_path=<directory path>,
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants