-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/save embedded images in pdf #208
Conversation
# Conflicts: # CHANGELOG.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, just a pair of small change requests. Meanwhile I haven't run the test code yet, so that's my next step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several things on further inspection / after running test code:
- If I understand the motivation behind this issue correctly, I think we want the embedded image, meaning the image that is actually stored in the PDF. This appears to identify the image location, then use the bounding box information to crop that portion of the rendered PDF and save it. This results in something potentially different in format and resolution than what the PDF stores internally. (@cragwolfe correct me if I'm off base here)
- Running the test code results in a few things that seem off to me. It produces outputs
figure-1-x.jpg
wherex
ranges from 1 to 14, skipping 5. I get the logged error message showing that the exception handler triggers (presumably on image 5). Is it expected that we get an error on this sample doc? - Additionally output images 3 and 4 appear to be duplicates, 6 and 7 appear to be duplicates, 8 and 9 appear to be duplicates, 10 and 11 appear to be duplicates, and 13 and 14 appear to be duplicates.
|
…mages_in_pdf`, and `image_output_dir_path` to the signature
# Conflicts: # CHANGELOG.md
# Conflicts: # CHANGELOG.md
…ject parsed by `pdfminer`
…lement` & always keep embedded images unless they are full page images
…keep embedded images
Talked over the use case with Crag, former implementation is probably okay with some improvements.
… processing libraries
…`L` and `RGB` mode images
… image based on the image elements bounding boxes
@cragwolfe @qued Reverted back to the former implementation (extracting embedded images by cropping each page image based on the image elements bounding boxes) with some improvements. |
# Conflicts: # CHANGELOG.md # test_unstructured_inference/inference/test_layout.py # unstructured_inference/__version__.py
…o the output directory path
@cragwolfe @qued As of now, some extracted image elements are missing in the final result and are not being saved. This issue will be addressed by issue #219. |
Addresses [#1332](#1332) with `unstructured-inference` PR [#208](Unstructured-IO/unstructured-inference#208). ### Summary - Add `image_path` to element metadata - Pass parameters related to extracting images in PDF - Preserve image elements ignored due to garbage text if `el.metadata.image_path` is `True` ### Testing from unstructured.partition.pdf import partition_pdf f_path = "example-docs/embedded-images.pdf" # default image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, ) # specific image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, image_output_dir_path=<directory path>, )
Addresses unstructured issue #1332. This PR will work together with unstructured PR #1371.
This PR also addresses
"true" embedded images
issue #215.Summary
extract_images
method to thePageLayout
classLTImage
) from aPDF layout element
parsed bypdfminer.high_level.extract_pages
ImageTextRegion
inload_pdf()
layout visualization
script to be able to show only image elements if needThe following documents can be used for testing and evaluation.
Testing
Evaluation
NOTE: To reproduce the original results for comparision, you need to replace the lines with the following code snippet