-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancement: Save embedded images in PDF's separately as images #1332
Labels
enhancement
New feature or request
Comments
This was referenced Sep 12, 2023
cragwolfe
pushed a commit
to Unstructured-IO/unstructured-inference
that referenced
this issue
Sep 21, 2023
Addresses unstructured issue [#1332](Unstructured-IO/unstructured#1332). This PR will work together with unstructured PR [#1371](Unstructured-IO/unstructured#1371). This PR also addresses `"true" embedded images` issue #215. ### Summary - Add functionality to extract and save images from the page - add the `extract_images` method to the `PageLayout` class - pass parameters related to extracting images from the page - add Python script to evaluate image extraction with various PDF processing libraries - Add functionality to get only "true" embedded images when extracting elements from PDF pages - add functionality to extract image objects (`LTImage`) from a `PDF layout element` parsed by `pdfminer.high_level.extract_pages` - update logic to determine `ImageTextRegion` in `load_pdf()` - Update the `layout visualization` script to be able to show only image elements if need The following documents can be used for testing and evaluation. - [Captur-1317-5_ENG-p23.pdf](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/Captur-1317-5_ENG-p23.pdf) - [23-BERKSHIRE.pdf](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/23-BERKSHIRE.pdf) - [main.PMC6312790-p1.pdf](https://github.com/Unstructured-IO/unstructured-inference/files/12675967/main.PMC6312790_1-1.pdf) ### Testing ``` from unstructured_inference.inference.layout import DocumentLayout f_path = "sample-docs/embedded-images.pdf" # default image output directory doc = DocumentLayout.from_file( filename=f_path, extract_images_in_pdf=True, ) # specific image output directory doc = DocumentLayout.from_file( filename=f_path, extract_images_in_pdf=True, image_output_dir_path=<directory_path>, ) ``` ### Evaluation ``` // Extracting Images $ PYTHONPATH=. python examples/image-extraction/embedded-image-extraction.py Captur-1317-5_ENG-p23.pdf unstructured // Layout Visualziation $ PYTHONPATH=. python examples/layout_analysis/visualization.py Captur-1317-5_ENG-p23.pdf image_oly ``` **NOTE:** To reproduce the original results for comparision, you need to replace [the lines](https://github.com/Unstructured-IO/unstructured-inference/blob/feat/save-embedded-images-in-pdf/unstructured_inference/inference/layout.py#L650-L659) with the following code snippet ``` _text, element_class = ( (element.get_text(), EmbeddedTextRegion) if hasattr(element, "get_text") else (None, ImageTextRegion) ) ```
cragwolfe
pushed a commit
that referenced
this issue
Sep 22, 2023
Addresses [#1332](#1332) with `unstructured-inference` PR [#208](Unstructured-IO/unstructured-inference#208). ### Summary - Add `image_path` to element metadata - Pass parameters related to extracting images in PDF - Preserve image elements ignored due to garbage text if `el.metadata.image_path` is `True` ### Testing from unstructured.partition.pdf import partition_pdf f_path = "example-docs/embedded-images.pdf" # default image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, ) # specific image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, image_output_dir_path=<directory path>, )
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The goal of this issue is to update the library to save embedded images in PDF's separately as images, given some directory path. The image could be saved as <pdf-basename-or-"figure">-pageN-figN.jpg. This filename should be in the metadata for the Image element. the default would be to not do this, of course.
The text was updated successfully, but these errors were encountered: