enhancement: Save embedded images in PDF's separately as images #1332

christinestraub · 2023-09-07T18:42:23Z

The goal of this issue is to update the library to save embedded images in PDF's separately as images, given some directory path. The image could be saved as <pdf-basename-or-"figure">-pageN-figN.jpg. This filename should be in the metadata for the Image element. the default would be to not do this, of course.

Addresses unstructured issue [#1332](Unstructured-IO/unstructured#1332). This PR will work together with unstructured PR [#1371](Unstructured-IO/unstructured#1371). This PR also addresses `"true" embedded images` issue #215. ### Summary - Add functionality to extract and save images from the page - add the `extract_images` method to the `PageLayout` class - pass parameters related to extracting images from the page - add Python script to evaluate image extraction with various PDF processing libraries - Add functionality to get only "true" embedded images when extracting elements from PDF pages - add functionality to extract image objects (`LTImage`) from a `PDF layout element` parsed by `pdfminer.high_level.extract_pages` - update logic to determine `ImageTextRegion` in `load_pdf()` - Update the `layout visualization` script to be able to show only image elements if need The following documents can be used for testing and evaluation. - [Captur-1317-5_ENG-p23.pdf](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/Captur-1317-5_ENG-p23.pdf) - [23-BERKSHIRE.pdf](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/23-BERKSHIRE.pdf) - [main.PMC6312790-p1.pdf](https://github.com/Unstructured-IO/unstructured-inference/files/12675967/main.PMC6312790_1-1.pdf) ### Testing ``` from unstructured_inference.inference.layout import DocumentLayout f_path = "sample-docs/embedded-images.pdf" # default image output directory doc = DocumentLayout.from_file( filename=f_path, extract_images_in_pdf=True, ) # specific image output directory doc = DocumentLayout.from_file( filename=f_path, extract_images_in_pdf=True, image_output_dir_path=<directory_path>, ) ``` ### Evaluation ``` // Extracting Images $ PYTHONPATH=. python examples/image-extraction/embedded-image-extraction.py Captur-1317-5_ENG-p23.pdf unstructured // Layout Visualziation $ PYTHONPATH=. python examples/layout_analysis/visualization.py Captur-1317-5_ENG-p23.pdf image_oly ``` **NOTE:** To reproduce the original results for comparision, you need to replace [the lines](https://github.com/Unstructured-IO/unstructured-inference/blob/feat/save-embedded-images-in-pdf/unstructured_inference/inference/layout.py#L650-L659) with the following code snippet ``` _text, element_class = ( (element.get_text(), EmbeddedTextRegion) if hasattr(element, "get_text") else (None, ImageTextRegion) ) ```

Addresses [#1332](#1332) with `unstructured-inference` PR [#208](Unstructured-IO/unstructured-inference#208). ### Summary - Add `image_path` to element metadata - Pass parameters related to extracting images in PDF - Preserve image elements ignored due to garbage text if `el.metadata.image_path` is `True` ### Testing from unstructured.partition.pdf import partition_pdf f_path = "example-docs/embedded-images.pdf" # default image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, ) # specific image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, image_output_dir_path=<directory path>, )

christinestraub added the enhancement New feature or request label Sep 7, 2023

christinestraub self-assigned this Sep 7, 2023

This was referenced Sep 12, 2023

Feat/save embedded images in pdf Unstructured-IO/unstructured-inference#208

Merged

Feat/1332 save embedded images in pdf #1371

Merged

christinestraub closed this as completed Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhancement: Save embedded images in PDF's separately as images #1332

enhancement: Save embedded images in PDF's separately as images #1332

christinestraub commented Sep 7, 2023

enhancement: Save embedded images in PDF's separately as images #1332

enhancement: Save embedded images in PDF's separately as images #1332

Comments

christinestraub commented Sep 7, 2023