Feat/1332 save embedded images in pdf #1371

christinestraub · 2023-09-11T20:04:22Z

Addresses #1332 with unstructured-inference PR #208.

Summary

Add image_path to element metadata
Pass parameters related to extracting images in PDF
Preserve image elements ignored due to garbage text if el.metadata.image_path is True

Testing

from unstructured.partition.pdf import partition_pdf

f_path = "example-docs/embedded-images.pdf"

# default image output directory
elements = partition_pdf(
    f_path,
    strategy=strategy,
    extract_images_in_pdf=True,
)

# specific image output directory
elements = partition_pdf(
    f_path,
    strategy=strategy,
    extract_images_in_pdf=True,
    image_output_dir_path=<directory path>,
)

# Conflicts: # CHANGELOG.md # unstructured/__version__.py

…1398) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>

# Conflicts: # CHANGELOG.md # unstructured/__version__.py

# Conflicts: # CHANGELOG.md # unstructured/__version__.py # unstructured/documents/elements.py # unstructured/partition/common.py

# Conflicts: # CHANGELOG.md

Addresses unstructured issue [#1332](Unstructured-IO/unstructured#1332). This PR will work together with unstructured PR [#1371](Unstructured-IO/unstructured#1371). This PR also addresses `"true" embedded images` issue #215. ### Summary - Add functionality to extract and save images from the page - add the `extract_images` method to the `PageLayout` class - pass parameters related to extracting images from the page - add Python script to evaluate image extraction with various PDF processing libraries - Add functionality to get only "true" embedded images when extracting elements from PDF pages - add functionality to extract image objects (`LTImage`) from a `PDF layout element` parsed by `pdfminer.high_level.extract_pages` - update logic to determine `ImageTextRegion` in `load_pdf()` - Update the `layout visualization` script to be able to show only image elements if need The following documents can be used for testing and evaluation. - [Captur-1317-5_ENG-p23.pdf](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/Captur-1317-5_ENG-p23.pdf) - [23-BERKSHIRE.pdf](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/23-BERKSHIRE.pdf) - [main.PMC6312790-p1.pdf](https://github.com/Unstructured-IO/unstructured-inference/files/12675967/main.PMC6312790_1-1.pdf) ### Testing ``` from unstructured_inference.inference.layout import DocumentLayout f_path = "sample-docs/embedded-images.pdf" # default image output directory doc = DocumentLayout.from_file( filename=f_path, extract_images_in_pdf=True, ) # specific image output directory doc = DocumentLayout.from_file( filename=f_path, extract_images_in_pdf=True, image_output_dir_path=<directory_path>, ) ``` ### Evaluation ``` // Extracting Images $ PYTHONPATH=. python examples/image-extraction/embedded-image-extraction.py Captur-1317-5_ENG-p23.pdf unstructured // Layout Visualziation $ PYTHONPATH=. python examples/layout_analysis/visualization.py Captur-1317-5_ENG-p23.pdf image_oly ``` **NOTE:** To reproduce the original results for comparision, you need to replace [the lines](https://github.com/Unstructured-IO/unstructured-inference/blob/feat/save-embedded-images-in-pdf/unstructured_inference/inference/layout.py#L650-L659) with the following code snippet ``` _text, element_class = ( (element.get_text(), EmbeddedTextRegion) if hasattr(element, "get_text") else (None, ImageTextRegion) ) ```

…1483) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: cragwolfe <cragwolfe@users.noreply.github.com>

cragwolfe

LGTM!
hitting a bunch of unrelated issues with ingest tests : /

christinestraub added 5 commits September 11, 2023 03:44

feat: add image_path to element metadata

621ef71

feat: pass parameters related to extracting images in PDF

75144b1

feat: preserve image elements for other downstream use cases

1b024ce

chore: add example doc

1df3cf0

Merge branch 'main' into feat/1332-save-embedded-images-in-pdf

4d8beac

christinestraub mentioned this pull request Sep 12, 2023

Feat/save embedded images in pdf Unstructured-IO/unstructured-inference#208

Merged

christinestraub marked this pull request as ready for review September 12, 2023 19:53

christinestraub requested review from cragwolfe and qued September 12, 2023 19:58

christinestraub and others added 14 commits September 12, 2023 13:00

Merge branch 'main' into feat/1332-save-embedded-images-in-pdf

295c337

chore: update changelog & version

63f4d3d

test: fix lint errors

bcbc3d5

test: update test cases

d48febb

feat: preserve image elements if el.metadata.image_path is True

249de7b

Merge branch 'main' into feat/1332-save-embedded-images-in-pdf

c47c594

# Conflicts: # CHANGELOG.md # unstructured/__version__.py

Feat/1332 save embedded images in pdf <- Ingest test fixtures update (#…

b328b0b

…1398) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>

Merge branch 'main' into feat/1332-save-embedded-images-in-pdf

7cd3344

# Conflicts: # CHANGELOG.md # unstructured/__version__.py

chore: update changelog & version

f08acbb

Merge branch 'main' into feat/1332-save-embedded-images-in-pdf

a3e8980

# Conflicts: # CHANGELOG.md # unstructured/__version__.py # unstructured/documents/elements.py # unstructured/partition/common.py

chore: update changelog & version

117f7ce

feat: pass extra params to process_file_with_model only if set

3cd5682

Merge branch 'main' into feat/1332-save-embedded-images-in-pdf

38f3736

# Conflicts: # CHANGELOG.md

chore: update changelog

eb65ee8

cragwolfe and others added 4 commits September 20, 2023 22:40

Merge branch 'main' into feat/1332-save-embedded-images-in-pdf

90a7d47

make pip-compile

49edd79

pip compile again

f015a44

Feat/1332 save embedded images in pdf <- Ingest test fixtures update (#…

5a1eff4

…1483) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: cragwolfe <cragwolfe@users.noreply.github.com>

cragwolfe approved these changes Sep 21, 2023

View reviewed changes

add spuriously removed notion test .json?

e00e3d5

christinestraub and others added 2 commits September 21, 2023 13:18

Merge branch 'main' into feat/1332-save-embedded-images-in-pdf

2e7b359

Merge branch 'main' into feat/1332-save-embedded-images-in-pdf

21ce813

cragwolfe enabled auto-merge (squash) September 22, 2023 08:42

cragwolfe merged commit 2d95172 into main Sep 22, 2023
39 checks passed

cragwolfe deleted the feat/1332-save-embedded-images-in-pdf branch September 22, 2023 09:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/1332 save embedded images in pdf #1371

Feat/1332 save embedded images in pdf #1371

christinestraub commented Sep 11, 2023 •

edited

Loading

cragwolfe left a comment

Feat/1332 save embedded images in pdf #1371

Feat/1332 save embedded images in pdf #1371

Conversation

christinestraub commented Sep 11, 2023 • edited Loading

Summary

Testing

cragwolfe left a comment

Choose a reason for hiding this comment

christinestraub commented Sep 11, 2023 •

edited

Loading