Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/1332 save embedded images in pdf #1371

Merged
merged 26 commits into from
Sep 22, 2023

Conversation

christinestraub
Copy link
Collaborator

@christinestraub christinestraub commented Sep 11, 2023

Addresses #1332 with unstructured-inference PR #208.

Summary

  • Add image_path to element metadata
  • Pass parameters related to extracting images in PDF
  • Preserve image elements ignored due to garbage text if el.metadata.image_path is True

Testing

from unstructured.partition.pdf import partition_pdf

f_path = "example-docs/embedded-images.pdf"

# default image output directory
elements = partition_pdf(
    f_path,
    strategy=strategy,
    extract_images_in_pdf=True,
)

# specific image output directory
elements = partition_pdf(
    f_path,
    strategy=strategy,
    extract_images_in_pdf=True,
    image_output_dir_path=<directory path>,
)

cragwolfe pushed a commit to Unstructured-IO/unstructured-inference that referenced this pull request Sep 21, 2023
Addresses unstructured issue
[#1332](Unstructured-IO/unstructured#1332).
This PR will work together with unstructured PR
[#1371](Unstructured-IO/unstructured#1371).
This PR also addresses `"true" embedded images` issue #215.

### Summary
- Add functionality to extract and save images from the page
  - add the `extract_images` method to the `PageLayout` class
  - pass parameters related to extracting images from the page
- add Python script to evaluate image extraction with various PDF
processing libraries
- Add functionality to get only "true" embedded images when extracting
elements from PDF pages
- add functionality to extract image objects (`LTImage`) from a `PDF
layout element` parsed by `pdfminer.high_level.extract_pages`
  - update logic to determine `ImageTextRegion` in `load_pdf()`
- Update the `layout visualization` script to be able to show only image
elements if need

The following documents can be used for testing and evaluation.
-
[Captur-1317-5_ENG-p23.pdf](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/Captur-1317-5_ENG-p23.pdf)
-
[23-BERKSHIRE.pdf](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/23-BERKSHIRE.pdf)
-
[main.PMC6312790-p1.pdf](https://github.com/Unstructured-IO/unstructured-inference/files/12675967/main.PMC6312790_1-1.pdf)

### Testing
```
from unstructured_inference.inference.layout import DocumentLayout

f_path = "sample-docs/embedded-images.pdf"

# default image output directory
doc = DocumentLayout.from_file(
    filename=f_path,
    extract_images_in_pdf=True,
)

# specific image output directory
doc = DocumentLayout.from_file(
    filename=f_path,
    extract_images_in_pdf=True,
    image_output_dir_path=<directory_path>,
)
```
### Evaluation
```
// Extracting Images
$ PYTHONPATH=. python examples/image-extraction/embedded-image-extraction.py Captur-1317-5_ENG-p23.pdf unstructured

// Layout Visualziation
$ PYTHONPATH=. python examples/layout_analysis/visualization.py Captur-1317-5_ENG-p23.pdf image_oly
```
**NOTE:** To reproduce the original results for comparision, you need to
replace [the
lines](https://github.com/Unstructured-IO/unstructured-inference/blob/feat/save-embedded-images-in-pdf/unstructured_inference/inference/layout.py#L650-L659)
with the following code snippet
```
_text, element_class = (
    (element.get_text(), EmbeddedTextRegion)
    if hasattr(element, "get_text")
    else (None, ImageTextRegion)
)
```
cragwolfe and others added 4 commits September 20, 2023 22:40
…1483)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: cragwolfe <cragwolfe@users.noreply.github.com>
Copy link
Contributor

@cragwolfe cragwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
hitting a bunch of unrelated issues with ingest tests : /

@cragwolfe cragwolfe enabled auto-merge (squash) September 22, 2023 08:42
@cragwolfe cragwolfe merged commit 2d95172 into main Sep 22, 2023
39 checks passed
@cragwolfe cragwolfe deleted the feat/1332-save-embedded-images-in-pdf branch September 22, 2023 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants