Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement: Save embedded images in PDF's separately as images #1332

Closed
christinestraub opened this issue Sep 7, 2023 · 0 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@christinestraub
Copy link
Collaborator

The goal of this issue is to update the library to save embedded images in PDF's separately as images, given some directory path. The image could be saved as <pdf-basename-or-"figure">-pageN-figN.jpg. This filename should be in the metadata for the Image element. the default would be to not do this, of course.

@christinestraub christinestraub added the enhancement New feature or request label Sep 7, 2023
@christinestraub christinestraub self-assigned this Sep 7, 2023
cragwolfe pushed a commit to Unstructured-IO/unstructured-inference that referenced this issue Sep 21, 2023
Addresses unstructured issue
[#1332](Unstructured-IO/unstructured#1332).
This PR will work together with unstructured PR
[#1371](Unstructured-IO/unstructured#1371).
This PR also addresses `"true" embedded images` issue #215.

### Summary
- Add functionality to extract and save images from the page
  - add the `extract_images` method to the `PageLayout` class
  - pass parameters related to extracting images from the page
- add Python script to evaluate image extraction with various PDF
processing libraries
- Add functionality to get only "true" embedded images when extracting
elements from PDF pages
- add functionality to extract image objects (`LTImage`) from a `PDF
layout element` parsed by `pdfminer.high_level.extract_pages`
  - update logic to determine `ImageTextRegion` in `load_pdf()`
- Update the `layout visualization` script to be able to show only image
elements if need

The following documents can be used for testing and evaluation.
-
[Captur-1317-5_ENG-p23.pdf](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/Captur-1317-5_ENG-p23.pdf)
-
[23-BERKSHIRE.pdf](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/23-BERKSHIRE.pdf)
-
[main.PMC6312790-p1.pdf](https://github.com/Unstructured-IO/unstructured-inference/files/12675967/main.PMC6312790_1-1.pdf)

### Testing
```
from unstructured_inference.inference.layout import DocumentLayout

f_path = "sample-docs/embedded-images.pdf"

# default image output directory
doc = DocumentLayout.from_file(
    filename=f_path,
    extract_images_in_pdf=True,
)

# specific image output directory
doc = DocumentLayout.from_file(
    filename=f_path,
    extract_images_in_pdf=True,
    image_output_dir_path=<directory_path>,
)
```
### Evaluation
```
// Extracting Images
$ PYTHONPATH=. python examples/image-extraction/embedded-image-extraction.py Captur-1317-5_ENG-p23.pdf unstructured

// Layout Visualziation
$ PYTHONPATH=. python examples/layout_analysis/visualization.py Captur-1317-5_ENG-p23.pdf image_oly
```
**NOTE:** To reproduce the original results for comparision, you need to
replace [the
lines](https://github.com/Unstructured-IO/unstructured-inference/blob/feat/save-embedded-images-in-pdf/unstructured_inference/inference/layout.py#L650-L659)
with the following code snippet
```
_text, element_class = (
    (element.get_text(), EmbeddedTextRegion)
    if hasattr(element, "get_text")
    else (None, ImageTextRegion)
)
```
cragwolfe pushed a commit that referenced this issue Sep 22, 2023
Addresses
[#1332](#1332)
with `unstructured-inference` PR
[#208](Unstructured-IO/unstructured-inference#208).
### Summary
- Add `image_path` to element metadata
- Pass parameters related to extracting images in PDF
- Preserve image elements ignored due to garbage text if
`el.metadata.image_path` is `True`
### Testing


from unstructured.partition.pdf import partition_pdf

f_path = "example-docs/embedded-images.pdf"

# default image output directory
elements = partition_pdf(
    f_path,
    strategy=strategy,
    extract_images_in_pdf=True,
)

# specific image output directory
elements = partition_pdf(
    f_path,
    strategy=strategy,
    extract_images_in_pdf=True,
    image_output_dir_path=<directory path>,
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant