enhancement: allow setting image block crop padding parameter #2415

christinestraub · 2024-01-17T21:36:04Z

Closes #2320 .

Summary

In certain circumstances, adjusting the image block crop padding can improve image block extraction by preventing extracted image blocks from being clipped.

Testing

PDF: LM339-D_2-2.pdf
Set two environment variables EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD and EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD
(e.g. EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD = 40, EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD = 20

elements = partition_pdf(
    filename="LM339-D_2-2.pdf",
    extract_image_block_types=["image"],
)

…tils.py

# Conflicts: # CHANGELOG.md

Coniferish · 2024-01-18T17:04:20Z

@christine
This is what I'm using to test this, but I'm not seeing a difference in the image that's saved from the main branch and this one:

from unstructured.partition.pdf import partition_pdf
filename = "example-docs/LM339-D_2-2.pdf"
elements = partition_pdf(
    filename=filename,
    extract_image_block_types=["image"],
    strategy="hi_res",
    extract_image_block_output_dir="output/"
)

Do I need to define some additional param to increase the padding?

unstructured/partition/pdf_image/pdf_image_utils.py

Coniferish · 2024-01-18T17:06:36Z

Do the .rst files in docs/source/ need to be updates with these changes?

christinestraub · 2024-01-18T19:12:47Z

@christine This is what I'm using to test this, but I'm not seeing a difference in the image that's saved from the main branch and this one:
from unstructured.partition.pdf import partition_pdf
filename = "example-docs/LM339-D_2-2.pdf"
elements = partition_pdf(
    filename=filename,
    extract_image_block_types=["image"],
    strategy="hi_res",
    extract_image_block_output_dir="output/"
)
Do I need to define some additional param to increase the padding?

@Coniferish To see a difference in the image, you'll need to set two environment variables EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD and EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD.
I updated testing instruction in description.

christinestraub · 2024-01-18T20:18:52Z

Do the .rst files in docs/source/ need to be updates with these changes?

Updated the partition.rst file in docs/source/ :) Thanks!

Coniferish

LGTM once we get that docker test passing :)

Want to check something before approval

Coniferish · 2024-01-18T20:59:21Z

@christinestraub, one last question: I want to make sure that this should NOT update the element.text that's detected even if the image now includes the full text (in this example "Output" is now visible, but it's still not a part of element.text)

christinestraub · 2024-01-18T23:02:22Z

@christinestraub, one last question: I want to make sure that this should NOT update the element.text that's detected even if the image now includes the full text (in this example "Output" is now visible, but it's still not a part of element.text)

I didn't expect to update the element.txt. @cragwolfe What do you think of this? Do we need to update element.text as well when extracting by adding padding to the detected image block?

# Conflicts: # CHANGELOG.md # unstructured/__version__.py

cragwolfe · 2024-01-19T06:25:56Z

@christinestraub, one last question: I want to make sure that this should NOT update the element.text that's detected even if the image now includes the full text (in this example "Output" is now visible, but it's still not a part of element.text)

I didn't expect to update the element.txt. @cragwolfe What do you think of this? Do we need to update element.text as well when extracting by adding padding to the detected image block?

I think the primary intent of this PR is with respect to the image artifacts saved, so do not need to update the .text in this PR. It's a good callout though and worth tracking in another github issue. But, there is a fair bit of nuance -- it seems find to extend OCR .text, but also want to be careful not to duplicate other text if the bbox overlaps another element. Or, don't allow a bbox overlap in the first place for OCR, only extend to min(EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD,distance_to_next_bbox).

cragwolfe

LGTM!

christinestraub added 5 commits January 17, 2024 11:09

refactor: move utility function pad_element_bboxes() to pdf_image_u…

ee8a3b0

…tils.py

feat: add functionality to pad identified image blocks before cropping

dee38bc

test: add unit test for pad_bbox()

6bd85d0

chore: update changelog & version

f8f2755

feat: set default padding to 0

23cfdf1

christinestraub requested review from Coniferish, badGarnet, cragwolfe and qued January 17, 2024 21:36

test: fix lint errors

0301199

christinestraub temporarily deployed to ci January 17, 2024 21:40 — with GitHub Actions Inactive

Merge branch 'main' into feat/2320-extract-image-block-content-clippped

8bf0682

christinestraub temporarily deployed to ci January 17, 2024 23:17 — with GitHub Actions Inactive

christinestraub added 2 commits January 17, 2024 21:10

Merge branch 'main' into feat/2320-extract-image-block-content-clippped

565b1c8

# Conflicts: # CHANGELOG.md

chore: bump version

82be65e

christinestraub temporarily deployed to ci January 18, 2024 05:13 — with GitHub Actions Inactive

christinestraub temporarily deployed to ci January 18, 2024 06:14 — with GitHub Actions Inactive

Merge branch 'main' into feat/2320-extract-image-block-content-clippped

af86fca

christinestraub temporarily deployed to ci January 18, 2024 07:06 — with GitHub Actions Inactive

Coniferish reviewed Jan 18, 2024

View reviewed changes

unstructured/partition/pdf_image/pdf_image_utils.py Show resolved Hide resolved

feat: fix env var names

8b443cd

christinestraub temporarily deployed to ci January 18, 2024 19:02 — with GitHub Actions Inactive

chore: update document .rst file

780a9b0

christinestraub temporarily deployed to ci January 18, 2024 20:19 — with GitHub Actions Inactive

Coniferish previously approved these changes Jan 18, 2024

View reviewed changes

Coniferish self-requested a review January 18, 2024 20:54

Merge branch 'main' into feat/2320-extract-image-block-content-clippped

282a62b

# Conflicts: # CHANGELOG.md # unstructured/__version__.py

christinestraub temporarily deployed to ci January 19, 2024 05:16 — with GitHub Actions Inactive

cragwolfe approved these changes Jan 19, 2024

View reviewed changes

christinestraub added this pull request to the merge queue Jan 19, 2024

Merged via the queue into main with commit 7378a37 Jan 19, 2024

christinestraub deleted the feat/2320-extract-image-block-content-clippped branch January 19, 2024 07:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhancement: allow setting image block crop padding parameter #2415

enhancement: allow setting image block crop padding parameter #2415

Uh oh!

christinestraub commented Jan 17, 2024 •

edited

Loading

Uh oh!

Coniferish commented Jan 18, 2024

Uh oh!

Uh oh!

Coniferish commented Jan 18, 2024

Uh oh!

christinestraub commented Jan 18, 2024 •

edited

Loading

Uh oh!

christinestraub commented Jan 18, 2024

Uh oh!

Coniferish left a comment

Uh oh!

Coniferish commented Jan 18, 2024 •

edited

Loading

Uh oh!

christinestraub commented Jan 18, 2024

Uh oh!

cragwolfe commented Jan 19, 2024

Uh oh!

cragwolfe left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

enhancement: allow setting image block crop padding parameter #2415

enhancement: allow setting image block crop padding parameter #2415

Uh oh!

Conversation

christinestraub commented Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

Coniferish commented Jan 18, 2024

Uh oh!

Uh oh!

Coniferish commented Jan 18, 2024

Uh oh!

christinestraub commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christinestraub commented Jan 18, 2024

Uh oh!

Coniferish left a comment

Choose a reason for hiding this comment

Uh oh!

Coniferish commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christinestraub commented Jan 18, 2024

Uh oh!

cragwolfe commented Jan 19, 2024

Uh oh!

cragwolfe left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

christinestraub commented Jan 17, 2024 •

edited

Loading

christinestraub commented Jan 18, 2024 •

edited

Loading

Coniferish commented Jan 18, 2024 •

edited

Loading