Skip to content

Conversation

@christinestraub
Copy link
Contributor

@christinestraub christinestraub commented Jan 17, 2024

Closes #2320 .

Summary

In certain circumstances, adjusting the image block crop padding can improve image block extraction by preventing extracted image blocks from being clipped.

Testing

  • PDF: LM339-D_2-2.pdf
  • Set two environment variables EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD and EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD
    (e.g. EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD = 40, EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD = 20
elements = partition_pdf(
    filename="LM339-D_2-2.pdf",
    extract_image_block_types=["image"],
)

@Coniferish
Copy link
Contributor

@christine
This is what I'm using to test this, but I'm not seeing a difference in the image that's saved from the main branch and this one:

from unstructured.partition.pdf import partition_pdf
filename = "example-docs/LM339-D_2-2.pdf"
elements = partition_pdf(
    filename=filename,
    extract_image_block_types=["image"],
    strategy="hi_res",
    extract_image_block_output_dir="output/"
)

Do I need to define some additional param to increase the padding?

@Coniferish
Copy link
Contributor

Do the .rst files in docs/source/ need to be updates with these changes?

@christinestraub
Copy link
Contributor Author

christinestraub commented Jan 18, 2024

@christine This is what I'm using to test this, but I'm not seeing a difference in the image that's saved from the main branch and this one:

from unstructured.partition.pdf import partition_pdf
filename = "example-docs/LM339-D_2-2.pdf"
elements = partition_pdf(
    filename=filename,
    extract_image_block_types=["image"],
    strategy="hi_res",
    extract_image_block_output_dir="output/"
)

Do I need to define some additional param to increase the padding?

@Coniferish To see a difference in the image, you'll need to set two environment variables EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD and EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD.
I updated testing instruction in description.

@christinestraub
Copy link
Contributor Author

Do the .rst files in docs/source/ need to be updates with these changes?

Updated the partition.rst file in docs/source/ :) Thanks!

Coniferish
Coniferish previously approved these changes Jan 18, 2024
Copy link
Contributor

@Coniferish Coniferish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once we get that docker test passing :)

@Coniferish Coniferish self-requested a review January 18, 2024 20:54
@Coniferish Coniferish dismissed their stale review January 18, 2024 20:54

Want to check something before approval

@Coniferish
Copy link
Contributor

Coniferish commented Jan 18, 2024

@christinestraub, one last question: I want to make sure that this should NOT update the element.text that's detected even if the image now includes the full text (in this example "Output" is now visible, but it's still not a part of element.text)

@christinestraub
Copy link
Contributor Author

@christinestraub, one last question: I want to make sure that this should NOT update the element.text that's detected even if the image now includes the full text (in this example "Output" is now visible, but it's still not a part of element.text)

I didn't expect to update the element.txt. @cragwolfe What do you think of this? Do we need to update element.text as well when extracting by adding padding to the detected image block?

# Conflicts:
#	CHANGELOG.md
#	unstructured/__version__.py
@cragwolfe
Copy link
Contributor

@christinestraub, one last question: I want to make sure that this should NOT update the element.text that's detected even if the image now includes the full text (in this example "Output" is now visible, but it's still not a part of element.text)

I didn't expect to update the element.txt. @cragwolfe What do you think of this? Do we need to update element.text as well when extracting by adding padding to the detected image block?

I think the primary intent of this PR is with respect to the image artifacts saved, so do not need to update the .text in this PR. It's a good callout though and worth tracking in another github issue. But, there is a fair bit of nuance -- it seems find to extend OCR .text, but also want to be careful not to duplicate other text if the bbox overlaps another element. Or, don't allow a bbox overlap in the first place for OCR, only extend to min(EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD,distance_to_next_bbox).

Copy link
Contributor

@cragwolfe cragwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@christinestraub christinestraub added this pull request to the merge queue Jan 19, 2024
Merged via the queue into main with commit 7378a37 Jan 19, 2024
@christinestraub christinestraub deleted the feat/2320-extract-image-block-content-clippped branch January 19, 2024 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: some extracted images have content clipped

4 participants