Skip to content

Conversation

@christinestraub
Copy link
Contributor

@christinestraub christinestraub commented Jul 12, 2023

Summary

Closes issue #521. Implements the same logic as unstructured-inference/PR #136 for the ocr_only strategy.

  • Add functionality to convert a PDF in small chunks of pages at a time
  • Create a wrapper function convert_pdf_to_images for pdf2image library using Python generator, apply it to _partition_pdf_or_image_with_ocr
  • Set the file's current position to the beginning after reading the file in convert_to_bytes

Testing

filename = "example-docs/pdf2image-memory-error-test-400p.pdf"
elements = partition_pdf(filename, strategy="ocr_only")
print("\n\n".join([str(el) for el in elements]))

@christinestraub christinestraub marked this pull request as ready for review July 12, 2023 22:27
…e` library using Python generator, apply it to `_partition_pdf_or_image_with_ocr`
Copy link
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I like the generator implementation! One formatting tweak, and a general (nonblocking) suggestion for the function interface.

@christinestraub christinestraub requested a review from qued July 14, 2023 18:38
@qued qued merged commit 5b7ae29 into main Jul 14, 2023
@qued qued deleted the fix/521-pdf2image-memeory-error branch July 14, 2023 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Out of memory processing pdf

3 participants