Fix/521 pdf2image memory error #924

christinestraub · 2023-07-12T22:11:50Z

Summary

Closes issue #521. Implements the same logic as unstructured-inference/PR #136 for the ocr_only strategy.

Add functionality to convert a PDF in small chunks of pages at a time
Create a wrapper function convert_pdf_to_images for pdf2image library using Python generator, apply it to _partition_pdf_or_image_with_ocr
Set the file's current position to the beginning after reading the file in convert_to_bytes

Testing

filename = "example-docs/pdf2image-memory-error-test-400p.pdf"
elements = partition_pdf(filename, strategy="ocr_only")
print("\n\n".join([str(el) for el in elements]))

…a time for `ocr_only` strategy

…ing the file in `convert_to_bytes`

# Conflicts: # CHANGELOG.md # unstructured/__version__.py

…e` library using Python generator, apply it to `_partition_pdf_or_image_with_ocr`

qued

LGTM, I like the generator implementation! One formatting tweak, and a general (nonblocking) suggestion for the function interface.

unstructured/partition/pdf.py

christinestraub added 4 commits July 12, 2023 14:02

feat: add functionality to convert a PDF in small chunks of pages at …

657927d

…a time for `ocr_only` strategy

refactor: set the file's current position to the beginning after read…

55b9f3b

…ing the file in `convert_to_bytes`

chore: update changelog & version

d1be11f

Merge branch 'main' into fix/521-pdf2image-memeory-error

7b52eb7

# Conflicts: # CHANGELOG.md # unstructured/__version__.py

christinestraub marked this pull request as ready for review July 12, 2023 22:27

christinestraub requested review from cragwolfe and qued July 12, 2023 22:27

christinestraub mentioned this pull request Jul 12, 2023

Out of memory processing pdf #521

Closed

feat: create a wrapper function convert_pdf_to_images for `pdf2imag…

9f07339

…e` library using Python generator, apply it to `_partition_pdf_or_image_with_ocr`

qued approved these changes Jul 14, 2023

View reviewed changes

unstructured/partition/pdf.py Outdated Show resolved Hide resolved

unstructured/partition/pdf.py Show resolved Hide resolved

qued and others added 3 commits July 14, 2023 12:35

Merge branch 'main' into fix/521-pdf2image-memeory-error

1ef3f97

refactor: yield individual images from the generator

19e699b

Merge branch 'main' into fix/521-pdf2image-memeory-error

bb02e9d

christinestraub requested a review from qued July 14, 2023 18:38

qued reviewed Jul 14, 2023

View reviewed changes

unstructured/partition/pdf.py Outdated Show resolved Hide resolved

fix: correct return type of the generator function

4b4adfb

qued merged commit 5b7ae29 into main Jul 14, 2023

qued deleted the fix/521-pdf2image-memeory-error branch July 14, 2023 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/521 pdf2image memory error #924

Fix/521 pdf2image memory error #924

Uh oh!

christinestraub commented Jul 12, 2023 •

edited

Loading

Uh oh!

qued left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix/521 pdf2image memory error #924

Fix/521 pdf2image memory error #924

Uh oh!

Conversation

christinestraub commented Jul 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

qued left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

christinestraub commented Jul 12, 2023 •

edited

Loading