Skip to content

Conversation

@ajjimeno
Copy link
Contributor

This PR intends to solve the following issues:

  • Memory leak in DonutProcessor when using large images in numpy format
  • Use the right settings for beam search size > 1
  • Solve a bug that in very rare cases made the last element predicted by Chipper to have a bbox = None

@ajjimeno
Copy link
Contributor Author

Example test snipper. Without the proposed changes it runs out of memory and if there is enough memory, it breaks because an element has no bbox. You need to use the proposed document from sample-docs, please change the location accordingly.

Probably on a Mac with a lot of memory the memory problem might not be a problem. I did not see the memory leak in a mac computer. On linux, with 16gb, it runs out of memory pretty quickly, e.g. on a c6i.2xlarge instance.

from unstructured_inference.inference.layout import DocumentLayout
from unstructured_inference.models.base import get_model

model = get_model("chipper")
doc = DocumentLayout.from_file("sample-docs/patent.pdf", model, pdf_image_dpi=300) 

@ajjimeno ajjimeno changed the title First commit fix: memory leak on chipper processor, beam search parameters, and bbox bug Oct 17, 2023
@cragwolfe cragwolfe merged commit b1dba87 into main Oct 17, 2023
@cragwolfe cragwolfe deleted the bug/chipper-out-of-memory branch October 17, 2023 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants