Skip to content

Conversation

@christinestraub
Copy link
Contributor

@christinestraub christinestraub commented Nov 24, 2023

Summary

This PR is the second part of pdfminer refactor to move it from unstructured-inference repo to unstructured repo, the first part is done in Unstructured-IO/unstructured-inference#294. This PR adds logic to merge the extracted layout with the inferred layout.

The updated workflow for the hi_res strategy:

  • pass the document (as data/filename) to the inference repo to get inferred_layout (DocumentLayout)
  • pass the inferred_layout returned from the inference repo and the document (as data/filename) to the pdfminer_processing module, which first opens the document (create temp file/dir as needed), and splits the document by pages
    • if is_image is True, return the passed inferred_layout(DocumentLayout)
    • if is_image is False:
      • get extracted_layout (TextRegions) from the passed document(data/filename) by pdfminer
      • merge extracted_layout (TextRegions) with the passed inferred_layout (DocumentLayout)
      • return the inferred_layout (DocumentLayout) with updated elements (all merged LayoutElements) as merged_layout (DocumentLayout)
  • pass merged_layout and the document (as data/filename) to the OCR module, which first opens the document (create temp file/dir as needed), and splits the document by pages (convert PDF pages to image pages for PDF file)

This PR also bumps unstructured-inference==0.7.17 since the branch relay on pdfminer refactor from unstructured-inference.

Note

This PR also fixes issue #2164 by using functionality similar to the one implemented in the fast strategy workflow when extracting elements by pdfminer.

TODO

  • image extraction refactor to move it from unstructured-inference repo to unstructured repo
  • improving natural reading order by applying the current default xycut sorting to the elements extracted by pdfminer

Copy link
Contributor

@benjats07 benjats07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

# Conflicts:
#	CHANGELOG.md
#	requirements/base.txt
#	requirements/ingest/embed-aws-bedrock.txt
#	requirements/ingest/embed-huggingface.txt
#	requirements/ingest/embed-openai.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants