Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partition_pdf should support documents in two column format #356

Closed
MthwRobinson opened this issue Mar 10, 2023 · 6 comments
Closed

partition_pdf should support documents in two column format #356

MthwRobinson opened this issue Mar 10, 2023 · 6 comments
Assignees
Labels
enhancement New feature or request python Pull requests that update Python code

Comments

@MthwRobinson
Copy link
Contributor

The goal of this issue is to add support for two column documents to partition_pdf. Currently, partition_pdf will process the documents but the elements are not always in the correct order. The attached document can be used for testing.
Dense-Passage-Retrieval-for-Open-Domain-Question-Answering.pdf

@MthwRobinson MthwRobinson added enhancement New feature or request python Pull requests that update Python code labels Mar 10, 2023
@hsm207
Copy link
Contributor

hsm207 commented May 12, 2023

does this issue also implies adding support for docs with more than 2 columns?

@MthwRobinson
Copy link
Contributor Author

@hsm207 - Yes, this will support multicolumn (two or more columns). In the meantime, you can pass in strategy="fast" (if the text is extractable) or stragey="ocr_only" (if the text is not extractable) for better results on multi-column documents.

@CShorten
Copy link

CShorten commented May 23, 2023

This will be huge, so excited for what this will unlock with putting scientific papers into LLM systems!

@Kate-Lyndegaard
Copy link

Hi,
Is this improvement available, yet?

@walsha2
Copy link
Contributor

walsha2 commented Sep 7, 2023

Is this improvement available, yet?

In tracking the commit history in the unstructured-inference repo, it looks like a promising first attempt was added in #112 and reverted in #120 - then no action since then I presume?

Maybe @MthwRobinson or @qued can comment on the status of this. Is a more sophisticated sorting method the right approach or is that too fragile? Seems like sorting is required no matter what, esp if the text is directly extracted from a pdf.

@MthwRobinson was on the right track and I think a few more cracks at "smart sorting" might yield a solution that is at least do no harm to single column layouts while also addressing some larger set of multi-column PDFs.

Not sure if there is some other means of solving the multi-column problem that is being tossed around internally but has yet to be disseminated? Wanted to ask first before digging any deeper myself.

@cragwolfe
Copy link
Contributor

Yes, there was a significant improvement in reflecting "natural reading order" in #1161 , which was released in 0.10.6.

However, there are a couple of open issues to further improve on this:
#1233
and
#1209

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python Pull requests that update Python code
Projects
None yet
8 participants