-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
partition_pdf
should support documents in two column format
#356
Comments
does this issue also implies adding support for docs with more than 2 columns? |
@hsm207 - Yes, this will support multicolumn (two or more columns). In the meantime, you can pass in |
This will be huge, so excited for what this will unlock with putting scientific papers into LLM systems! |
Hi, |
In tracking the commit history in the Maybe @MthwRobinson or @qued can comment on the status of this. Is a more sophisticated sorting method the right approach or is that too fragile? Seems like sorting is required no matter what, esp if the text is directly extracted from a pdf. @MthwRobinson was on the right track and I think a few more cracks at "smart sorting" might yield a solution that is at least do no harm to single column layouts while also addressing some larger set of multi-column PDFs. Not sure if there is some other means of solving the multi-column problem that is being tossed around internally but has yet to be disseminated? Wanted to ask first before digging any deeper myself. |
The goal of this issue is to add support for two column documents to
partition_pdf
. Currently,partition_pdf
will process the documents but the elements are not always in the correct order. The attached document can be used for testing.Dense-Passage-Retrieval-for-Open-Domain-Question-Answering.pdf
The text was updated successfully, but these errors were encountered: