`partition_pdf` should support documents in two column format #356

MthwRobinson · 2023-03-10T17:20:41Z

The goal of this issue is to add support for two column documents to partition_pdf. Currently, partition_pdf will process the documents but the elements are not always in the correct order. The attached document can be used for testing.
Dense-Passage-Retrieval-for-Open-Domain-Question-Answering.pdf

The text was updated successfully, but these errors were encountered:

hsm207 · 2023-05-12T17:42:13Z

does this issue also implies adding support for docs with more than 2 columns?

MthwRobinson · 2023-05-12T17:44:19Z

@hsm207 - Yes, this will support multicolumn (two or more columns). In the meantime, you can pass in strategy="fast" (if the text is extractable) or stragey="ocr_only" (if the text is not extractable) for better results on multi-column documents.

CShorten · 2023-05-23T16:30:40Z

This will be huge, so excited for what this will unlock with putting scientific papers into LLM systems!

Kate-Lyndegaard · 2023-07-05T08:17:20Z

Hi,
Is this improvement available, yet?

walsha2 · 2023-09-07T23:32:55Z

Is this improvement available, yet?

In tracking the commit history in the unstructured-inference repo, it looks like a promising first attempt was added in #112 and reverted in #120 - then no action since then I presume?

Maybe @MthwRobinson or @qued can comment on the status of this. Is a more sophisticated sorting method the right approach or is that too fragile? Seems like sorting is required no matter what, esp if the text is directly extracted from a pdf.

@MthwRobinson was on the right track and I think a few more cracks at "smart sorting" might yield a solution that is at least do no harm to single column layouts while also addressing some larger set of multi-column PDFs.

Not sure if there is some other means of solving the multi-column problem that is being tossed around internally but has yet to be disseminated? Wanted to ask first before digging any deeper myself.

cragwolfe · 2023-09-08T04:26:27Z

Yes, there was a significant improvement in reflecting "natural reading order" in #1161 , which was released in 0.10.6.

However, there are a couple of open issues to further improve on this:
#1233
and
#1209

MthwRobinson added enhancement New feature or request python Pull requests that update Python code labels Mar 10, 2023

MthwRobinson assigned mallorih Mar 10, 2023

MthwRobinson added the reviewed label Mar 10, 2023

MthwRobinson removed the reviewed label Mar 28, 2023

MthwRobinson assigned MthwRobinson and unassigned mallorih May 18, 2023

This was referenced May 18, 2023

fix: correct column ordering for multi-column documents Unstructured-IO/unstructured-inference#112

Merged

release: pull in inference updates; update docs; version bump for release #613

Closed

build(deps): bump unstructured_inference to 0.5.0 #618

Closed

h0lybyte mentioned this issue Sep 4, 2023

[Concept] : Weaviate Migration KBVE/kbve.com#788

Open

walsha2 mentioned this issue Sep 7, 2023

bug: PDF with multiple columns not partitioned correctly #1336

Closed

SimranJha2408 closed this as completed Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`partition_pdf` should support documents in two column format #356

`partition_pdf` should support documents in two column format #356

MthwRobinson commented Mar 10, 2023

hsm207 commented May 12, 2023

MthwRobinson commented May 12, 2023

CShorten commented May 23, 2023 •

edited

Loading

Kate-Lyndegaard commented Jul 5, 2023

walsha2 commented Sep 7, 2023

cragwolfe commented Sep 8, 2023

partition_pdf should support documents in two column format #356

partition_pdf should support documents in two column format #356

Comments

MthwRobinson commented Mar 10, 2023

hsm207 commented May 12, 2023

MthwRobinson commented May 12, 2023

CShorten commented May 23, 2023 • edited Loading

Kate-Lyndegaard commented Jul 5, 2023

walsha2 commented Sep 7, 2023

cragwolfe commented Sep 8, 2023

`partition_pdf` should support documents in two column format #356

`partition_pdf` should support documents in two column format #356

CShorten commented May 23, 2023 •

edited

Loading