Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect tables automagically when Stream is used #102

Closed
vinayak-mehta opened this issue Sep 11, 2018 · 10 comments
Closed

Detect tables automagically when Stream is used #102

vinayak-mehta opened this issue Sep 11, 2018 · 10 comments
Assignees

Comments

@vinayak-mehta
Copy link
Contributor

By default, Stream treats the whole page as a table right now, which fails when there are two or more Stream-type tables on the same page with different number of columns. Change the default to fallback.

12s0324.pdf and how Tabula does this should be a good start.

@vinayak-mehta
Copy link
Contributor Author

Tabula has an implementation based on Anssi Nurminen's master's thesis, starting from there.

@imri
Copy link

imri commented Nov 22, 2018

Hi there, thanks for this library! :)

Regarding table detection algorithms - I know that Tabula uses Nurminen's algorithm, but I was wondering - is it the best algorithm that's out there? Do you guys know of any other / better ones?

Thanks a lot - you guys rock!

@vinayak-mehta
Copy link
Contributor Author

Hey @imri!

There has been a lot of research on detecting and extracting tables from PDFs. All the approaches I've seen are heuristics and no single one gives a 100% table detection accuracy. pdf2table is where I started some time back and followed on the citations and links from there.

I've seen Tabula and Nurminen's algorithm work really well on tables that don't have both vertical and horizontal ruling lines and instead rely on spaces to form the grid. As Tabula's author states in this comment tabulapdf/tabula-java#49 (comment), it got really good results in ICDAR 2013 and passed most of the table detection tests that Tabula has. The ones it didn't pass were just corner cases, which could be extracted by specifying table areas or column separators.

Since that was 5 years, it's possible that other performant approaches could've been devised. If you come across one, please let us know!

@imri
Copy link

imri commented Nov 24, 2018

Thanks for you reply.

Weird, because when I use it on the simplest document I have, it's not working well.
As far as I'm concerned, Nurminen's algorithm is part of Tabula's Autodetect Tables feature. When using it on this document: document_800_1

It resulted the following:
screen shot 2018-11-24 at 6 11 26 pm

As you can see, it unified the 'From:' and the right table into one selection area, which is wrong.

Is this a corner case? Isn't it the simplest case there is?

Thanks,
Imri

@vinayak-mehta
Copy link
Contributor Author

Nurminen's master's thesis states that after calculating left, middle and right text edges (which are vertical lines that pass through similarly aligned text), each text row is assigned a probability of being part of a table. The comment here NurminenDetectionAlgorithm.java#L237 says that Tabula uses a general heuristic instead, by trying to find the text edge type that intersects most horizontal text rows and then generating table areas using those text edges.

I'm guessing that Tabula extends the table areas by including text rows that share a y-axis overlap with the table areas, because that is what I've done in #206 (see comment here L180) and the image that you showed is the kind of result I expect using stream. I went with this approach to include the cases where table columns share different alignments. For example: if a table has 4 columns that are left aligned and 3 columns that are right aligned, the left aligned columns would win the majority due to the sum of their intersections with horizontal text rows, leading to the table area being limited to only left aligned columns.

However, in the case that you showed, lattice should be able to work perfectly since the table cells are separated by lines.

@imri
Copy link

imri commented Nov 25, 2018

Interesting.
Is it possible to detect (and extract) tables using both Stream and Lattice together?

When you say that lattice should work perfectly - I sort of wish to create a generic way to detect and extract tables without having to know which detection method (lattice / stream) is best for a given document - I want to decouple them as much as possible.

Imri

@vinayak-mehta
Copy link
Contributor Author

vinayak-mehta commented Dec 1, 2018

Is it possible to detect (and extract) tables using both Stream and Lattice together?

I get your use-case and it is not possible currently through the library itself. But I see two possibilities which can be implemented (both heuristics):

  1. As far as I can tell from NurminenDetectionAlgorithm.java, Tabula first filters out all Lattice-type tables from the document and then looks for Stream-type tables, till it cannot find any more tables. Similarly, we can "couple" both flavors into a single one inside Camelot.

  2. We can create a flavor called guess which automatically chooses between Lattice and Stream.

@vinayak-mehta
Copy link
Contributor Author

@imri Let's continue the conversation on the issue I just opened.

@abhilashabhardwaj
Copy link

@vinayak-mehta, I was wondering if any code has been merged regarding 'guess' flavor?
I'm having trouble identifying the following as a table.
Stream gives entire page as table.
edge_tol doesn't work.
can't use visual debugging to identify coordinates at run-time.

image

@ShanksDS
Copy link

ShanksDS commented Oct 15, 2020

Hi @vinayak-mehta
Given that Camelot is the best for these things, I am trying to pull a huge set of pdfs which look like these:

There's a table in red, then in blue and then the 3rd table in green starts and extends on the next page. Some pages have 2 tables, some 1 and likewise. Insides of a particular table (like blue) wouldn't necessarily match with others (like red), too.
I have to automate and run this through several hundred docs.
I've tried several iterations of the options that were provided but none seem to work. Has camelot any option that I could use? Presently I'm using edge_tol, Col_tol, row_tol and flavor = 'stream'.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants