Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make camelot search for tables in certain page regions #209

Closed
anakin87 opened this issue Nov 28, 2018 · 11 comments
Closed

Make camelot search for tables in certain page regions #209

anakin87 opened this issue Nov 28, 2018 · 11 comments
Milestone

Comments

@anakin87
Copy link

I'm trying to automatically detect and extract tables encapsulated in other tables.

I would want to make camelot search in certain area:
this is not table area but the area where the table resides (see the attached image).
cattura

How I can make Camelot work in this way?
Ideas for the develop are well-accepted...

@vinayak-mehta
Copy link
Contributor

Hi @anakin87! You can specify table areas in read_pdf using the table_areas kwarg. For more information on usage, check out the docs. Please comment if you face any problems.

@anakin87
Copy link
Author

anakin87 commented Dec 3, 2018

If I provide table_areas, Camelot interprets them as specific table coordinates.

My problem is that I want to search for tables in a specific area of the page, but I don't know specific table coordinates. How to cope with this problem?

@vinayak-mehta
Copy link
Contributor

I get the issue now. Camelot treats the passed table areas as actual boundaries of the table. This can be an enhancement where the user can pass a table_region so that camelot only processes the text and lines inside the region to form a table. Reopening this.

@vinayak-mehta vinayak-mehta reopened this Dec 4, 2018
@vinayak-mehta vinayak-mehta changed the title How to make camelot search for tables in certain area (not TABLE_AREAS)? Make camelot search for tables in certain page regions Dec 4, 2018
@vinayak-mehta vinayak-mehta added this to the v0.8.0 milestone Dec 4, 2018
@vinayak-mehta vinayak-mehta modified the milestones: v0.8.0, v0.7.0 Dec 20, 2018
@vinayak-mehta
Copy link
Contributor

@anakin87 Can you post a link to that PDF?

@anakin87
Copy link
Author

anakin87 commented Jan 3, 2019

PIR_Prospetto dOfferta.pdf

I would want to search for tables in a certain region of the page, in the order to extract only true tables and not tables that are elements of layout.

@vinayak-mehta
Copy link
Contributor

@anakin87 Thanks for reporting this issue, the current table_areas kwarg for Lattice hardcodes the coordinates of the table boundary leading to unwanted text with the extracted table and making the user note the exact coordinates while debugging visually. Which should not be the case, table_areas should just guide camelot to analyze only that part of the page to find tables using Lattice and Stream.

This is a behavioral bug, I'll push a fix today.

@anakin87
Copy link
Author

anakin87 commented Jan 3, 2019

I think both the options are useful:

  • the user can set the coordinates of the table boundary (current behaviour of table_areas)
  • the user can specify a region where to search for the tables

@vinayak-mehta
Copy link
Contributor

Hmm, I guess keeping them separate makes sense since a table region could contain two or more table areas too.

@vinayak-mehta
Copy link
Contributor

@anakin87 Check out the docs for usage details.

@anakin87
Copy link
Author

anakin87 commented Jan 4, 2019

Great!!!

@gyan7611
Copy link

How do you get the coordinates to be passed as argument to table_areas ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants