Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify Table Areas Returns Full Page #149

Closed
Tavisius25 opened this issue Oct 15, 2018 · 5 comments
Closed

Specify Table Areas Returns Full Page #149

Tavisius25 opened this issue Oct 15, 2018 · 5 comments

Comments

@Tavisius25
Copy link

As per the advanced uses section in the documentation, I would like to define a portion of a page for table extraction using the stream parsing method. I am using the 3rd page of the following pdf...
SziniczToxicol.pdf

I read the pdf like this
table = Camelot.read_pdf('SziniczToxicol.pdf', pages='3', flavor='stream', flag_size=True)

visualize text to understand the table boundaries
table[0].plot('text')

Observe the upper left and bottom right boundaries which I estimated to be (79,727) and (537,383) respectively.

Now I attempt to parse this section along with column demarcations (353 and 474).
table2 = Camelot.read_pdf('SziniczToxicol.pdf' ,pages='3', flavor='stream', table_areas=['79,727,537,384'], columns=['353,473'], flag_size=True)

The attached output csv file includes text beyond my selection in fact it seems to be the full page in 3 column format. Is this due to stream treating the whole page as one table? Am I specifying my selected area correctly? Any help would be great. Thanks for making this great tool.
Toxicol-page-3-table-1.zip

@vinayak-mehta
Copy link
Contributor

vinayak-mehta commented Oct 16, 2018

Looks like a bug, let me look into this.

@charles-haynes
Copy link

charles-haynes commented Oct 22, 2018

I'm having the same issue, can reproduce it using the example in the docs:

https://camelot-py.readthedocs.io/en/master/user/advanced.html#specify-table-areas

tables = camelot.read_pdf('table_areas.pdf', flavor='stream', table_areas=['316,499,566,337'])
tables[0].df

returns the entire page.

@vinayak-mehta
Copy link
Contributor

Sorry for the late response on this and sorry again for a typo in the docs. The keyword argument to specify table areas is table_area and not table_areas. Though now that I think of it, table_areas sounds more right. I've fixed the docs.

Will change it to table_areas in a later release.

@felipeacsi
Copy link

It works. Thank you!

@cfrejlach
Copy link

Table_area still reads the whole page

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants