-
-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Given a document how ignore the header and set the columns of a table? #13
Comments
For that type of table, you should use |
Indeed @chezou, I know that this is related to the area or column options. I looked through the docs, unfortunately I did not understood how to use such parameters in my case. Could you provide some example of how to use For instance I tried this:
But it still doesn't worked.. |
In short, you can extract with In [4]: tabula.read_pdf('./table.pdf', spreadsheet=True, area=(337.29, 226.49, 472.85, 384.91))
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Out[4]:
Unnamed: 0 Col2 Col3 Col4 Col5
0 A B 12 R G
1 NaN R T 23 H
2 B B 33 R A
3 C T 99 E M
4 D I 12 34 M
5 E I I W 90
6 NaN 1 2 W h
7 NaN 4 3 E H
8 F E E4 R 4 How to use
|
@chezou thanks for the help!. It would worth to add this information to the docs! |
The spreadsheet flag did the trick. Thanks a lot @chezou |
I'm trying to use tabula-py to import some info from pdf files, but having problems with the argument 'column'. In my case, I need to use area (so far so good) and also column, since my data is not very well defined. The data I need is a 2 column 'table' positioned in an specific area of the pdf files. I used tabula to determine the positions and everything is working good, except for the column argument. I'm using this way:
The problem is that the column argument seems to be ignored. I always get the same output (as if there were no column argument), no matter what number I use for 'column'. |
@sfinotti Use |
Just tried with "columns", and got the error: 'float' object is not iterable Thanks a lot !!! |
I am working with a PDF very similar to this document:
As you can see the above document has a header, when I try to use tabula-py to extract it, I am getting everything merged in a single column:
In:
df = read_pdf_table('file.pdf')
Out:
Thus, my question is how can I ignore the header and get the content of the table?. I also tried with the options:
In:
df = read_pdf_table('file.pdf', columns = ['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])
Out:
Nevertheless, it did not worked.
The text was updated successfully, but these errors were encountered: