Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Given a document how ignore the header and set the columns of a table? #13

Closed
alonsopg opened this issue Jan 17, 2017 · 8 comments
Closed

Comments

@alonsopg
Copy link

alonsopg commented Jan 17, 2017

I am working with a PDF very similar to this document:

captura de pantalla 2017-01-17 a las 11 43 37 a m

As you can see the above document has a header, when I try to use tabula-py to extract it, I am getting everything merged in a single column:

In:

df = read_pdf_table('file.pdf')

Out:

captura de pantalla 2017-01-17 a las 12 44 00 p m

Thus, my question is how can I ignore the header and get the content of the table?. I also tried with the options:

In:

df = read_pdf_table('file.pdf', columns = ['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])

Out:


---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
<ipython-input-4-33ed930c5d2a> in <module>()
      6 
      7 df = read_pdf_table('file.pdf',
----> 8                    columns = ['Col1', 'Col2', 'Col3', 'Col4', 'Col5']
          9 
         10 #df = read_pdf_table('/Users/user/Downloads/table.pdf')

/usr/local/lib/python3.5/site-packages/tabula/wrapper.py in read_pdf_table(input_path, **kwargs)
     45     args = ["java", "-jar", jar_path] + options + [input_path]
     46 
---> 47     output = subprocess.check_output(args)
     48 
     49     if len(output) == 0:

/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py in check_output(timeout, *popenargs, **kwargs)
    624 
    625     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
--> 626                **kwargs).stdout
    627 
    628 

/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
    706         if check and retcode:
    707             raise CalledProcessError(retcode, process.args,
--> 708                                      output=stdout, stderr=stderr)
    709     return CompletedProcess(process.args, retcode, stdout, stderr)
    710 

CalledProcessError: Command '['java', '-jar', '/usr/local/lib/python3.5/site-packages/tabula/tabula-0.9.1-jar-with-dependencies.jar', '--pages', '1', '--guess', '--columns', 

Nevertheless, it did not worked.

@chezou
Copy link
Owner

chezou commented Jan 18, 2017

For that type of table, you should use column or area option.
This issue may help you. tabulapdf/tabula-java#84

@alonsopg
Copy link
Author

alonsopg commented Jan 18, 2017

Indeed @chezou, I know that this is related to the area or column options. I looked through the docs, unfortunately I did not understood how to use such parameters in my case. Could you provide some example of how to use column or area parameters for this case?.

For instance I tried this:
In:

df = read_pdf_table('file.pdf', area = (269.875, 12.75, 790.5, 561))

But it still doesn't worked..

@chezou
Copy link
Owner

chezou commented Jan 18, 2017

In short, you can extract with area and spreadsheet option.

In [4]: tabula.read_pdf('./table.pdf', spreadsheet=True, area=(337.29, 226.49, 472.85, 384.91))
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Out[4]:
  Unnamed: 0 Col2 Col3 Col4 Col5
0          A    B   12    R    G
1        NaN    R    T   23    H
2          B    B   33    R    A
3          C    T   99    E    M
4          D    I   12   34    M
5          E    I    I    W   90
6        NaN    1    2    W    h
7        NaN    4    3    E    H
8          F    E   E4    R    4

How to use area option

According to tabula-java wiki, there is a explain how to specify the area:
https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want

Using macOS's preview, I got area information:

image

java -jar ./target/tabula-0.9.0-jar-with-dependencies.jar -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename

given

Note the left, top, height, and width parameters and calculate the following:

y1 = top
x1 = left
y2 = top + height
x2 = left + width

I confirmed with tabula-java:

java -jar ./tabula/tabula-0.9.1-jar-with-dependencies.jar -g -r -a "337.29,226.49,472.85,384.91" table.pdf

Without -r(same as --spreadsheet) option, it does not work properly.

@alonsopg
Copy link
Author

alonsopg commented Jan 26, 2017

@chezou thanks for the help!. It would worth to add this information to the docs!

@jiteshm17
Copy link

The spreadsheet flag did the trick. Thanks a lot @chezou

@sfinotti
Copy link

sfinotti commented Dec 29, 2019

I'm trying to use tabula-py to import some info from pdf files, but having problems with the argument 'column'. In my case, I need to use area (so far so good) and also column, since my data is not very well defined.

The data I need is a 2 column 'table' positioned in an specific area of the pdf files. I used tabula to determine the positions and everything is working good, except for the column argument.

I'm using this way:

def le_2(directory,tab_def,col_def):
    demonstrativos = []
    for filename in os.listdir(directory):
        demo_mes = read_pdf(f"{directory}/{filename}", area=tab_def, columns=col_def, pandas_options={'header':None}, spread=True, guess=False)
        demonstrativos.append(demo_mes)

    return demonstrativos

tab_def=(148.378,13.016,410.922,253.991)
col_def=(186.681)
demo1 = le_2("2019/t1", tab_def, col_def)

The problem is that the column argument seems to be ignored. I always get the same output (as if there were no column argument), no matter what number I use for 'column'.

@chezou
Copy link
Owner

chezou commented Jan 1, 2020

@sfinotti Use columns instead if column. Note that columns option doesn't work with lattice mode.

@sfinotti
Copy link

sfinotti commented Jan 2, 2020

@sfinotti Use columns instead if column. Note that columns option doesn't work with lattice mode.

Just tried with "columns", and got the error: 'float' object is not iterable
Then, changing "col_def" from =(186.681) to =(186.681,), it worded out
So, even if you have only ONE column delimiter, it's necessary to ad a "," at the end.

Thanks a lot !!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants