Given a document how ignore the header and set the columns of a table? #13

alonsopg · 2017-01-17T17:55:38Z

I am working with a PDF very similar to this document:

As you can see the above document has a header, when I try to use tabula-py to extract it, I am getting everything merged in a single column:

In:

df = read_pdf_table('file.pdf')

Out:

Thus, my question is how can I ignore the header and get the content of the table?. I also tried with the options:

In:

df = read_pdf_table('file.pdf', columns = ['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])

Out:


---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
<ipython-input-4-33ed930c5d2a> in <module>()
      6 
      7 df = read_pdf_table('file.pdf',
----> 8                    columns = ['Col1', 'Col2', 'Col3', 'Col4', 'Col5']
          9 
         10 #df = read_pdf_table('/Users/user/Downloads/table.pdf')

/usr/local/lib/python3.5/site-packages/tabula/wrapper.py in read_pdf_table(input_path, **kwargs)
     45     args = ["java", "-jar", jar_path] + options + [input_path]
     46 
---> 47     output = subprocess.check_output(args)
     48 
     49     if len(output) == 0:

/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py in check_output(timeout, *popenargs, **kwargs)
    624 
    625     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
--> 626                **kwargs).stdout
    627 
    628 

/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
    706         if check and retcode:
    707             raise CalledProcessError(retcode, process.args,
--> 708                                      output=stdout, stderr=stderr)
    709     return CompletedProcess(process.args, retcode, stdout, stderr)
    710 

CalledProcessError: Command '['java', '-jar', '/usr/local/lib/python3.5/site-packages/tabula/tabula-0.9.1-jar-with-dependencies.jar', '--pages', '1', '--guess', '--columns',

Nevertheless, it did not worked.

The text was updated successfully, but these errors were encountered:

chezou · 2017-01-18T00:47:45Z

For that type of table, you should use column or area option.
This issue may help you. tabulapdf/tabula-java#84

alonsopg · 2017-01-18T01:02:02Z

Indeed @chezou, I know that this is related to the area or column options. I looked through the docs, unfortunately I did not understood how to use such parameters in my case. Could you provide some example of how to use column or area parameters for this case?.

For instance I tried this:
In:

df = read_pdf_table('file.pdf', area = (269.875, 12.75, 790.5, 561))

But it still doesn't worked..

chezou · 2017-01-18T01:51:48Z

In short, you can extract with area and spreadsheet option.

In [4]: tabula.read_pdf('./table.pdf', spreadsheet=True, area=(337.29, 226.49, 472.85, 384.91))
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Out[4]:
  Unnamed: 0 Col2 Col3 Col4 Col5
0          A    B   12    R    G
1        NaN    R    T   23    H
2          B    B   33    R    A
3          C    T   99    E    M
4          D    I   12   34    M
5          E    I    I    W   90
6        NaN    1    2    W    h
7        NaN    4    3    E    H
8          F    E   E4    R    4

How to use `area` option

According to tabula-java wiki, there is a explain how to specify the area:
https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want

Using macOS's preview, I got area information:

java -jar ./target/tabula-0.9.0-jar-with-dependencies.jar -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename

given

Note the left, top, height, and width parameters and calculate the following:

y1 = top
x1 = left
y2 = top + height
x2 = left + width

I confirmed with tabula-java:

java -jar ./tabula/tabula-0.9.1-jar-with-dependencies.jar -g -r -a "337.29,226.49,472.85,384.91" table.pdf

Without -r(same as --spreadsheet) option, it does not work properly.

alonsopg · 2017-01-26T13:47:19Z

@chezou thanks for the help!. It would worth to add this information to the docs!

jiteshm17 · 2019-06-23T14:52:00Z

The spreadsheet flag did the trick. Thanks a lot @chezou

sfinotti · 2019-12-29T21:25:40Z

I'm trying to use tabula-py to import some info from pdf files, but having problems with the argument 'column'. In my case, I need to use area (so far so good) and also column, since my data is not very well defined.

The data I need is a 2 column 'table' positioned in an specific area of the pdf files. I used tabula to determine the positions and everything is working good, except for the column argument.

I'm using this way:

def le_2(directory,tab_def,col_def):
    demonstrativos = []
    for filename in os.listdir(directory):
        demo_mes = read_pdf(f"{directory}/{filename}", area=tab_def, columns=col_def, pandas_options={'header':None}, spread=True, guess=False)
        demonstrativos.append(demo_mes)

    return demonstrativos

tab_def=(148.378,13.016,410.922,253.991)
col_def=(186.681)
demo1 = le_2("2019/t1", tab_def, col_def)

The problem is that the column argument seems to be ignored. I always get the same output (as if there were no column argument), no matter what number I use for 'column'.

chezou · 2020-01-01T06:39:49Z

@sfinotti Use columns instead if column. Note that columns option doesn't work with lattice mode.

sfinotti · 2020-01-02T16:53:49Z

@sfinotti Use columns instead if column. Note that columns option doesn't work with lattice mode.

Just tried with "columns", and got the error: 'float' object is not iterable
Then, changing "col_def" from =(186.681) to =(186.681,), it worded out
So, even if you have only ONE column delimiter, it's necessary to ad a "," at the end.

Thanks a lot !!!

alonsopg closed this as completed Jan 26, 2017

alonsopg reopened this Jan 26, 2017

alonsopg closed this as completed Jan 26, 2017

vikjam mentioned this issue Nov 1, 2017

Format data vikjam/ui-policy#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Given a document how ignore the header and set the columns of a table? #13

Given a document how ignore the header and set the columns of a table? #13

alonsopg commented Jan 17, 2017 •

edited

Loading

chezou commented Jan 18, 2017

alonsopg commented Jan 18, 2017 •

edited

Loading

chezou commented Jan 18, 2017

alonsopg commented Jan 26, 2017 •

edited

Loading

jiteshm17 commented Jun 23, 2019

sfinotti commented Dec 29, 2019 •

edited

Loading

chezou commented Jan 1, 2020

sfinotti commented Jan 2, 2020 •

edited

Loading

Given a document how ignore the header and set the columns of a table? #13

Given a document how ignore the header and set the columns of a table? #13

Comments

alonsopg commented Jan 17, 2017 • edited Loading

chezou commented Jan 18, 2017

alonsopg commented Jan 18, 2017 • edited Loading

chezou commented Jan 18, 2017

How to use area option

alonsopg commented Jan 26, 2017 • edited Loading

jiteshm17 commented Jun 23, 2019

sfinotti commented Dec 29, 2019 • edited Loading

chezou commented Jan 1, 2020

sfinotti commented Jan 2, 2020 • edited Loading

alonsopg commented Jan 17, 2017 •

edited

Loading

alonsopg commented Jan 18, 2017 •

edited

Loading

How to use `area` option

alonsopg commented Jan 26, 2017 •

edited

Loading

sfinotti commented Dec 29, 2019 •

edited

Loading

sfinotti commented Jan 2, 2020 •

edited

Loading