Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue while using tabula-py #7

Closed
sidmohan opened this issue Nov 5, 2016 · 20 comments
Closed

issue while using tabula-py #7

sidmohan opened this issue Nov 5, 2016 · 20 comments

Comments

@sidmohan
Copy link

sidmohan commented Nov 5, 2016

i need urgent help on tabula automation. i read your article at the below link but and installed pip as well as tabula-py image below -

https://github.com/chezou/tabula-py

But how to proceed after that, when i try to execute below lines through a python script its giving an error, kindly help-

#!/usr/bin/python
#!/usr/bin/perl
#!/usr/bin/perl -d:ptkdb

import fileinput, sys, os ,subprocess, io

from tabula import read_pdf_table
df=read_pdf_table("TAJ.pdf")

image

@chezou
Copy link
Owner

chezou commented Nov 5, 2016

If it possible could you give me your PDF url?

How about opening with guess=False option like:

read_pdf_table("TAJ.pdf", guess=False)

@sidmohan
Copy link
Author

sidmohan commented Nov 5, 2016

please find the link here
http://datasheets.avx.com/TAJ.pdf

@sidmohan
Copy link
Author

sidmohan commented Nov 5, 2016

read_pdf_table("TAJ.pdf", guess=False) did not work for me, any other way of giving the pdf as input, am i using the correct means to execute the tabula

#!/usr/bin/python
#!/usr/bin/perl
#!/usr/bin/perl -d:ptkdb

import fileinput, sys, os ,subprocess, io

from tabula import read_pdf_table
df=read_pdf_table("TAJ.pdf")

@sidmohan sidmohan closed this as completed Nov 5, 2016
@chezou
Copy link
Owner

chezou commented Nov 5, 2016

In current version, tabula-py doesn't handle Java exception well. tabula-py depends on tabula-java. I tried to read with tabula-java and I found the PDF is too complex to parse whole PDF with tabula-java.

I got following error with tabula-java:

$ java -jar tabula-0.9.1-jar-with-dependencies.jar -g TAJ.pdf                                                                                       19:22:51
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Nov 05, 2016 7:25:56 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:25:57 PM org.apache.pdfbox.pdmodel.graphics.color.PDSeparation createColorModel
INFO: About to create ColorModel for ICCBased{numberOfComponents: 3}
Nov 05, 2016 7:25:57 PM org.apache.pdfbox.util.operator.pagedrawer.Invoke process
SEVERE: java.lang.IllegalArgumentException: Map size (0) must be >= 1
java.lang.IllegalArgumentException: Map size (0) must be >= 1
    at java.awt.image.IndexColorModel.<init>(IndexColorModel.java:335)
    at java.awt.image.IndexColorModel.<init>(IndexColorModel.java:287)
    at org.apache.pdfbox.pdmodel.graphics.color.PDIndexed.createColorModel(PDIndexed.java:169)
    at org.apache.pdfbox.pdmodel.graphics.color.PDIndexed.createColorModel(PDIndexed.java:146)
    at org.apache.pdfbox.pdmodel.graphics.xobject.PDCcitt.getRGBImage(PDCcitt.java:189)
    at org.apache.pdfbox.util.operator.pagedrawer.Invoke.process(Invoke.java:96)
    at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:269)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
    at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:216)
    at technology.tabula.ObjectExtractor.drawPage(ObjectExtractor.java:153)
    at technology.tabula.ObjectExtractor.extractPage(ObjectExtractor.java:108)
    at technology.tabula.PageIterator.next(PageIterator.java:29)
    at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:144)
    at technology.tabula.CommandLineApp.main(CommandLineApp.java:60)

Nov 05, 2016 7:25:57 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
Nov 05, 2016 7:25:57 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
Nov 05, 2016 7:25:57 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
WARNING: java.lang.NullPointerException
java.lang.NullPointerException
    at technology.tabula.ObjectExtractor$PointComparator.compare(ObjectExtractor.java:410)
    at technology.tabula.ObjectExtractor.strokeOrFillPath(ObjectExtractor.java:254)
    at technology.tabula.ObjectExtractor.strokePath(ObjectExtractor.java:275)
    at org.apache.pdfbox.util.operator.pagedrawer.StrokePath.process(StrokePath.java:47)
    at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562)
    at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:537)
    at org.apache.pdfbox.util.operator.CloseAndStrokePath.process(CloseAndStrokePath.java:45)
    at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:269)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
    at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:216)
    at technology.tabula.ObjectExtractor.drawPage(ObjectExtractor.java:153)
    at technology.tabula.ObjectExtractor.extractPage(ObjectExtractor.java:108)
    at technology.tabula.PageIterator.next(PageIterator.java:29)
    at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:144)
    at technology.tabula.CommandLineApp.main(CommandLineApp.java:60)

Nov 05, 2016 7:25:57 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:25:58 PM org.apache.pdfbox.pdmodel.graphics.color.PDSeparation createColorModel
INFO: About to create ColorModel for ICCBased{numberOfComponents: 3}
Nov 05, 2016 7:25:58 PM org.apache.pdfbox.util.operator.pagedrawer.Invoke process
SEVERE: java.lang.IllegalArgumentException: Map size (0) must be >= 1
java.lang.IllegalArgumentException: Map size (0) must be >= 1
    at java.awt.image.IndexColorModel.<init>(IndexColorModel.java:335)
    at java.awt.image.IndexColorModel.<init>(IndexColorModel.java:287)
    at org.apache.pdfbox.pdmodel.graphics.color.PDIndexed.createColorModel(PDIndexed.java:169)
    at org.apache.pdfbox.pdmodel.graphics.color.PDIndexed.createColorModel(PDIndexed.java:146)
    at org.apache.pdfbox.pdmodel.graphics.xobject.PDCcitt.getRGBImage(PDCcitt.java:189)
    at org.apache.pdfbox.util.operator.pagedrawer.Invoke.process(Invoke.java:96)
    at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:269)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
    at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:216)
    at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:139)
    at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:801)
    at technology.tabula.detectors.NurminenDetectionAlgorithm.detect(NurminenDetectionAlgorithm.java:93)
    at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:161)
    at technology.tabula.CommandLineApp.main(CommandLineApp.java:60)

Nov 05, 2016 7:25:58 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
Nov 05, 2016 7:25:58 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
Nov 05, 2016 7:25:58 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:25:58 PM org.apache.pdfbox.pdmodel.graphics.color.PDSeparation createColorModel
INFO: About to create ColorModel for ICCBased{numberOfComponents: 3}
Nov 05, 2016 7:25:58 PM org.apache.pdfbox.util.operator.pagedrawer.Invoke process
SEVERE: java.lang.IllegalArgumentException: Map size (0) must be >= 1
java.lang.IllegalArgumentException: Map size (0) must be >= 1
    at java.awt.image.IndexColorModel.<init>(IndexColorModel.java:335)
    at java.awt.image.IndexColorModel.<init>(IndexColorModel.java:287)
    at org.apache.pdfbox.pdmodel.graphics.color.PDIndexed.createColorModel(PDIndexed.java:169)
    at org.apache.pdfbox.pdmodel.graphics.color.PDIndexed.createColorModel(PDIndexed.java:146)
    at org.apache.pdfbox.pdmodel.graphics.xobject.PDCcitt.getRGBImage(PDCcitt.java:189)
    at org.apache.pdfbox.util.operator.pagedrawer.Invoke.process(Invoke.java:96)
    at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:269)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
    at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:216)
    at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:139)
    at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:801)
    at technology.tabula.detectors.NurminenDetectionAlgorithm.detect(NurminenDetectionAlgorithm.java:103)
    at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:161)
    at technology.tabula.CommandLineApp.main(CommandLineApp.java:60)

Nov 05, 2016 7:25:58 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: BDC
Nov 05, 2016 7:25:59 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EMC
"",Code,EIA S Min.Code,EIA Metric,L±0.20 (0.008),,W+0.20 (0.008) -0.10 (0.004),,H+0.20 (0.008) -0.10 (0.004),,W1±0.20 (0.008),,A+0.30 (0.012)-0.20 (0.008),,,
"",A,1206,3216-18,3.20 (0.126),,1.60 (0.063),,1.60 (0.063),,1.20 (0.047),,0.80 (0.031),,,1.10 (0.043)
"",B,1210,3528-21,3.50 (0.138),,2.80 (0.110),,1.90 (0.075),,2.20 (0.087),,0.80 (0.031),,,1.40 (0.055)
"",C,2312,6032-28,6.00 (0.236),,3.20 (0.126),,2.60 (0.102),,2.20 (0.087),,1.30 (0.051),,,2.90 (0.114)
MARKING,D,2917,7343-31,7.30 (0.287),,4.30 (0.169),,2.90 (0.114),,2.40 (0.094),,1.30 (0.051),,,4.40 (0.173)
"",E,2917,7343-43,7.30 (0.287),,4.30 (0.169),,4.10 (0.162),,2.40 (0.094),,1.30 (0.051),,,4.40 (0.173)
"A, B, C, D, E, U, V CASE",,,,,,,,,,,,,,,
"",U,2924,7361-43,7.30 (0.287),,6.10 (0.240),,4.10 (0.162),,3.10 (0.120),,1.30 (0.051),,,4.40 (0.173)
AVX LOGO Capacitance Value in pF,,,,,,,,,,,,,,,

...snip...

XXXXX ID Code,,
HOW TO ORDER,,
TAJ C 106,M 035 R NJ,—
Type Case Size Capacitance Code,Tolerance Rated DC Voltage Packaging Specification,Additional
See table pF code: 1st two,"K = ±10% 002 = 2.5Vdc R = Pure Tin 7"" Reel Suffix",characters may be
above digits represent,"M = ±20% 004 = 4Vdc S = Pure Tin 13"" Reel NJ = Standard",added for special
significant figures,"006 = 6.3Vdc A = Gold Plating 7"" Reel Suffix",requirements
3rd digit represents,"010 = 10Vdc B = Gold Plating 13"" Reel",V = Dry pack Option(selected codes only)

One thing you should try is opening with page option.

With tabula-java, I can extract with -p option as following:

$ java -jar tabula-0.9.1-jar-with-dependencies.jar -g -p 3-6 TAJ.pdf
Nov 05, 2016 7:27:30 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:27:31 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:27:33 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:27:34 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:27:35 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:27:36 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:27:38 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:27:38 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Nov 05, 2016 7:27:39 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
"",,,,,,,,,ESR,,
AVX,Case,Capacitance,Rated,Rated,Category Category,,DCL,DF,Max.,,100kHz RMS Current (mA)
Part No.,Size,(μF),Voltage,Temperature,Voltage Temperature,,Max.,Max.,@ 100kHz,MSL,
"",,,(V),(oC),(V) (oC),,(μA),(%),(Ω),,25oC 85oC 125oC
"",,,,,2.5 Volt @ 85°C,,,,,,
TAJA336*002#NJ,A,33,2.5,85,1.7 125,,0.8,8,1.7,1,210 189 84
TAJA476*002#NJ,A,47,2.5,85,1.7 125,,0.9,6,3,1,158 142 63
TAJA686*002#NJ,A,68,2.5,85,1.7 125,,1.4,8,1.5,1,224 201 89
TAJA107*002#NJ,A,100,2.5,85,1.7 125,,2.5,30,1.4,1,231 208 93

...snip...

TAJE156*050#NJ,E,15,50,85,33 125,,7.5,6,0.6,11),524 472 210
TAJV156*050#NJ,V,15,50,85,33 125,,7.5,6,0.6,11),645 581 258
TAJV226*050#NJ,V,22,50,85,33 125,,11,8,0.6,11),645 581 258

Unfortunately, current version of tabla-py has a restriction( #2 ) not to handle multiple tables at once, I got same error you got with same option tabla-java worked.

In [3]: from tabula import read_pdf_table

In [4]: df = read_pdf_table("./TAJ.pdf", guess=False, pages="3-6")
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-4-1ebe490eea51> in <module>()
----> 1 df = read_pdf_table("./TAJ.pdf", guess=False, pages="3-6")

/Users/ariga/.virtualenvs/ibis/lib/python3.5/site-packages/tabula/wrapper.py in read_pdf_table(input_path, options, pages, guess, area, spreadsheet, password, nospreadsheet, silent)
     79     return
     80
---> 81   return pd.read_csv(io.BytesIO(output))

/Users/ariga/.virtualenvs/ibis/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    643                     skip_blank_lines=skip_blank_lines)
    644
--> 645         return _read(filepath_or_buffer, kwds)
    646
    647     parser_f.__name__ = name

/Users/ariga/.virtualenvs/ibis/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    398         return parser
    399
--> 400     data = parser.read()
    401     parser.close()
    402     return data

/Users/ariga/.virtualenvs/ibis/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
    936                 raise ValueError('skipfooter not supported for iteration')
    937
--> 938         ret = self._engine.read(nrows)
    939
    940         if self.options.get('as_recarray'):

/Users/ariga/.virtualenvs/ibis/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
   1503     def read(self, nrows=None):
   1504         try:
-> 1505             data = self._reader.read(nrows)
   1506         except StopIteration:
   1507             if self._first_chunk:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:9884)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10142)()

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:10870)()

pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10741)()

pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:25878)()

CParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 14

@sidmohan
Copy link
Author

sidmohan commented Nov 5, 2016

where should i pass the -p option in tabula-py?

@sidmohan
Copy link
Author

sidmohan commented Nov 5, 2016

can you please also let me know what all i need to install to run tabula-java and from which command prompt are you running the below command? i have cygwin installed on windows

java -jar tabula-0.9.1-jar-with-dependencies.jar -g -p 3-6 TAJ.pdf

@sidmohan
Copy link
Author

sidmohan commented Nov 5, 2016

request inputs from anyone

@chezou
Copy link
Owner

chezou commented Nov 5, 2016

Did you see tabula-java repo?
https://github.com/tabulapdf/tabula-java

You can download from here.
https://github.com/tabulapdf/tabula-java/releases

@sidmohan
Copy link
Author

sidmohan commented Nov 5, 2016

thanks for the link , what are the steps to install this package?

@sidmohan
Copy link
Author

Can someone please tell me the difference between STREAM MODE & LATTICE MODE OF TABULA and in which options Lattice mode is to be applied?

@wtlnd
Copy link

wtlnd commented Nov 23, 2016

I made a workaround you can use.
What I did was to place the jar-file in the project root, and use 'subprocess.call' to call the jar-file.
This solution works pretty well for me. It gives some errors and warnings, hence the stderr to devnull, but it works. Here's my method:

def convert_pdf_to_csv(self, file):
	outfile = file.replace("pdf", self.OUTPUT_FORMAT)
	subprocess.call(['java', '-jar', self.JARFILE, '-o', outfile, '-f', self.OUTPUT_FORMAT, '-g', '-r', file], stdout=open(os.devnull, 'w'), stderr=subprocess.STDOUT)

@sidmohan
Copy link
Author

thanks for the reply, can you please give a bit more explanation of what is happenning onsubprocess.call(['java', '-jar', self.JARFILE, '-o',

@wtlnd
Copy link

wtlnd commented Nov 24, 2016

Referencing https://github.com/tabulapdf/tabula-java

I'm pretty new to python, but afaik subprocess.call just lets you call other processess from within python.
self.JARFILE is the absolute path of "tabula-0.9.1-jar-with-dependencies.jar", because somehow, it wouldn't work with relative path.

@sidmohan
Copy link
Author

thanks for the reply and what about self.OUTPUT_FORMAT? is it setting the output format to PDF? can complex pdfs be converted through this means ? also how can i pass the mode type STREAM MODE or LATTICE MODE

@wtlnd
Copy link

wtlnd commented Nov 24, 2016

output_format is just a variable i use for setting the output format with the -f-option. Please read the link I provided to see which ones you can use for your project. I havent seen anything to do with stream of lattice mode.

@chezou
Copy link
Owner

chezou commented Jan 9, 2017

@sidmohan I'm sorry, I made a bug and fixed it. f1db4ef
Could you upgrade your tabula-py?

@sidmohan
Copy link
Author

sidmohan commented Jan 9, 2017

Ok thanks how do you want us to try this?

@chezou
Copy link
Owner

chezou commented Jan 9, 2017

As following:

df = read_pdf_table("./TAJ.pdf", guess=False, pages="3-6")

Of course, there might be still some problems related to multiple tables. You should have some restrictions of pages and areas to extract what you want. To know options for restriction in detail, you must read tabula-java Readme.
https://github.com/tabulapdf/tabula-java/blob/master/README.md

or, I think you can ask in gitter chat room of tabula-java
https://gitter.im/tabulapdf/tabula-java?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge

@nitinvijay23
Copy link

I am facing a similar issue.

tabula-py works fine on jupyter notebook but when I run the same code on command line then the error occurs unsupported/disabled operation BDC.

Please help.

@chezou
Copy link
Owner

chezou commented Apr 7, 2017

@nitinvijay23 This issue had already been closed. Could you create a new issue with more detail?

Repository owner locked and limited conversation to collaborators May 29, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants