java.lang.OutOfMemoryError: GC overhead limit exceeded #27

catchcharan · 2017-05-16T22:25:59Z

Summary of your issue

My input PDF file is too large ..around 9000 pages (working fine if i select few pages)

Environment

Trying both in windows and linux
Write and check your environment.

python --version: 2.7.13
java -version: ? 1.7 (tried only python)
OS and it's version: ? windows10 and linux (tried both)
Your PDF URL: solution is working fine few pages but not working for 9000 pages.

What did you do when you faced the problem?

//write here

Example code:

paste your core code

tabula.convert_into("C:\Meher\pricelist.pdf", "C:\Meher\pricelistoutput.csv", spreadsheet=True,output_format="csv", pages="all")

Output:

tabula.convert_into("C:\Meher\pricelist.pdf", "C:\Meher\pricelistoutput.csv"
, spreadsheet=True,output_format="csv", pages="all")
May 16, 2017 6:12:14 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceede
d
at technology.tabula.ObjectExtractor.processTextPosition(ObjectExtractor
.java:329)
at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEn
gine.java:504)
at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:56)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngin
e.java:562)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
ne.java:269)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
ne.java:236)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.
java:216)
at technology.tabula.ObjectExtractor.drawPage(ObjectExtractor.java:153)
at technology.tabula.ObjectExtractor.extractPage(ObjectExtractor.java:10

 at technology.tabula.PageIterator.next(PageIterator.java:29)
 at technology.tabula.CommandLineApp.extractFile(CommandLineApp.java:160)

 at technology.tabula.CommandLineApp.extractFileInto(CommandLineApp.java:

at technology.tabula.CommandLineApp.extractFileTables(CommandLineApp.jav
a:128)
at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:10

 at technology.tabula.CommandLineApp.main(CommandLineApp.java:74)

Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\site-packages\tabula\wrapper.py", line 114, in convert_i
nto
subprocess.check_output(args)
File "C:\Python27\lib\subprocess.py", line 219, in check_output
raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['java', '-jar', 'C:\Python27\lib\sit
e-packages\tabula\tabula-0.9.2-jar-with-dependencies.jar', '--pages', 'all', '
--guess', '--format', 'CSV', '--outfile', 'C:\Meher\pricelistoutput.csv', '--s
preadsheet', 'C:\Meher\pricelist.pdf']' returned non-zero exit status 1

What did you intend to be?

The text was updated successfully, but these errors were encountered:

chezou · 2017-05-17T03:47:36Z

I think you can extract with specific pages with for loop, and then you can concatenate outputs.

Othrewise, with tabula-java, you can set heap size using -Xmx like following:

java -Xmx4080m -jar C:\Python27\lib\sit
e-packages\tabula\tabula-0.9.2-jar-with-dependencies.jar --pages all 
--guess --format CSV --outfile C:\Meher\pricelistoutput.csv --s
preadsheet C:\Meher\pricelist.pdf

But I don't guarantee this way. tabula-py is a simple wrapper of tabula-java, and it is a basic tuning point for Java. If you want to tune furthermore, you can file an issue in the tabula-java issue.

catchcharan · 2017-05-17T14:40:15Z

Thank you Chezou. With 4080m It could able to fetch 3000 pdf pages in a single loop at very quick pace. I could able to get it done with a for loop for all 9000 pages. Just one more issue I found. I will open a new issue.

catchcharan closed this as completed May 17, 2017

This was referenced May 18, 2017

Handle Java option #30

Closed

Can't read the embedded font #31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

java.lang.OutOfMemoryError: GC overhead limit exceeded #27

java.lang.OutOfMemoryError: GC overhead limit exceeded #27

catchcharan commented May 16, 2017

chezou commented May 17, 2017 •

edited

catchcharan commented May 17, 2017

java.lang.OutOfMemoryError: GC overhead limit exceeded #27

java.lang.OutOfMemoryError: GC overhead limit exceeded #27

Comments

catchcharan commented May 16, 2017

Summary of your issue

Environment

What did you do when you faced the problem?

Example code:

Output:

What did you intend to be?

chezou commented May 17, 2017 • edited

catchcharan commented May 17, 2017

chezou commented May 17, 2017 •

edited