Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.OutOfMemoryError: GC overhead limit exceeded #27

Closed
4 tasks
catchcharan opened this issue May 16, 2017 · 2 comments
Closed
4 tasks

java.lang.OutOfMemoryError: GC overhead limit exceeded #27

catchcharan opened this issue May 16, 2017 · 2 comments

Comments

@catchcharan
Copy link

Summary of your issue

My input PDF file is too large ..around 9000 pages (working fine if i select few pages)

Environment

Trying both in windows and linux
Write and check your environment.

  • python --version: 2.7.13
  • java -version: ? 1.7 (tried only python)
  • OS and it's version: ? windows10 and linux (tried both)
  • Your PDF URL: solution is working fine few pages but not working for 9000 pages.

What did you do when you faced the problem?

//write here

Example code:

paste your core code

tabula.convert_into("C:\Meher\pricelist.pdf", "C:\Meher\pricelistoutput.csv", spreadsheet=True,output_format="csv", pages="all")

Output:

tabula.convert_into("C:\Meher\pricelist.pdf", "C:\Meher\pricelistoutput.csv"
, spreadsheet=True,output_format="csv", pages="all")
May 16, 2017 6:12:14 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: i
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceede
d
at technology.tabula.ObjectExtractor.processTextPosition(ObjectExtractor
.java:329)
at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEn
gine.java:504)
at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:56)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngin
e.java:562)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
ne.java:269)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
ne.java:236)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.
java:216)
at technology.tabula.ObjectExtractor.drawPage(ObjectExtractor.java:153)
at technology.tabula.ObjectExtractor.extractPage(ObjectExtractor.java:10

  1.  at technology.tabula.PageIterator.next(PageIterator.java:29)
     at technology.tabula.CommandLineApp.extractFile(CommandLineApp.java:160)
    
     at technology.tabula.CommandLineApp.extractFileInto(CommandLineApp.java:
    
  2. at technology.tabula.CommandLineApp.extractFileTables(CommandLineApp.jav
    a:128)
    at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:10
  3.  at technology.tabula.CommandLineApp.main(CommandLineApp.java:74)
    

Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\site-packages\tabula\wrapper.py", line 114, in convert_i
nto
subprocess.check_output(args)
File "C:\Python27\lib\subprocess.py", line 219, in check_output
raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['java', '-jar', 'C:\Python27\lib\sit
e-packages\tabula\tabula-0.9.2-jar-with-dependencies.jar', '--pages', 'all', '
--guess', '--format', 'CSV', '--outfile', 'C:\Meher\pricelistoutput.csv', '--s
preadsheet', 'C:\Meher\pricelist.pdf']' returned non-zero exit status 1

What did you intend to be?

@chezou
Copy link
Owner

chezou commented May 17, 2017

I think you can extract with specific pages with for loop, and then you can concatenate outputs.

Othrewise, with tabula-java, you can set heap size using -Xmx like following:

java -Xmx4080m -jar C:\Python27\lib\sit
e-packages\tabula\tabula-0.9.2-jar-with-dependencies.jar --pages all 
--guess --format CSV --outfile C:\Meher\pricelistoutput.csv --s
preadsheet C:\Meher\pricelist.pdf

But I don't guarantee this way. tabula-py is a simple wrapper of tabula-java, and it is a basic tuning point for Java. If you want to tune furthermore, you can file an issue in the tabula-java issue.

@catchcharan
Copy link
Author

Thank you Chezou. With 4080m It could able to fetch 3000 pdf pages in a single loop at very quick pace. I could able to get it done with a for loop for all 9000 pages. Just one more issue I found. I will open a new issue.

This was referenced May 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants