Skip to content
This repository has been archived by the owner on Jun 20, 2023. It is now read-only.

NullpointerException when parsing large PDF documents #93

Closed
istvano opened this issue Dec 12, 2014 · 3 comments
Closed

NullpointerException when parsing large PDF documents #93

istvano opened this issue Dec 12, 2014 · 3 comments

Comments

@istvano
Copy link

istvano commented Dec 12, 2014

Hi Guys,

I am having trouble indexing some of our documents. They are sometimes over 20Mb.

I receive the following exception.

[2014-12-08 10:48:43,375][WARN ][org.apache.pdfbox.util.PDFStreamEngine] java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:364)
at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:106)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.Tika.parseToString(Tika.java:421)
at org.elasticsearch.index.mapper.attachment.AttachmentMapper.parse(AttachmentMapper.java:389)
at org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:648)
at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:501)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:542)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:491)
at org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:397)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:442)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:157)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:535)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:434)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Can this be due to not enough memory allocated to Tika ?

How can I allocate more memory to ES so that Tika can benefit from it too ?

Thanks for any hints or help

@dadoonet
Copy link
Member

Sorry for the delay answering. You could may be give more memory to elasticsearch so Tika will have more RAM to deal with bigger documents.

Have a look at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html#setup-configuration

I wonder if this has been fixed in more recent version of Tika. We are going to release the plugin with Tika 1.7 (was 1.5). If you still use it, could you test with the new version and reopen the issue if you still get NPE?

BTW are you using an old version of the plugin? Because we now catch a Throwable so you should not see that kind of log.

Closing. Feel free to reopen if needed and/or add more details.

@istvano
Copy link
Author

istvano commented Feb 11, 2015

Hi,

No worries, thanks for the tip. I will try it. Also I have upgraded to ES 1.4 and latest plugin so I see less problems with that !

@dadoonet
Copy link
Member

Just wait for some minutes. Hopefully a new version will be released :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants