Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal error with ingest-attachment plugin #22077

Closed
chrduf opened this issue Dec 9, 2016 · 7 comments

Comments

@chrduf
Copy link

commented Dec 9, 2016

I am getting a fatal error when trying to index the attached stripped down document using the ingest-attachment plugin. This causes my cluster to reboot and does not give me an email notification. It looks to be having a problem with the embedded Visio diagram.

Version: 5.1.1 (Elastic Cloud)
ClusterId: 7e7501
Node: instance-0000000005
Plugin: ingest-attachment

Link to problematic doc: https://1drv.ms/w/s!ApTXXtrEV_GGiosenEfoUSk1rRnuYA

Error Information

Dec 8 21:42:34 ERROR org.elasticsearch.bootstrap.ElasticsearchUncaughtExceptionHandler i5@z0
[2016-12-08T21:42:34,628][ERROR][org.elasticsearch.bootstrap.ElasticsearchUncaughtExceptionHandler] fatal error in thread [elasticsearch[index][T#1]], exiting java.lang.NoClassDefFoundError: com/graphbuilder/curve/Point
 at java.lang.Class.getDeclaredConstructors0(Native Method) ~[?:1.8.0_72]
 at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671) ~[?:1.8.0_72]
 at java.lang.Class.getConstructor0(Class.java:3075) ~[?:1.8.0_72]
 at java.lang.Class.getDeclaredConstructor(Class.java:2178) ~[?:1.8.0_72]
 at org.apache.poi.xdgf.util.ObjectFactory.put(ObjectFactory.java:34) ~[?:?]
 at org.apache.poi.xdgf.usermodel.section.geometry.GeometryRowFactory.(GeometryRowFactory.java:39) ~[?:?]
 at org.apache.poi.xdgf.usermodel.section.GeometrySection.(GeometrySection.java:55) ~[?:?]
 at org.apache.poi.xdgf.usermodel.XDGFSheet.(XDGFSheet.java:77) ~[?:?]
 at org.apache.poi.xdgf.usermodel.XDGFShape.(XDGFShape.java:113) ~[?:?]
 at org.apache.poi.xdgf.usermodel.XDGFShape.(XDGFShape.java:107) ~[?:?]
 at org.apache.poi.xdgf.usermodel.XDGFBaseContents.onDocumentRead(XDGFBaseContents.java:82) ~[?:?]
 at org.apache.poi.xdgf.usermodel.XDGFMasterContents.onDocumentRead(XDGFMasterContents.java:66) ~[?:?]
 at org.apache.poi.xdgf.usermodel.XDGFMasters.onDocumentRead(XDGFMasters.java:101) ~[?:?]
 at org.apache.poi.xdgf.usermodel.XmlVisioDocument.onDocumentRead(XmlVisioDocument.java:106) ~[?:?]
 at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190) ~[?:?]
 at org.apache.poi.xdgf.usermodel.XmlVisioDocument.(XmlVisioDocument.java:79) ~[?:?]
 at org.apache.poi.xdgf.extractor.XDGFVisioExtractor.(XDGFVisioExtractor.java:41) ~[?:?]
 at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:207) ~[?:?]
 at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) ~[?:?]
 at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) ~[?:?]
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) ~[?:?]
 at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) ~[?:?]
 at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) ~[?:?]
 at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:311) ~[?:?]
 at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:202) ~[?:?]
 at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115) ~[?:?]
 at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) ~[?:?]
 at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) ~[?:?]
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) ~[?:?]
 at org.apache.tika.Tika.parseToString(Tika.java:568) ~[?:?]
 at org.elasticsearch.ingest.attachment.TikaImpl$1.run(TikaImpl.java:94) ~[?:?]
 at org.elasticsearch.ingest.attachment.TikaImpl$1.run(TikaImpl.java:91) ~[?:?]
 at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_72]
 at org.elasticsearch.ingest.attachment.TikaImpl.parse(TikaImpl.java:91) ~[?:?]
 at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:72) ~[?:?]
 at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-5.1.1.jar:5.1.1]
 at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-5.1.1.jar:5.1.1]
 at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:166) ~[elasticsearch-5.1.1.jar:5.1.1]
 at org.elasticsearch.ingest.PipelineExecutionService.access$000(PipelineExecutionService.java:41) ~[elasticsearch-5.1.1.jar:5.1.1]
 at org.elasticsearch.ingest.PipelineExecutionService$1.doRun(PipelineExecutionService.java:65) ~[elasticsearch-5.1.1.jar:5.1.1]
 at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:527) ~[elasticsearch-5.1.1.jar:5.1.1]
 at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.1.1.jar:5.1.1]
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_72]
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_72]
 at java.lang.Thread.run(Thread.java:745) [?:1.8.0_72] 
Caused by: java.lang.ClassNotFoundException: com.graphbuilder.curve.Point
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381) ~[?:1.8.0_72]
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424) ~[?:1.8.0_72]
 at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:814) ~[?:1.8.0_72]
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_72]
 ... 47 more
@jasontedor

This comment has been minimized.

Copy link
Member

commented Dec 9, 2016

This is due to a missing transitive dependency: com.github.virtuald:curvesapi:1.04:

_transitive_org.apache.poi:poi-ooxml:3.15
\--- org.apache.poi:poi-ooxml:3.15
     +--- org.apache.poi:poi:3.15
     |    +--- commons-codec:commons-codec:1.10
     |    \--- org.apache.commons:commons-collections4:4.1
     +--- org.apache.poi:poi-ooxml-schemas:3.15
     |    \--- org.apache.xmlbeans:xmlbeans:2.6.0
     |         \--- stax:stax-api:1.0.1
     \--- com.github.virtuald:curvesapi:1.04

It's not ideal, but I think that you can work around this for now by dropping this dependency (and any of its transitive dependencies) in plugins/ingest-attachment (note: this might not be a complete solution if there are also security permissions that need to be added too, just trying to see what we can do here in the short term).

@dadoonet

This comment has been minimized.

Copy link
Member

commented Dec 9, 2016

He can't do that. He is on cloud.
I'm currently reproducing it.

@dadoonet dadoonet self-assigned this Dec 9, 2016

@jasontedor

This comment has been minimized.

Copy link
Member

commented Dec 9, 2016

He can't do that. He is on cloud.

I didn't notice that but that is indeed unfortunate.

@dadoonet

This comment has been minimized.

Copy link
Member

commented Dec 9, 2016

I can totally reproduce the hard failure locally.
Thanks a lot @chrduf for providing the file which is causing that.
Working on a fix ATM. We have to fix 2 things IMO:

  • catch the exception correctly and just fail the ingest pipeline (don't stop the node basically)
  • then try to see if we can add safely the missing dependency
@jasontedor

This comment has been minimized.

Copy link
Member

commented Dec 9, 2016

This causes my cluster to reboot and does not give me an email notification.

We will look into why this is the case (that you're not receiving the email notification).

@dadoonet

This comment has been minimized.

Copy link
Member

commented Dec 9, 2016

@chrduf I'd like to use your file https://1drv.ms/w/s!ApTXXtrEV_GGiosenEfoUSk1rRnuYA as an input for a test case. Do you allow us doing so?

@chrduf

This comment has been minimized.

Copy link
Author

commented Dec 9, 2016

yes, you can use that document

dadoonet added a commit to dadoonet/elasticsearch that referenced this issue Jan 24, 2017
Add missing mime4j library
In some cases (apparently with outlook files), mime4j library is needed.
We removed it in the past which can cause elasticsearch to crash when you are using ingest-attachment (and probably mapper-attachments as well in 2.x series) with a file which requires this library.

 Similar problem as the one reported at elastic#22077.
dadoonet added a commit to dadoonet/elasticsearch that referenced this issue Feb 3, 2017
Remove support for Visio and POTM files
Actually we never supported Visio files but we are failing hard (kill a node) when that kind of file is provided.
See elastic#22079 (comment)

This commits excludes Visio parsing from Tika so it does not fail anymore but returns empty content instead.

As a side effect, it also removes support for POTM files.

Closes elastic#22077.
dadoonet added a commit to dadoonet/elasticsearch that referenced this issue Feb 17, 2017
Remove support for Visio and potm files
* Send a non supported document to an ingest pipeline using `ingest-attachment`
* If Tika is not able to parse the document because of a missing class (we are not importing all jars needed by Tika), Tika throws a Throwable which is not catch.

This commit removes support for Visio and POTM office files.

So elasticsearch is not killed anymore when you run a command like:

```
GET _ingest/pipeline/_simulate
{
  "pipeline" : {
    "processors" : [
      {
        "attachment" : {
          "field" : "file"
        }
      }
    ]
  },
  "docs" : [
    {
      "_source" : {
        "file" : "BASE64CONTENT"
      }
    }
  ]
}
```

The good news is that it does not kill the node anymore and allows to extract the text which is in the Office document even if we have a Visio content (which is not extracted anymore).

Related to elastic#22077

Backport of elastic#22079 in 5.x branch (5.3)
dadoonet added a commit to dadoonet/elasticsearch that referenced this issue Feb 17, 2017
Remove support for Visio and potm files
* Parse a non supported document using `mapper-attachments`
* If Tika is not able to parse the document because of a missing class (we are not importing all jars needed by Tika), Tika throws a Throwable which is not catch.

This commit removes support for Visio and POTM office files.

The good news is that it does not kill the node anymore and allows to extract the text which is in the Office document even if we have a Visio content (which is not extracted anymore).

Related to elastic#22077 and elastic#22079 for mapper-attachments plugin
dadoonet added a commit that referenced this issue Feb 20, 2017
Remove support for Visio and potm files
* Send a non supported document to an ingest pipeline using `ingest-attachment`
* If Tika is not able to parse the document because of a missing class (we are not importing all jars needed by Tika), Tika throws a Throwable which is not catch.

This commit removes support for Visio and POTM office files.

So elasticsearch is not killed anymore when you run a command like:

```
GET _ingest/pipeline/_simulate
{
  "pipeline" : {
    "processors" : [
      {
        "attachment" : {
          "field" : "file"
        }
      }
    ]
  },
  "docs" : [
    {
      "_source" : {
        "file" : "BASE64CONTENT"
      }
    }
  ]
}
```

The good news is that it does not kill the node anymore and allows to extract the text which is in the Office document even if we have a Visio content (which is not extracted anymore).

Related to #22077

Backport of #23214 in 5.2 branch
dadoonet added a commit that referenced this issue Feb 20, 2017
Remove support for Visio and potm files
* Parse a non supported document using `mapper-attachments`
* If Tika is not able to parse the document because of a missing class (we are not importing all jars needed by Tika), Tika throws a Throwable which is not catch.

This commit removes support for Visio and POTM office files.

The good news is that it does not kill the node anymore and allows to extract the text which is in the Office document even if we have a Visio content (which is not extracted anymore).

Related to #22077 and #22079 for mapper-attachments plugin

Backport of #23214 in 5.2 branch
dadoonet added a commit that referenced this issue Apr 23, 2017
Remove support for Visio and potm files
* Send a non supported document to an ingest pipeline using `ingest-attachment`
* If Tika is not able to parse the document because of a missing class (we are not importing all jars needed by Tika), Tika throws a Throwable which is not catch.

This commit removes support for Visio and POTM office files.

So elasticsearch is not killed anymore when you run a command like:

```
GET _ingest/pipeline/_simulate
{
  "pipeline" : {
    "processors" : [
      {
        "attachment" : {
          "field" : "file"
        }
      }
    ]
  },
  "docs" : [
    {
      "_source" : {
        "file" : "BASE64CONTENT"
      }
    }
  ]
}
```

The good news is that it does not kill the node anymore and allows to extract the text which is in the Office document even if we have a Visio content (which is not extracted anymore).

Related to #22077

Backport of #23214 in 5.3 branch

(cherry picked from commit 76a977a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.