Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not extract text from Office documents (`.docx` extension) #16864

Closed
dadoonet opened this Issue Feb 29, 2016 · 4 comments

Comments

Projects
None yet
4 participants
@dadoonet
Copy link
Member

commented Feb 29, 2016

Elasticsearch version: 2.2.0
Description of the problem including expected versus actual behavior: Mapper attachments plugin (or ingest-attachment) works with Text of PDF, but not with the Office formats.
Steps to reproduce:

  1. Install mapper-attachments plugin
  2. Index a Word (.docx document)
  3. Look at logs DEBUG level.

Logs:

[2016-02-29 16:43:39,341][DEBUG][mapper.attachment ] Failed to extract [100000] characters of text for [null]: [Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@51667d8a]
...
Caused by: java.lang.IllegalStateException: access denied ("java.lang.RuntimePermission" "getClassLoader")
at org.apache.xmlbeans.XmlBeans.getContextTypeLoader(XmlBeans.java:336)
...
Caused by: java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "getClassLoader")
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)

Analysis:

As recent Office documents are now xml based (.docx, .xlsx...), Tika can not read them anymore in the context of elasticsearch because getClassLoader call is forbidden.

Reported by many users at https://discuss.elastic.co/t/no-hits-when-do-a-text-search-in-an-attachment-for-docx-file/41779

Switching to .doc legacy format works well.

@clintongormley

This comment has been minimized.

Copy link
Member

commented Mar 1, 2016

@rjernst any idea what we can do here? Is this a matter of allowing Tika to use getClassLoader?

@rjernst

This comment has been minimized.

Copy link
Member

commented Mar 9, 2016

Is this a matter of allowing Tika to use getClassLoader?

No, it's not that simple. Adding any permissions still requires executing code needing those permissions inside a doPrivileged block. At a quick glance, it looks like xmlbeans is trying to get the context class loader, but it is hard to tell from the log snippet where this is called from within tika.

@dadoonet Do you have the full stack trace, and/or can you post the base64 of the sample doc which fails? A full recreation would make it easier to investigate where/how to fix the problem, preferably in this case as an addition to the tests (at worst rest tests).

@rmuir

This comment has been minimized.

Copy link
Contributor

commented Mar 10, 2016

it is not "at worst rest tests", in this case an integration test is a must. Because fantasyland (unit test environment) has everything on a gigantic classpath: you won't hit failures for violations of classloader boundaries.

once you have a good test, then its easy to "fix". Add getClassLoader to this plugin's security policy file and then add it to tika's sandbox: https://github.com/elastic/elasticsearch/blob/master/plugins/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/TikaImpl.java#L120

Tika must unfortunately have this special stuff in TikaImpl too, because of the nature of what it does, and because global permissions (core/ security file) are too generous/lenient. So by restricting it, it can only use /tmp and other things, this easily stops e.g. a directory traversal from an xml parser flaw, which is the kind of thing we should expect exist. Plus its just a shit-ton of different jars/code and we gotta keep some kind of leash on that :)

At the same time its not good to just keep adding more exceptions for bad code, without doing something about it. Later as a followup we should look at upgrading tika, they released recently and I know @uschindler also fixed a bunch of problems in some of the parsers themselves like apache POI, those fixes might be in that release and allow some cleanups here.

dadoonet added a commit that referenced this issue Mar 10, 2016

Can not extract text from Office documents (`.docx` extension)
Add REST test for:

* `.doc`
* `.docx`

The later fails with:

```
==> Test Info: seed=DB93397128B876D4; jvm=1; suite=1
Suite: org.elasticsearch.ingest.attachment.IngestAttachmentRestIT
  2> REPRODUCE WITH: gradle :plugins:ingest-attachment:integTest -Dtests.seed=DB93397128B876D4 -Dtests.class=org.elasticsearch.ingest.attachment.IngestAttachmentRestIT -Dtests.method="test {yaml=ingest_attachment/30_files_supported/Test ingest attachment processor with .docx file}" -Des.logger.level=WARN -Dtests.security.manager=true -Dtests.locale=bg -Dtests.timezone=Europe/Athens
FAILURE 4.53s | IngestAttachmentRestIT.test {yaml=ingest_attachment/30_files_supported/Test ingest attachment processor with .docx file} <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: expected [2xx] status code but api [index] returned [400 Bad Request] [{"error":{"root_cause":[{"type":"parse_exception","reason":"Error parsing document in field [field1]"}],"type":"parse_exception","reason":"Error parsing document in field [field1]","caused_by":{"type":"tika_exception","reason":"Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@7f85baa5","caused_by":{"type":"illegal_state_exception","reason":"access denied (\"java.lang.RuntimePermission\" \"getClassLoader\")","caused_by":{"type":"access_control_exception","reason":"access denied (\"java.lang.RuntimePermission\" \"getClassLoader\")"}}}},"status":400}]
   >    at __randomizedtesting.SeedInfo.seed([DB93397128B876D4:53C706AB86441B2C]:0)
   >    at org.elasticsearch.test.rest.section.DoSection.execute(DoSection.java:107)
   >    at org.elasticsearch.test.rest.ESRestTestCase.test(ESRestTestCase.java:395)
   >    at java.lang.Thread.run(Thread.java:745)
```

Related to #16864
@dadoonet

This comment has been minimized.

Copy link
Member Author

commented Mar 10, 2016

I do agree with @rmuir. So I created a branch here: https://github.com/elastic/elasticsearch/tree/fix/16864-attachment-doctypes where you can add also your changes.

It adds a new REST test which test .doc, .docx docs in the ingest-attachment plugin which replaces the mapper-attachments plugin.
We can copy our findings then to mapper-attachments if we are willing to.

The doc test works. The docx test fails.

Let me know if I can help on something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.