Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Can not extract text from Office documents (`.docx` extension) #16864
Elasticsearch version: 2.2.0
As recent Office documents are now
Reported by many users at https://discuss.elastic.co/t/no-hits-when-do-a-text-search-in-an-attachment-for-docx-file/41779
No, it's not that simple. Adding any permissions still requires executing code needing those permissions inside a doPrivileged block. At a quick glance, it looks like xmlbeans is trying to get the context class loader, but it is hard to tell from the log snippet where this is called from within tika.
@dadoonet Do you have the full stack trace, and/or can you post the base64 of the sample doc which fails? A full recreation would make it easier to investigate where/how to fix the problem, preferably in this case as an addition to the tests (at worst rest tests).
it is not "at worst rest tests", in this case an integration test is a must. Because fantasyland (unit test environment) has everything on a gigantic classpath: you won't hit failures for violations of classloader boundaries.
once you have a good test, then its easy to "fix". Add getClassLoader to this plugin's security policy file and then add it to tika's sandbox: https://github.com/elastic/elasticsearch/blob/master/plugins/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/TikaImpl.java#L120
Tika must unfortunately have this special stuff in TikaImpl too, because of the nature of what it does, and because global permissions (core/ security file) are too generous/lenient. So by restricting it, it can only use /tmp and other things, this easily stops e.g. a directory traversal from an xml parser flaw, which is the kind of thing we should expect exist. Plus its just a shit-ton of different jars/code and we gotta keep some kind of leash on that :)
At the same time its not good to just keep adding more exceptions for bad code, without doing something about it. Later as a followup we should look at upgrading tika, they released recently and I know @uschindler also fixed a bunch of problems in some of the parsers themselves like apache POI, those fixes might be in that release and allow some cleanups here.
I do agree with @rmuir. So I created a branch here: https://github.com/elastic/elasticsearch/tree/fix/16864-attachment-doctypes where you can add also your changes.
It adds a new REST test which test
Let me know if I can help on something else.