Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
integrate with apache tika? #12
Apache Tika supports a pretty wide range of formats and appears to have many of the same goals—namely, extracting text for a very wide range of formats. It seems like it might be good to integrate the Apache Tika extraction capabilities into textract, not unlike how we use external libraries like
The nice thing about this is that it would provide yet another means of extracting raw text from all the features we have and just provide another method for doing it. For PDFs, for example, we expose methods to extract via the
The downside is that Tika is written in java and doesn't appear to be the easiest thing to install for maven n00bs like me. Python bindings exist but even those carry big caveats about installation.
Random thought: it would be interesting to do an experiment to look at how effective Tika is at extracting text versus the other utilities that are currently included in textract. Given that we often care about the accuracy of word sequences (or even more forgivingly, word frequencies), maybe we can construct a test to see where it makes sense to include Tika and where (if anywhere) it doesn't perform as well.
Does this solve the problem?
Using jnius: finally, I remembered of a library I spotted once, called jnius, that should be made exactly for that purpose: using Java libraries from Python, without the need of wrappers, running the whole thing in a JVM, etc.. at the end, I opted for doing this way.
Setting things up was pretty straight-forward, as it was just a matter of:
pip install cython
From that point, using the library was a breeze:
If you put the jar in a non-standard location, you need to
prepare the CLASSPATH before importing jnius
from jnius import autoclass
Import the Java classes we are going to need
Tika = autoclass('org.apache.tika.Tika')
tika = Tika()
Integrating this with django and celery tasks was straightforward.
Of course, have a look at the Tika API Documentation for more information on the available methods, signatures, etc.
Taken from this source:
@w1kke that sounds really promising! I like the idea of using pyjnius to get this to work. The java dependency for installation doesn't sound too terrible and I like how actively developed the pyjnius project is.
@chrismattmann sorry to rope you into this conversation, but do you have any thoughts on using pyjnius to autoclass Tika vs using your python bindings?
Ideally it would be great if we can keep the installation of
@GomesNayagam Can you elaborate a bit more on how Thrift works? Do you have to have a Thrift server process running in order to use Tika with python then? If so, I think I'd prefer to go with either pyjnius or the tika-python options as it makes using textract that much cleaner.
For what it's worth, I did some quick experiments trying to get the Tika detector interface to work in Python using Pyjnius some months ago. See this repo:
But then I ran into this:
After which I gave up on it. Just thought I'd drop those links here, just in case it's useful!
Hi @deanmalmgren thanks for roping me in! To answer your question, I'm open to using pyjnius if it's better than JCC. Right now I'm experimenting with just making Tika available from Python in the easiest manner possible. JCC has its issues mostly having to do with weird configuration that's kind of unknown to me at times. I modeled my tika-python wrapper after Aptivate Tika which to my knowledge about a year ago was really the only well defined effort to expose Tika to Python. It had been forked many times, adapted in various ways, etc.
I'm happy to explore other Python integration facilities, with the ultimate goal of getting this pushed upstream into Apache Tika. I'd like Python support to be a focus there since there are many folks working to improve Tika at Apache and since it we maintain other bindings there (e.g., just as a basic .NET facility, even though there are other examples of such bindings e.g., http://kevm.github.io/tikaondotnet/).
That said, in looking at textract, I'm wondering - how are its goals different than Apache Tika's? If they are the same, it would be good to join forces and simply make a great Python version of Tika and then provide that to the community.
Thoughts? Thanks for checking out tika-python. I'm currently using it on the DARPA XDATA project in concert with a simple ETL library (https://github.com/chrismattmann/etllib) and Apache OODT (http://oodt.apache.org/) and Apache Solr (http://lucene.apache.org/solr/).
@deanmalmgren yes, we run the tika jar as separate process with Thrift(java service) and use python thrift client to access the functionality. Since our use case is online collaboration, we need to have scaled manner to deal with this problem. For your case you can go with your approach or use python subprocess to load the jar as command line mode and get the result.
@chrismattmann Thanks for sharing your thoughts on how you developed the tika python bindings and the pros/cons of pyjnius vs JCC.
This is a great question and admittedly something I have been grappling with quite a bit since I was first made aware of Tika.
One key difference, at least as far as I understand Tika, is that Tika provides one and only one parser class for each document type. Textract, on the other hand, is parser method agnostic meaning that we could have multiple ways of extracting content from the same document type. For example, you can currently either parse PDFs with the
In that vein, it seems natural to extend textract to have tika support (either by JCC or pyjnius) because it is yet another way to extract text. Provided its as easy to install textract as it is today (with one
I'm certainly not opposed to creating a "Tika for python" but think that realistically other tools beyond Tika are probably just as good, if not better, at extracting content. The intent here is to pull all those together in one easy to use way.
What do you think about this? Am I full of shit or is this a worthy goal?
Hi @deanmalmgren thanks for your reply. My thoughts are below:
Tika doesn't only provide one type of parser class for each document type. We have all sorts of ways to combine them (e.g., we have a CompositeParser construct that combines various underlying Parsers to form new and more powerful ones; we have FallbackParsers that try and parse first, and if unsuccessful, use an ordered List of Parsers to fall back on; we have ForkParsers which fork out new processes to control parsers, etc.) The mapping of Parsers to MIME types is also something that is a 1...N relationship (each parser declares its supported MIME types and they can be overlapping).
As for other tools besides Tika being good or better at extracting content - that is entirely possible and I've seen it. Tika isn't meant to be the best parsing toolkit in the world - it's goal is to find all those toolkits and to integrate them. So far the MIME registry we have is compliant with the 1200+ types defined in the IANA registry and we pretty much have parsers for all of those different types and more and more are being added each day. I have funding from NASA, DARPA and the NSF and various other reimbursable efforts (e.g., with bioinformatics companies, etc.) and so we are working to add more and more parsers and support. We also have a healthy active community of developers at Apache (I would say currently there are between 8-10 active developers working on Tika not simply at NASA, but at various companies, and agencies). I'm about to publish a blog post as well showing where we're taking Tika in some areas especially concerning Machine Translation (once you identify language and you have the ability to parse text and metadata from it, when not unify the languages and translate the text, the metadata, etc., too?) We are actively working in that area now. Tika really does aim to be the "digital babel fish". Not sure if you saw but the 1st and 4th chapters from the Tika in Action book are available the 1st one gives the motivation and case for the "Digital Babel Fish": http://manning.com/mattmann/SampleChapter-1.pdf and has some more insight into my and the team's motivations over the years.
So, at the end of the day, I'm definitely biased and think that the goals for textract are actually quite overlapping with Tika. That doesn't mean you have to be swallowed into the Tika project and you may decide, nope, I'm going forward and doing my own thing. If you do, that's fine and Tika can be one of your dependencies since we want anyone to use it and permissively license it under the Apache License version 2. Heck, we don't even mind people competing with Tika and if you build a better library, more power to ya! But, consider this an invite to our team, since I think your goals and philosophies and your code (that you are developing) would be most welcomed in the Apache Tika project.
If keeping it simple to install things is an objective, then Tika provides two "single jar" executables that you can run to do your parsing. One is the Tika App (tika-app.jar), which will require forking a new JVM each time, but provides a very simple way to feed Tika the file and get back text or metadata. The other is the Tika Server (tika-server.jar), which provides REST-ful services to do things like detection, plain text extraction, html extraction etc. (Tika Server has almost, but not quite as many endpoints as the Tika App has options). If you start a Tika Server, then there's a one-time cost then it's very quick to send over files and get back the parsed response
Otherwise, the Tika App jar contains all of Tika, along with the CLI + GUI interfaces. To keep things simple for Java programmers, we provide the OSGi bundle. For Python users, there's something to be said for grabbing the Tika App jar, adding that single jar to your classpath, then calling the normal Tika methods from within that (skipping the CLI classes). That would make it very simple for you to add Tika in, without the need to play with Maven (which while good, is quite a lot of work to use for a non-Java project)
Thanks everybody for your thoughts on this. I'm still not exactly sure what makes sense here—hoping from an epiphany from a little time to consider the options—but in e6cf734 I started a related projects portion of the documentation so that we can list this (and other) packages that have similar goals