Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrate with apache tika? #12

Open
deanmalmgren opened this issue Jul 28, 2014 · 11 comments
Open

integrate with apache tika? #12

deanmalmgren opened this issue Jul 28, 2014 · 11 comments

Comments

@deanmalmgren
Copy link
Owner

Apache Tika supports a pretty wide range of formats and appears to have many of the same goals—namely, extracting text for a very wide range of formats. It seems like it might be good to integrate the Apache Tika extraction capabilities into textract, not unlike how we use external libraries like antidoc or python-pptx to extract content.

The nice thing about this is that it would provide yet another means of extracting raw text from all the features we have and just provide another method for doing it. For PDFs, for example, we expose methods to extract via the pdfminer python package as well as the command line utility pdftotext and so it would be natural to just add another tika extraction method. We could even use tika to have better fallback behavior here when there aren't any natively written extraction methods specified.

The downside is that Tika is written in java and doesn't appear to be the easiest thing to install for maven n00bs like me. Python bindings exist but even those carry big caveats about installation.

Random thought: it would be interesting to do an experiment to look at how effective Tika is at extracting text versus the other utilities that are currently included in textract. Given that we often care about the accuracy of word sequences (or even more forgivingly, word frequencies), maybe we can construct a test to see where it makes sense to include Tika and where (if anywhere) it doesn't perform as well.

This idea came up on twitter (here, here, and here) and I should probably get back to them at some point once we figure out what to do here.

@w1kke
Copy link

w1kke commented Jul 28, 2014

Does this solve the problem?

Using jnius: finally, I remembered of a library I spotted once, called jnius, that should be made exactly for that purpose: using Java libraries from Python, without the need of wrappers, running the whole thing in a JVM, etc.. at the end, I opted for doing this way.
Setting up pyjnius

Setting things up was pretty straight-forward, as it was just a matter of:

pip install cython
pip install git+git://github.com/kivy/pyjnius.git
Then, I downloaded the tika-app jar, and put it somewhere.

From that point, using the library was a breeze:

If you put the jar in a non-standard location, you need to

prepare the CLASSPATH before importing jnius

import os
os.environ['CLASSPATH'] = "/path/to/tika-app.jar"

from jnius import autoclass

Import the Java classes we are going to need

Tika = autoclass('org.apache.tika.Tika')
Metadata = autoclass('org.apache.tika.metadata.Metadata')
FileInputStream = autoclass('java.io.FileInputStream')

tika = Tika()
meta = Metadata()
text = tika.parseToString(FileInputStream(filename), meta)
That's it! Now, you can just access the text transcript from text, and the file metadata is stored in meta (have a look at the .names() and .get(name) methods).

Integrating this with django and celery tasks was straightforward.

Of course, have a look at the Tika API Documentation for more information on the available methods, signatures, etc.

Taken from this source:
http://www.hackzine.org/using-apache-tika-from-python-with-jnius.html

@GomesNayagam
Copy link

we use text extraction process(using tika) as micro service(Thrift) and finally my python client or any can use Apache thrift client, so that i left with complexity at the same time the native JVM performance etc is intact.It is my personal choice of-course.

@deanmalmgren
Copy link
Owner Author

@w1kke that sounds really promising! I like the idea of using pyjnius to get this to work. The java dependency for installation doesn't sound too terrible and I like how actively developed the pyjnius project is.

@chrismattmann sorry to rope you into this conversation, but do you have any thoughts on using pyjnius to autoclass Tika vs using your python bindings?

Ideally it would be great if we can keep the installation of textract as simple and reliable as possible, with one apt-get install command and one pip install textract command.

@GomesNayagam Can you elaborate a bit more on how Thrift works? Do you have to have a Thrift server process running in order to use Tika with python then? If so, I think I'd prefer to go with either pyjnius or the tika-python options as it makes using textract that much cleaner.

@bitsgalore
Copy link

For what it's worth, I did some quick experiments trying to get the Tika detector interface to work in Python using Pyjnius some months ago. See this repo:

https://github.com/bitsgalore/tikadetect

But then I ran into this:

bitsgalore/tikadetect#1

After which I gave up on it. Just thought I'd drop those links here, just in case it's useful!

@chrismattmann
Copy link

Hi @deanmalmgren thanks for roping me in! To answer your question, I'm open to using pyjnius if it's better than JCC. Right now I'm experimenting with just making Tika available from Python in the easiest manner possible. JCC has its issues mostly having to do with weird configuration that's kind of unknown to me at times. I modeled my tika-python wrapper after Aptivate Tika which to my knowledge about a year ago was really the only well defined effort to expose Tika to Python. It had been forked many times, adapted in various ways, etc.

I'm happy to explore other Python integration facilities, with the ultimate goal of getting this pushed upstream into Apache Tika. I'd like Python support to be a focus there since there are many folks working to improve Tika at Apache and since it we maintain other bindings there (e.g., just as a basic .NET facility, even though there are other examples of such bindings e.g., http://kevm.github.io/tikaondotnet/).

That said, in looking at textract, I'm wondering - how are its goals different than Apache Tika's? If they are the same, it would be good to join forces and simply make a great Python version of Tika and then provide that to the community.

Thoughts? Thanks for checking out tika-python. I'm currently using it on the DARPA XDATA project in concert with a simple ETL library (https://github.com/chrismattmann/etllib) and Apache OODT (http://oodt.apache.org/) and Apache Solr (http://lucene.apache.org/solr/).

@GomesNayagam
Copy link

@deanmalmgren yes, we run the tika jar as separate process with Thrift(java service) and use python thrift client to access the functionality. Since our use case is online collaboration, we need to have scaled manner to deal with this problem. For your case you can go with your approach or use python subprocess to load the jar as command line mode and get the result.

@deanmalmgren
Copy link
Owner Author

@bitsgalore thanks for sharing the links; that's great to know.

@deanmalmgren
Copy link
Owner Author

@chrismattmann Thanks for sharing your thoughts on how you developed the tika python bindings and the pros/cons of pyjnius vs JCC.

That said, in looking at textract, I'm wondering - how are its goals different than Apache Tika's? If they are the same, it would be good to join forces and simply make a great Python version of Tika and then provide that to the community.

This is a great question and admittedly something I have been grappling with quite a bit since I was first made aware of Tika.

One key difference, at least as far as I understand Tika, is that Tika provides one and only one parser class for each document type. Textract, on the other hand, is parser method agnostic meaning that we could have multiple ways of extracting content from the same document type. For example, you can currently either parse PDFs with the pdftotext command line utility or with the pdfminer python package and this can be controlled with the --method command line argument or the method kwarg to textract.process. Sometimes there is a tradeoff on accuracy vs performance and I think its important to give users flexibility when parsing content.

In that vein, it seems natural to extend textract to have tika support (either by JCC or pyjnius) because it is yet another way to extract text. Provided its as easy to install textract as it is today (with one apt-get install command and one pip install command), this seems like a great addition.

I'm certainly not opposed to creating a "Tika for python" but think that realistically other tools beyond Tika are probably just as good, if not better, at extracting content. The intent here is to pull all those together in one easy to use way.

What do you think about this? Am I full of shit or is this a worthy goal?

@chrismattmann
Copy link

Hi @deanmalmgren thanks for your reply. My thoughts are below:

One key difference, at least as far as I understand Tika, is that Tika provides one and only one parser >class for each document type. Textract, on the other hand, is parser method agnostic meaning that >we could have multiple ways of extracting content from the same document type. For example, you >can currently either parse PDFs with the pdftotext command line utility or with the pdfminer python >package and this can be controlled with the --method command line argument or the method kwarg >to textract.process. Sometimes there is a tradeoff on accuracy vs performance and I think its >important to give users flexibility when parsing content.

Tika doesn't only provide one type of parser class for each document type. We have all sorts of ways to combine them (e.g., we have a CompositeParser construct that combines various underlying Parsers to form new and more powerful ones; we have FallbackParsers that try and parse first, and if unsuccessful, use an ordered List of Parsers to fall back on; we have ForkParsers which fork out new processes to control parsers, etc.) The mapping of Parsers to MIME types is also something that is a 1...N relationship (each parser declares its supported MIME types and they can be overlapping).

As for other tools besides Tika being good or better at extracting content - that is entirely possible and I've seen it. Tika isn't meant to be the best parsing toolkit in the world - it's goal is to find all those toolkits and to integrate them. So far the MIME registry we have is compliant with the 1200+ types defined in the IANA registry and we pretty much have parsers for all of those different types and more and more are being added each day. I have funding from NASA, DARPA and the NSF and various other reimbursable efforts (e.g., with bioinformatics companies, etc.) and so we are working to add more and more parsers and support. We also have a healthy active community of developers at Apache (I would say currently there are between 8-10 active developers working on Tika not simply at NASA, but at various companies, and agencies). I'm about to publish a blog post as well showing where we're taking Tika in some areas especially concerning Machine Translation (once you identify language and you have the ability to parse text and metadata from it, when not unify the languages and translate the text, the metadata, etc., too?) We are actively working in that area now. Tika really does aim to be the "digital babel fish". Not sure if you saw but the 1st and 4th chapters from the Tika in Action book are available the 1st one gives the motivation and case for the "Digital Babel Fish": http://manning.com/mattmann/SampleChapter-1.pdf and has some more insight into my and the team's motivations over the years.

So, at the end of the day, I'm definitely biased and think that the goals for textract are actually quite overlapping with Tika. That doesn't mean you have to be swallowed into the Tika project and you may decide, nope, I'm going forward and doing my own thing. If you do, that's fine and Tika can be one of your dependencies since we want anyone to use it and permissively license it under the Apache License version 2. Heck, we don't even mind people competing with Tika and if you build a better library, more power to ya! But, consider this an invite to our team, since I think your goals and philosophies and your code (that you are developing) would be most welcomed in the Apache Tika project.

Cheers!

@Gagravarr
Copy link

If keeping it simple to install things is an objective, then Tika provides two "single jar" executables that you can run to do your parsing. One is the Tika App (tika-app.jar), which will require forking a new JVM each time, but provides a very simple way to feed Tika the file and get back text or metadata. The other is the Tika Server (tika-server.jar), which provides REST-ful services to do things like detection, plain text extraction, html extraction etc. (Tika Server has almost, but not quite as many endpoints as the Tika App has options). If you start a Tika Server, then there's a one-time cost then it's very quick to send over files and get back the parsed response

Otherwise, the Tika App jar contains all of Tika, along with the CLI + GUI interfaces. To keep things simple for Java programmers, we provide the OSGi bundle. For Python users, there's something to be said for grabbing the Tika App jar, adding that single jar to your classpath, then calling the normal Tika methods from within that (skipping the CLI classes). That would make it very simple for you to add Tika in, without the need to play with Maven (which while good, is quite a lot of work to use for a non-Java project)

@deanmalmgren
Copy link
Owner Author

Thanks everybody for your thoughts on this. I'm still not exactly sure what makes sense here—hoping from an epiphany from a little time to consider the options—but in e6cf734 I started a related projects portion of the documentation so that we can list this (and other) packages that have similar goals

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants