-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TIKA-1993: ObjectRecognitionParser + Tensorflow image recognition with Inception-V3 model as default implementation #125
Conversation
LOG.warn("{} is not available for service", recogniser.getClass()); | ||
return; | ||
} | ||
metadata.set("object.rec.impl", recogniser.getClass().getSimpleName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about org.apache.tika.parser.recognition.object.rec.impl
- IOW prefix with o.a.t.p.r?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to keep shorter key.
Its Done!
@thammegowda this is fantastic. I think you forgot to attach the default python scripts though too since you ref them in the classpath. |
@chrismattmann Thanks for checking this. I went ahead and incorporated your feedback regarding confidence in metadata and package name in key |
Thamme why not just include the script in the classpath? |
Also would be great to take the docs from the PR and make a wiki page out of them |
Tensorflow is Apache Licence so we can definitely include the script in classpath. it will be then archived inside the jar.
Correct me if I am wrong. I can definitely place the script inside and then make a copy locally (instead of HTTP GET) to run it, but I don't see any advantage in doing so. |
It makes it so we don't rely on having a Internet connection to run - also you cs still use http get just use a classpath jar URi |
iOw one less http call and if the user has already downloaded the model no http calls externally |
…ns correction for static final constants
@chrismattmann Thanks for the clarification. Corrected in the latest commits. |
+1 from me great work |
EDIT: This implementation is removed due to the setup complexities. Use REST based implementation instead which is easy to setup and has same performanceAchieved production grade performance for tensorflow image recognition using gRPC Step1: Start tensorflow service
Step 2: Build an addon jar
Enable TensorflowOpen Checkout test case |
working on a new pull request too @thammegowda |
…ier via: (1) GRPC and (2) REST API - Added REST API service python program to resources - Added Docker Build File for REST API service - Added few Test Cases
+ service bind to all interfaces + auto start service
@chrismattmann 1. Start the inception service on 8764 port :The API service code is added at Also, a docker file is added to setup the environment quickly cd tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/
docker build -f InceptionRestDockerfile -t inception-rest-tika .
docker run -p 8764:8764 -it inception-rest-tika 2. Sample ConfigurationUse this configuration to parse JPEG images <?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
<mime>image/jpeg</mime>
<params>
<param name="topN" type="int">7</param>
<param name="minConfidence" type="double">0.015</param>
<param name="class" type="string">org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser</param>
</params>
</parser>
</parsers>
</properties> This one by default uses |
@thammegowda I tried merging this. First, the error:
Steps:
Can you please confirm this builds? |
@chrismattmann what's happening here : This happens when there are additional parameters in the config file that are not declared by parsers. Contrary to this case, we had another discussion in Jira TIKA-1986 that parsers cannot anticipate all the runtime parameters of inner services at compile time. In my opinion, this test case is no longer applicable. But I wanted to get comments from @tballison regarding the same. |
@chrismattmann fixed |
thanks, appreciate it, working on testing now. @thammegowda |
@thammegowda please update your final step in the docker build above, it should read: docker run -p 8764:8764 -it inception-rest-tika |
Done. Thanks! |
looking good! @thammegowda First test:
|
all tests:
All script based tests work. Testing REST now. |
for the REST server here's what I'm getting: on tika-app side with config to point to REST:
on docker side running REST server:
|
OK so I got it working by upgrading Docker, but the REST service for Tensorflow has some weird error where it won't print the XHTML content. |
OK so I can get Metadata
XHTML
|
finally fixed it! 2 issues:
I'm going to do a few more tests then get this committed! Great work @thammegowda. Overall this is an amazing contribution it will be awesome for Tika users! |
Awesome. Lessons learned. |
build passed:
committing! |
Hi Mr @thammegowda & Mr @chrismattmann Thanks and Regards |
Hey folks, If there is a problem with deeplearning4j, we have a gitter channel: The release is coming out which will have the model import. @chrismattmann 's and teams example is based on snapshot. Please communicate with us if there are issues. Thanks! |
hello @agibsonccc thanks for the concern but I think you got me wrong I am discussing about this project |
Hi @kkrgithub ! Please go to https://issues.apache.org/jira/browse/TIKA-2262 and comment that you're are interested, that will add you to watcher list and thus notify you if any further action is taken on that issue. To contribute to Tika, please look for open issues in JIRA at https://issues.apache.org/jira/browse/TIKA/?selectedTab=com.atlassian.jira.jira-projects-plugin:issues-panel and start digging! |
@agibsonccc Sure thing.
We are eagerly waiting for it. Can't wait to integrate it to tika and run it on Spark clusters to process huge datasets of images. FYI, we are also exploring Image2Text and other OCR models to run them using deeplearning4j. |
Summary of changes:
ExternalParser
which (if missing) downloads and calls tensorflowimage_classify.py
script (the script then downloads Inception-v3 model)Quick Setup and Test
tika-parsers/src/test/java/org/apache/tika/parser/recognition/ObjectRecognitionParserTest.java
Demos
Compile package :
mvn clean install
#-DskipTests
if you dont like to wait for testsLets check
/ /NOTE:
image/jpeg
is supported. PNG coming later