Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIKA-1993: ObjectRecognitionParser + Tensorflow image recognition with Inception-V3 model as default implementation #125

Merged
merged 13 commits into from
Aug 14, 2016

Conversation

thammegowda
Copy link
Member

Summary of changes:

  • Fixed TIKA-2002 : ExternalParser.check() empties stdout and stderr buffers so no more hanging is expected
  • Added ObjectRecognitionParser, ObjectRecogniser, RecognisedObject - A parser, interface and a model class respectively
  • implemented TensorFlowImageRecParser - an ExternalParser which (if missing) downloads and calls tensorflow image_classify.py script (the script then downloads Inception-v3 model)

Quick Setup and Test

Demos

Compile package : mvn clean install # -DskipTests if you dont like to wait for tests

Lets check

java -jar tika-app/target/tika-app-1.14-SNAPSHOT.jar \
 --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml \
  tika-parsers/src/test/resources/test-documents/testJPEG.jpg
<meta name="English foxhound" content="0.02759"/>
<meta name="Egyptian cat" content="0.09168"/>
<meta name="collie" content="0.02982"/>
<meta name="bluetick" content="0.06043"/>
<meta name="Border collie" content="0.07553"/>
<meta name="projectile, missile" content="0.00034"/>
<meta name="military uniform" content="0.00763"/>
<meta name="bulletproof vest" content="0.00489"/>
<meta name="assault rifle, assault gun" content="0.92418"/>
<meta name="rifle" content="0.04343"/>
<meta name="power drill" content="0.00470"/>
<meta name="revolver, six-gun, six-shooter" content="0.69355"/>
<meta name="holster" content="0.21180"/>
<meta name="assault rifle, assault gun" content="0.01513"/>
<meta name="rifle" content="0.01053"/>
<meta name="car wheel" content="0.02527"/>
<meta name="convertible" content="0.01338"/>
<meta name="sports car, sport car" content="0.87855"/>
<meta name="beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon" content="0.00903"/>
<meta name="minivan" content="0.01217"/>

/ /NOTE:

  1. The most efficient way to make use of tensorflow would be to use C++ api via JNI. I didn't have a chance to learn that stuff so far so help needed to make this efficient. Or else we may wait for tensorflow folks to offer Java bindings! Right now, the image recognition model is loaded and unloaded every time by the script (200MB of disk-read per parse call, very inefficient!).
  2. The very first call takes plenty of time as the model is downloaded lazily
  3. Only image/jpeg is supported. PNG coming later

LOG.warn("{} is not available for service", recogniser.getClass());
return;
}
metadata.set("object.rec.impl", recogniser.getClass().getSimpleName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about org.apache.tika.parser.recognition.object.rec.impl - IOW prefix with o.a.t.p.r?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to keep shorter key.
Its Done!

@chrismattmann
Copy link
Contributor

@thammegowda this is fantastic. I think you forgot to attach the default python scripts though too since you ref them in the classpath.

@thammegowda
Copy link
Member Author

@chrismattmann Thanks for checking this. I went ahead and incorporated your feedback regarding confidence in metadata and package name in key
The default python script path is relative to current directory (also users can change it, TIKA-1508). This program should download the script when it is missing. If you have faced any issues please paste the error message, I will rectify it.

@chrismattmann
Copy link
Contributor

Thamme why not just include the script in the classpath?

@chrismattmann
Copy link
Contributor

Also would be great to take the docs from the PR and make a wiki page out of them

@thammegowda
Copy link
Member Author

Tensorflow is Apache Licence so we can definitely include the script in classpath. it will be then archived inside the jar.
However,

  1. Even then we will have to copy the script from inside the jar to outside to let the python interpreter run it.
  2. Users will still need the network connection to download the model.

Correct me if I am wrong. I can definitely place the script inside and then make a copy locally (instead of HTTP GET) to run it, but I don't see any advantage in doing so.

@chrismattmann
Copy link
Contributor

It makes it so we don't rely on having a Internet connection to run - also you cs still use http get just use a classpath jar URi

@chrismattmann
Copy link
Contributor

iOw one less http call and if the user has already downloaded the model no http calls externally

@thammegowda
Copy link
Member Author

@chrismattmann Thanks for the clarification. Corrected in the latest commits.

@chrismattmann
Copy link
Contributor

+1 from me great work

@thammegowda
Copy link
Member Author

thammegowda commented Jun 29, 2016

EDIT: This implementation is removed due to the setup complexities. Use REST based implementation instead which is easy to setup and has same performance

Achieved production grade performance for tensorflow image recognition using gRPC

Step1: Start tensorflow service

# pull and start the prebuilt container, forward port 9000
docker run -it -p 9000:9000 tgowda/inception_serving_tika

# Inside the container, start tensorflow service
root@8311ea4e8074:/# /serving/server.sh

Step 2: Build an addon jar

git clone git@github.com:thammegowda/tensorflow-grpc-java.git
cd tensorflow-grpc-java
mvn clean compile assembly:single

# copy the path of target/tensorflow-java-1.0-jar-with-dependencies.jar

Enable Tensorflow

Open org/apache/tika/parser/recognition/tika-config-tflow-addon.xml and set addon file path to the actual file obtained in previous step

Checkout test case ObjectRecognitionParserTest#testAddonJar

CC @chrismattmann

@chrismattmann
Copy link
Contributor

working on a new pull request too @thammegowda

…ier via: (1) GRPC and (2) REST API

- Added REST API service python program to resources
- Added Docker Build File for REST API service
- Added few Test Cases
+ service bind to all interfaces
+ auto start service
@thammegowda
Copy link
Member Author

thammegowda commented Jul 25, 2016

@chrismattmann
Updated with the RESTAPI based integration.

1. Start the inception service on 8764 port :

The API service code is added at tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/inceptionapi.py

Also, a docker file is added to setup the environment quickly

cd tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/
docker build -f InceptionRestDockerfile -t inception-rest-tika .
docker run -p 8764:8764 -it inception-rest-tika

2. Sample Configuration

Use this configuration to parse JPEG images

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
            <mime>image/jpeg</mime>
            <params>
                <param name="topN" type="int">7</param>
                <param name="minConfidence" type="double">0.015</param>
                <param name="class" type="string">org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser</param>
            </params>
        </parser>
    </parsers>
</properties>

This one by default uses apiUrl=http://localhost:8764/inception/v3/classify. The way to change the defaults will be documented in wiki later.

@chrismattmann
Copy link
Contributor

@thammegowda I tried merging this.

First, the error:

Results :

Failed tests: 
  ParameterizedParserTest.testBadParam:92 should have thrown exception

Tests run: 201, Failures: 1, Errors: 0, Skipped: 1

Steps:

  1. git checkout --track github/TIKA-1508
  2. git pull
  3. git pull https://github.com/thammegowda/tika.git TIKA-1993
  4. mvn clean install

Can you please confirm this builds?

@thammegowda
Copy link
Member Author

thammegowda commented Aug 13, 2016

@chrismattmann
I can reproduce the problem.

what's happening here :
this one due to an issue introduced in TIKA-1508 (runtime parameters from configuration file).

This happens when there are additional parameters in the config file that are not declared by parsers.
The test case is asserting that there should not be any extra parameters (extra params are treated as bad parameters).

Contrary to this case, we had another discussion in Jira TIKA-1986 that parsers cannot anticipate all the runtime parameters of inner services at compile time.

In my opinion, this test case is no longer applicable. But I wanted to get comments from @tballison regarding the same.
The breaking change is due to this comment here in the PR

@thammegowda
Copy link
Member Author

@chrismattmann fixed

@chrismattmann
Copy link
Contributor

thanks, appreciate it, working on testing now. @thammegowda

@chrismattmann
Copy link
Contributor

chrismattmann commented Aug 13, 2016

@thammegowda please update your final step in the docker build above, it should read:

docker run -p 8764:8764 -it inception-rest-tika

@thammegowda
Copy link
Member Author

Done. Thanks!

@chrismattmann
Copy link
Contributor

looking good! @thammegowda

First test:

LMC-053601:tika1.14 mattmann$ java -jar tika-app/target/tika-app-1.14-SNAPSHOT.jar  --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml   tika-parsers/src/test/resources/test-documents/testJPEG.jpg
WARN  Model doesn't exist at tensorflow/tf-objectrec-model. Expecting the script to download it.
INFO  minConfidence = 0.015, topN=2
INFO  Recogniser = org.apache.tika.parser.recognition.tf.TensorflowImageRecParser
INFO  Recogniser Available = true
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="English foxhound" content="0.02759"/>
<meta name="Egyptian cat" content="0.09168"/>
<meta name="collie" content="0.02982"/>
<meta name="bluetick" content="0.06043"/>
<meta name="Border collie" content="0.07553"/>
<title/>
</head>
<body><p/>
</body></html>LMC-053601:tika1.14 mattmann$ 

@chrismattmann
Copy link
Contributor

all tests:

LMC-053601:tika1.14 mattmann$ java -jar tika-app/target/tika-app-1.14-SNAPSHOT.jar  --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml   tika-parsers/src/test/resources/test-documents/testJPEG.jpg
WARN  Model doesn't exist at tensorflow/tf-objectrec-model. Expecting the script to download it.
INFO  minConfidence = 0.015, topN=2
INFO  Recogniser = org.apache.tika.parser.recognition.tf.TensorflowImageRecParser
INFO  Recogniser Available = true
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="English foxhound" content="0.02759"/>
<meta name="Egyptian cat" content="0.09168"/>
<meta name="collie" content="0.02982"/>
<meta name="bluetick" content="0.06043"/>
<meta name="Border collie" content="0.07553"/>
<title/>
</head>
<body><p/>
</body></html>LMC-053601:tika1java -jar tika-app/target/tika-app-1.14-SNAPSHOT.jar  --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml   https://upload.wikimedia.org/wikipedia/commons/thumb/3/38/US_Navy_100714-N-4965F-174_Chief_Mass_Communication_Specialist_Paula_Ludwick%2C_assigned_to_Fleet_Combat_Camera_Group_Pacific%2C_shoots_at_a_target_during_a_Navy_Rifle_Qualification_Course.jpg/220px-thumbnail.jpg
INFO  minConfidence = 0.015, topN=2
INFO  Recogniser = org.apache.tika.parser.recognition.tf.TensorflowImageRecParser
INFO  Recogniser Available = true
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="projectile, missile" content="0.00034"/>
<meta name="military uniform" content="0.00763"/>
<meta name="bulletproof vest" content="0.00489"/>
<meta name="assault rifle, assault gun" content="0.92418"/>
<meta name="rifle" content="0.04343"/>
<title/>
</head>
<body><p/>
</body></html>LMC-053601:tika1.14 mattmann$ java -jar tika-app/target/tika-app-1.14-SNAPSHOT.jar  --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-conf https://upload.wikimedia.org/wikipedia/commons/8/8d/Glock17.jpg
INFO  minConfidence = 0.015, topN=2
INFO  Recogniser = org.apache.tika.parser.recognition.tf.TensorflowImageRecParser
INFO  Recogniser Available = true
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="power drill" content="0.00470"/>
<meta name="revolver, six-gun, six-shooter" content="0.69355"/>
<meta name="holster" content="0.21180"/>
<meta name="assault rifle, assault gun" content="0.01513"/>
<meta name="rifle" content="0.01053"/>
<title/>
</head>
<body><p/>
</body></html>LMC-053601:tika1.14 mattmann$ java -jar tika-app/target/tika-app-1.14-SNAPSHOT.jar  --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-conf http://www.trbimg.com/img-57226a08/turbine/ct-tesla-model-3-unveiling-20160404/650/650x366
INFO  minConfidence = 0.015, topN=2
INFO  Recogniser = org.apache.tika.parser.recognition.tf.TensorflowImageRecParser
INFO  Recogniser Available = true
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="car wheel" content="0.02527"/>
<meta name="convertible" content="0.01338"/>
<meta name="sports car, sport car" content="0.87855"/>
<meta name="beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon" content="0.00903"/>
<meta name="minivan" content="0.01217"/>
<title/>
</head>
<body><p/>
</body></html>LMC-053601:tika1.14 mattmann$ 

All script based tests work. Testing REST now.

@chrismattmann
Copy link
Contributor

for the REST server here's what I'm getting:

on tika-app side with config to point to REST:

LMC-053601:tika1.14 mattmann$ java -jar tika-app/target/tika-app-1.14-SNAPSHOT.jar  --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-rest-config.xml   http://www.trbimg.com/img-57226a08/turbine/ct-tesla-model-3-unveiling-20160404/650/650x366
Exception in thread "main" org.apache.tika.exception.TikaConfigException: Connection to http://localhost:8764 refused
    at org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser.initialize(TensorflowRESTRecogniser.java:93)
    at org.apache.tika.parser.recognition.ObjectRecognitionParser.initialize(ObjectRecognitionParser.java:99)
    at org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:569)
    at org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:491)
    at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:168)
    at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:147)
    at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:122)
    at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:118)
    at org.apache.tika.cli.TikaCLI.configure(TikaCLI.java:673)
    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:406)
    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
Caused by: org.apache.http.conn.HttpHostConnectException: Connection to http://localhost:8764 refused
    at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190)
    at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
    at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:643)
    at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
    at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
    at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
    at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
    at org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser.initialize(TensorflowRESTRecogniser.java:88)
    ... 10 more
Caused by: java.net.ConnectException: Connection refused
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:127)
    at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)
    ... 17 more
LMC-053601:tika1.14 mattmann$ 

on docker side running REST server:

LMC-053601:tf mattmann$ docker run -p 8764:8764 -it inception-rest-tika
>> Downloading inception-2015-12-05.tgz 100.0%
Succesfully downloaded inception-2015-12-05.tgz 88931400 bytes.
Logs are directed to inception.log
Serving on port 8764
 * Running on http://0.0.0.0:8764/ (Press CTRL+C to quit)

@chrismattmann
Copy link
Contributor

OK so I got it working by upgrading Docker, but the REST service for Tensorflow has some weird error where it won't print the XHTML content.

@chrismattmann
Copy link
Contributor

chrismattmann commented Aug 14, 2016

OK so I can get -m to work but not -x.

Metadata

LMC-053601:tika1.14 mattmann$ java -cp tika-app/target/tika-app-1.14-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest.xml -m tika-parsers/src/test/resources/test-documents/testJPEG.jpg
INFO  Available = true, API Status = HTTP/1.0 200 OK
INFO  minConfidence = 0.015, topN=7
INFO  Recogniser = org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser
INFO  Recogniser Available = true
Content-Length: 7686
Content-Type: image/jpeg
OBJECT: Egyptian cat (0.09168)
OBJECT: Border collie (0.07553)
OBJECT: bluetick (0.06043)
OBJECT: collie (0.02982)
OBJECT: English foxhound (0.02759)
OBJECT: Siamese cat, Siamese (0.02053)
OBJECT: tabby, tabby cat (0.01826)
X-Parsed-By: org.apache.tika.parser.CompositeParser
X-Parsed-By: org.apache.tika.parser.recognition.ObjectRecognitionParser
org.apache.tika.parser.recognition.object.rec.impl: org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser
resourceName: testJPEG.jpg
LMC-053601:tika1.14 mattmann$ 

XHTML

LMC-053601:tika1.14 mattmann$ java -cp tika-app/target/tika-app-1.14-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest.xml -x tika-parsers/src/test/resources/test-documents/testJPEG.jpg
INFO  Available = true, API Status = HTTP/1.0 200 OK
INFO  minConfidence = 0.015, topN=7
INFO  Recogniser = org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser
INFO  Recogniser Available = true

@chrismattmann
Copy link
Contributor

finally fixed it! 2 issues:

  1. Needed startDocument and endDocument in the handler - that fixed the JSON and in turn ended up fixing the REST and script based Tensorflow calls.
  2. The often come up (but still undocumented we need to fix that!) problem that you can't concurrently mess with the metadata object whilst doing the ContentHandler stuff. You have to have an ImmutableMetadata object by the time you do ContentHandler stuff.

I'm going to do a few more tests then get this committed! Great work @thammegowda. Overall this is an amazing contribution it will be awesome for Tika users!

@thammegowda
Copy link
Member Author

Awesome. Lessons learned.

@chrismattmann
Copy link
Contributor

build passed:

[INFO] --- forbiddenapis:2.0:testCheck (default) @ tika ---
[INFO] Skipping execution for packaging "pom"
[INFO] 
[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ tika ---
[INFO] Installing /Users/mattmann/tmp/tika1.14/pom.xml to /Users/mattmann/.m2/repository/org/apache/tika/tika/1.14-SNAPSHOT/tika-1.14-SNAPSHOT.pom
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Tika parent ................................. SUCCESS [  1.025 s]
[INFO] Apache Tika core ................................... SUCCESS [ 11.208 s]
[INFO] Apache Tika parsers ................................ SUCCESS [02:48 min]
[INFO] Apache Tika XMP .................................... SUCCESS [  1.423 s]
[INFO] Apache Tika serialization .......................... SUCCESS [  1.323 s]
[INFO] Apache Tika batch .................................. SUCCESS [01:46 min]
[INFO] Apache Tika language detection ..................... SUCCESS [  3.852 s]
[INFO] Apache Tika application ............................ SUCCESS [ 28.489 s]
[INFO] Apache Tika OSGi bundle ............................ SUCCESS [ 15.415 s]
[INFO] Apache Tika translate .............................. SUCCESS [  1.540 s]
[INFO] Apache Tika server ................................. SUCCESS [ 34.281 s]
[INFO] Apache Tika examples ............................... SUCCESS [  4.990 s]
[INFO] Apache Tika Java-7 Components ...................... SUCCESS [  1.458 s]
[INFO] Apache Tika ........................................ SUCCESS [  0.015 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 06:20 min
[INFO] Finished at: 2016-08-14T11:44:03-07:00
[INFO] Final Memory: 121M/1751M
[INFO] ------------------------------------------------------------------------
LMC-053601:tika1.14 mattmann$ 

committing!

@asfgit asfgit merged commit a1d1a81 into apache:TIKA-1508 Aug 14, 2016
@kkrgithub
Copy link

Hi Mr @thammegowda & Mr @chrismattmann
I am a student looking to participate in GSOC2017 . So I would like to contributing to your reported project @Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types .
As far as I understood in this present thread you discussed about improving this module through usage of c++ api for tensorflow , so i am wondering should I do that(improvement) to get started for the actual project .
If gRPC is enough , should I consider fixing any bugs in deeplearning4j . I have gone through all the resources you have quoted at the "apache Gsoc2017 ideas " site and I completed the quick start tutorial on deeplearning4j.
So please help me to get on further on this .

Thanks and Regards
M kranthi kumar reddy
student @iiit Gwalior India.

@agibsonccc
Copy link

Hey folks, If there is a problem with deeplearning4j, we have a gitter channel:
https://gitter.im/deeplearning4j/deeplearning4j

The release is coming out which will have the model import. @chrismattmann 's and teams example is based on snapshot. Please communicate with us if there are issues.

Thanks!

@kkrgithub
Copy link

hello @agibsonccc thanks for the concern but I think you got me wrong I am discussing about this project
[https://issues.apache.org/jira/browse/TIKA-2262?filter=12339687]Supporting Image-to-Text (Image Captioning) in Tika for Image MIME Types
I think the above project is about deeplearning4j or is it about TIKA , I may be wrong so any clarification is greatly appriciated
thank you !

@thammegowda
Copy link
Member Author

Hi @kkrgithub ! Please go to https://issues.apache.org/jira/browse/TIKA-2262 and comment that you're are interested, that will add you to watcher list and thus notify you if any further action is taken on that issue.

To contribute to Tika, please look for open issues in JIRA at https://issues.apache.org/jira/browse/TIKA/?selectedTab=com.atlassian.jira.jira-projects-plugin:issues-panel and start digging!

@thammegowda
Copy link
Member Author

@agibsonccc Sure thing.

The release is coming out which will have the model import. @chrismattmann 's and teams example is based on snapshot. Please communicate with us if there are issues.

We are eagerly waiting for it. Can't wait to integrate it to tika and run it on Spark clusters to process huge datasets of images.

FYI, we are also exploring Image2Text and other OCR models to run them using deeplearning4j.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants