New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIKA-2306: Update Inception v3 to Inception v4 in Object recognition parser #163

Merged
merged 21 commits into from Apr 20, 2017

Conversation

Projects
None yet
5 participants
@KranthiGV
Contributor

KranthiGV commented Mar 29, 2017

Summary:

Object Recognition Parser currently uses Inception V3 model for the object classification task. Google released a newer Inception V4 model [1][2].
It has an improved Top -1 accuracy of 80.2 and Top-5 accuracy of 95.2 [3].

Quick setup and Test:

git clone https://github.com/tensorflow/models/   
export PYTHONPATH="$PYTHONPATH:/models/slim" (replace with your installation directory)   

sudo apt-get install libtcmalloc-minimal4   
export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4"   
  • NOTE: The last two lines are added due to tensorflow issues tensorflow/tensorflow#6968. It would be removed once it is fixed.

  • It can be evaded by integrating parts of tensorflow/models code into our repository. It has Apache license. So, it can be done.

  • Checkout the test case tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml

Demos:

java -jar tika-app/target/tika-app-1.15-SNAPSHOT.jar --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml tika-parsers/src/test/resources/test-documents/testJPEG.jpg

  • The output would include:
<meta name="OBJECT" content="Egyptian cat (0.31143)"/>  
<meta name="OBJECT" content="tabby, tabby cat (0.07072)"/>  

REST API

Start the inception service on 8764 port :

The API service code is added at tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/inceptionapi.py

Also, a docker file is added to setup the environment quickly

cd tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/
docker build -f InceptionRestDockerfile -t inception-rest-tika .
docker run -p 8764:8764 -it inception-rest-tika
  • Use the config at tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest.xml

Tests and build

Build status

[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Tika parent ................................ SUCCESS [0.693s]
[INFO] Apache Tika core .................................. SUCCESS [19.393s]
[INFO] Apache Tika parsers ............................... SUCCESS [1:02.685s]
[INFO] Apache Tika XMP ................................... SUCCESS [0.851s]
[INFO] Apache Tika serialization ......................... SUCCESS [0.924s]
[INFO] Apache Tika batch ................................. SUCCESS [1:53.792s]
[INFO] Apache Tika language detection .................... SUCCESS [2.210s]
[INFO] Apache Tika application ........................... SUCCESS [23.620s]
[INFO] Apache Tika OSGi bundle ........................... SUCCESS [11.271s]
[INFO] Apache Tika translate ............................. SUCCESS [1.161s]
[INFO] Apache Tika server ................................ SUCCESS [26.655s]
[INFO] Apache Tika examples .............................. SUCCESS [3.562s]
[INFO] Apache Tika Java-7 Components ..................... SUCCESS [1.040s]
[INFO] Apache Tika eval .................................. SUCCESS [13.477s]
[INFO] Apache Tika ....................................... SUCCESS [0.037s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 4:41.751s
[INFO] Finished at: Wed Mar 29 10:38:00 IST 2017
[INFO] Final Memory: 158M/1535M
[INFO] ------------------------------------------------------------------------

Script based implementation tests

timberners@galileo:~/Desktop/gsoc/issues/tika$ java -jar tika-app/target/tika-app-1.15-SNAPSHOT.jar --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml tika-parsers/src/test/resources/test-documents/testJPEG.jpg
WARN  JBIG2ImageReader not loaded. jbig2 files will be ignored
INFO  minConfidence = 0.015, topN=2
INFO  Recogniser = org.apache.tika.parser.recognition.tf.TensorflowImageRecParser
INFO  Recogniser Available = true
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Egyptian cat" content="0.31143"/>
<meta name="tabby, tabby cat" content="0.07072"/>
<meta name="tiger cat" content="0.04990"/>
<meta name="Siamese cat, Siamese" content="0.02097"/>
<meta name="Border collie" content="0.01930"/>
<title/>
</head>
<body><p/>
</body></html><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="org.apache.tika.parser.recognition.object.rec.impl" content="org.apache.tika.parser.recognition.tf.TensorflowImageRecParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.CompositeParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.recognition.ObjectRecognitionParser"/>
<meta name="resourceName" content="testJPEG.jpg"/>
<meta name="Content-Length" content="7686"/>
<meta name="OBJECT" content="Egyptian cat (0.31143)"/>
<meta name="OBJECT" content="tabby, tabby cat (0.07072)"/>
<meta name="Content-Type" content="image/jpeg"/>
<title/>
</head>
<body><ol id="objects">	<li id="Egyptian cat"> Egyptian cat [eng](confidence = 0.311430 )</li>
	<li id="tabby, tabby cat"> tabby, tabby cat [eng](confidence = 0.070720 )</li>
</ol>

timberners@galileo:~/Desktop/gsoc/issues/tika$ java -jar tika-app/p-1.15-SNAPSHOT.jar  --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml https://upload.wikimedia.org/wikipedia/commons/thumb/3/38/US_Navy_100714-N-4965F-174_Chief_Mass_Communication_Specialist_Paula_Ludwick%2C_assigned_to_Fleet_Combat_Camera_Group_Pacific%2C_shoots_at_a_target_during_a_Navy_Rifle_Qualification_Course.jpg/220px-thumbnail.jpg
WARN  JBIG2ImageReader not loaded. jbig2 files will be ignored
INFO  minConfidence = 0.015, topN=2
INFO  Recogniser = org.apache.tika.parser.recognition.tf.TensorflowImageRecParser
INFO  Recogniser Available = true
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="military uniform" content="0.00355"/>
<meta name="bulletproof vest" content="0.00397"/>
<meta name="revolver, six-gun, six-shooter" content="0.00176"/>
<meta name="assault rifle, assault gun" content="0.84119"/>
<meta name="rifle" content="0.08642"/>
<title/>
</head>
<body><p/>
</body></html><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="org.apache.tika.parser.recognition.object.rec.impl" content="org.apache.tika.parser.recognition.tf.TensorflowImageRecParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.CompositeParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.recognition.ObjectRecognitionParser"/>
<meta name="resourceName" content="220px-thumbnail.jpg"/>
<meta name="Content-Length" content="9535"/>
<meta name="OBJECT" content="assault rifle, assault gun (0.84119)"/>
<meta name="OBJECT" content="rifle (0.08642)"/>
<meta name="Content-Type" content="image/jpeg"/>
<title/>
</head>
<body><ol id="objects">	<li id="assault rifle, assault gun"> assault rifle, assault gun [eng](confidence = 0.841190 )</li>
	<li id="rifle"> rifle [eng](confidence = 0.086420 )</li>
</ol>
</body></html>

timberners@galileo:~/Desktop/gsoc/issues/tika$ java -jar tika-app/p-1.15-SNAPSHOT.jar  --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml https://upload.wikimedia.org/wikipedia/commons/8/8d/Glock17.jpg
WARN  JBIG2ImageReader not loaded. jbig2 files will be ignored
INFO  minConfidence = 0.015, topN=2
INFO  Recogniser = org.apache.tika.parser.recognition.tf.TensorflowImageRecParser
INFO  Recogniser Available = true
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="hatchet" content="0.00087"/>
<meta name="revolver, six-gun, six-shooter" content="0.89842"/>
<meta name="holster" content="0.02361"/>
<meta name="assault rifle, assault gun" content="0.01820"/>
<meta name="rifle" content="0.00943"/>
<title/>
</head>
<body><p/>
</body></html><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="org.apache.tika.parser.recognition.object.rec.impl" content="org.apache.tika.parser.recognition.tf.TensorflowImageRecParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.CompositeParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.recognition.ObjectRecognitionParser"/>
<meta name="resourceName" content="Glock17.jpg"/>
<meta name="Content-Length" content="368025"/>
<meta name="OBJECT" content="revolver, six-gun, six-shooter (0.89842)"/>
<meta name="OBJECT" content="holster (0.02361)"/>
<meta name="Content-Type" content="image/jpeg"/>
<title/>
</head>
<body><ol id="objects">	<li id="revolver, six-gun, six-shooter"> revolver, six-gun, six-shooter [eng](confidence = 0.898420 )</li>
	<li id="holster"> holster [eng](confidence = 0.023610 )</li>
</ol>
</body></html>

timberners@galileo:~/Desktop/gsoc/issues/tika$ java -jar tika-app/p-1.15-SNAPSHOT.jar  --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow.xml http://www.trbimg.com/img-57226a08/turbine/ct-tesla-model-3-unveiling-20160404/650/650x366
WARN  JBIG2ImageReader not loaded. jbig2 files will be ignored
INFO  minConfidence = 0.015, topN=2
INFO  Recogniser = org.apache.tika.parser.recognition.tf.TensorflowImageRecParser
INFO  Recogniser Available = true
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="car wheel" content="0.07900"/>
<meta name="convertible" content="0.04067"/>
<meta name="sports car, sport car" content="0.75443"/>
<meta name="beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon" content="0.00955"/>
<meta name="grille, radiator grille" content="0.01366"/>
<title/>
</head>
<body><p/>
</body></html><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="org.apache.tika.parser.recognition.object.rec.impl" content="org.apache.tika.parser.recognition.tf.TensorflowImageRecParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.CompositeParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.recognition.ObjectRecognitionParser"/>
<meta name="resourceName" content="650x366"/>
<meta name="Content-Length" content="42377"/>
<meta name="OBJECT" content="sports car, sport car (0.75443)"/>
<meta name="OBJECT" content="car wheel (0.07900)"/>
<meta name="Content-Type" content="image/jpeg"/>
<title/>
</head>
<body><ol id="objects">	<li id="sports car, sport car"> sports car, sport car [eng](confidence = 0.754430 )</li>
	<li id="car wheel"> car wheel [eng](confidence = 0.079000 )</li>
</ol>
</body></html>

REST API based implementation tests

timberners@galileo:~/Desktop/gsoc/issues/tika$ java -jar tika-app/target/tika-app-1.15-SNAPSHOT.jar  --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest.xml http://www.trbimg.com/img-57226a08/turbine/ct-tesla-model-3-unveiling-20160404/650/650x366
WARN  JBIG2ImageReader not loaded. jbig2 files will be ignored
INFO  Available = true, API Status = HTTP/1.0 200 OK
INFO  minConfidence = 0.015, topN=2
INFO  Recogniser = org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser
INFO  Recogniser Available = true
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="org.apache.tika.parser.recognition.object.rec.impl" content="org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.CompositeParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.recognition.ObjectRecognitionParser"/>
<meta name="resourceName" content="650x366"/>
<meta name="Content-Length" content="42377"/>
<meta name="OBJECT" content="sports car, sport car (0.75443)"/>
<meta name="OBJECT" content="car wheel (0.07900)"/>
<meta name="Content-Type" content="image/jpeg"/>
<title/>
</head>
<body><ol id="objects">	<li id="sports car, sport car"> sports car, sport car [en](confidence = 0.754434 )</li>
	<li id="car wheel"> car wheel [en](confidence = 0.079000 )</li>
</ol>
</body></html>
@KranthiGV

This comment has been minimized.

Show comment
Hide comment
@KranthiGV

KranthiGV Mar 29, 2017

Contributor

@thammegowda @chrismattmann
Please review this PR.

Contributor

KranthiGV commented Mar 29, 2017

@thammegowda @chrismattmann
Please review this PR.

@KranthiGV KranthiGV changed the title from Tika-2306: Update Inception v3 to Inception v4 in Object recognition parser to TIKA-2306: Update Inception v3 to Inception v4 in Object recognition parser Mar 29, 2017

@grossws

This comment has been minimized.

Show comment
Hide comment
@grossws

grossws Mar 29, 2017

Member

Can you, please, reformat all python code according to PEP8 (it's usually sone with IDE or any automated tool)?

Also, https://github.com/apache/tika/pull/163/files#diff-d1b641ce7738b7a40d608caeaa51994cR36 is quite normal while developing but should point to Apache Tika repo afterward.

@thammegowda, you forgot your comment there, in InceptionRestDockerfile ,)

Member

grossws commented Mar 29, 2017

Can you, please, reformat all python code according to PEP8 (it's usually sone with IDE or any automated tool)?

Also, https://github.com/apache/tika/pull/163/files#diff-d1b641ce7738b7a40d608caeaa51994cR36 is quite normal while developing but should point to Apache Tika repo afterward.

@thammegowda, you forgot your comment there, in InceptionRestDockerfile ,)

@KranthiGV

This comment has been minimized.

Show comment
Hide comment
@KranthiGV

KranthiGV Mar 29, 2017

Contributor

@grossws
Thanks for the review, Gribov.
I have just updated the coding style to follow PEP8 as suggested.
Please review.
And yes, I would add a new commit to point to the Apache Tika repo's version as soon as this is merged.

Contributor

KranthiGV commented Mar 29, 2017

@grossws
Thanks for the review, Gribov.
I have just updated the coding style to follow PEP8 as suggested.
Please review.
And yes, I would add a new commit to point to the Apache Tika repo's version as soon as this is merged.

@grossws

This comment has been minimized.

Show comment
Hide comment
@grossws

grossws Mar 29, 2017

Member

@KranthiGV, my first name is Konstantin, but simplier is just to use nickname.

Member

grossws commented Mar 29, 2017

@KranthiGV, my first name is Konstantin, but simplier is just to use nickname.

@KranthiGV

This comment has been minimized.

Show comment
Hide comment
@KranthiGV

KranthiGV Mar 29, 2017

Contributor

Apologies! @grossws
That was an honest mistake.

PS The tests yielded a similar result after changes. So, I am not including them for the sake of brevity.

Contributor

KranthiGV commented Mar 29, 2017

Apologies! @grossws
That was an honest mistake.

PS The tests yielded a similar result after changes. So, I am not including them for the sake of brevity.

@chrismattmann

This comment has been minimized.

Show comment
Hide comment
@chrismattmann

chrismattmann Mar 29, 2017

Contributor

Do the above tests show that Inception v4 is better?

Contributor

chrismattmann commented Mar 29, 2017

Do the above tests show that Inception v4 is better?

@chrismattmann

This comment has been minimized.

Show comment
Hide comment
@chrismattmann

chrismattmann Mar 29, 2017

Contributor

Also all of this documentation should be added to the wiki - it looks great.

Contributor

chrismattmann commented Mar 29, 2017

Also all of this documentation should be added to the wiki - it looks great.

# TODO: Change the URL to Apache/Tika Repo when this PR gets merged
RUN \
wget https://raw.githubusercontent.com/thammegowda/tika/TIKA-1993/tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/inceptionapi.py -O /usr/bin/inceptionapi.py && \
wget https://raw.githubusercontent.com/KranthiGV/tika/TIKA-2306/tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/inceptionapi.py -O /usr/bin/inceptionapi.py && \

This comment has been minimized.

@chrismattmann

chrismattmann Mar 29, 2017

Contributor

this should be a Tika URL...not personal GitHub.

@chrismattmann

chrismattmann Mar 29, 2017

Contributor

this should be a Tika URL...not personal GitHub.

This comment has been minimized.

@KranthiGV

KranthiGV Mar 29, 2017

Contributor

I would add another commit as soon as this is merged. Thanks.

@KranthiGV

KranthiGV Mar 29, 2017

Contributor

I would add another commit as soon as this is merged. Thanks.

This comment has been minimized.

@thammegowda

thammegowda Apr 14, 2017

Contributor

TODO: when we merge this PR, we modify it and then merge it

@thammegowda

thammegowda Apr 14, 2017

Contributor

TODO: when we merge this PR, we modify it and then merge it

uid_lookup_path = os.path.join(
FLAGS.model_dir, 'imagenet_synset_to_human_label_map.txt')
self.node_lookup = self.load(label_lookup_path, uid_lookup_path)
def create_readable_names_for_imagenet_labels():

This comment has been minimized.

@chrismattmann

chrismattmann Mar 29, 2017

Contributor

shouldn't we just make this a module function since it's defined in two files?

@chrismattmann

chrismattmann Mar 29, 2017

Contributor

shouldn't we just make this a module function since it's defined in two files?

@chrismattmann

This comment has been minimized.

Show comment
Hide comment
@chrismattmann

chrismattmann Mar 29, 2017

Contributor

I made a few comments: overall it looks great. There is no need to maintain back compat with v3 since v3 was available in Tika 1.14. I am good on my end +1 @thammegowda or @grossws if you guys want to commit, go for it.

Contributor

chrismattmann commented Mar 29, 2017

I made a few comments: overall it looks great. There is no need to maintain back compat with v3 since v3 was available in Tika 1.14. I am good on my end +1 @thammegowda or @grossws if you guys want to commit, go for it.

@thammegowda

This comment has been minimized.

Show comment
Hide comment
@thammegowda

thammegowda Mar 29, 2017

Contributor

The comparison of Inception-V3 to Inception-V4 is here https://arxiv.org/pdf/1602.07261.pdf Page 10

Contributor

thammegowda commented Mar 29, 2017

The comparison of Inception-V3 to Inception-V4 is here https://arxiv.org/pdf/1602.07261.pdf Page 10

@thammegowda

This comment has been minimized.

Show comment
Hide comment
@thammegowda

thammegowda Mar 29, 2017

Contributor

@thammegowda, you forgot your comment there, in InceptionRestDockerfile ,)

Yep! forgot to update it after merging the PR.
This time we will correct it. Thanks :-)

Contributor

thammegowda commented Mar 29, 2017

@thammegowda, you forgot your comment there, in InceptionRestDockerfile ,)

Yep! forgot to update it after merging the PR.
This time we will correct it. Thanks :-)

@KranthiGV

This comment has been minimized.

Show comment
Hide comment
@KranthiGV

KranthiGV Mar 29, 2017

Contributor

"Do the above tests show that Inception v4 is better?"

Inception V4 has been proven to have a Top-1 accuracy of 80.2 and Top-5 accuracy of 95.2 whereas Inception V3 has 78.0 and 93.9 respectively.
The aim of this PR is let the Tika community have access to near state-of-art features.

Also all of this documentation should be added to the wiki - it looks great.

Definitely. I am currently working on it.

Contributor

KranthiGV commented Mar 29, 2017

"Do the above tests show that Inception v4 is better?"

Inception V4 has been proven to have a Top-1 accuracy of 80.2 and Top-5 accuracy of 95.2 whereas Inception V3 has 78.0 and 93.9 respectively.
The aim of this PR is let the Tika community have access to near state-of-art features.

Also all of this documentation should be added to the wiki - it looks great.

Definitely. I am currently working on it.

@chrismattmann

This comment has been minimized.

Show comment
Hide comment
@chrismattmann
Contributor

chrismattmann commented Mar 29, 2017

thanks @KranthiGV

@KranthiGV

This comment has been minimized.

Show comment
Hide comment
@KranthiGV

KranthiGV Mar 30, 2017

Contributor

I'm waiting for the PR to be merged. Once it is done, I'd update the wiki.

Contributor

KranthiGV commented Mar 30, 2017

I'm waiting for the PR to be merged. Once it is done, I'd update the wiki.

@KranthiGV

This comment has been minimized.

Show comment
Hide comment
@KranthiGV

KranthiGV Apr 12, 2017

Contributor

@thammegowda @grossws
Can you please review this PR?
It would enable me to work on TIKA-2308 and also adding support for other image MIME types at TIKA-2324.
Since the current and this implementations are different, it would save us the trouble of re-doing a lot of work again when this PR gets merged.

Contributor

KranthiGV commented Apr 12, 2017

@thammegowda @grossws
Can you please review this PR?
It would enable me to work on TIKA-2308 and also adding support for other image MIME types at TIKA-2324.
Since the current and this implementations are different, it would save us the trouble of re-doing a lot of work again when this PR gets merged.

@thammegowda

This comment has been minimized.

Show comment
Hide comment
@thammegowda

thammegowda Apr 14, 2017

Contributor

Reviewing this today.

Contributor

thammegowda commented Apr 14, 2017

Reviewing this today.

@Field
private URI healthUri = URI.create("http://localhost:8764/inception/v3/ping");
private URI healthUri = URI.create("http://localhost:8764/inception/v4/ping");

This comment has been minimized.

@thammegowda

thammegowda Apr 14, 2017

Contributor

Perfect! the very little change expected to see in tika-parsers java source. Nicely done👍

@thammegowda

thammegowda Apr 14, 2017

Contributor

Perfect! the very little change expected to see in tika-parsers java source. Nicely done👍

Show outdated Hide outdated .../resources/org/apache/tika/parser/recognition/tf/InceptionRestDockerfile Outdated
# TODO: Change the URL to Apache/Tika Repo when this PR gets merged
RUN \
wget https://raw.githubusercontent.com/thammegowda/tika/TIKA-1993/tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/inceptionapi.py -O /usr/bin/inceptionapi.py && \
wget https://raw.githubusercontent.com/KranthiGV/tika/TIKA-2306/tika-parsers/src/main/resources/org/apache/tika/parser/recognition/tf/inceptionapi.py -O /usr/bin/inceptionapi.py && \

This comment has been minimized.

@thammegowda

thammegowda Apr 14, 2017

Contributor

TODO: when we merge this PR, we modify it and then merge it

@thammegowda

thammegowda Apr 14, 2017

Contributor

TODO: when we merge this PR, we modify it and then merge it

Show outdated Hide outdated ...c/main/resources/org/apache/tika/parser/recognition/tf/classify_image.py Outdated
@thammegowda

This comment has been minimized.

Show comment
Hide comment
@thammegowda

thammegowda Apr 14, 2017

Contributor

@KranthiGV Apologies for the delay in reviewing this. Great work 👍
Please review my comments. Few minor changes are requested to make it better.
I look forward to reviewing it again after the changes and it will be close to merging to master after that.

Issue #168 might create conflicts with the code, so earlier the better :-)

Contributor

thammegowda commented Apr 14, 2017

@KranthiGV Apologies for the delay in reviewing this. Great work 👍
Please review my comments. Few minor changes are requested to make it better.
I look forward to reviewing it again after the changes and it will be close to merging to master after that.

Issue #168 might create conflicts with the code, so earlier the better :-)

@thammegowda

This comment has been minimized.

Show comment
Hide comment
@thammegowda

thammegowda Apr 14, 2017

Contributor

@KranthiGV Also please pull changes on master branch and merge it with this branch.

Contributor

thammegowda commented Apr 14, 2017

@KranthiGV Also please pull changes on master branch and merge it with this branch.

Merge remote-tracking branch 'origin/master' into TIKA-2306
To pull the latest changes from master
@KranthiGV

This comment has been minimized.

Show comment
Hide comment
@KranthiGV

KranthiGV Apr 14, 2017

Contributor

@thammegowda
I have made the necessary changes at Reduced disk I/O commit (db8c814)
The performance is compared by running it 50 times on an image.

script.sh:
n=0; while [[ $n -lt 50 ]]; do java -jar tika-app/target/tika-app-1.15-SNAPSHOT.jar  --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest.xml http://www.trbimg.com/img-57226a08/turbine/ct-tesla-model-3-unveiling-20160404/650/650x366; n=$((n+1)); done

time ./script.sh

Before changes:
real 4m33.334s

After changes:
real 1m0.736s

TODO after merging:

  1. Update the inceptionapi.py link to Apache's repo link.
  2. Document the usage in Wiki.
Contributor

KranthiGV commented Apr 14, 2017

@thammegowda
I have made the necessary changes at Reduced disk I/O commit (db8c814)
The performance is compared by running it 50 times on an image.

script.sh:
n=0; while [[ $n -lt 50 ]]; do java -jar tika-app/target/tika-app-1.15-SNAPSHOT.jar  --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest.xml http://www.trbimg.com/img-57226a08/turbine/ct-tesla-model-3-unveiling-20160404/650/650x366; n=$((n+1)); done

time ./script.sh

Before changes:
real 4m33.334s

After changes:
real 1m0.736s

TODO after merging:

  1. Update the inceptionapi.py link to Apache's repo link.
  2. Document the usage in Wiki.
@thammegowda

This comment has been minimized.

Show comment
Hide comment
@thammegowda

thammegowda Apr 14, 2017

Contributor

Thanks, @KranthiGV.
LGTM!

Contributor

thammegowda commented Apr 14, 2017

Thanks, @KranthiGV.
LGTM!

@thammegowda

This comment has been minimized.

Show comment
Hide comment
@thammegowda

thammegowda Apr 19, 2017

Contributor

@tballison Can we merge this before the Tika 1.15 release.
It's ready to be merged, I had it tested.

Contributor

thammegowda commented Apr 19, 2017

@tballison Can we merge this before the Tika 1.15 release.
It's ready to be merged, I had it tested.

@tballison

This comment has been minimized.

Show comment
Hide comment
@tballison

tballison Apr 19, 2017

Contributor

What's the worst that could happen? Go for it...and thank you!

Wait, does this prevent jpeg metadata from being extracted?

Contributor

tballison commented Apr 19, 2017

What's the worst that could happen? Go for it...and thank you!

Wait, does this prevent jpeg metadata from being extracted?

@thammegowda

This comment has been minimized.

Show comment
Hide comment
@thammegowda

thammegowda Apr 19, 2017

Contributor

Wait, does this prevent jpeg metadata from being extracted?

Oh yes! That is the case with existing ObjectRecognition Parser, however ObjectRecognitionParser isnt enabled by default.

How do we run two different parsers for the same file and merge the metadata?

Contributor

thammegowda commented Apr 19, 2017

Wait, does this prevent jpeg metadata from being extracted?

Oh yes! That is the case with existing ObjectRecognition Parser, however ObjectRecognitionParser isnt enabled by default.

How do we run two different parsers for the same file and merge the metadata?

@tballison

This comment has been minimized.

Show comment
Hide comment
@tballison

tballison Apr 19, 2017

Contributor

It looks like you have to turn the ObjectRecognitionParser on via config, etc? So, the default behavior is that the JPEGParser is called and the ObjectRecognitionParser is not called, right?

When I switch the "parser" in the JpegParserTest to AutoDetectParser, tests still pass...Phew...

To merge metadata, see: _TMP_IMAGE_METADATA_PARSER in TesseractOCRParser

Contributor

tballison commented Apr 19, 2017

It looks like you have to turn the ObjectRecognitionParser on via config, etc? So, the default behavior is that the JPEGParser is called and the ObjectRecognitionParser is not called, right?

When I switch the "parser" in the JpegParserTest to AutoDetectParser, tests still pass...Phew...

To merge metadata, see: _TMP_IMAGE_METADATA_PARSER in TesseractOCRParser

@thammegowda

This comment has been minimized.

Show comment
Hide comment
@thammegowda

thammegowda Apr 19, 2017

Contributor

I looks like you have to turn the ObjectRecognitionParser on via config, etc? So, the default behavior is that the JPEGParser is called and the ObjectRecognitionParser is not called, right?

That's right.

To merge metadata, see: _TMP_IMAGE_METADATA_PARSER in TesseractOCRParser

Thanks! I will have a look

Contributor

thammegowda commented Apr 19, 2017

I looks like you have to turn the ObjectRecognitionParser on via config, etc? So, the default behavior is that the JPEGParser is called and the ObjectRecognitionParser is not called, right?

That's right.

To merge metadata, see: _TMP_IMAGE_METADATA_PARSER in TesseractOCRParser

Thanks! I will have a look

@tballison

This comment has been minimized.

Show comment
Hide comment
@tballison

tballison Apr 19, 2017

Contributor

That's right.
Ok, I'm ok with this behavior. Users who want inception lose JPEG metadata (for now). What I don't want to repeat is users who aren't using the ObjectRecognizer losing JPEG metadata.

Bonus points for copy/paste of TesseractOCRParser's metadata insertion, but not required, IMHO.

Contributor

tballison commented Apr 19, 2017

That's right.
Ok, I'm ok with this behavior. Users who want inception lose JPEG metadata (for now). What I don't want to repeat is users who aren't using the ObjectRecognizer losing JPEG metadata.

Bonus points for copy/paste of TesseractOCRParser's metadata insertion, but not required, IMHO.

@thammegowda

This comment has been minimized.

Show comment
Hide comment
@thammegowda

thammegowda Apr 19, 2017

Contributor

Got it. users who aren't using Inception will not lose JPEG metadata.
We can merge this PR and support the composition of JPEG and ObjectRecognition in a new PR.

Contributor

thammegowda commented Apr 19, 2017

Got it. users who aren't using Inception will not lose JPEG metadata.
We can merge this PR and support the composition of JPEG and ObjectRecognition in a new PR.

@KranthiGV

This comment has been minimized.

Show comment
Hide comment
@KranthiGV

KranthiGV Apr 19, 2017

Contributor

That sounds great! I'm looking at _TMP_IMAGE_METADATA_PARSER and the CompositeParser.

Contributor

KranthiGV commented Apr 19, 2017

That sounds great! I'm looking at _TMP_IMAGE_METADATA_PARSER and the CompositeParser.

@thammegowda thammegowda merged commit 09698c6 into apache:master Apr 20, 2017

@thammegowda

This comment has been minimized.

Show comment
Hide comment
@thammegowda

thammegowda Apr 20, 2017

Contributor

@KranthiGV It is now merged. I updated the URL in dockerfile to the ASF repo.

Could you please update the URLs in the wiki https://wiki.apache.org/tika/TikaAndVision
Inception/v3 URLS should now be replaced by inception/v4

Let us know if you need access to wiki.

Contributor

thammegowda commented Apr 20, 2017

@KranthiGV It is now merged. I updated the URL in dockerfile to the ASF repo.

Could you please update the URLs in the wiki https://wiki.apache.org/tika/TikaAndVision
Inception/v3 URLS should now be replaced by inception/v4

Let us know if you need access to wiki.

@KranthiGV

This comment has been minimized.

Show comment
Hide comment
@KranthiGV

KranthiGV Apr 20, 2017

Contributor

Thanks for updating the URL in dockerfile.
I have made the minor changes on Wiki page now (V3 to V4, etc).
I will be updating the results of running the parser by today.
Thank you.

UPDATE: The wiki is up to date now.

Contributor

KranthiGV commented Apr 20, 2017

Thanks for updating the URL in dockerfile.
I have made the minor changes on Wiki page now (V3 to V4, etc).
I will be updating the results of running the parser by today.
Thank you.

UPDATE: The wiki is up to date now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment