New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIKA-2520 optimize OptimaizeLangDetector default loadModel() #237

Merged
merged 1 commit into from May 24, 2018

Conversation

Projects
None yet
4 participants
@mbaechler
Member

mbaechler commented May 24, 2018

When using tika-server, every single call triggers loadModel and
this method is very CPU intensive.

Optimaize uses immutable objects so we can easily reuse the
default model when no configuration is provided, it's a
20x improvement for my workload.
TIKA-2520 optimize OptimaizeLangDetector default loadModel()
	When using tika-server, every single call triggers loadModel and
	this method is very CPU intensive.

	Optimaize uses immutable objects so we can easily reuse the
	default model when no configuration is provided, it's a
	20x improvement for my workload.
@chrismattmann

This comment has been minimized.

Contributor

chrismattmann commented May 24, 2018

So i tried pulling this patch and testing it, here's what I got:

Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0 sec - in org.apache.tika.TestXMLEntityExpansion
Running org.apache.tika.mime.MimeTypeTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec - in org.apache.tika.mime.MimeTypeTest
Running org.apache.tika.mime.MimeTypesTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec - in org.apache.tika.mime.MimeTypesTest
Running org.apache.tika.mime.TestMimeTypes
Tests run: 75, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.204 sec - in org.apache.tika.mime.TestMimeTypes
Running org.apache.tika.TestCorruptedFiles
Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0 sec - in org.apache.tika.TestCorruptedFiles

Results :

Failed tests: 
  PDFParserTest.testEmbeddedDocsWithOCROnly:1250->TikaTest.assertContains:103 pdf_haystack not found in:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2013-05-23T18:30:00Z" />
<meta name="cp:revision" content="1" />
<meta name="extended-properties:AppVersion" content="14.0000" />
<meta name="meta:paragraph-count" content="1" />
<meta name="meta:word-count" content="16" />
<meta name="extended-properties:Company" content="" />
<meta name="Word-Count" content="16" />
<meta name="dcterms:created" content="2013-05-23T18:30:00Z" />
<meta name="meta:line-count" content="1" />
<meta name="Last-Modified" content="2013-05-23T18:30:00Z" />
<meta name="dcterms:modified" content="2013-05-23T18:30:00Z" />
<meta name="Last-Save-Date" content="2013-05-23T18:30:00Z" />
<meta name="meta:character-count" content="96" />
<meta name="Template" content="Normal.dotm" />
<meta name="Line-Count" content="1" />
<meta name="Paragraph-Count" content="1" />
<meta name="meta:save-date" content="2013-05-23T18:30:00Z" />
<meta name="meta:character-count-with-spaces" content="111" />
<meta name="Application-Name" content="Microsoft Office Word" />
<meta name="modified" content="2013-05-23T18:30:00Z" />
<meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.wordprocessingml.document" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
<meta name="meta:creation-date" content="2013-05-23T18:30:00Z" />
<meta name="extended-properties:Application" content="Microsoft Office Word" />
<meta name="Creation-Date" content="2013-05-23T18:30:00Z" />
<meta name="xmpTPg:NPages" content="1" />
<meta name="Character-Count-With-Spaces" content="111" />
<meta name="Character Count" content="96" />
<meta name="Page-Count" content="1" />
<meta name="Revision-Number" content="1" />
<meta name="Application-Version" content="14.0000" />
<meta name="extended-properties:Template" content="Normal.dotm" />
<meta name="publisher" content="" />
<meta name="meta:page-count" content="1" />
<meta name="dc:publisher" content="" />
<title></title>
</head>
<body><p class="header" />
<p class="header" />
<p class="header" />
<p>Outer_haystack</p>
<p>Outer_haystack</p>
<p><div class="embedded" id="rId8" />
</p>
<p>Outer_haystack</p>
<p />
<p>Outer_haystack</p>
<p />
<p>Outer_haystack</p>
<p><a name="_GoBack" /></p>
<p class="footer" />
<p class="footer" />
<p class="footer" />
<p>attached.pdf</p>
<div class="page"><div class="ocr">dehayslack dehaystack dehayslack dehaystack dehaystack dehaystack pd'

</div>
</div>
<p class="header" />

<p class="header" />

<p class="header" />

<p>Haystack</p>

<p>Needle</p>

<p>Haystack</p>

<p><a name="_GoBack" /></p>

<p class="footer" />

<p class="footer" />

<p class="footer" />

<div source="attachment" class="embedded" id="Test.docx" />
</body></html>

Tests run: 1009, Failures: 1, Errors: 0, Skipped: 30

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Tika parent ................................. SUCCESS [  2.378 s]
[INFO] Apache Tika core ................................... SUCCESS [ 28.594 s]
[INFO] Apache Tika parsers ................................ FAILURE [06:06 min]
[INFO] Apache Tika XMP .................................... SKIPPED
[INFO] Apache Tika serialization .......................... SKIPPED
[INFO] Apache Tika batch .................................. SKIPPED
[INFO] Apache Tika language detection ..................... SKIPPED
[INFO] Apache Tika application ............................ SKIPPED
[INFO] Apache Tika OSGi bundle ............................ SKIPPED
[INFO] Apache Tika translate .............................. SKIPPED
[INFO] Apache Tika server ................................. SKIPPED
[INFO] Apache Tika examples ............................... SKIPPED
[INFO] Apache Tika Java-7 Components ...................... SKIPPED
[INFO] Apache Tika eval ................................... SKIPPED
[INFO] Apache Tika Deep Learning (powered by DL4J) ........ SKIPPED
[INFO] Apache Tika Natural Language Processing ............ SKIPPED
[INFO] Apache Tika ........................................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 06:39 min
[INFO] Finished at: 2018-05-24T08:29:28-07:00
[INFO] Final Memory: 67M/961M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on project tika-parsers: There are test failures.
[ERROR] 
[ERROR] Please refer to /Users/mattmann/tmp/tika2.0.0/tika-parsers/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :tika-parsers

any idea what's up there?

@mbaechler

This comment has been minimized.

Member

mbaechler commented May 24, 2018

I'm a bit surprise because I did it myself on branch_1x:

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Tika parent ................................. SUCCESS [  1.237 s]
[INFO] Apache Tika core ................................... SUCCESS [ 19.513 s]
[INFO] Apache Tika parsers ................................ SUCCESS [03:27 min]
[INFO] Apache Tika XMP .................................... SUCCESS [  1.712 s]
[INFO] Apache Tika serialization .......................... SUCCESS [  1.491 s]
[INFO] Apache Tika batch .................................. SUCCESS [02:00 min]
[INFO] Apache Tika language detection ..................... SUCCESS [  2.341 s]
[INFO] Apache Tika application ............................ SUCCESS [ 55.028 s]
[INFO] Apache Tika OSGi bundle ............................ SUCCESS [ 11.241 s]
[INFO] Apache Tika translate .............................. SUCCESS [  2.242 s]
[INFO] Apache Tika server ................................. SUCCESS [ 21.370 s]
[INFO] Apache Tika examples ............................... SUCCESS [ 13.538 s]
[INFO] Apache Tika Java-7 Components ...................... SUCCESS [  2.075 s]
[INFO] Apache Tika eval ................................... SUCCESS [ 20.426 s]
[INFO] Apache Tika Deep Learning (powered by DL4J) ........ SUCCESS [02:33 min]
[INFO] Apache Tika Natural Language Processing ............ SUCCESS [ 23.690 s]
[INFO] Apache Tika ........................................ SUCCESS [  0.011 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10:57 min
[INFO] Finished at: 2018-05-24T17:36:34+02:00
[INFO] Final Memory: 169M/1667M
[INFO] ------------------------------------------------------------------------

Are you using the same branch ?

@kkrugler

This comment has been minimized.

Contributor

kkrugler commented May 24, 2018

If two threads create separate LanguageDetector objects, and they don't use priors, they're sharing the same DEFAULT_DETECTOR, yes? If so, this class has state during detection, so that's going to create problems. I think you just want to load the models once, but create a new detector each time.

@mbaechler

This comment has been minimized.

Member

mbaechler commented May 24, 2018

optimaize LanguageDetectorImpl says :
This class is immutable and thus thread-safe
I know that it's a bit brittle to rely on that because we use LanguageDetectorBuilder that doesn't enforce this property but it works.
Maybe we could make a PR to optimaize to update LanguageDetectorBuilder javadoc ? WDYT ?

@kkrugler

This comment has been minimized.

Contributor

kkrugler commented May 24, 2018

@mbaechler - the class might be thread-safe, but I think you're trying to use the same instance (the DEFAULT_DETECTOR) for multiple callers.

@chrismattmann

This comment has been minimized.

Contributor

chrismattmann commented May 24, 2018

So my problem was Tesseract installed on MacOS X (thanks to @dameikle for pointing this out on list). I turned off Tesseract and then built again and this patch / PR integrated fine:

[INFO] Scanned 2 class file(s) for forbidden API invocations (in 0.04s), 0 error(s).
[INFO] 
[INFO] --- forbiddenapis:2.5:testCheck (default) @ tika-nlp ---
[INFO] Scanning for classes to check...
[INFO] Reading bundled API signatures: jdk-unsafe-1.7
[INFO] Reading bundled API signatures: jdk-deprecated-1.7
[INFO] Reading bundled API signatures: jdk-non-portable
[INFO] Reading bundled API signatures: jdk-internal-1.7
[INFO] Reading bundled API signatures: commons-io-unsafe-2.6
[INFO] Loading classes to check...
[INFO] Scanning classes for violations...
[INFO] Scanned 1 class file(s) for forbidden API invocations (in 0.09s), 0 error(s).
[INFO] 
[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ tika-nlp ---
[INFO] Installing /Users/mattmann/tmp/tika2.0.0/tika-nlp/target/tika-nlp-1.19-SNAPSHOT.jar to /Users/mattmann/.m2/repository/org/apache/tika/tika-nlp/1.19-SNAPSHOT/tika-nlp-1.19-SNAPSHOT.jar
[INFO] Installing /Users/mattmann/tmp/tika2.0.0/tika-nlp/pom.xml to /Users/mattmann/.m2/repository/org/apache/tika/tika-nlp/1.19-SNAPSHOT/tika-nlp-1.19-SNAPSHOT.pom
[INFO] Installing /Users/mattmann/tmp/tika2.0.0/tika-nlp/target/tika-nlp-1.19-SNAPSHOT-jar-with-dependencies.jar to /Users/mattmann/.m2/repository/org/apache/tika/tika-nlp/1.19-SNAPSHOT/tika-nlp-1.19-SNAPSHOT-jar-with-dependencies.jar
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] Building Apache Tika 1.19-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ tika ---
[INFO] 
[INFO] --- maven-enforcer-plugin:3.0.0-M1:enforce (enforce) @ tika ---
[INFO] 
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ tika ---
[INFO] 
[INFO] --- maven-site-plugin:3.4:attach-descriptor (attach-descriptor) @ tika ---
[INFO] 
[INFO] --- forbiddenapis:2.5:check (default) @ tika ---
[INFO] Skipping execution for packaging "pom"
[INFO] 
[INFO] --- forbiddenapis:2.5:testCheck (default) @ tika ---
[INFO] Skipping execution for packaging "pom"
[INFO] 
[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ tika ---
[INFO] Installing /Users/mattmann/tmp/tika2.0.0/pom.xml to /Users/mattmann/.m2/repository/org/apache/tika/tika/1.19-SNAPSHOT/tika-1.19-SNAPSHOT.pom
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Tika parent ................................. SUCCESS [  1.515 s]
[INFO] Apache Tika core ................................... SUCCESS [ 28.678 s]
[INFO] Apache Tika parsers ................................ SUCCESS [03:50 min]
[INFO] Apache Tika XMP .................................... SUCCESS [  2.401 s]
[INFO] Apache Tika serialization .......................... SUCCESS [  1.925 s]
[INFO] Apache Tika batch .................................. SUCCESS [01:55 min]
[INFO] Apache Tika language detection ..................... SUCCESS [  2.867 s]
[INFO] Apache Tika application ............................ SUCCESS [01:08 min]
[INFO] Apache Tika OSGi bundle ............................ SUCCESS [ 39.266 s]
[INFO] Apache Tika translate .............................. SUCCESS [  7.387 s]
[INFO] Apache Tika server ................................. SUCCESS [ 28.187 s]
[INFO] Apache Tika examples ............................... SUCCESS [ 11.841 s]
[INFO] Apache Tika Java-7 Components ...................... SUCCESS [  2.566 s]
[INFO] Apache Tika eval ................................... SUCCESS [ 30.226 s]
[INFO] Apache Tika Deep Learning (powered by DL4J) ........ SUCCESS [01:02 min]
[INFO] Apache Tika Natural Language Processing ............ SUCCESS [ 23.399 s]
[INFO] Apache Tika ........................................ SUCCESS [  0.025 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10:59 min
[INFO] Finished at: 2018-05-24T13:19:36-07:00
[INFO] Final Memory: 186M/1733M
[INFO] ------------------------------------------------------------------------

However @kkrugler point is well taken. We should address that first before merging.

@chrismattmann chrismattmann self-assigned this May 24, 2018

@chrismattmann chrismattmann requested review from dameikle and tballison May 24, 2018

@mbaechler

This comment has been minimized.

Member

mbaechler commented May 24, 2018

@mbaechler - the class might be thread-safe, but I think you're trying to use the same instance (the DEFAULT_DETECTOR) for multiple callers.

It's not only thread-safe, it's advertised as immutable. Sharing immutable structure is safe.

@kkrugler

This comment has been minimized.

Contributor

kkrugler commented May 24, 2018

@mbaechler - yes, my bad...the version of LanguageDetector we're using doesn't have state, so it is in fact mutable. Because it doesn't support incremental text processing, I had to buffer up text in the call to addText() in OptimaizeLangDetector, versus actually calling the detector, which is also why the hasEnoughText() call just relies on the length of the text, versus any in-flight results.

I'd modified LanguageDetector to get around that (and eventually created Yalder), but I was conflating that code with what we're currently using.

So yes, we can safely share the DEFAULT_DETECTOR - sorry for the noise.

@chrismattmann chrismattmann merged commit 124a06d into apache:branch_1x May 24, 2018

@chrismattmann

This comment has been minimized.

Contributor

chrismattmann commented May 24, 2018

thanks @kkrugler this looks good then, so committed!
I will push to master/2x shortly.

nonas:tika2.0.0 mattmann$ git push -u origin branch_1x
Counting objects: 14, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (14/14), 1.72 KiB | 252.00 KiB/s, done.
Total 14 (delta 4), reused 0 (delta 0)
remote: Resolving deltas: 100% (4/4), completed with 4 local objects.
To github.com:/apache/tika.git
   cdca0f726..7e3e34caf  branch_1x -> branch_1x
Branch 'branch_1x' set up to track remote branch 'branch_1x' from 'origin'.
nonas:tika2.0.0 mattmann$ 
@chrismattmann

This comment has been minimized.

Contributor

chrismattmann commented May 24, 2018

I also merged this into 2.x-master:

[INFO] Installing /Users/mattmann/tmp/tika2.0.0/pom.xml to /Users/mattmann/.m2/repository/org/apache/tika/tika/2.0.0-SNAPSHOT/tika-2.0.0-SNAPSHOT.pom
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Tika parent ................................. SUCCESS [  1.977 s]
[INFO] Apache Tika core ................................... SUCCESS [ 30.959 s]
[INFO] Apache Tika parsers ................................ SUCCESS [03:25 min]
[INFO] Apache Tika XMP .................................... SUCCESS [  2.420 s]
[INFO] Apache Tika serialization .......................... SUCCESS [  1.955 s]
[INFO] Apache Tika batch .................................. SUCCESS [01:58 min]
[INFO] Apache Tika language detection ..................... SUCCESS [  2.731 s]
[INFO] Apache Tika application ............................ SUCCESS [01:07 min]
[INFO] Apache Tika OSGi bundle ............................ SUCCESS [ 31.078 s]
[INFO] Apache Tika translate .............................. SUCCESS [  3.269 s]
[INFO] Apache Tika server ................................. SUCCESS [ 21.436 s]
[INFO] Apache Tika examples ............................... SUCCESS [ 15.475 s]
[INFO] Apache Tika Java-7 Components ...................... SUCCESS [  3.467 s]
[INFO] Apache Tika eval ................................... SUCCESS [ 40.324 s]
[INFO] Apache Tika Deep Learning (powered by DL4J) ........ SUCCESS [01:02 min]
[INFO] Apache Tika Natural Language Processing ............ SUCCESS [ 25.107 s]
[INFO] Apache Tika ........................................ SUCCESS [  0.030 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10:34 min
[INFO] Finished at: 2018-05-24T14:30:18-07:00
[INFO] Final Memory: 203M/1743M
[INFO] ------------------------------------------------------------------------
nonas:tika2.0.0 mattmann$ git push -u origin master
Counting objects: 11, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (11/11), 1.38 KiB | 1.38 MiB/s, done.
Total 11 (delta 3), reused 0 (delta 0)
remote: Resolving deltas: 100% (3/3), completed with 3 local objects.
To github.com:/apache/tika.git
   e24e6afb1..5c1143b30  master -> master
Branch 'master' set up to track remote branch 'master' from 'origin'.
nonas:tika2.0.0 mattmann$ 
@mbaechler

This comment has been minimized.

Member

mbaechler commented May 25, 2018

thank you guys, you are really welcoming and having my PR merged so fast is really a pleasure

@tballison

This comment has been minimized.

Contributor

tballison commented May 25, 2018

@chrismattmann I made the ocr test in master and branch_1x a bit more flexible. Let me know if you're still having problems, or if there is a more robust way to test this. Thank you!

@chrismattmann

This comment has been minimized.

Contributor

chrismattmann commented May 25, 2018

@tballison I tried master with your updates and it's still failing :(

Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0 sec - in org.apache.tika.TestCorruptedFiles

Results :

Failed tests: 
  PDFParserTest.testEmbeddedDocsWithOCROnly:1236->TikaTest.assertContains:104 pdf_haystack not found in:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="cp:revision" content="1" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
<meta name="extended-properties:AppVersion" content="14.0000" />
<meta name="meta:paragraph-count" content="1" />
<meta name="meta:word-count" content="16" />
<meta name="meta:creation-date" content="2013-05-23T18:30:00Z" />
<meta name="extended-properties:Application" content="Microsoft Office Word" />
<meta name="extended-properties:Company" content="" />
<meta name="xmpTPg:NPages" content="1" />
<meta name="dcterms:created" content="2013-05-23T18:30:00Z" />
<meta name="meta:line-count" content="1" />
<meta name="dcterms:modified" content="2013-05-23T18:30:00Z" />
<meta name="Last-Modified" content="2013-05-23T18:30:00Z" />
<meta name="meta:character-count" content="96" />
<meta name="extended-properties:Template" content="Normal.dotm" />
<meta name="meta:save-date" content="2013-05-23T18:30:00Z" />
<meta name="meta:character-count-with-spaces" content="111" />
<meta name="meta:page-count" content="1" />
<meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.wordprocessingml.document" />
<meta name="dc:publisher" content="" />
<title></title>
</head>
<body><p class="header" />
<p class="header" />
<p class="header" />
<p>Outer_haystack</p>
<p>Outer_haystack</p>
<p><div class="embedded" id="rId8" />
</p>
<p>Outer_haystack</p>
<p />
<p>Outer_haystack</p>
<p />
<p>Outer_haystack</p>
<p><a name="_GoBack" /></p>
<p class="footer" />
<p class="footer" />
<p class="footer" />
<p>attached.pdf</p>
<div class="page"><div class="ocr">dehayslack dehaystack dehayslack dehaystack dehaystack dehaystack pd'

</div>
</div>
<p class="header" />

<p class="header" />

<p class="header" />

<p>Haystack</p>

<p>Needle</p>

<p>Haystack</p>

<p><a name="_GoBack" /></p>

<p class="footer" />

<p class="footer" />

<p class="footer" />

<div source="attachment" class="embedded" id="Test.docx" />
</body></html>

Tests run: 1021, Failures: 1, Errors: 0, Skipped: 30

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Tika parent ................................. SUCCESS [  1.914 s]
[INFO] Apache Tika core ................................... SUCCESS [ 38.078 s]
[INFO] Apache Tika parsers ................................ FAILURE [06:34 min]
[INFO] Apache Tika XMP .................................... SKIPPED
[INFO] Apache Tika serialization .......................... SKIPPED
[INFO] Apache Tika batch .................................. SKIPPED
[INFO] Apache Tika language detection ..................... SKIPPED
[INFO] Apache Tika application ............................ SKIPPED
[INFO] Apache Tika OSGi bundle ............................ SKIPPED
[INFO] Apache Tika translate .............................. SKIPPED
[INFO] Apache Tika server ................................. SKIPPED
[INFO] Apache Tika examples ............................... SKIPPED
[INFO] Apache Tika Java-7 Components ...................... SKIPPED
[INFO] Apache Tika eval ................................... SKIPPED
[INFO] Apache Tika Deep Learning (powered by DL4J) ........ SKIPPED
[INFO] Apache Tika Natural Language Processing ............ SKIPPED
[INFO] Apache Tika ........................................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 07:15 min
[INFO] Finished at: 2018-05-25T08:31:27-07:00
[INFO] Final Memory: 62M/876M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on project tika-parsers: There are test failures.
[ERROR] 
[ERROR] Please refer to /Users/mattmann/tmp/tika2.0.0/tika-parsers/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :tika-parsers
@chrismattmann

This comment has been minimized.

Contributor

chrismattmann commented May 25, 2018

didn't fix it for branch_1x either :( @tballison


Results :

Failed tests: 
  PDFParserTest.testEmbeddedDocsWithOCROnly:1250->TikaTest.assertContains:103 pdf_haystack not found in:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2013-05-23T18:30:00Z" />
<meta name="cp:revision" content="1" />
<meta name="extended-properties:AppVersion" content="14.0000" />
<meta name="meta:paragraph-count" content="1" />
<meta name="meta:word-count" content="16" />
<meta name="extended-properties:Company" content="" />
<meta name="Word-Count" content="16" />
<meta name="dcterms:created" content="2013-05-23T18:30:00Z" />
<meta name="meta:line-count" content="1" />
<meta name="Last-Modified" content="2013-05-23T18:30:00Z" />
<meta name="dcterms:modified" content="2013-05-23T18:30:00Z" />
<meta name="Last-Save-Date" content="2013-05-23T18:30:00Z" />
<meta name="meta:character-count" content="96" />
<meta name="Template" content="Normal.dotm" />
<meta name="Line-Count" content="1" />
<meta name="Paragraph-Count" content="1" />
<meta name="meta:save-date" content="2013-05-23T18:30:00Z" />
<meta name="meta:character-count-with-spaces" content="111" />
<meta name="Application-Name" content="Microsoft Office Word" />
<meta name="modified" content="2013-05-23T18:30:00Z" />
<meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.wordprocessingml.document" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
<meta name="meta:creation-date" content="2013-05-23T18:30:00Z" />
<meta name="extended-properties:Application" content="Microsoft Office Word" />
<meta name="Creation-Date" content="2013-05-23T18:30:00Z" />
<meta name="xmpTPg:NPages" content="1" />
<meta name="Character-Count-With-Spaces" content="111" />
<meta name="Character Count" content="96" />
<meta name="Page-Count" content="1" />
<meta name="Revision-Number" content="1" />
<meta name="Application-Version" content="14.0000" />
<meta name="extended-properties:Template" content="Normal.dotm" />
<meta name="publisher" content="" />
<meta name="meta:page-count" content="1" />
<meta name="dc:publisher" content="" />
<title></title>
</head>
<body><p class="header" />
<p class="header" />
<p class="header" />
<p>Outer_haystack</p>
<p>Outer_haystack</p>
<p><div class="embedded" id="rId8" />
</p>
<p>Outer_haystack</p>
<p />
<p>Outer_haystack</p>
<p />
<p>Outer_haystack</p>
<p><a name="_GoBack" /></p>
<p class="footer" />
<p class="footer" />
<p class="footer" />
<p>attached.pdf</p>
<div class="page"><div class="ocr">dehayslack dehaystack dehayslack dehaystack dehaystack dehaystack pd'

</div>
</div>
<p class="header" />

<p class="header" />

<p class="header" />

<p>Haystack</p>

<p>Needle</p>

<p>Haystack</p>

<p><a name="_GoBack" /></p>

<p class="footer" />

<p class="footer" />

<p class="footer" />

<div source="attachment" class="embedded" id="Test.docx" />
</body></html>

Tests run: 1009, Failures: 1, Errors: 0, Skipped: 30

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Tika parent ................................. SUCCESS [  2.496 s]
[INFO] Apache Tika core ................................... SUCCESS [ 35.187 s]
[INFO] Apache Tika parsers ................................ FAILURE [07:03 min]
[INFO] Apache Tika XMP .................................... SKIPPED
[INFO] Apache Tika serialization .......................... SKIPPED
[INFO] Apache Tika batch .................................. SKIPPED
[INFO] Apache Tika language detection ..................... SKIPPED
[INFO] Apache Tika application ............................ SKIPPED
[INFO] Apache Tika OSGi bundle ............................ SKIPPED
[INFO] Apache Tika translate .............................. SKIPPED
[INFO] Apache Tika server ................................. SKIPPED
[INFO] Apache Tika examples ............................... SKIPPED
[INFO] Apache Tika Java-7 Components ...................... SKIPPED
[INFO] Apache Tika eval ................................... SKIPPED
[INFO] Apache Tika Deep Learning (powered by DL4J) ........ SKIPPED
[INFO] Apache Tika Natural Language Processing ............ SKIPPED
[INFO] Apache Tika ........................................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 07:42 min
[INFO] Finished at: 2018-05-25T08:45:25-07:00
[INFO] Final Memory: 66M/751M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on project tika-parsers: There are test failures.
[ERROR] 
[ERROR] Please refer to /Users/mattmann/tmp/tika2.0.0/tika-parsers/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :tika-parsers
nonas:tika2.0.0 mattmann$ git branch
  TIKA-1988
  TIKA-1988-new
  TIKA-2016
  TIKA-2298
* branch_1x
  gsoc17
  master
  merge-mattmann-TIKA-1988
nonas:tika2.0.0 mattmann$ 

@tballison

This comment has been minimized.

Contributor

tballison commented May 25, 2018

@chrismattmann

This comment has been minimized.

Contributor

chrismattmann commented May 26, 2018

yep @tballison that fixed it:

for 2.x/master:

INFO] --- forbiddenapis:2.5:testCheck (default) @ tika ---
[INFO] Skipping execution for packaging "pom"
[INFO] 
[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ tika ---
[INFO] Installing /Users/mattmann/tmp/tika2.0.0/pom.xml to /Users/mattmann/.m2/repository/org/apache/tika/tika/2.0.0-SNAPSHOT/tika-2.0.0-SNAPSHOT.pom
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Tika parent ................................. SUCCESS [  2.659 s]
[INFO] Apache Tika core ................................... SUCCESS [ 32.386 s]
[INFO] Apache Tika parsers ................................ SUCCESS [05:29 min]
[INFO] Apache Tika XMP .................................... SUCCESS [  2.089 s]
[INFO] Apache Tika serialization .......................... SUCCESS [  1.455 s]
[INFO] Apache Tika batch .................................. SUCCESS [01:54 min]
[INFO] Apache Tika language detection ..................... SUCCESS [  2.796 s]
[INFO] Apache Tika application ............................ SUCCESS [ 56.919 s]
[INFO] Apache Tika OSGi bundle ............................ SUCCESS [ 30.672 s]
[INFO] Apache Tika translate .............................. SUCCESS [  2.907 s]
[INFO] Apache Tika server ................................. SUCCESS [ 23.061 s]
[INFO] Apache Tika examples ............................... SUCCESS [ 10.833 s]
[INFO] Apache Tika Java-7 Components ...................... SUCCESS [  2.372 s]
[INFO] Apache Tika eval ................................... SUCCESS [ 29.789 s]
[INFO] Apache Tika Deep Learning (powered by DL4J) ........ SUCCESS [01:01 min]
[INFO] Apache Tika Natural Language Processing ............ SUCCESS [ 23.639 s]
[INFO] Apache Tika ........................................ SUCCESS [  0.018 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 12:10 min
[INFO] Finished at: 2018-05-25T16:48:59-07:00
[INFO] Final Memory: 174M/1661M
[INFO] ------------------------------------------------------------------------
nonas:tika2.0.0 mattmann$ tesseract
Usage:
  tesseract --help | --help-psm | --help-oem | --version
  tesseract --list-langs [--tessdata-dir PATH]
  tesseract --print-parameters [options...] [configfile...]
  tesseract imagename|stdin outputbase|stdout [options...] [configfile...]

OCR options:
  --tessdata-dir PATH   Specify the location of tessdata path.
  --user-words PATH     Specify the location of user words file.
  --user-patterns PATH  Specify the location of user patterns file.
  -l LANG[+LANG]        Specify language(s) used for OCR.
  -c VAR=VALUE          Set value for config variables.
                        Multiple -c arguments are allowed.
  --psm NUM             Specify page segmentation mode.
  --oem NUM             Specify OCR Engine mode.
NOTE: These options must occur before any configfile.

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
			bypassing hacks that are Tesseract-specific.
OCR Engine modes:
  0    Original Tesseract only.
  1    Cube only.
  2    Tesseract + cube.
  3    Default, based on what is available.

Single options:
  -h, --help            Show this help message.
  --help-psm            Show page segmentation modes.
  --help-oem            Show OCR Engine modes.
  -v, --version         Show version information.
  --list-langs          List available languages for tesseract engine.
  --print-parameters    Print tesseract parameters to stdout.
nonas:tika2.0.0 mattmann$ 

And also for branch_1x:

[INFO] --- forbiddenapis:2.5:testCheck (default) @ tika ---
[INFO] Skipping execution for packaging "pom"
[INFO] 
[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ tika ---
[INFO] Installing /Users/mattmann/tmp/tika2.0.0/pom.xml to /Users/mattmann/.m2/repository/org/apache/tika/tika/1.19-SNAPSHOT/tika-1.19-SNAPSHOT.pom
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Tika parent ................................. SUCCESS [  1.662 s]
[INFO] Apache Tika core ................................... SUCCESS [ 29.080 s]
[INFO] Apache Tika parsers ................................ SUCCESS [05:30 min]
[INFO] Apache Tika XMP .................................... SUCCESS [  2.053 s]
[INFO] Apache Tika serialization .......................... SUCCESS [  1.455 s]
[INFO] Apache Tika batch .................................. SUCCESS [02:02 min]
[INFO] Apache Tika language detection ..................... SUCCESS [  3.046 s]
[INFO] Apache Tika application ............................ SUCCESS [01:30 min]
[INFO] Apache Tika OSGi bundle ............................ SUCCESS [ 52.497 s]
[INFO] Apache Tika translate .............................. SUCCESS [  4.999 s]
[INFO] Apache Tika server ................................. SUCCESS [ 36.182 s]
[INFO] Apache Tika examples ............................... SUCCESS [ 16.681 s]
[INFO] Apache Tika Java-7 Components ...................... SUCCESS [  4.317 s]
[INFO] Apache Tika eval ................................... SUCCESS [ 42.570 s]
[INFO] Apache Tika Deep Learning (powered by DL4J) ........ SUCCESS [02:05 min]
[INFO] Apache Tika Natural Language Processing ............ SUCCESS [ 35.008 s]
[INFO] Apache Tika ........................................ SUCCESS [  0.036 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 15:00 min
[INFO] Finished at: 2018-05-25T17:38:01-07:00
[INFO] Final Memory: 173M/1589M
[INFO] ------------------------------------------------------------------------
nonas:tika2.0.0 mattmann$ git branch
  TIKA-1988
  TIKA-1988-new
  TIKA-2016
  TIKA-2298
* branch_1x
  gsoc17
  master
  merge-mattmann-TIKA-1988
nonas:tika2.0.0 mattmann$ 
@chrismattmann

This comment has been minimized.

Contributor

chrismattmann commented May 26, 2018

thanks @tballison !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment