Tika support #998

JiriOndrusek · 2020-03-30T09:38:58Z

Fix: #799

PR contains also change in xalan support - adds SAX capabilities (by extension of SAXTransformerFactory)

JiriOndrusek · 2020-04-03T08:47:12Z

@ppalaga Current states is, that tika parser "works" in jvm and native, BUT

Tike parser obviously has some issues in quarkus. Not every parser is working. There is a list of not native ready parsers hardcoded in git - https://github.com/quarkusio/quarkus/blob/master/extensions/tika/deployment/src/main/java/io/quarkus/tika/deployment/TikaProcessor.java#L39
Even some of the parsers which should work, are not working. For example image parser - I've reported an issue about it Tika: image parser fails in JVM mode with java.lang.LinkageError quarkusio/quarkus#8375
following change has to be present in camel CAMEL-14833 camel-tika: enhance TikaProducer to be easoly extensible camel#3705 (already merged) I'll prepare PR for integration branch once the branch is working with current camel

Do you see any direction what to do with this issue until integration branch is working?

ppalaga

Good work, thanks @JiriOndrusek !

I have added some suggestions inline.

Except for those, please add a doc page under docs/modules/ROOT/pages/extensions/tika.adoc and document all known peculiarities and config options by which the extension differs from a stock Camel componet.

ppalaga · 2020-04-03T10:28:13Z

.github/workflows/pr-build.yaml

@@ -154,11 +154,12 @@ jobs:
              braintree
              compression
              graphql
-              mustache
+              mustache              


Plz. remove the tailing whitespace

ppalaga · 2020-04-03T10:29:18Z

...ployment/src/main/java/org/apache/camel/quarkus/component/tika/deployment/TikaProcessor.java

+    }
+
+    /*
+     * The bean-validator component is programmatically configured by the extension thus


Suggested change

* The bean-validator component is programmatically configured by the extension thus

* The tika component is programmatically configured by the extension thus

ppalaga · 2020-04-03T10:33:38Z

...ployment/src/main/java/org/apache/camel/quarkus/component/tika/deployment/TikaProcessor.java

+    CamelRuntimeBeanBuildItem tikaComponent(BeanContainerBuildItem beanContainer, TikaRecorder recorder) {
+        return new CamelRuntimeBeanBuildItem(
+                "tika",
+                TikaRecorder.class.getName(),


TikaRecorder looks strange. TikaComponent maybe?

ppalaga · 2020-04-03T10:35:41Z

...ration-tests/tika/src/main/java/org/apache/camel/quarkus/component/tika/it/TikaResource.java

+import javax.ws.rs.core.MediaType;
+import javax.ws.rs.core.Response;
+
+// import org.apache.camel.ProducerTemplate;


Plz remove the commented code.

ppalaga · 2020-04-03T10:39:17Z

integration-tests/tika/src/test/java/org/apache/camel/quarkus/component/tika/it/TikaTest.java

+        Assertions.assertTrue(detectedCharset.name().startsWith(Charset.defaultCharset().name()));
+    }
+
+    //    @Test


Maybe better this?

Suggested change

// @Test

@Test

@Disabled("https://github.com/quarkusio/quarkus/issues/8375")

ppalaga · 2020-04-03T10:40:49Z

poms/bom-deployment/pom.xml

@@ -34,6 +34,7 @@

    <properties>
        <camel-quarkus.version>1.1.0-SNAPSHOT</camel-quarkus.version><!-- kept in sync with project.version by the release plugin -->
+        <quarkus-version>1.3.0.Final</quarkus-version>


quarkus.version is defined in the top pom so I think this one is not needed?

ppalaga · 2020-04-03T10:42:32Z

poms/bom-deployment/pom.xml

+            <dependency>
+                <groupId>io.quarkus</groupId>
+                <artifactId>quarkus-tika-deployment</artifactId>
+                <version>${quarkus.version}</version>
+            </dependency>


Plz move this one to the top where the mysql driver is and add a link to the quarkus PR where it is fixed in their BOM.

ppalaga · 2020-04-03T10:46:32Z

And please rebase and squash your commits.

ppalaga · 2020-04-03T10:50:49Z

You can add the missing license headers by running mvn process-resources -Pformat
The order of imports can be fixed by re-compiling the failing modules.

JiriOndrusek · 2020-04-03T15:05:54Z

@ppalaga all suggestions are applied, doc created, rebased to integration branch, squashed.

Althoug I wasn't able to test or co,pile it, as this branch can not be compiled.

ppalaga · 2020-06-17T13:47:18Z

@JiriOndrusek Could you please check the failing test?

JiriOndrusek · 2020-06-17T13:49:03Z

@ppalaga I'll look into it

ppalaga

Some comments and suggestions inline.

ppalaga · 2020-06-17T14:07:17Z

...an/runtime/src/main/java/org/apache/camel/quarkus/support/xalan/XalanTransformerFactory.java

+        if (delegate instanceof SAXTransformerFactory) {
+            return (SAXTransformerFactory) delegate;
+        }
+        throw new IllegalArgumentException("Unsupported TransformerFactory feature " + SAXTransformerFactory.FEATURE);


I think it would be more effective to move this check to the constructor and change the type of the delegate field to SAXTransformerFactory thus eliminating the delegateAsSAXTransformerFactory() method. Given that we call TransformerFactory.newInstance( "org.apache.xalan.xsltc.trax.TransformerFactoryImpl", ...) we are quite safe to always get a SAXTransformerFactory

ppalaga · 2020-06-17T14:23:48Z

extensions/tika/runtime/src/main/doc/limitations.adoc

+While you can use any of the available Tika parsers in JVM mode (https://tika.apache.org/[Apache Tika]),
+only several Tika parses are supported in native mode. See https://quarkus.io/guides/tika[QUARKUS - USING APACHE TIKA].


Suggested change

While you can use any of the available Tika parsers in JVM mode (https://tika.apache.org/[Apache Tika]),

only several Tika parses are supported in native mode. See https://quarkus.io/guides/tika[QUARKUS - USING APACHE TIKA].

While you can use any of the available https://tika.apache.org/1.24.1/formats.html[Tika parsers] in JVM mode,

only some of those are supported in native mode - see the https://quarkus.io/guides/tika[Quarkus Tika guide].

ppalaga · 2020-06-17T14:39:27Z

extensions/tika/runtime/src/main/doc/limitations.adoc

+In order to work in native mode, some properties should be set:
+
+* `quarkus.tika.parsers`
+* optionally `quarkus.tika.parser.*`
+
+Example of `application.properties` follows :
+[source,properties]
+----
+quarkus.tika.parsers = pdf,odf,office
+quarkus.tika.parser.office = org.apache.tika.parser.microsoft.OfficeParser
+----


My understanding is that using quarkus.tika.parsers is recommended (not required) to reduce the application memory and native executable sizes. So we should perhaps also formulate it like that.

It is not clear to me whether each abbreviation used in quarkus.tika.parsers needs to be defined in quarkus.tika.parser.my-abbrev ? Or are there some well known abbreviations?

ppalaga · 2020-06-17T14:39:53Z

extensions/tika/runtime/src/main/doc/limitations.adoc

+quarkus.tika.parser.office = org.apache.tika.parser.microsoft.OfficeParser
+----
+
+For more information about selecting parsers see https://quarkus.io/guides/tika[QUARKUS - USING APACHE TIKA]


Suggested change

For more information about selecting parsers see https://quarkus.io/guides/tika[QUARKUS - USING APACHE TIKA]

For more information about selecting parsers see https://quarkus.io/guides/tika[Quarkus Tika guide]

ppalaga · 2020-06-17T14:41:35Z

extensions/tika/runtime/src/main/java/org/apache/camel/quarkus/component/tika/TikaRecorder.java

+        return new RuntimeValue<>(new QuarkusTikaComponent(container.instance(TikaParserProducer.class)));
+    }
+
+    @org.apache.camel.spi.annotations.Component("tika")


Is this annotation required given that we produce a named bean above? https://github.com/apache/camel-quarkus/pull/998/files#diff-6af9bcae1d2af7449582ad99e6bdac3cR54

@ppalaga It would make sense to get rid of this annotation.

But without it, execution fails for odf parser with:

Caused by: java.lang.LinkageError: loader constraint violation: loader (instance of ) previously initiated loading for a different type with name "org/w3c/dom/Node"

I'm not sure about the reason of this behavior. There is a one related comment: quarkusio/quarkus#8375 (comment)

Strange, that looks like a symptom of some class loading nastyness. Either org/w3c/dom/Node is coming from two jars or it is loaded by two different class loaders.

ppalaga · 2020-06-17T14:49:11Z

extensions/tika/runtime/src/main/java/org/apache/camel/quarkus/component/tika/TikaRecorder.java

+
+        private final TikaParserProducer tikaParserProducer;
+
+        public QuarkusTikaComponent(TikaParserProducer tikaParserProducer) {


This is acceptable for now, but as a next step, could we perhaps define some sort of TikaParserProducer interface in Camel and have it there in the Camel Tika component and producer, so that we do not have to subclass here?

JiriOndrusek · 2020-06-18T07:40:45Z

@ppalaga I've fixed Xalan and documentation.

Removal of annotation doesn't work - I'll try to find an explanation.
The last suggestion would need a change in camel-tika. I've created an issue for this - https://issues.apache.org/jira/browse/CAMEL-15207

JiriOndrusek · 2020-06-18T07:53:32Z

closes #799

ppalaga

One more question inline, otherwise looks good.

ppalaga · 2020-06-18T08:13:53Z

integration-tests/tika/src/test/java/org/apache/camel/quarkus/component/tika/it/TikaTest.java

+        Charset detectedCharset = null;
+        try {
+            InputStream bodyIs = new ByteArrayInputStream(body.getBytes(StandardCharsets.UTF_16));
+            UniversalEncodingDetector encodingDetector = new UniversalEncodingDetector();
+            detectedCharset = encodingDetector.detect(bodyIs, new Metadata());
+        } catch (IOException e1) {
+            Assertions.fail();
+        }
+
+        Assertions.assertTrue(detectedCharset.name().startsWith(StandardCharsets.UTF_16.name()));


I wonder why do we need to test UniversalEncodingDetector here? It does not seem to be testing any Camel Quarkus code.

ppalaga · 2020-06-18T08:16:41Z

integration-tests/tika/src/test/java/org/apache/camel/quarkus/component/tika/it/TikaTest.java

+        Charset detectedCharset = null;
+        try {
+            InputStream bodyIs = new ByteArrayInputStream(body.getBytes(StandardCharsets.UTF_16));
+            UniversalEncodingDetector encodingDetector = new UniversalEncodingDetector();
+            detectedCharset = encodingDetector.detect(bodyIs, new Metadata());
+        } catch (IOException e1) {
+            Assertions.fail();
+        }
+
+        Assertions.assertTrue(detectedCharset.name().startsWith(StandardCharsets.UTF_16.name()));


Same as above: do we need this?

You are right. This is covered by camel itself (https://github.com/apache/camel/blob/master/components/camel-tika/src/test/java/org/apache/camel/component/tika/TikaParseTest.java#L71). I'll remove both parts.

JiriOndrusek · 2020-06-18T08:36:30Z

Both unnecessary asserts are removed.

JiriOndrusek · 2020-06-18T10:33:33Z

@ppalaga I've discovered the reason of mandatory annotation.
As you can see from https://github.com/apache/camel/blob/master/core/camel-support/src/main/java/org/apache/camel/support/DefaultComponent.java#L410, if no annotation is present, then configurer is not used therefore camel throws an error, that not all of the properties are used. As I've cited linkage error in previous comment, it was my fault, it was not connected to this issue.

JiriOndrusek mentioned this pull request Mar 30, 2020

Tika support ppalaga/camel-quarkus#3

Open

ppalaga reviewed Apr 3, 2020

View reviewed changes

JiriOndrusek force-pushed the 799_Tika-support-WIP2 branch 2 times, most recently from 3030530 to a19fefc Compare April 3, 2020 15:04

JiriOndrusek changed the base branch from master to camel-master April 3, 2020 15:04

JiriOndrusek force-pushed the 799_Tika-support-WIP2 branch from a19fefc to 323d720 Compare April 3, 2020 15:17

gnodet force-pushed the camel-master branch from 97f9b08 to f22d8a1 Compare April 7, 2020 08:55

JiriOndrusek force-pushed the 799_Tika-support-WIP2 branch from 323d720 to cf061b7 Compare April 7, 2020 10:19

jamesnetherton force-pushed the camel-master branch 4 times, most recently from 4aed376 to 849e366 Compare April 20, 2020 08:46

jamesnetherton force-pushed the camel-master branch 3 times, most recently from e8543f5 to b862f45 Compare May 4, 2020 07:35

ppalaga force-pushed the camel-master branch 5 times, most recently from 34c1f42 to 4082313 Compare May 12, 2020 16:13

ppalaga force-pushed the camel-master branch 2 times, most recently from 13259a7 to 1f5432f Compare May 15, 2020 07:18

github-actions bot force-pushed the camel-master branch from 1f5432f to 8888ad5 Compare May 18, 2020 02:14

jamesnetherton force-pushed the camel-master branch 3 times, most recently from 7addb1f to dcf750a Compare June 1, 2020 16:12

jamesnetherton force-pushed the camel-master branch from dcf750a to 564b1bf Compare June 8, 2020 08:41

jamesnetherton force-pushed the camel-master branch from 0d5c790 to da2bf8a Compare June 15, 2020 13:50

JiriOndrusek force-pushed the 799_Tika-support-WIP2 branch from cf061b7 to 25291ae Compare June 17, 2020 12:14

JiriOndrusek changed the base branch from camel-master to master June 17, 2020 12:15

JiriOndrusek force-pushed the 799_Tika-support-WIP2 branch 3 times, most recently from ae443f6 to 384fe34 Compare June 17, 2020 13:10

JiriOndrusek marked this pull request as ready for review June 17, 2020 13:42

ppalaga changed the title ~~799 tika support wip2~~ Tika support Jun 17, 2020

JiriOndrusek force-pushed the 799_Tika-support-WIP2 branch from 384fe34 to 318ec10 Compare June 17, 2020 13:53

ppalaga reviewed Jun 17, 2020

View reviewed changes

JiriOndrusek force-pushed the 799_Tika-support-WIP2 branch 2 times, most recently from acb9ae5 to b9615da Compare June 18, 2020 07:39

ppalaga reviewed Jun 18, 2020

View reviewed changes

Tika support apache#799

02b0870

JiriOndrusek force-pushed the 799_Tika-support-WIP2 branch from b9615da to 02b0870 Compare June 18, 2020 08:36

ppalaga approved these changes Jun 18, 2020

View reviewed changes

ppalaga merged commit e9bbeec into apache:master Jun 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tika support #998

Tika support #998

JiriOndrusek commented Mar 30, 2020 •

edited by ppalaga

JiriOndrusek commented Apr 3, 2020

ppalaga left a comment

ppalaga Apr 3, 2020

ppalaga Apr 3, 2020

ppalaga Apr 3, 2020

ppalaga Apr 3, 2020

ppalaga Apr 3, 2020

ppalaga Apr 3, 2020

ppalaga Apr 3, 2020

ppalaga commented Apr 3, 2020

ppalaga commented Apr 3, 2020

JiriOndrusek commented Apr 3, 2020 •

edited

ppalaga commented Jun 17, 2020

JiriOndrusek commented Jun 17, 2020

ppalaga left a comment

ppalaga Jun 17, 2020

ppalaga Jun 17, 2020

ppalaga Jun 17, 2020

ppalaga Jun 17, 2020

ppalaga Jun 17, 2020

JiriOndrusek Jun 18, 2020

ppalaga Jun 18, 2020 •

edited

ppalaga Jun 17, 2020

JiriOndrusek commented Jun 18, 2020 •

edited

JiriOndrusek commented Jun 18, 2020

ppalaga left a comment

ppalaga Jun 18, 2020

ppalaga Jun 18, 2020

JiriOndrusek Jun 18, 2020

JiriOndrusek commented Jun 18, 2020

JiriOndrusek commented Jun 18, 2020

	* The bean-validator component is programmatically configured by the extension thus
	* The tika component is programmatically configured by the extension thus

	// @Test
	@Test
	@Disabled("https://github.com/quarkusio/quarkus/issues/8375")

		While you can use any of the available Tika parsers in JVM mode (https://tika.apache.org/[Apache Tika]),
		only several Tika parses are supported in native mode. See https://quarkus.io/guides/tika[QUARKUS - USING APACHE TIKA].

	For more information about selecting parsers see https://quarkus.io/guides/tika[QUARKUS - USING APACHE TIKA]
	For more information about selecting parsers see https://quarkus.io/guides/tika[Quarkus Tika guide]


		private final TikaParserProducer tikaParserProducer;

		public QuarkusTikaComponent(TikaParserProducer tikaParserProducer) {

Tika support #998

Tika support #998

Conversation

JiriOndrusek commented Mar 30, 2020 • edited by ppalaga

JiriOndrusek commented Apr 3, 2020

ppalaga left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ppalaga commented Apr 3, 2020

ppalaga commented Apr 3, 2020

JiriOndrusek commented Apr 3, 2020 • edited

ppalaga commented Jun 17, 2020

JiriOndrusek commented Jun 17, 2020

ppalaga left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ppalaga Jun 18, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JiriOndrusek commented Jun 18, 2020 • edited

JiriOndrusek commented Jun 18, 2020

ppalaga left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JiriOndrusek commented Jun 18, 2020

JiriOndrusek commented Jun 18, 2020

JiriOndrusek commented Mar 30, 2020 •

edited by ppalaga

JiriOndrusek commented Apr 3, 2020 •

edited

ppalaga Jun 18, 2020 •

edited

JiriOndrusek commented Jun 18, 2020 •

edited