Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tika support #998

Merged
merged 1 commit into from Jun 18, 2020
Merged

Tika support #998

merged 1 commit into from Jun 18, 2020

Conversation

JiriOndrusek
Copy link
Contributor

@JiriOndrusek JiriOndrusek commented Mar 30, 2020

Fix: #799

PR contains also change in xalan support - adds SAX capabilities (by extension of SAXTransformerFactory)

@JiriOndrusek
Copy link
Contributor Author

@ppalaga Current states is, that tika parser "works" in jvm and native, BUT

Do you see any direction what to do with this issue until integration branch is working?

Copy link
Contributor

@ppalaga ppalaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, thanks @JiriOndrusek !

I have added some suggestions inline.

Except for those, please add a doc page under docs/modules/ROOT/pages/extensions/tika.adoc and document all known peculiarities and config options by which the extension differs from a stock Camel componet.

@@ -154,11 +154,12 @@ jobs:
braintree
compression
graphql
mustache
mustache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz. remove the tailing whitespace

}

/*
* The bean-validator component is programmatically configured by the extension thus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The bean-validator component is programmatically configured by the extension thus
* The tika component is programmatically configured by the extension thus

CamelRuntimeBeanBuildItem tikaComponent(BeanContainerBuildItem beanContainer, TikaRecorder recorder) {
return new CamelRuntimeBeanBuildItem(
"tika",
TikaRecorder.class.getName(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TikaRecorder looks strange. TikaComponent maybe?

import javax.ws.rs.core.MediaType;
import javax.ws.rs.core.Response;

// import org.apache.camel.ProducerTemplate;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz remove the commented code.

Assertions.assertTrue(detectedCharset.name().startsWith(Charset.defaultCharset().name()));
}

// @Test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better this?

Suggested change
// @Test
@Test
@Disabled("https://github.com/quarkusio/quarkus/issues/8375")

@@ -34,6 +34,7 @@

<properties>
<camel-quarkus.version>1.1.0-SNAPSHOT</camel-quarkus.version><!-- kept in sync with project.version by the release plugin -->
<quarkus-version>1.3.0.Final</quarkus-version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quarkus.version is defined in the top pom so I think this one is not needed?

Comment on lines 831 to 835
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-tika-deployment</artifactId>
<version>${quarkus.version}</version>
</dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz move this one to the top where the mysql driver is and add a link to the quarkus PR where it is fixed in their BOM.

@ppalaga
Copy link
Contributor

ppalaga commented Apr 3, 2020

And please rebase and squash your commits.

@ppalaga
Copy link
Contributor

ppalaga commented Apr 3, 2020

You can add the missing license headers by running mvn process-resources -Pformat
The order of imports can be fixed by re-compiling the failing modules.

@JiriOndrusek JiriOndrusek force-pushed the 799_Tika-support-WIP2 branch 2 times, most recently from 3030530 to a19fefc Compare April 3, 2020 15:04
@JiriOndrusek JiriOndrusek changed the base branch from master to camel-master April 3, 2020 15:04
@JiriOndrusek
Copy link
Contributor Author

JiriOndrusek commented Apr 3, 2020

@ppalaga all suggestions are applied, doc created, rebased to integration branch, squashed.

Althoug I wasn't able to test or co,pile it, as this branch can not be compiled.

@JiriOndrusek JiriOndrusek changed the base branch from camel-master to master June 17, 2020 12:15
@JiriOndrusek JiriOndrusek force-pushed the 799_Tika-support-WIP2 branch 3 times, most recently from ae443f6 to 384fe34 Compare June 17, 2020 13:10
@JiriOndrusek JiriOndrusek marked this pull request as ready for review June 17, 2020 13:42
@ppalaga ppalaga changed the title 799 tika support wip2 Tika support Jun 17, 2020
@ppalaga
Copy link
Contributor

ppalaga commented Jun 17, 2020

@JiriOndrusek Could you please check the failing test?

@JiriOndrusek
Copy link
Contributor Author

@ppalaga I'll look into it

Copy link
Contributor

@ppalaga ppalaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments and suggestions inline.

Comment on lines 120 to 123
if (delegate instanceof SAXTransformerFactory) {
return (SAXTransformerFactory) delegate;
}
throw new IllegalArgumentException("Unsupported TransformerFactory feature " + SAXTransformerFactory.FEATURE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more effective to move this check to the constructor and change the type of the delegate field to SAXTransformerFactory thus eliminating the delegateAsSAXTransformerFactory() method. Given that we call TransformerFactory.newInstance( "org.apache.xalan.xsltc.trax.TransformerFactoryImpl", ...) we are quite safe to always get a SAXTransformerFactory

Comment on lines 4 to 5
While you can use any of the available Tika parsers in JVM mode (https://tika.apache.org/[Apache Tika]),
only several Tika parses are supported in native mode. See https://quarkus.io/guides/tika[QUARKUS - USING APACHE TIKA].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While you can use any of the available Tika parsers in JVM mode (https://tika.apache.org/[Apache Tika]),
only several Tika parses are supported in native mode. See https://quarkus.io/guides/tika[QUARKUS - USING APACHE TIKA].
While you can use any of the available https://tika.apache.org/1.24.1/formats.html[Tika parsers] in JVM mode,
only some of those are supported in native mode - see the https://quarkus.io/guides/tika[Quarkus Tika guide].

Comment on lines 7 to 22
In order to work in native mode, some properties should be set:

* `quarkus.tika.parsers`
* optionally `quarkus.tika.parser.*`

Example of `application.properties` follows :
[source,properties]
----
quarkus.tika.parsers = pdf,odf,office
quarkus.tika.parser.office = org.apache.tika.parser.microsoft.OfficeParser
----
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that using quarkus.tika.parsers is recommended (not required) to reduce the application memory and native executable sizes. So we should perhaps also formulate it like that.

It is not clear to me whether each abbreviation used in quarkus.tika.parsers needs to be defined in quarkus.tika.parser.my-abbrev ? Or are there some well known abbreviations?

quarkus.tika.parser.office = org.apache.tika.parser.microsoft.OfficeParser
----

For more information about selecting parsers see https://quarkus.io/guides/tika[QUARKUS - USING APACHE TIKA]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For more information about selecting parsers see https://quarkus.io/guides/tika[QUARKUS - USING APACHE TIKA]
For more information about selecting parsers see https://quarkus.io/guides/tika[Quarkus Tika guide]

return new RuntimeValue<>(new QuarkusTikaComponent(container.instance(TikaParserProducer.class)));
}

@org.apache.camel.spi.annotations.Component("tika")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this annotation required given that we produce a named bean above? https://github.com/apache/camel-quarkus/pull/998/files#diff-6af9bcae1d2af7449582ad99e6bdac3cR54

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ppalaga It would make sense to get rid of this annotation.

But without it, execution fails for odf parser with:

Caused by: java.lang.LinkageError: loader constraint violation: loader (instance of ) previously initiated loading for a different type with name "org/w3c/dom/Node"

I'm not sure about the reason of this behavior. There is a one related comment: quarkusio/quarkus#8375 (comment)

Copy link
Contributor

@ppalaga ppalaga Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange, that looks like a symptom of some class loading nastyness. Either org/w3c/dom/Node is coming from two jars or it is loaded by two different class loaders.


private final TikaParserProducer tikaParserProducer;

public QuarkusTikaComponent(TikaParserProducer tikaParserProducer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is acceptable for now, but as a next step, could we perhaps define some sort of TikaParserProducer interface in Camel and have it there in the Camel Tika component and producer, so that we do not have to subclass here?

@JiriOndrusek JiriOndrusek force-pushed the 799_Tika-support-WIP2 branch 2 times, most recently from acb9ae5 to b9615da Compare June 18, 2020 07:39
@JiriOndrusek
Copy link
Contributor Author

JiriOndrusek commented Jun 18, 2020

@ppalaga I've fixed Xalan and documentation.

@JiriOndrusek
Copy link
Contributor Author

closes #799

Copy link
Contributor

@ppalaga ppalaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more question inline, otherwise looks good.

Comment on lines 54 to 63
Charset detectedCharset = null;
try {
InputStream bodyIs = new ByteArrayInputStream(body.getBytes(StandardCharsets.UTF_16));
UniversalEncodingDetector encodingDetector = new UniversalEncodingDetector();
detectedCharset = encodingDetector.detect(bodyIs, new Metadata());
} catch (IOException e1) {
Assertions.fail();
}

Assertions.assertTrue(detectedCharset.name().startsWith(StandardCharsets.UTF_16.name()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why do we need to test UniversalEncodingDetector here? It does not seem to be testing any Camel Quarkus code.

Comment on lines 70 to 79
Charset detectedCharset = null;
try {
InputStream bodyIs = new ByteArrayInputStream(body.getBytes(StandardCharsets.UTF_16));
UniversalEncodingDetector encodingDetector = new UniversalEncodingDetector();
detectedCharset = encodingDetector.detect(bodyIs, new Metadata());
} catch (IOException e1) {
Assertions.fail();
}

Assertions.assertTrue(detectedCharset.name().startsWith(StandardCharsets.UTF_16.name()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JiriOndrusek
Copy link
Contributor Author

Both unnecessary asserts are removed.

@JiriOndrusek
Copy link
Contributor Author

@ppalaga I've discovered the reason of mandatory annotation.
As you can see from https://github.com/apache/camel/blob/master/core/camel-support/src/main/java/org/apache/camel/support/DefaultComponent.java#L410, if no annotation is present, then configurer is not used therefore camel throws an error, that not all of the properties are used. As I've cited linkage error in previous comment, it was my fault, it was not connected to this issue.

@ppalaga ppalaga merged commit e9bbeec into apache:master Jun 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tika support
2 participants