New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tika support #998
Tika support #998
Conversation
@ppalaga Current states is, that tika parser "works" in jvm and native, BUT
Do you see any direction what to do with this issue until integration branch is working? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work, thanks @JiriOndrusek !
I have added some suggestions inline.
Except for those, please add a doc page under docs/modules/ROOT/pages/extensions/tika.adoc
and document all known peculiarities and config options by which the extension differs from a stock Camel componet.
.github/workflows/pr-build.yaml
Outdated
@@ -154,11 +154,12 @@ jobs: | |||
braintree | |||
compression | |||
graphql | |||
mustache | |||
mustache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plz. remove the tailing whitespace
} | ||
|
||
/* | ||
* The bean-validator component is programmatically configured by the extension thus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* The bean-validator component is programmatically configured by the extension thus | |
* The tika component is programmatically configured by the extension thus |
CamelRuntimeBeanBuildItem tikaComponent(BeanContainerBuildItem beanContainer, TikaRecorder recorder) { | ||
return new CamelRuntimeBeanBuildItem( | ||
"tika", | ||
TikaRecorder.class.getName(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TikaRecorder
looks strange. TikaComponent
maybe?
import javax.ws.rs.core.MediaType; | ||
import javax.ws.rs.core.Response; | ||
|
||
// import org.apache.camel.ProducerTemplate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plz remove the commented code.
Assertions.assertTrue(detectedCharset.name().startsWith(Charset.defaultCharset().name())); | ||
} | ||
|
||
// @Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better this?
// @Test | |
@Test | |
@Disabled("https://github.com/quarkusio/quarkus/issues/8375") |
poms/bom-deployment/pom.xml
Outdated
@@ -34,6 +34,7 @@ | |||
|
|||
<properties> | |||
<camel-quarkus.version>1.1.0-SNAPSHOT</camel-quarkus.version><!-- kept in sync with project.version by the release plugin --> | |||
<quarkus-version>1.3.0.Final</quarkus-version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quarkus.version
is defined in the top pom so I think this one is not needed?
poms/bom-deployment/pom.xml
Outdated
<dependency> | ||
<groupId>io.quarkus</groupId> | ||
<artifactId>quarkus-tika-deployment</artifactId> | ||
<version>${quarkus.version}</version> | ||
</dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plz move this one to the top where the mysql driver is and add a link to the quarkus PR where it is fixed in their BOM.
And please rebase and squash your commits. |
You can add the missing license headers by running |
3030530
to
a19fefc
Compare
@ppalaga all suggestions are applied, doc created, rebased to integration branch, squashed. Althoug I wasn't able to test or co,pile it, as this branch can not be compiled. |
a19fefc
to
323d720
Compare
323d720
to
cf061b7
Compare
4aed376
to
849e366
Compare
e8543f5
to
b862f45
Compare
34c1f42
to
4082313
Compare
13259a7
to
1f5432f
Compare
1f5432f
to
8888ad5
Compare
7addb1f
to
dcf750a
Compare
dcf750a
to
564b1bf
Compare
0d5c790
to
da2bf8a
Compare
cf061b7
to
25291ae
Compare
ae443f6
to
384fe34
Compare
@JiriOndrusek Could you please check the failing test? |
@ppalaga I'll look into it |
384fe34
to
318ec10
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments and suggestions inline.
if (delegate instanceof SAXTransformerFactory) { | ||
return (SAXTransformerFactory) delegate; | ||
} | ||
throw new IllegalArgumentException("Unsupported TransformerFactory feature " + SAXTransformerFactory.FEATURE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more effective to move this check to the constructor and change the type of the delegate field to SAXTransformerFactory thus eliminating the delegateAsSAXTransformerFactory()
method. Given that we call TransformerFactory.newInstance( "org.apache.xalan.xsltc.trax.TransformerFactoryImpl", ...)
we are quite safe to always get a SAXTransformerFactory
While you can use any of the available Tika parsers in JVM mode (https://tika.apache.org/[Apache Tika]), | ||
only several Tika parses are supported in native mode. See https://quarkus.io/guides/tika[QUARKUS - USING APACHE TIKA]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you can use any of the available Tika parsers in JVM mode (https://tika.apache.org/[Apache Tika]), | |
only several Tika parses are supported in native mode. See https://quarkus.io/guides/tika[QUARKUS - USING APACHE TIKA]. | |
While you can use any of the available https://tika.apache.org/1.24.1/formats.html[Tika parsers] in JVM mode, | |
only some of those are supported in native mode - see the https://quarkus.io/guides/tika[Quarkus Tika guide]. |
In order to work in native mode, some properties should be set: | ||
|
||
* `quarkus.tika.parsers` | ||
* optionally `quarkus.tika.parser.*` | ||
|
||
Example of `application.properties` follows : | ||
[source,properties] | ||
---- | ||
quarkus.tika.parsers = pdf,odf,office | ||
quarkus.tika.parser.office = org.apache.tika.parser.microsoft.OfficeParser | ||
---- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that using quarkus.tika.parsers
is recommended (not required) to reduce the application memory and native executable sizes. So we should perhaps also formulate it like that.
It is not clear to me whether each abbreviation used in quarkus.tika.parsers
needs to be defined in quarkus.tika.parser.my-abbrev
? Or are there some well known abbreviations?
quarkus.tika.parser.office = org.apache.tika.parser.microsoft.OfficeParser | ||
---- | ||
|
||
For more information about selecting parsers see https://quarkus.io/guides/tika[QUARKUS - USING APACHE TIKA] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more information about selecting parsers see https://quarkus.io/guides/tika[QUARKUS - USING APACHE TIKA] | |
For more information about selecting parsers see https://quarkus.io/guides/tika[Quarkus Tika guide] |
return new RuntimeValue<>(new QuarkusTikaComponent(container.instance(TikaParserProducer.class))); | ||
} | ||
|
||
@org.apache.camel.spi.annotations.Component("tika") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this annotation required given that we produce a named bean above? https://github.com/apache/camel-quarkus/pull/998/files#diff-6af9bcae1d2af7449582ad99e6bdac3cR54
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ppalaga It would make sense to get rid of this annotation.
But without it, execution fails for odf
parser with:
Caused by: java.lang.LinkageError: loader constraint violation: loader (instance of ) previously initiated loading for a different type with name "org/w3c/dom/Node"
I'm not sure about the reason of this behavior. There is a one related comment: quarkusio/quarkus#8375 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange, that looks like a symptom of some class loading nastyness. Either org/w3c/dom/Node
is coming from two jars or it is loaded by two different class loaders.
|
||
private final TikaParserProducer tikaParserProducer; | ||
|
||
public QuarkusTikaComponent(TikaParserProducer tikaParserProducer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is acceptable for now, but as a next step, could we perhaps define some sort of TikaParserProducer interface in Camel and have it there in the Camel Tika component and producer, so that we do not have to subclass here?
acb9ae5
to
b9615da
Compare
@ppalaga I've fixed Xalan and documentation.
|
closes #799 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more question inline, otherwise looks good.
Charset detectedCharset = null; | ||
try { | ||
InputStream bodyIs = new ByteArrayInputStream(body.getBytes(StandardCharsets.UTF_16)); | ||
UniversalEncodingDetector encodingDetector = new UniversalEncodingDetector(); | ||
detectedCharset = encodingDetector.detect(bodyIs, new Metadata()); | ||
} catch (IOException e1) { | ||
Assertions.fail(); | ||
} | ||
|
||
Assertions.assertTrue(detectedCharset.name().startsWith(StandardCharsets.UTF_16.name())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why do we need to test UniversalEncodingDetector here? It does not seem to be testing any Camel Quarkus code.
Charset detectedCharset = null; | ||
try { | ||
InputStream bodyIs = new ByteArrayInputStream(body.getBytes(StandardCharsets.UTF_16)); | ||
UniversalEncodingDetector encodingDetector = new UniversalEncodingDetector(); | ||
detectedCharset = encodingDetector.detect(bodyIs, new Metadata()); | ||
} catch (IOException e1) { | ||
Assertions.fail(); | ||
} | ||
|
||
Assertions.assertTrue(detectedCharset.name().startsWith(StandardCharsets.UTF_16.name())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above: do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. This is covered by camel itself (https://github.com/apache/camel/blob/master/components/camel-tika/src/test/java/org/apache/camel/component/tika/TikaParseTest.java#L71). I'll remove both parts.
b9615da
to
02b0870
Compare
Both unnecessary asserts are removed. |
@ppalaga I've discovered the reason of mandatory annotation. |
Fix: #799
PR contains also change in xalan support - adds SAX capabilities (by extension of SAXTransformerFactory)