Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIKA-4038: Remove shading of tika-parsers-standard-package #1130

Merged
merged 1 commit into from
May 11, 2023

Conversation

gastaldi
Copy link
Member

@tballison tballison merged commit ef8c8ff into apache:main May 11, 2023
1 check passed
@tballison
Copy link
Contributor

Works on an external dummy project that uses tika-parsers-standard-package as a dependency. Thank you!

@gastaldi gastaldi deleted the parsers_shading branch May 11, 2023 19:03
@sandeshkr419
Copy link

sandeshkr419 commented Mar 19, 2024

Hi @tballison & @gastaldi,

I was trying to upgrade tika-parsers-standard-package 2.6 -> 2.8/2.9 as after updating commons-compress to 1.26.0, I was facing issues with parsing IWorkPackageParser related files (.pages, .key).

Here are more details: opensearch-project/OpenSearch#12627

When I was bumping up tika dependencies to 2.8.0 or 2.9.0 or 2.9.1, I was not able to utilize the various parsers which were part of tika-parsers-standard-package jar such as HtmlParser, etc listed here.

After the changes in package structure in tika-parsers-standard-package since 2.7.0, is there a change in how the dependencies are consumed now? Any documented way which I can refer to on how to consume the various parser implementations (dependencies like PDFParser, HtmlParser) which are now not available in the tika-parsers-standard-package now.

@gastaldi
Copy link
Member Author

What error are you getting? If cannot access org.apache.tika.parser.AbstractEncodingDetectorParsermake sure you also add a dependency to tika-core

@sandeshkr419
Copy link

sandeshkr419 commented Mar 19, 2024

@gastaldi Thanks for the quick revert.

These are the present tika libraries that I'm consuming:

versions << [
  'tika'  : '2.6.0',
 'commonscompress' : '1.24.0'
.
.
.
  api "org.apache.tika:tika-core:${versions.tika}"
  api "org.apache.tika:tika-parsers:${versions.tika}"
  api "org.apache.tika:tika-parsers-standard-package:${versions.tika}"
  api "org.apache.tika:tika-langdetect-optimaize:${versions.tika}"

  api "org.apache.commons:commons-compress:${versions.commonscompress}

Gradle configuration for reference: https://github.com/opensearch-project/OpenSearch/blob/main/plugins/ingest-attachment/build.gradle

Relevant Tika Implementation in usage: https://github.com/opensearch-project/OpenSearch/blob/main/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java

With tika version:2.6.0, and commons-compress 1.24.0:
Everything worked fine.

With tika version:2.6.0, and commons-compress 1.26.0:
IWorkerParser related parsing methods started throwing exceptions:

org.opensearch.ingest.attachment.TikaDocTests > testFiles FAILED
    java.lang.RuntimeException: parsing of filename: testKeynote.key failed
        at __randomizedtesting.SeedInfo.seed([7E30995C8CE0CC1:6EFE6C139A13FF43]:0)
        at org.opensearch.ingest.attachment.TikaDocTests.assertParseable(TikaDocTests.java:85)
        at org.opensearch.ingest.attachment.TikaDocTests.testFiles(TikaDocTests.java:71)

        Caused by:
        org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.iwork.IWorkPackageParser@3ba82e1d
            at app//org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
            at app//org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:195)
            at app//org.apache.tika.Tika.parseToString(Tika.java:525)
            at app//org.opensearch.ingest.attachment.TikaImpl.lambda$parse$0(TikaImpl.java:122)
            at java.base@21.0.2/java.security.AccessController.doPrivileged(AccessController.java:714)
            at app//org.opensearch.ingest.attachment.TikaImpl.parse(TikaImpl.java:121)
            at app//org.opensearch.ingest.attachment.TikaDocTests.assertParseable(TikaDocTests.java:80)
            ... 1 more

            Caused by:
            java.io.IOException: Resetting to invalid mark
                at java.base/java.io.BufferedInputStream.implReset(BufferedInputStream.java:583)
                at java.base/java.io.BufferedInputStream.reset(BufferedInputStream.java:569)
                at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:97)
                at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
                ... 7 more

With tika version:2.8.0, and commons-compress 1.26.0:

The following dependencies fail to resolve:

/Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:94: error: package org.apache.tika.parser.html does not exist
        new org.apache.tika.parser.html.HtmlParser(),
                                       ^
/Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:95: error: package org.apache.tika.parser.pdf does not exist
        new org.apache.tika.parser.pdf.PDFParser(),
                                      ^
/Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:96: error: package org.apache.tika.parser.txt does not exist
        new org.apache.tika.parser.txt.TXTParser(),
                                      ^
/Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:97: error: package org.apache.tika.parser.microsoft.rtf does not exist
        new org.apache.tika.parser.microsoft.rtf.RTFParser(),
                                                ^
/Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:98: error: package org.apache.tika.parser.microsoft does not exist
        new org.apache.tika.parser.microsoft.OfficeParser(),
                                            ^
/Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:99: error: package org.apache.tika.parser.microsoft does not exist
        new org.apache.tika.parser.microsoft.OldExcelParser(),
                                            ^
/Users/kusandes/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:100: error: package org.apache.tika.parser.microsoft.ooxml does not exist
        ParserDecorator.withoutTypes(new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), EXCLUDES),
                                                                               ^
/Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:101: error: package org.apache.tika.parser.odf does not exist
        new org.apache.tika.parser.odf.OpenDocumentParser(),
                                      ^
/Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:102: error: package org.apache.tika.parser.iwork does not exist
        new org.apache.tika.parser.iwork.IWorkPackageParser(),
                                        ^
/Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:103: error: package org.apache.tika.parser.xml does not exist
        new org.apache.tika.parser.xml.DcXMLParser(),
                                      ^
/Users/--/workplace/opensearch/OpenSearch/plugins/ingest-attachment/src/main/java/org/opensearch/ingest/attachment/TikaImpl.java:104: error: package org.apache.tika.parser.epub does not exist
        new org.apache.tika.parser.epub.EpubParser(), };
                                       ^

@gastaldi
Copy link
Member Author

No idea what can be causing that, perhaps @tballison might know

@tballison
Copy link
Contributor

Sorry, I haven't looked carefully at your gradle file...is it pulling in transitive dependencies, like tika-parser-misc-office-module for example?

@tballison
Copy link
Contributor

I think the iworks and compress thing is fixed in 1.26.1. @THausherr does that sound right? The iworks issue rings a bell...

@THausherr
Copy link
Contributor

THausherr commented Mar 20, 2024

Yes; although I see that your last improvement wasn't added to 2.9.2, I'll do it. (update: done)

@gastaldi you can test with a snapshot. The last file already has a working fix; the latest improvement will be there in maybe two hours, just look for a file that has todays date.
https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/2.9.2-SNAPSHOT/

@tballison
Copy link
Contributor

Doh! Thank you, @THausherr . I'm happy to cherry-pick that bit as well.

@tballison
Copy link
Contributor

#1130 (comment)

@sandeshkr419
Copy link

Thanks @tballison @THausherr - I'm able to upgrade tika now.
Last question, when are we expecting 2.9.2 to be available/released?

@THausherr
Copy link
Contributor

Some time next week, see the message in the mailing lists: "I'd like to fix TIKA-4211 before the next release. It has been a while since our last 2.x release. What do you think about aiming for starting the voting process early next week? Any other blockers?"

@THausherr
Copy link
Contributor

Also observe the mass regression tests in https://issues.apache.org/jira/browse/TIKA-4171 . We hit several problems yesterday and these must be solved first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants