Skip to content

NUTCH-2457 Embedded documents likely not correctly parsed by Tika#474

Merged
sebastian-nagel merged 3 commits intoapache:masterfrom
sebastian-nagel:NUTCH-2457-parse-tika-embedded-docs
Sep 30, 2019
Merged

NUTCH-2457 Embedded documents likely not correctly parsed by Tika#474
sebastian-nagel merged 3 commits intoapache:masterfrom
sebastian-nagel:NUTCH-2457-parse-tika-embedded-docs

Conversation

@sebastian-nagel
Copy link
Copy Markdown
Contributor

  • add unit test for embedded documents

@tballison
Copy link
Copy Markdown
Contributor

Embedded documents likely not correctly parsed by Tika

Can we help?

@sebastian-nagel
Copy link
Copy Markdown
Contributor Author

Yes, please see my comment in Jira.

@sebastian-nagel sebastian-nagel force-pushed the NUTCH-2457-parse-tika-embedded-docs branch from f0b23f7 to 653e310 Compare September 27, 2019 14:49
- remove needless unit test whether document to be tested is opened by parse-tika
- add AutoDetectParser to ParseContext, so that it is called
  for embedded documents
- if `tika.parse.embedded` is true
  (false disables recursive parsing of embedded documents)
@sebastian-nagel sebastian-nagel force-pushed the NUTCH-2457-parse-tika-embedded-docs branch from 653e310 to c9238a1 Compare September 30, 2019 11:29
@sebastian-nagel
Copy link
Copy Markdown
Contributor Author

(rebased to master, resolved conflicts)

@sebastian-nagel sebastian-nagel merged commit 9e49c3f into apache:master Sep 30, 2019
@sebastian-nagel sebastian-nagel deleted the NUTCH-2457-parse-tika-embedded-docs branch October 15, 2019 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants