Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUTCH-2584 Upgrade parse-tika to use Tika 1.18 #336

Merged

Conversation

sebastian-nagel
Copy link
Contributor

(includes patch contributed by Ralf for NUTCH-2583)

In addition to the upgrade,

  • use Tika parser (instead of nekohtml) to get the DOM tree of test documents
  • fix HTMLMetaProcessor to extract no-cache and base-href attributes on DOM tree modified by Tika

- apply patch contributed by Ralf
- fix failing unit tests
- use Tika parser to get DOM tree of test documents
- fix HTMLMetaProcessor to extract no-cache and base-href
  attributes on DOM tree modified by Tika
- ignore links from FORM and SOURCE elements which are
  not extracted by Tika parser
- extract meta-refresh redirects from DOM tree normalized by Tika
- add unit test to check whether meta-refresh redirects are
  extracted and parse status holds the redirect target
@sebastian-nagel sebastian-nagel merged commit 2544fad into apache:master Jun 2, 2018
@sebastian-nagel sebastian-nagel deleted the NUTCH-2583-upgrade-dependencies branch July 16, 2018 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant