Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUTCH-2748 Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb #485

Merged

Conversation

sebastian-nagel
Copy link
Contributor

(final solution, commits need to be squashed before merging)

  • new configuration property http.redirect.max.exceeded.skip:
    • if true skip redirect targets if http.redirect.max is exceeded
    • if false (default): store the redirect targets with status "linked"
  • log whether exceeded redirects are "skipped" or "linked":
    FetcherThread 44 - redirect count exceeded https://en.wikipedia.org/wiki/URL_redirection (skipped)
    
    resp.
    FetcherThread 44 - redirect count exceeded https://en.wikipedia.org/wiki/URL_redirection (linked)
    

@sebastian-nagel sebastian-nagel changed the title UTCH-2748 Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb NUTCH-2748 Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb Nov 8, 2019
…sting items in CrawlDb

- new configuration property `http.redirect.max.exceeded.skip`:
  * if true skip redirect targets if http.redirect.max is exceeded
  * if false (default): store the redirect targets with status "linked"
- log whether exceeded redirects are "skipped" or "linked"
@sebastian-nagel sebastian-nagel merged commit ac9c435 into apache:master Dec 2, 2019
@sebastian-nagel sebastian-nagel deleted the NUTCH-2748-redir-exceeded branch December 2, 2019 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant