Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status #724

Conversation

sebastian-nagel
Copy link
Contributor

  • add properties
    • http.robots.503.defer.visits :
      enable/disable the feature (default: enabled)
    • http.robots.503.defer.visits.delay :
      delay to wait before the next trial to fetch the properties
      (default: wait 5 minutes)
    • http.robots.503.defer.visits.retries :
      max. number of retries before giving up and dropping all URLs from the given host / queue
      (default: give up after the 3rd retry, ie. after 4 attempts)
  • handle HTTP 5xx in robots.txt parser
  • handle delay, retries and dropping queues in Fetcher

Stop queuing fetch items if timelimit is reached

  • re-queuing items where the robots.txt request returned a 5xx
  • redirects (http.redirect.max > 0) or
  • outlinks (fetcher.follow.outlinks.depth > 0)

In a first version, I forgot to verify whether the Fetcher timelimit (fetcher.timelimit.mins) was already reached before re-queuing the fetch item. This caused very few fetcher task to end up in an infinite loop. In detail, this happened:

  1. fetcher thread starts fetching an item and requests the corresponding robots.txt. Ev., the server responds slowly.
  2. fetcher timelimit is reached, all fetcher queues are flushed
  3. robots.txt response "arrived". Because it's a 5xx the fetch item is re-queued and the fetch is delayed for 30 min. (custom configuration).

Then steps 1 and 3 are retried until the max number of retries is reached. But this was fixed and I've also made sure that redirects or outlinks are not queued if the timelimit is reached.

- add properties
  http.robots.503.defer.visits :
    enable/disable the feature (default: enabled)
  http.robots.503.defer.visits.delay :
    delay to wait before the next trial to fetch the deferred URL
    and the corresponding robots.txt
    (default: wait 5 minutes)
  http.robots.503.defer.visits.retries :
    max. number of retries before giving up and dropping all URLs from
    the given host / queue
    (default: give up after the 3rd retry, ie. after 4 attempts)
- handle HTTP 5xx in robots.txt parser
- handle delay, retries and dropping queues in Fetcher

Stop adding fetch items if timelimit is reached
- redirects (http.redirect.max > 0) or
- outlinks (fetcher.follow.outlinks.depth > 0)
Copy link
Member

@lewismc lewismc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this patch. It's actually something which will help us loop back in with website administrators to notify them of service issues. Thanks for including the metrics this is really useful. Some minor suggestions from me @sebastian-nagel .

src/java/org/apache/nutch/fetcher/FetchItemQueues.java Outdated Show resolved Hide resolved
@@ -263,6 +283,10 @@ public synchronized int checkExceptionThreshold(String queueid) {
return 0;
}

public int checkExceptionThreshold(String queueid) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Basic Javadoc?

src/java/org/apache/nutch/fetcher/FetcherThread.java Outdated Show resolved Hide resolved
- rename counter to follow naming scheme of other robots.txt related
  counters: `robots_defer_visits_dropped`
- rename method timelimitReached -> timelimitExceeded
- add Javadoc
@sebastian-nagel
Copy link
Contributor Author

Hi @lewismc, done: updated metrics wiki page (hitByTimeLimit is already documented), added Javadocs and renamed the counter to follow the naming convention of the other robots_* counters. Also renamed the method ("timelimitReached" -> "timelimitExceeded").

@lewismc
Copy link
Member

lewismc commented Jan 17, 2022

Looks like it failed on Javadoc generation @sebastian-nagel

Copy link
Member

@lewismc lewismc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once Javadoc fixed +1 from me.

@sebastian-nagel sebastian-nagel force-pushed the NUTCH-2573-suspend-crawling-robotstxt-fails branch from 170e3fc to 7f944c1 Compare January 17, 2022 21:03
@sebastian-nagel sebastian-nagel merged commit f691bae into apache:master Jan 18, 2022
@sebastian-nagel sebastian-nagel deleted the NUTCH-2573-suspend-crawling-robotstxt-fails branch January 19, 2022 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants