NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status #724

sebastian-nagel · 2022-01-15T13:16:23Z

add properties
- http.robots.503.defer.visits :
  enable/disable the feature (default: enabled)
- http.robots.503.defer.visits.delay :
  delay to wait before the next trial to fetch the properties
  (default: wait 5 minutes)
- http.robots.503.defer.visits.retries :
  max. number of retries before giving up and dropping all URLs from the given host / queue
  (default: give up after the 3rd retry, ie. after 4 attempts)
handle HTTP 5xx in robots.txt parser
handle delay, retries and dropping queues in Fetcher

Stop queuing fetch items if timelimit is reached

re-queuing items where the robots.txt request returned a 5xx
redirects (http.redirect.max > 0) or
outlinks (fetcher.follow.outlinks.depth > 0)

In a first version, I forgot to verify whether the Fetcher timelimit (fetcher.timelimit.mins) was already reached before re-queuing the fetch item. This caused very few fetcher task to end up in an infinite loop. In detail, this happened:

fetcher thread starts fetching an item and requests the corresponding robots.txt. Ev., the server responds slowly.
fetcher timelimit is reached, all fetcher queues are flushed
robots.txt response "arrived". Because it's a 5xx the fetch item is re-queued and the fetch is delayed for 30 min. (custom configuration).

Then steps 1 and 3 are retried until the max number of retries is reached. But this was fixed and I've also made sure that redirects or outlinks are not queued if the timelimit is reached.

- add properties http.robots.503.defer.visits : enable/disable the feature (default: enabled) http.robots.503.defer.visits.delay : delay to wait before the next trial to fetch the deferred URL and the corresponding robots.txt (default: wait 5 minutes) http.robots.503.defer.visits.retries : max. number of retries before giving up and dropping all URLs from the given host / queue (default: give up after the 3rd retry, ie. after 4 attempts) - handle HTTP 5xx in robots.txt parser - handle delay, retries and dropping queues in Fetcher Stop adding fetch items if timelimit is reached - redirects (http.redirect.max > 0) or - outlinks (fetcher.follow.outlinks.depth > 0)

lewismc

I like this patch. It's actually something which will help us loop back in with website administrators to notify them of service issues. Thanks for including the metrics this is really useful. Some minor suggestions from me @sebastian-nagel .

src/java/org/apache/nutch/fetcher/FetchItemQueues.java

lewismc · 2022-01-15T23:48:37Z

src/java/org/apache/nutch/fetcher/FetchItemQueues.java

@@ -263,6 +283,10 @@ public synchronized int checkExceptionThreshold(String queueid) {
    return 0;
  }

+  public int checkExceptionThreshold(String queueid) {


Same here. Basic Javadoc?

src/java/org/apache/nutch/fetcher/FetcherThread.java

- rename counter to follow naming scheme of other robots.txt related counters: `robots_defer_visits_dropped` - rename method timelimitReached -> timelimitExceeded - add Javadoc

sebastian-nagel · 2022-01-17T18:41:00Z

Hi @lewismc, done: updated metrics wiki page (hitByTimeLimit is already documented), added Javadocs and renamed the counter to follow the naming convention of the other robots_* counters. Also renamed the method ("timelimitReached" -> "timelimitExceeded").

lewismc · 2022-01-17T19:41:15Z

Looks like it failed on Javadoc generation @sebastian-nagel

lewismc

Once Javadoc fixed +1 from me.

- fix javadoc error

lewismc requested changes Jan 15, 2022

View reviewed changes

NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status

57f3448

- rename counter to follow naming scheme of other robots.txt related counters: `robots_defer_visits_dropped` - rename method timelimitReached -> timelimitExceeded - add Javadoc

lewismc approved these changes Jan 17, 2022

View reviewed changes

NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status

7f944c1

- fix javadoc error

sebastian-nagel force-pushed the NUTCH-2573-suspend-crawling-robotstxt-fails branch from 170e3fc to 7f944c1 Compare January 17, 2022 21:03

sebastian-nagel merged commit f691bae into apache:master Jan 18, 2022

sebastian-nagel deleted the NUTCH-2573-suspend-crawling-robotstxt-fails branch January 19, 2022 21:14

sebastian-nagel mentioned this pull request Feb 23, 2023

Adapting rules for parsing robots.txt file apache/incubator-stormcrawler#1042

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status #724

NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status #724

sebastian-nagel commented Jan 15, 2022

lewismc left a comment

lewismc Jan 15, 2022

sebastian-nagel commented Jan 17, 2022

lewismc commented Jan 17, 2022

lewismc left a comment

NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status #724

NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status #724

Conversation

sebastian-nagel commented Jan 15, 2022

lewismc left a comment

Choose a reason for hiding this comment

lewismc Jan 15, 2022

Choose a reason for hiding this comment

sebastian-nagel commented Jan 17, 2022

lewismc commented Jan 17, 2022

lewismc left a comment

Choose a reason for hiding this comment