Adapting rules for parsing robots.txt file #1042

michaeldinzinger · 2023-02-20T13:37:25Z

Hello all,
while crawling, we ran into a politeness issue and we suppose that its cause is that there was apparently a Connection Timeout when trying to fetch the robots.txt. We suppose that as a consequence the other webpages for this host were crawled without any restriction, just as if there would have been a 404 on the robots.txt.

As far as I see, is the logic on parsing the robots.txt file implemented as follows:

            if (code == 200) // found rules: parse them
            {
                String ct = response.getMetadata().getFirstValue(HttpHeaders.CONTENT_TYPE);
                robotRules = parseRules(url.toString(), response.getContent(), ct, agentNames);
            } else if ((code == 403) && (!allowForbidden)) {
                robotRules = FORBID_ALL_RULES; // use forbid all
            } else if (code >= 500) {
                cacheRule = false;
                robotRules = EMPTY_RULES;
            } else robotRules = EMPTY_RULES; // use default rules

In HttpRobotRulesParser.java; line 168-177

More suitable would be a logic as it is described here: https://support.google.com/webmasters/answer/9679690#robots_details
To differentiate between the cases:

403/404/410: For the crawling of this website is no robots.txt necessary
429/5xx: Connection Error results in no crawling of this website

Please tell me your thoughts on this

The text was updated successfully, but these errors were encountered:

rzo1 · 2023-02-20T13:41:56Z

Sounds valid to apply FORBID_ALL_RULES, if we encounter "too many requests" or a http 5xx.

However, I can also think of usecases in which you still want to apply EMPTY_RULES in such a case (or just stop being polite at all) ;-)

Maybe we can adjust the default here and add a configuration option to just ignore it (similar to http.robots.403.allow) ?

michaeldinzinger · 2023-02-21T23:04:10Z

Maybe we can adjust the default here and add a configuration option to just ignore it (similar to http.robots.403.allow) ?

Thank you, sounds good:)
Maybe something like http.robots.connectionerror.skip or http.robots.5xx.allow which defaults to false.
The code could then look like this

            robotRules = FORBID_ALL_RULES; // use default rules
            if (code == 200) // found rules: parse them
            {
                String ct = response.getMetadata().getFirstValue(HttpHeaders.CONTENT_TYPE);
                robotRules = parseRules(url.toString(), response.getContent(), ct, agentNames);
            } else if (code == 403 && allowForbidden) {
                robotRules = EMPTY_RULES; // allow all
            } else if (code >= 500) {
                cacheRule = false;
                if (allow5xx) {
                    robotRules = EMPTY_RULES; // allow all
                }
            }

sebastian-nagel · 2023-02-23T10:41:37Z

Also the recently published RFC 9309 requires to treat a failure to fetch the robots.txt with HTTP status 500-599 as "complete disallow" (see section "Unreachable" Status). However, if the 5xx status of the robots.txt is observed over a longer period of time, crawlers may assume that there is none (ie. EMPTY_RULES).

Nutch handles 5xx failures and after few retries to fetch the robots.txt suspends crawling content from the given site. See NUTCH-2573 and apache/nutch#724. Since fetch queues are implemented similarly in Nutch and StormCrawler, this mechanism could be ported to StormCrawler. Eventually, it'd be better to not just drop the URLs/tuples but to update nextFetchDate (by adding 1 hour or 1 day) to avoid that the spout is releasing the same URLs again and again into the topology.

jnioche · 2023-02-23T11:09:57Z

Thanks for this discussion people!

Eventually, it'd be better to not just drop the URLs/tuples but to update nextFetchDate (by adding 1 hour or 1 day) to avoid that the spout is releasing the same URLs again and again into the topology.

This could be done for an entire host with the mechanism suggested in #867. I have started working on it for the OpenSearch backend in branch 990 but still early days

sebastian-nagel · 2023-02-23T11:57:03Z

the mechanism suggested in #867

Nice!

Just as a note: when running the Common Crawl crawls, temporarily suspending fetching from sites with a robots.txt 5xx HTTP status saved a lot of work responding to complaints from webmasters (sent automatically as abuse reports to AWS). This was in combination with a general slow down (exponential backoff) on HTTP 5xx, 403 Forbidden and 429 Too many requests (see NUTCH-2946).

michaeldinzinger · 2023-02-25T12:02:27Z

Also the recently published RFC 9309 requires to treat a failure to fetch the robots.txt with HTTP status 500-599 as "complete disallow" (see section "Unreachable" Status). However, if the 5xx status of the robots.txt is observed over a longer period of time, crawlers may assume that there is none (ie. EMPTY_RULES).

Thank you, very interesting:)
As far as I understand, a possible modification would be to adapt the current handling of HTTP 429 and 5xx and, concretely, set FORBID_ALL_RULES as a default instead of EMPTY_RULES. This is necessary to meet the recently published RFC9309 requirements.
A long-term solution is to also add the parameters and the underlying mechanism (#867) to retry to crawl the robots.txt a few times (for the case of HTTP 503 and maybe also 429), before just being satisfied with saying "FORBID_ALL_RULES". In either way, a host would only be temporarily suspended from crawling, because the SC will try to crawl the robots.txt again as soon as it's not in the error_cache anymore.
And as an add-on for the long-term solution, the RFC9309 would even allow to bypass the suspension of a host due to a 5xx error on the robots.txt after getting the same server error for e.g. 30 days. But I don't see how this is easily implementable. So maybe it's better to just settle for a short-term solution for now?

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Remove injection from crawl topologies in *Search archetypes, fixes #1065 Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * BasicURLNormalizer .unmangleQueryString() returns invalid results if "&" symbol in a parents path #1059 (#1062) * Fix unmangleQueryString filter. Fix unmangleQueryString filter. Do not analyze full URL path, just last child, * formatting Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Removed remaining references to ES in OPenSearch module Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Dependency upgrades.fixes #1066 (#1067) Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Automatic creation of index definitions should use the bolt type (#1069) Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Maven plugin upgrades + better handling of plugin versions Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * bgufix test jar not attached Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Update maven.yml v3 version of actions Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * mechanism to retrieve more generic value of configuration (#1071) * mechanism to retrieve more generic value of configuration if a specific one is not found, fixes #1070 Signed-off-by: Julien Nioche <julien@digitalpebble.com> * minor javadoc fix Signed-off-by: Julien Nioche <julien@digitalpebble.com> --------- Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Batch requests in DeleterBolt, fixes #1072 Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Update README.md link to docker project Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Create DeletionBolt.java for Solr. #1050 (#1073) * Create DeletionBolt.java storm-crawler-solr bug. Missing DeletionBolt bolt code. #1050 * Update DeletionBolt.java License header added * Update DeletionBolt.java formatting Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * SOLR: suppress warnings + minor changes and Javadoc + added deletion to default topology Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Tika 2.8.0, fixes 1066 Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Increase the number of redirects to 5 for Robots.txt fetching (#1074) * Issue #1058: Allow 5 redirects for Robots.txt fetching Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Minor variable renaming Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> --------- Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Add test coverage reports with JaCoCo and Coveralls, fixes #1075 Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * #1075 - Add test coverage reports with JaCoCo Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * #1075 - Update GH workflow to reduce log spam by adding -B and --no-transfer-progess maven options Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Rebase - Issue #1042: Forbid all rules by default Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Modify Robots.txt parsing logic and add test cases Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Parse robots txt rules only for status code 200 Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Trying to resolve merge conflicts Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Modify Robots.txt parsing logic and add test cases Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Parse robots txt rules only for status code 200 Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Merge HttpRobotRulesParserTest Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> --------- Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de> Co-authored-by: Julien Nioche <julien@digitalpebble.com> Co-authored-by: syefimov <syefimov@ptfs.com> Co-authored-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de>

michaeldinzinger added a commit to michaeldinzinger/storm-crawler that referenced this issue Apr 13, 2023

Issue apache#1042: Forbid all rules by default

973eb47

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

michaeldinzinger mentioned this issue Apr 15, 2023

Issue #1042: Adapt parsing of robots.txt files #1055

Merged

michaeldinzinger added a commit to michaeldinzinger/storm-crawler that referenced this issue May 22, 2023

Issue apache#1042: Forbid all rules by default

6707edc

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

michaeldinzinger added a commit to michaeldinzinger/storm-crawler that referenced this issue May 22, 2023

Rebase - Issue apache#1042: Forbid all rules by default

2c9823d

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

michaeldinzinger added a commit to michaeldinzinger/storm-crawler that referenced this issue May 22, 2023

Rebase - Issue apache#1042: Forbid all rules by default

f768ec9

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

jnioche closed this as completed May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapting rules for parsing robots.txt file #1042

Adapting rules for parsing robots.txt file #1042

michaeldinzinger commented Feb 20, 2023 •

edited

Loading

rzo1 commented Feb 20, 2023

michaeldinzinger commented Feb 21, 2023

sebastian-nagel commented Feb 23, 2023

jnioche commented Feb 23, 2023

sebastian-nagel commented Feb 23, 2023

michaeldinzinger commented Feb 25, 2023 •

edited

Loading

Adapting rules for parsing robots.txt file #1042

Adapting rules for parsing robots.txt file #1042

Comments

michaeldinzinger commented Feb 20, 2023 • edited Loading

rzo1 commented Feb 20, 2023

michaeldinzinger commented Feb 21, 2023

sebastian-nagel commented Feb 23, 2023

jnioche commented Feb 23, 2023

sebastian-nagel commented Feb 23, 2023

michaeldinzinger commented Feb 25, 2023 • edited Loading

michaeldinzinger commented Feb 20, 2023 •

edited

Loading

michaeldinzinger commented Feb 25, 2023 •

edited

Loading