Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapting rules for parsing robots.txt file #1042

Closed
michaeldinzinger opened this issue Feb 20, 2023 · 6 comments
Closed

Adapting rules for parsing robots.txt file #1042

michaeldinzinger opened this issue Feb 20, 2023 · 6 comments

Comments

@michaeldinzinger
Copy link
Contributor

michaeldinzinger commented Feb 20, 2023

Hello all,
while crawling, we ran into a politeness issue and we suppose that its cause is that there was apparently a Connection Timeout when trying to fetch the robots.txt. We suppose that as a consequence the other webpages for this host were crawled without any restriction, just as if there would have been a 404 on the robots.txt.

As far as I see, is the logic on parsing the robots.txt file implemented as follows:

            if (code == 200) // found rules: parse them
            {
                String ct = response.getMetadata().getFirstValue(HttpHeaders.CONTENT_TYPE);
                robotRules = parseRules(url.toString(), response.getContent(), ct, agentNames);
            } else if ((code == 403) && (!allowForbidden)) {
                robotRules = FORBID_ALL_RULES; // use forbid all
            } else if (code >= 500) {
                cacheRule = false;
                robotRules = EMPTY_RULES;
            } else robotRules = EMPTY_RULES; // use default rules

In HttpRobotRulesParser.java; line 168-177

More suitable would be a logic as it is described here: https://support.google.com/webmasters/answer/9679690#robots_details
To differentiate between the cases:

  • 403/404/410: For the crawling of this website is no robots.txt necessary
  • 429/5xx: Connection Error results in no crawling of this website

Please tell me your thoughts on this

@rzo1
Copy link
Contributor

rzo1 commented Feb 20, 2023

Sounds valid to apply FORBID_ALL_RULES, if we encounter "too many requests" or a http 5xx.

However, I can also think of usecases in which you still want to apply EMPTY_RULES in such a case (or just stop being polite at all) ;-)

Maybe we can adjust the default here and add a configuration option to just ignore it (similar to http.robots.403.allow) ?

@michaeldinzinger
Copy link
Contributor Author

Maybe we can adjust the default here and add a configuration option to just ignore it (similar to http.robots.403.allow) ?

Thank you, sounds good:)
Maybe something like http.robots.connectionerror.skip or http.robots.5xx.allow which defaults to false.
The code could then look like this

            robotRules = FORBID_ALL_RULES; // use default rules
            if (code == 200) // found rules: parse them
            {
                String ct = response.getMetadata().getFirstValue(HttpHeaders.CONTENT_TYPE);
                robotRules = parseRules(url.toString(), response.getContent(), ct, agentNames);
            } else if (code == 403 && allowForbidden) {
                robotRules = EMPTY_RULES; // allow all
            } else if (code >= 500) {
                cacheRule = false;
                if (allow5xx) {
                    robotRules = EMPTY_RULES; // allow all
                }
            }

@sebastian-nagel
Copy link
Contributor

Also the recently published RFC 9309 requires to treat a failure to fetch the robots.txt with HTTP status 500-599 as "complete disallow" (see section "Unreachable" Status). However, if the 5xx status of the robots.txt is observed over a longer period of time, crawlers may assume that there is none (ie. EMPTY_RULES).

Nutch handles 5xx failures and after few retries to fetch the robots.txt suspends crawling content from the given site. See NUTCH-2573 and apache/nutch#724. Since fetch queues are implemented similarly in Nutch and StormCrawler, this mechanism could be ported to StormCrawler. Eventually, it'd be better to not just drop the URLs/tuples but to update nextFetchDate (by adding 1 hour or 1 day) to avoid that the spout is releasing the same URLs again and again into the topology.

@jnioche
Copy link
Contributor

jnioche commented Feb 23, 2023

Thanks for this discussion people!

Eventually, it'd be better to not just drop the URLs/tuples but to update nextFetchDate (by adding 1 hour or 1 day) to avoid that the spout is releasing the same URLs again and again into the topology.

This could be done for an entire host with the mechanism suggested in #867. I have started working on it for the OpenSearch backend in branch 990 but still early days

@sebastian-nagel
Copy link
Contributor

the mechanism suggested in #867

Nice!

Just as a note: when running the Common Crawl crawls, temporarily suspending fetching from sites with a robots.txt 5xx HTTP status saved a lot of work responding to complaints from webmasters (sent automatically as abuse reports to AWS). This was in combination with a general slow down (exponential backoff) on HTTP 5xx, 403 Forbidden and 429 Too many requests (see NUTCH-2946).

@michaeldinzinger
Copy link
Contributor Author

michaeldinzinger commented Feb 25, 2023

Also the recently published RFC 9309 requires to treat a failure to fetch the robots.txt with HTTP status 500-599 as "complete disallow" (see section "Unreachable" Status). However, if the 5xx status of the robots.txt is observed over a longer period of time, crawlers may assume that there is none (ie. EMPTY_RULES).

Thank you, very interesting:)
As far as I understand, a possible modification would be to adapt the current handling of HTTP 429 and 5xx and, concretely, set FORBID_ALL_RULES as a default instead of EMPTY_RULES. This is necessary to meet the recently published RFC9309 requirements.
A long-term solution is to also add the parameters and the underlying mechanism (#867) to retry to crawl the robots.txt a few times (for the case of HTTP 503 and maybe also 429), before just being satisfied with saying "FORBID_ALL_RULES". In either way, a host would only be temporarily suspended from crawling, because the SC will try to crawl the robots.txt again as soon as it's not in the error_cache anymore.
And as an add-on for the long-term solution, the RFC9309 would even allow to bypass the suspension of a host due to a 5xx error on the robots.txt after getting the same server error for e.g. 30 days. But I don't see how this is easily implementable. So maybe it's better to just settle for a short-term solution for now?

michaeldinzinger added a commit to michaeldinzinger/storm-crawler that referenced this issue Apr 13, 2023
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
michaeldinzinger added a commit to michaeldinzinger/storm-crawler that referenced this issue May 22, 2023
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
michaeldinzinger added a commit to michaeldinzinger/storm-crawler that referenced this issue May 22, 2023
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
michaeldinzinger added a commit to michaeldinzinger/storm-crawler that referenced this issue May 22, 2023
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
jnioche added a commit that referenced this issue May 23, 2023
* Remove injection from crawl topologies in *Search archetypes, fixes #1065

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* BasicURLNormalizer .unmangleQueryString() returns invalid results if "&" symbol in a parents path #1059 (#1062)

* Fix unmangleQueryString filter.

Fix unmangleQueryString filter. Do not analyze full URL path, just last child,

* formatting

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Removed remaining references to ES in OPenSearch module

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Dependency upgrades.fixes #1066 (#1067)

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Automatic creation of index definitions should use the bolt type (#1069)

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Maven plugin upgrades + better handling of plugin versions

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* bgufix test jar not attached

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Update maven.yml

v3 version of actions

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* mechanism to retrieve more generic value of configuration  (#1071)

* mechanism to retrieve more generic value of configuration if a specific one is not found, fixes #1070

Signed-off-by: Julien Nioche <julien@digitalpebble.com>

* minor javadoc fix

Signed-off-by: Julien Nioche <julien@digitalpebble.com>

---------

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Batch requests in DeleterBolt, fixes #1072

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Update README.md

link to docker project

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Create DeletionBolt.java for Solr. #1050 (#1073)

* Create DeletionBolt.java

storm-crawler-solr bug. Missing DeletionBolt bolt code. #1050

* Update DeletionBolt.java

License header added

* Update DeletionBolt.java

formatting

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* SOLR: suppress warnings + minor changes and Javadoc + added deletion to default topology

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Tika 2.8.0, fixes 1066

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Increase the number of redirects to 5 for Robots.txt fetching (#1074)

* Issue #1058: Allow 5 redirects for Robots.txt fetching

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Minor variable renaming

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

---------

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Add test coverage reports with JaCoCo and Coveralls, fixes #1075

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* #1075 - Add test coverage reports with JaCoCo

Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* #1075 - Update GH workflow to reduce log spam by adding -B and --no-transfer-progess maven options

Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Rebase - Issue #1042: Forbid all rules by default

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Modify Robots.txt parsing logic and add test cases

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Parse robots txt rules only for status code 200

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Trying to resolve merge conflicts

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Modify Robots.txt parsing logic and add test cases

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Parse robots txt rules only for status code 200

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Merge HttpRobotRulesParserTest

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

---------

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de>
Co-authored-by: Julien Nioche <julien@digitalpebble.com>
Co-authored-by: syefimov <syefimov@ptfs.com>
Co-authored-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de>
@jnioche jnioche closed this as completed May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants