-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adapting rules for parsing robots.txt file #1042
Comments
Sounds valid to apply However, I can also think of usecases in which you still want to apply Maybe we can adjust the default here and add a configuration option to just ignore it (similar to |
Thank you, sounds good:)
|
Also the recently published RFC 9309 requires to treat a failure to fetch the robots.txt with HTTP status 500-599 as "complete disallow" (see section "Unreachable" Status). However, if the 5xx status of the robots.txt is observed over a longer period of time, crawlers may assume that there is none (ie. EMPTY_RULES). Nutch handles 5xx failures and after few retries to fetch the robots.txt suspends crawling content from the given site. See NUTCH-2573 and apache/nutch#724. Since fetch queues are implemented similarly in Nutch and StormCrawler, this mechanism could be ported to StormCrawler. Eventually, it'd be better to not just drop the URLs/tuples but to update nextFetchDate (by adding 1 hour or 1 day) to avoid that the spout is releasing the same URLs again and again into the topology. |
Thanks for this discussion people!
This could be done for an entire host with the mechanism suggested in #867. I have started working on it for the OpenSearch backend in branch 990 but still early days |
Nice! Just as a note: when running the Common Crawl crawls, temporarily suspending fetching from sites with a robots.txt 5xx HTTP status saved a lot of work responding to complaints from webmasters (sent automatically as abuse reports to AWS). This was in combination with a general slow down (exponential backoff) on HTTP 5xx, 403 Forbidden and 429 Too many requests (see NUTCH-2946). |
Thank you, very interesting:) |
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
* Remove injection from crawl topologies in *Search archetypes, fixes #1065 Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * BasicURLNormalizer .unmangleQueryString() returns invalid results if "&" symbol in a parents path #1059 (#1062) * Fix unmangleQueryString filter. Fix unmangleQueryString filter. Do not analyze full URL path, just last child, * formatting Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Removed remaining references to ES in OPenSearch module Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Dependency upgrades.fixes #1066 (#1067) Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Automatic creation of index definitions should use the bolt type (#1069) Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Maven plugin upgrades + better handling of plugin versions Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * bgufix test jar not attached Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Update maven.yml v3 version of actions Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * mechanism to retrieve more generic value of configuration (#1071) * mechanism to retrieve more generic value of configuration if a specific one is not found, fixes #1070 Signed-off-by: Julien Nioche <julien@digitalpebble.com> * minor javadoc fix Signed-off-by: Julien Nioche <julien@digitalpebble.com> --------- Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Batch requests in DeleterBolt, fixes #1072 Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Update README.md link to docker project Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Create DeletionBolt.java for Solr. #1050 (#1073) * Create DeletionBolt.java storm-crawler-solr bug. Missing DeletionBolt bolt code. #1050 * Update DeletionBolt.java License header added * Update DeletionBolt.java formatting Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * SOLR: suppress warnings + minor changes and Javadoc + added deletion to default topology Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Tika 2.8.0, fixes 1066 Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Increase the number of redirects to 5 for Robots.txt fetching (#1074) * Issue #1058: Allow 5 redirects for Robots.txt fetching Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Minor variable renaming Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> --------- Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Add test coverage reports with JaCoCo and Coveralls, fixes #1075 Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * #1075 - Add test coverage reports with JaCoCo Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * #1075 - Update GH workflow to reduce log spam by adding -B and --no-transfer-progess maven options Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Rebase - Issue #1042: Forbid all rules by default Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Modify Robots.txt parsing logic and add test cases Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Parse robots txt rules only for status code 200 Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Trying to resolve merge conflicts Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Modify Robots.txt parsing logic and add test cases Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Parse robots txt rules only for status code 200 Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> * Merge HttpRobotRulesParserTest Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> --------- Signed-off-by: Julien Nioche <julien@digitalpebble.com> Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de> Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de> Co-authored-by: Julien Nioche <julien@digitalpebble.com> Co-authored-by: syefimov <syefimov@ptfs.com> Co-authored-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de>
Hello all,
while crawling, we ran into a politeness issue and we suppose that its cause is that there was apparently a Connection Timeout when trying to fetch the
robots.txt
. We suppose that as a consequence the other webpages for this host were crawled without any restriction, just as if there would have been a 404 on therobots.txt
.As far as I see, is the logic on parsing the robots.txt file implemented as follows:
In HttpRobotRulesParser.java; line 168-177
More suitable would be a logic as it is described here: https://support.google.com/webmasters/answer/9679690#robots_details
To differentiate between the cases:
Please tell me your thoughts on this
The text was updated successfully, but these errors were encountered: