New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Robots.txt] Clarify behavior when to close blocks of multiple user-agents #390
Comments
Hi Sebastian, The spirit of the line in the RFC was to treat rules undefined by the RFC as comments, which in your example, as you note, merges the bingbot group together with the group that's following it. This was introduced because more common RFC-undefined rules such as Sitemap were closing groups when it was plain obvious that that's not what the site owner meant to do (i.e. a block of disallow rules followed without a preceding user-agent). I think the current implementation is the best solution to that, but I'm open to suggestions if you have a better idea. |
Hi @garyillyes, I agree that the RFC needs to find a clear and easy definition how unsupported directives are treated. Ignoring them sounds reasonable. I have no better idea, at least, by now. The spirit of crawler-commons and its robots.txt parser was always to find a compromise between the standard and common practice. Before we change the current implementation, I'll try to estimate the impact of a change. Eventually, we need treat directives differently: the crawl-delay directive addresses individual user-agents but the sitemap directive is in a global scope. |
See also #114. |
See also the comments in SimpleRobotRulesParser, line 629: it should be well defined which instructions close a block of rules. |
…gents, closes crawler-commons#390 [Robots.txt] Handle robots.txt with missing sections (and implicit master rules), fixes crawler-commons#114 - do not close rule blocks / groups on other directives than specified in RFC 9309: groups are only closed on a user-agent line at least one allow/disallow line was read before - set Crawl-delay independently from grouping, but never override or set the value for a specific agent using a value defined for the wildcard agent
…gents, closes crawler-commons#390 [Robots.txt] Handle robots.txt with missing sections (and implicit master rules), fixes crawler-commons#114 - do not close rule blocks / groups on other directives than specified in RFC 9309: groups are only closed on a user-agent line at least one allow/disallow line was read before - set Crawl-delay independently from grouping, but never override or set the value for a specific agent using a value defined for the wildcard agent
The RFC reference parser includes a unit test to verify that sitemap and other directives not covered by the RFC are just ignored and do not close blocks of rules. See robots_test.cc, line 126 expecting that BarBot is disallowed given:
However, this interpretation might contradict the actual usage of certain non-RFC directives, see for example https://www.skyrock.com/robots.txt:
It's likely not not intended to exclude bingbot entirely - in fact, it wouldn't make sense to define a crawl-delay if a user-agent is entirely excluded. See also the discussion in google/robotstxt#51.
The text was updated successfully, but these errors were encountered: