Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Robots.txt] Clarify behavior when to close blocks of multiple user-agents #390

Closed
sebastian-nagel opened this issue Oct 20, 2022 · 4 comments · Fixed by #430
Closed

[Robots.txt] Clarify behavior when to close blocks of multiple user-agents #390

sebastian-nagel opened this issue Oct 20, 2022 · 4 comments · Fixed by #430
Labels
Milestone

Comments

@sebastian-nagel
Copy link
Contributor

The RFC reference parser includes a unit test to verify that sitemap and other directives not covered by the RFC are just ignored and do not close blocks of rules. See robots_test.cc, line 126 expecting that BarBot is disallowed given:

User-agent: BarBot
Sitemap: https://foo.bar/sitemap
User-agent: *
Disallow: /

However, this interpretation might contradict the actual usage of certain non-RFC directives, see for example https://www.skyrock.com/robots.txt:

User-agent: *
Disallow: /cache/
Disallow: /demo/
...

User-agent: Twitterbot
Allow: /img/

User-agent: bingbot
Crawl-delay: 30

User-agent: 007ac9
User-agent: AhrefsBot
User-agent: LinqiaBot
User-agent: LinqiaRSSBot
User-agent: SMTBot
User-agent: HaosouSpider
User-agent: OpenindexSpider
User-agent: PetalBot
User-agent: serpstatbot
Disallow: / 

...

It's likely not not intended to exclude bingbot entirely - in fact, it wouldn't make sense to define a crawl-delay if a user-agent is entirely excluded. See also the discussion in google/robotstxt#51.

@garyillyes
Copy link

Hi Sebastian,

The spirit of the line in the RFC was to treat rules undefined by the RFC as comments, which in your example, as you note, merges the bingbot group together with the group that's following it. This was introduced because more common RFC-undefined rules such as Sitemap were closing groups when it was plain obvious that that's not what the site owner meant to do (i.e. a block of disallow rules followed without a preceding user-agent). I think the current implementation is the best solution to that, but I'm open to suggestions if you have a better idea.

@sebastian-nagel
Copy link
Contributor Author

Hi @garyillyes, I agree that the RFC needs to find a clear and easy definition how unsupported directives are treated. Ignoring them sounds reasonable. I have no better idea, at least, by now.

The spirit of crawler-commons and its robots.txt parser was always to find a compromise between the standard and common practice. Before we change the current implementation, I'll try to estimate the impact of a change. Eventually, we need treat directives differently: the crawl-delay directive addresses individual user-agents but the sitemap directive is in a global scope.

@sebastian-nagel
Copy link
Contributor Author

See also #114.

@sebastian-nagel sebastian-nagel added this to the 1.4 milestone May 11, 2023
@sebastian-nagel
Copy link
Contributor Author

See also the comments in SimpleRobotRulesParser, line 629: it should be well defined which instructions close a block of rules.

sebastian-nagel added a commit to sebastian-nagel/crawler-commons that referenced this issue Jun 16, 2023
…gents, closes crawler-commons#390

[Robots.txt] Handle robots.txt with missing sections (and implicit master rules), fixes crawler-commons#114
- do not close rule blocks / groups on other directives than specified
  in RFC 9309: groups are only closed on a user-agent line at least
  one allow/disallow line was read before
- set Crawl-delay independently from grouping, but never override
  or set the value for a specific agent using a value defined for the
  wildcard agent
sebastian-nagel added a commit to sebastian-nagel/crawler-commons that referenced this issue Jul 10, 2023
…gents, closes crawler-commons#390

[Robots.txt] Handle robots.txt with missing sections (and implicit master rules), fixes crawler-commons#114
- do not close rule blocks / groups on other directives than specified
  in RFC 9309: groups are only closed on a user-agent line at least
  one allow/disallow line was read before
- set Crawl-delay independently from grouping, but never override
  or set the value for a specific agent using a value defined for the
  wildcard agent
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants