Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle robots.txt with missing sections (and implicit master rules) #114

Closed
kkrugler opened this issue Mar 9, 2016 · 1 comment · Fixed by #430
Closed

Handle robots.txt with missing sections (and implicit master rules) #114

kkrugler opened this issue Mar 9, 2016 · 1 comment · Fixed by #430
Labels
Milestone

Comments

@kkrugler
Copy link
Contributor

kkrugler commented Mar 9, 2016

The robots.txt file at http://www.scotsman.com/robots.txt has a number of issues...

  1. There's no blank line between rule sections
  2. The "*" user agent rule section is obviously intended to be used by the explicit (e.g. "bingbot") rule sections, even though it technically won't be.
  3. This isn't really an issue, but we should make sure we process all sitemap refs, as there is one in the middle of the general rule block, and a bunch more at the end.
@kkrugler
Copy link
Contributor Author

kkrugler commented Mar 9, 2016

Here's a saved copy of the robots.txt file...
scotsman-robots.txt, and the interesting bits from the beginning...

User-Agent: bingbot
Crawl-delay: 1
User-Agent: msnbot
Crawl-delay: 1
User-agent: *
Disallow: /ajax 
Disallow: /login
Disallow: /logout
Sitemap: http://www.scotsman.com/archive/sitemap-archive.xml
Disallow: /assets/
<many more disallows>

@jnioche jnioche added the robots label Apr 4, 2017
@sebastian-nagel sebastian-nagel added this to the 1.4 milestone Aug 11, 2022
sebastian-nagel added a commit to sebastian-nagel/crawler-commons that referenced this issue Jun 16, 2023
sebastian-nagel added a commit to sebastian-nagel/crawler-commons that referenced this issue Jun 16, 2023
…gents, closes crawler-commons#390

[Robots.txt] Handle robots.txt with missing sections (and implicit master rules), fixes crawler-commons#114
- do not close rule blocks / groups on other directives than specified
  in RFC 9309: groups are only closed on a user-agent line at least
  one allow/disallow line was read before
- set Crawl-delay independently from grouping, but never override
  or set the value for a specific agent using a value defined for the
  wildcard agent
sebastian-nagel added a commit to sebastian-nagel/crawler-commons that referenced this issue Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants