Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Robots.txt] Close groups of rules as defined in RFC 9309 #430

Conversation

sebastian-nagel
Copy link
Contributor

This PR addresses:

Changes:

  • rule groups are closed as specified in RFC 9309: on a user-agent line, if at least one allow/disallow line was read before
  • that is, Sitemap, Crawl-delay and other directives not covered by RFC 9309 do not finish a rule group
  • the Crawl-delay is set independently from grouping:
    • it's set for a given agent or the wildcard agent
    • but a value defined for a specific agent is never set using a value defined for the wildcard agent
  • unit tests are updated or added accordingly
    • for changed tests it's verified, that the RFC reference parser (google/robotstxt) behaves the same

This PR should unblock #245 (all unit tests added there should now pass).

…gents, closes crawler-commons#390

[Robots.txt] Handle robots.txt with missing sections (and implicit master rules), fixes crawler-commons#114
- do not close rule blocks / groups on other directives than specified
  in RFC 9309: groups are only closed on a user-agent line at least
  one allow/disallow line was read before
- set Crawl-delay independently from grouping, but never override
  or set the value for a specific agent using a value defined for the
  wildcard agent
…gents

- fix unit test broken by introducing compliance with RFC 9309
@sebastian-nagel sebastian-nagel force-pushed the cc-390-114-robots-closing-rule-group branch from 35053f5 to 17e8544 Compare July 10, 2023 11:00
…gents

- must keep state whether Crawl-delay is already set for a specific agent
  as separate variable
- add unit test to ensure that no already set Crawl-delay is overridden
  by a (lower) value of another agent
@sebastian-nagel
Copy link
Contributor Author

Thanks for the review, @jnioche!

@sebastian-nagel sebastian-nagel merged commit 871e4e6 into crawler-commons:master Jul 12, 2023
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants