[Robots.txt] RFC compliance: matching user-agent names when selecting rule blocks #362

sebastian-nagel · 2022-02-20T21:00:38Z

add unit test to verify that the rule with the completely matched user-agent name is selected, and no partial prefix match is preferred (cf. also [Robots.txt] Matching user-agent names does not conform to robots.txt RFC #192)

sebastian-nagel · 2022-02-21T21:55:40Z

Thanks, @rzo1! But I should add also add a fix. The (failing) unit tests are just an example that the prefix matching is even somewhat counter-intuitive in addition that it contradicts both the new and the old RFC.

sebastian-nagel · 2022-10-20T13:14:42Z

Updated the PR...

To summarize the issues around matching user agent names:

the new RFC 9309 is pretty clear: both the crawler (robot, user-agent) name and the name in the user-agent line are required to be a single word token (regex [a-zA-Z_-]+) and matching is done literally but case-insensitive on the entire word token.
the first RFC proposal only requires that the user-agent token is a substring of the User-Agent line in the robots.txt:

The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring.
our current implementation is different from 1 and 2:

a. the user-agent line is split at [ \t,] into tokens
b. the configured user-agent name is also split into tokens at white space (U+0020)
c. a user-agent line matches if one of the line tokens (from a) is a prefix of one of the name tokens (from b)
d. the first matched non-wildcard block of allow/disallow rules is selected

In short, the current implementation allows that crawler developers lazily use the HTTP User-Agent string also for the robots.txt parser. It does not cover the case when the HTTP User-Agent string is used in the robots.txt.
the current robots.txt parser API allows to pass
a. multiple agent names in a comma-separated string
b. and/or compound agent names such as "Mozilla Crawlerbot-super 1.0" where every white-space separated word is matched

Issues with the current implementation are caused by conflicts when robot names share a common prefix, eg. googlebot and googlebot-news. But the real trouble (that's #192 is about) starts when both user-agent line and name consist of multiple tokens and include common parts such as "Mozilla".

Decisions to resolve this issues:

extend the robots parser API and add a method to pass agent names as a collection following the RFC 9309 with no splitting of the names into words/tokens.

This still allows for multiple robot names, including varying names which include version string or similar ("webcrawler/3.0").
by default user agent names are matched literally but case-insensitive following RFC 9309
the old behavior can be restored by calling "setExactUserAgentMatching(false)"
the method which splits the robot name into tokens and performs prefix matching is deprecated but kept for backward-compatibility.
BaseRobotRulesParser: move the documented details about how user-agent names are matched into SimpleRobotRulesParser. The abstract base class shouldn't already define the behavior without a way to actually enforce it.

sebastian-nagel · 2022-10-21T14:24:42Z

Following the discussion in google/robotstxt#56, the behavior should be changed, so that a robot name "foo" also matches all of:

User-agent: Foo/1.2
User-agent: Foo Bar

Eventually, make the user-agent matching configurable and implement it in a method which can be overridden if a user is stick to an agent name containing any characters except [a-zA-Z_-] - "Mail.Ru", "Go!Zilla", "360Spider", etc.

- add unit test to verify that the rule with the completely matched user-agent name is selected, and no partial prefix match is preferred (cf. also crawler-commons#192)

- refactor agent name matching and move splitting robotNames string at comma into a separate method to be called once at the beginning of parsing the robots.txt file - extend the robots parser API and add a method to pass agent names as a collection following the RFC 9309 with no splitting of the names into words/tokens. - deprecate "old" method which splits the robot name into tokens and performs prefix matching - by default user agent names are matched literally but case-insensitive following RFC 9309. Add method to "restore" the prefix matching: "setExactUserAgentMatching(false)" - BaseRobotRulesParser: move the documented details about how user-agent names are matched into SimpleRobotRulesParser - unit tests: add tests for issues described in crawler-commons#192, configure exact user-agent matching if required

- match user-agent product token at beginning of user-agent line/statement followed by ignored non-token characters, e.g. "foo" is matched in "User-agent: foo/1.2"

- match user-agent product tokens followed by ignored characters also in legacy prefix matching mode, e.g. match "butterfly" in "User-agent: Butterfly/1.0" - refactor prefix matching: switch inner and outer loop, handle check for (common) wild-card user-agent outside of loop

- make exact user-agent matching the default in unit tests, explicitly pass flag for legacy prefix user-agent matching in unit tests where needed - names not following the ua pattern in the specificiation "[a-zA-Z_-]+" - user-agent lines with multiple user-agent names

- make the method to handle prefix/partial user-agent product token matches protected, so that it can be overridden to match non-standard user-agent product tokens, e.g. "Go!zilla"

sebastian-nagel · 2023-04-23T15:46:53Z

A "prefix" match up to the first non-product-token character is now implemented. I.e. the product token "foo" is matched in User-agent: Foo/1.2. To handle partial matches of special product tokens ("Go!Zilla", etc.) the method "userAgentProductTokenPartialMatch" can be overridden.

jnioche · 2023-04-24T08:27:32Z

LGTM @sebastian-nagel, nice one! Feel free to merge

sebastian-nagel · 2023-04-24T15:25:56Z

Thanks, @jnioche and @rzo1!

…ll requests

sebastian-nagel mentioned this pull request Feb 20, 2022

[Robots.txt] Matching user-agent names does not conform to robots.txt RFC #192

Closed

rzo1 approved these changes Feb 21, 2022

View reviewed changes

sebastian-nagel marked this pull request as draft February 21, 2022 21:49

sebastian-nagel added the robots label Jul 7, 2022

sebastian-nagel modified the milestones: 1.3, 1.4 Jul 14, 2022

sebastian-nagel mentioned this pull request Aug 11, 2022

Implement Robots Exclusion Protocol (REP) IETF RFC 9309 #245

Closed

sebastian-nagel force-pushed the cc-192-robotstxt-user-agent-string-matching branch from e6b08e3 to 1a40b4a Compare August 11, 2022 12:16

sebastian-nagel force-pushed the cc-192-robotstxt-user-agent-string-matching branch from 1a40b4a to 97718ef Compare October 20, 2022 13:08

sebastian-nagel marked this pull request as ready for review October 20, 2022 13:11

sebastian-nagel mentioned this pull request Oct 20, 2022

Implement Robots Exclusion Protocol (REP) IETF Draft: port unit tests #360

Merged

sebastian-nagel marked this pull request as draft October 21, 2022 14:09

sebastian-nagel added 4 commits April 20, 2023 17:08

RFC compliance: matching user-agent names when selecting rule blocks

5fcfe06

- add unit test to verify that the rule with the completely matched user-agent name is selected, and no partial prefix match is preferred (cf. also crawler-commons#192)

RFC compliance: matching user-agent names when selecting rule blocks

1099206

- match user-agent product token at beginning of user-agent line/statement followed by ignored non-token characters, e.g. "foo" is matched in "User-agent: foo/1.2"

sebastian-nagel force-pushed the cc-192-robotstxt-user-agent-string-matching branch from 97718ef to c57d716 Compare April 22, 2023 18:49

sebastian-nagel added 2 commits April 22, 2023 20:59

RFC compliance: matching user-agent names when selecting rule blocks

7bb4b7e

- make the method to handle prefix/partial user-agent product token matches protected, so that it can be overridden to match non-standard user-agent product tokens, e.g. "Go!zilla"

sebastian-nagel marked this pull request as ready for review April 23, 2023 15:43

rzo1 approved these changes Apr 24, 2023

View reviewed changes

sebastian-nagel merged commit d8a6126 into crawler-commons:master Apr 24, 2023
2 checks passed

sebastian-nagel added a commit that referenced this pull request May 11, 2023

Updates changelog for #192, #362, #383, #389 and merged dependabot pu…

8bb1694

…ll requests

sebastian-nagel mentioned this pull request May 23, 2023

[Robots.txt] SimpleRobotRulesParser main to use the new API method #413

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Robots.txt] RFC compliance: matching user-agent names when selecting rule blocks #362

[Robots.txt] RFC compliance: matching user-agent names when selecting rule blocks #362

sebastian-nagel commented Feb 20, 2022

sebastian-nagel commented Feb 21, 2022

sebastian-nagel commented Oct 20, 2022 •

edited

sebastian-nagel commented Oct 21, 2022

sebastian-nagel commented Apr 23, 2023

jnioche commented Apr 24, 2023

sebastian-nagel commented Apr 24, 2023

[Robots.txt] RFC compliance: matching user-agent names when selecting rule blocks #362

[Robots.txt] RFC compliance: matching user-agent names when selecting rule blocks #362

Conversation

sebastian-nagel commented Feb 20, 2022

sebastian-nagel commented Feb 21, 2022

sebastian-nagel commented Oct 20, 2022 • edited

sebastian-nagel commented Oct 21, 2022

sebastian-nagel commented Apr 23, 2023

jnioche commented Apr 24, 2023

sebastian-nagel commented Apr 24, 2023

sebastian-nagel commented Oct 20, 2022 •

edited