Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Robots.txt] RFC compliance: matching user-agent names when selecting rule blocks #362

Conversation

sebastian-nagel
Copy link
Contributor

@sebastian-nagel sebastian-nagel marked this pull request as draft February 21, 2022 21:49
@sebastian-nagel
Copy link
Contributor Author

Thanks, @rzo1! But I should add also add a fix. The (failing) unit tests are just an example that the prefix matching is even somewhat counter-intuitive in addition that it contradicts both the new and the old RFC.

@sebastian-nagel sebastian-nagel modified the milestones: 1.3, 1.4 Jul 14, 2022
@sebastian-nagel sebastian-nagel force-pushed the cc-192-robotstxt-user-agent-string-matching branch from e6b08e3 to 1a40b4a Compare August 11, 2022 12:16
@sebastian-nagel sebastian-nagel force-pushed the cc-192-robotstxt-user-agent-string-matching branch from 1a40b4a to 97718ef Compare October 20, 2022 13:08
@sebastian-nagel sebastian-nagel marked this pull request as ready for review October 20, 2022 13:11
@sebastian-nagel
Copy link
Contributor Author

sebastian-nagel commented Oct 20, 2022

Updated the PR...

To summarize the issues around matching user agent names:

  1. the new RFC 9309 is pretty clear: both the crawler (robot, user-agent) name and the name in the user-agent line are required to be a single word token (regex [a-zA-Z_-]+) and matching is done literally but case-insensitive on the entire word token.

  2. the first RFC proposal only requires that the user-agent token is a substring of the User-Agent line in the robots.txt:

    The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring.

  3. our current implementation is different from 1 and 2:

    a. the user-agent line is split at [ \t,] into tokens
    b. the configured user-agent name is also split into tokens at white space (U+0020)
    c. a user-agent line matches if one of the line tokens (from a) is a prefix of one of the name tokens (from b)
    d. the first matched non-wildcard block of allow/disallow rules is selected

    In short, the current implementation allows that crawler developers lazily use the HTTP User-Agent string also for the robots.txt parser. It does not cover the case when the HTTP User-Agent string is used in the robots.txt.

  4. the current robots.txt parser API allows to pass
    a. multiple agent names in a comma-separated string
    b. and/or compound agent names such as "Mozilla Crawlerbot-super 1.0" where every white-space separated word is matched

Issues with the current implementation are caused by conflicts when robot names share a common prefix, eg. googlebot and googlebot-news. But the real trouble (that's #192 is about) starts when both user-agent line and name consist of multiple tokens and include common parts such as "Mozilla".

Decisions to resolve this issues:

  1. extend the robots parser API and add a method to pass agent names as a collection following the RFC 9309 with no splitting of the names into words/tokens.

    This still allows for multiple robot names, including varying names which include version string or similar ("webcrawler/3.0").

  2. by default user agent names are matched literally but case-insensitive following RFC 9309

  3. the old behavior can be restored by calling "setExactUserAgentMatching(false)"

  4. the method which splits the robot name into tokens and performs prefix matching is deprecated but kept for backward-compatibility.

  5. BaseRobotRulesParser: move the documented details about how user-agent names are matched into SimpleRobotRulesParser. The abstract base class shouldn't already define the behavior without a way to actually enforce it.

@sebastian-nagel
Copy link
Contributor Author

Following the discussion in google/robotstxt#56, the behavior should be changed, so that a robot name "foo" also matches all of:

User-agent: Foo/1.2
User-agent: Foo Bar

Eventually, make the user-agent matching configurable and implement it in a method which can be overridden if a user is stick to an agent name containing any characters except [a-zA-Z_-] - "Mail.Ru", "Go!Zilla", "360Spider", etc.

- add unit test to verify that the rule with the completely
  matched user-agent name is selected, and no partial prefix match
  is preferred (cf. also crawler-commons#192)
- refactor agent name matching and move splitting robotNames string
  at comma into a separate method to be called once at the beginning
  of parsing the robots.txt file

- extend the robots parser API and add a method to pass agent names
  as a collection following the RFC 9309 with no splitting of the
  names into words/tokens.

- deprecate "old" method which splits the robot name into tokens and
  performs prefix matching

- by default user agent names are matched literally but case-insensitive
  following RFC 9309. Add method to "restore" the prefix matching:
  "setExactUserAgentMatching(false)"

- BaseRobotRulesParser: move the documented details about how
  user-agent names are matched into SimpleRobotRulesParser

- unit tests: add tests for issues described in crawler-commons#192, configure exact
  user-agent matching if required
- match user-agent product token at beginning of user-agent
  line/statement followed by ignored non-token characters,
  e.g. "foo" is matched in "User-agent: foo/1.2"
- match user-agent product tokens followed by ignored characters
  also in legacy prefix matching mode, e.g. match "butterfly" in
  "User-agent: Butterfly/1.0"
- refactor prefix matching: switch inner and outer loop, handle
  check for (common) wild-card user-agent outside of loop
@sebastian-nagel sebastian-nagel force-pushed the cc-192-robotstxt-user-agent-string-matching branch from 97718ef to c57d716 Compare April 22, 2023 18:49
- make exact user-agent matching the default in unit tests,
  explicitly pass flag for legacy prefix user-agent matching
  in unit tests where needed
  - names not following the ua pattern in the specificiation "[a-zA-Z_-]+"
  - user-agent lines with multiple user-agent names
- make the method to handle prefix/partial user-agent product token
  matches protected, so that it can be overridden to match non-standard
  user-agent product tokens, e.g. "Go!zilla"
@sebastian-nagel sebastian-nagel marked this pull request as ready for review April 23, 2023 15:43
@sebastian-nagel
Copy link
Contributor Author

A "prefix" match up to the first non-product-token character is now implemented. I.e. the product token "foo" is matched in User-agent: Foo/1.2. To handle partial matches of special product tokens ("Go!Zilla", etc.) the method "userAgentProductTokenPartialMatch" can be overridden.

@jnioche
Copy link
Contributor

jnioche commented Apr 24, 2023

LGTM @sebastian-nagel, nice one! Feel free to merge

@sebastian-nagel sebastian-nagel merged commit d8a6126 into crawler-commons:master Apr 24, 2023
2 checks passed
@sebastian-nagel
Copy link
Contributor Author

Thanks, @jnioche and @rzo1!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants