-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Robots.txt] RFC compliance: matching user-agent names when selecting rule blocks #362
[Robots.txt] RFC compliance: matching user-agent names when selecting rule blocks #362
Conversation
sebastian-nagel
commented
Feb 20, 2022
- add unit test to verify that the rule with the completely matched user-agent name is selected, and no partial prefix match is preferred (cf. also [Robots.txt] Matching user-agent names does not conform to robots.txt RFC #192)
e6b08e3
to
1a40b4a
Compare
1a40b4a
to
97718ef
Compare
Updated the PR... To summarize the issues around matching user agent names:
Issues with the current implementation are caused by conflicts when robot names share a common prefix, eg. Decisions to resolve this issues:
|
Following the discussion in google/robotstxt#56, the behavior should be changed, so that a robot name "foo" also matches all of:
Eventually, make the user-agent matching configurable and implement it in a method which can be overridden if a user is stick to an agent name containing any characters except |
- add unit test to verify that the rule with the completely matched user-agent name is selected, and no partial prefix match is preferred (cf. also crawler-commons#192)
- refactor agent name matching and move splitting robotNames string at comma into a separate method to be called once at the beginning of parsing the robots.txt file - extend the robots parser API and add a method to pass agent names as a collection following the RFC 9309 with no splitting of the names into words/tokens. - deprecate "old" method which splits the robot name into tokens and performs prefix matching - by default user agent names are matched literally but case-insensitive following RFC 9309. Add method to "restore" the prefix matching: "setExactUserAgentMatching(false)" - BaseRobotRulesParser: move the documented details about how user-agent names are matched into SimpleRobotRulesParser - unit tests: add tests for issues described in crawler-commons#192, configure exact user-agent matching if required
- match user-agent product token at beginning of user-agent line/statement followed by ignored non-token characters, e.g. "foo" is matched in "User-agent: foo/1.2"
- match user-agent product tokens followed by ignored characters also in legacy prefix matching mode, e.g. match "butterfly" in "User-agent: Butterfly/1.0" - refactor prefix matching: switch inner and outer loop, handle check for (common) wild-card user-agent outside of loop
97718ef
to
c57d716
Compare
- make exact user-agent matching the default in unit tests, explicitly pass flag for legacy prefix user-agent matching in unit tests where needed - names not following the ua pattern in the specificiation "[a-zA-Z_-]+" - user-agent lines with multiple user-agent names
- make the method to handle prefix/partial user-agent product token matches protected, so that it can be overridden to match non-standard user-agent product tokens, e.g. "Go!zilla"
A "prefix" match up to the first non-product-token character is now implemented. I.e. the product token "foo" is matched in |
LGTM @sebastian-nagel, nice one! Feel free to merge |