You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Section 2.2.2 of RFC 9309 requires that paths in allow/disallow directives are percent-encoded first:
Octets in the URI and robots.txt paths outside the range of the ASCII coded character set, and those in the reserved range defined by [RFC3986], MUST be percent-encoded as defined by [RFC3986] prior to comparison.
... under the requirement that robots.txt files are UTF-8 (section 2.3).
Currently, SimpleRobotRulesParser reads the robots.txt as ASCII - Java then replaces bytes outside the ASCII range by � (U+FFFD) and also does not try to percent-encode the paths in the allow/disallow directives.
The text was updated successfully, but these errors were encountered:
Unicode characters, fixescrawler-commons#389
- use UTF-8 as default input encoding of robots.txt files
- add unit test
- test matching of Unicode paths in allow/disallow directives
- test for proper matching of ASCII paths if encoding is not
UTF-8 (and no byte order mark present)
Section 2.2.2 of RFC 9309 requires that paths in allow/disallow directives are percent-encoded first:
... under the requirement that robots.txt files are UTF-8 (section 2.3).
The reference parser includes a unit test for the example path
/foo/bar/ツ
resp./foo/bar/%E3%83%84
.Currently, SimpleRobotRulesParser reads the robots.txt as ASCII - Java then replaces bytes outside the ASCII range by � (U+FFFD) and also does not try to percent-encode the paths in the allow/disallow directives.
The text was updated successfully, but these errors were encountered: