[Robots.txt] Handle allow/disallow directives containing unescaped Unicode characters #389

sebastian-nagel · 2022-10-20T14:10:37Z

Section 2.2.2 of RFC 9309 requires that paths in allow/disallow directives are percent-encoded first:

Octets in the URI and robots.txt paths outside the range of the ASCII coded character set, and those in the reserved range defined by [RFC3986], MUST be percent-encoded as defined by [RFC3986] prior to comparison.

... under the requirement that robots.txt files are UTF-8 (section 2.3).

The reference parser includes a unit test for the example path /foo/bar/ツ resp. /foo/bar/%E3%83%84.

Currently, SimpleRobotRulesParser reads the robots.txt as ASCII - Java then replaces bytes outside the ASCII range by � (U+FFFD) and also does not try to percent-encode the paths in the allow/disallow directives.

The text was updated successfully, but these errors were encountered:

Unicode characters, fixes crawler-commons#389 - use UTF-8 as default input encoding of robots.txt files - add unit test - test matching of Unicode paths in allow/disallow directives - test for proper matching of ASCII paths if encoding is not UTF-8 (and no byte order mark present)

…ll requests

sebastian-nagel added the robots label Oct 20, 2022

sebastian-nagel added this to the 1.4 milestone Oct 20, 2022

This was referenced Oct 20, 2022

Implement Robots Exclusion Protocol (REP) IETF Draft: port unit tests #360

Merged

Implement Robots Exclusion Protocol (REP) IETF RFC 9309 #245

Closed

sebastian-nagel mentioned this issue Apr 24, 2023

[Robots.txt] Handle allow/disallow directives containing unescaped Unicode characters #401

Merged

sebastian-nagel closed this as completed in #401 May 11, 2023

sebastian-nagel added a commit that referenced this issue May 11, 2023

Updates changelog for #192, #362, #383, #389 and merged dependabot pu…

8bb1694

…ll requests

sebastian-nagel mentioned this issue Jun 8, 2023

[Robots.txt] Deduplicate robots rules before matching #416

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Robots.txt] Handle allow/disallow directives containing unescaped Unicode characters #389

[Robots.txt] Handle allow/disallow directives containing unescaped Unicode characters #389

sebastian-nagel commented Oct 20, 2022

[Robots.txt] Handle allow/disallow directives containing unescaped Unicode characters #389

[Robots.txt] Handle allow/disallow directives containing unescaped Unicode characters #389

Comments

sebastian-nagel commented Oct 20, 2022