Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Robots.txt] Handle allow/disallow directives containing unescaped Unicode characters #389

Closed
sebastian-nagel opened this issue Oct 20, 2022 · 0 comments · Fixed by #401
Closed
Labels
Milestone

Comments

@sebastian-nagel
Copy link
Contributor

Section 2.2.2 of RFC 9309 requires that paths in allow/disallow directives are percent-encoded first:

Octets in the URI and robots.txt paths outside the range of the ASCII coded character set, and those in the reserved range defined by [RFC3986], MUST be percent-encoded as defined by [RFC3986] prior to comparison.

... under the requirement that robots.txt files are UTF-8 (section 2.3).

The reference parser includes a unit test for the example path /foo/bar/ツ resp. /foo/bar/%E3%83%84.

Currently, SimpleRobotRulesParser reads the robots.txt as ASCII - Java then replaces bytes outside the ASCII range by � (U+FFFD) and also does not try to percent-encode the paths in the allow/disallow directives.

@sebastian-nagel sebastian-nagel added this to the 1.4 milestone Oct 20, 2022
sebastian-nagel added a commit to sebastian-nagel/crawler-commons that referenced this issue Apr 24, 2023
Unicode characters, fixes crawler-commons#389
- use UTF-8 as default input encoding of robots.txt files
- add unit test
  - test matching of Unicode paths in allow/disallow directives
  - test for proper matching of ASCII paths if encoding is not
    UTF-8 (and no byte order mark present)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
1 participant