Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Robots.txt] Handle allow/disallow directives containing unescaped Unicode characters #401

Conversation

sebastian-nagel
Copy link
Contributor

fixes #389

  • use UTF-8 as default input encoding of robots.txt files
  • add unit test
    • test matching of Unicode paths in allow/disallow directives
    • test for proper matching of ASCII paths if encoding is not UTF-8 (and no byte order mark present)
  • add link to RFC 9309 to Javadoc class description
  • fix line wrapping in comments

Unicode characters, fixes crawler-commons#389
- use UTF-8 as default input encoding of robots.txt files
- add unit test
  - test matching of Unicode paths in allow/disallow directives
  - test for proper matching of ASCII paths if encoding is not
    UTF-8 (and no byte order mark present)
@sebastian-nagel sebastian-nagel added this to the 1.4 milestone Apr 24, 2023
@rzo1 rzo1 self-requested a review April 24, 2023 16:00
Copy link
Contributor

@aecio aecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sebastian-nagel sebastian-nagel merged commit 79bef97 into crawler-commons:master May 11, 2023
2 checks passed
@sebastian-nagel
Copy link
Contributor Author

Thanks @rzo1 and @aecio for the reviews!

@sebastian-nagel sebastian-nagel deleted the cc-389-allow-disallow-unicode-paths branch June 16, 2023 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Robots.txt] Handle allow/disallow directives containing unescaped Unicode characters
3 participants