Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Robots.txt] Deduplicate robots rules before matching #416

Merged

Conversation

sebastian-nagel
Copy link
Contributor

This PR request deduplicates the (dis)allow rules during sorting and before matching. This also applies to duplicates resulting of equivalent statements with and without percent-encoding and now normalized to a canonical form, cf. #389.

The Javadocs of SimpleRobotRules are also updated with references to RFC 9309.

@sebastian-nagel sebastian-nagel added this to the 1.4 milestone Jun 8, 2023
@jnioche jnioche merged commit 6c0d91e into crawler-commons:master Jun 9, 2023
2 of 3 checks passed
@jnioche
Copy link
Contributor

jnioche commented Jun 9, 2023

Thanks @sebastian-nagel

sebastian-nagel added a commit that referenced this pull request Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants