Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not working disallow rule #209

Closed
kox-solid opened this issue Aug 8, 2024 · 9 comments
Closed

Not working disallow rule #209

kox-solid opened this issue Aug 8, 2024 · 9 comments
Assignees

Comments

@kox-solid
Copy link

robotspy==0.8.0

import robots

content = """
User-agent: mozilla/5
Disallow: /
"""

check_url = "https://example.com"
user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"

parser = robots.RobotsParser.from_string(content)

print(parser.can_fetch(user_agent, check_url))
print(parser.is_agent_valid(user_agent))

returns True, False instead of False, False

@andreburgaud
Copy link
Owner

Thank you, @kox-solid, for raising this issue. Let me look into it as soon as possible.

@andreburgaud
Copy link
Owner

andreburgaud commented Aug 25, 2024

@kox-solid, I analyzed your example. Your example correctly returns True and False, not False, False, as you expected, for the following reasons:

  • In RFC939 (Robots Exclusion Protocol), the term user-agent is also the product token and represents the crawler's name (for example, Googlebot). The User-agent line in the robots.txt is not valid because the product token only accepts uppercase and lowercase letters ("a-z" and "A-Z"), underscores ("_"), and hyphens ("-"), per the RFC https://www.rfc-editor.org/rfc/rfc9309#name-the-user-agent-line. mozilla/5 includes a slash /, incompatible with the spec.
  • As a result of the point above, the library will discard the rule Disallow: /.
  • Also, the user_agent value passed as the first parameter to parser.can_fetch() is also intended to be a crawler's name to match one of the possible groups in the robots.txt. In your example, user_agent is a User-Agent HTTP request header. Although they are related, the function only expects a crawler's name.

So parser.is_agent_valid(user_agent) returns False because the User-Agent HTTP request header includes characters not valid for a product token, and parser.can_fetch(user_agent, check_url)) returns True because 1) there is no valid group in the robots.txt (mozilla/5 is not a valid product token), 2) the given user_agent has not match in the robots.txt, therefore the disallow rule does not apply.

The following example would have returned False, True:

import robots

content = """
User-agent: mozilla
Disallow: /
"""

check_url = "https://example.com"
user_agent = "Mozilla"

parser = robots.RobotsParser.from_string(content)

print(parser.can_fetch(user_agent, check_url))  # False (Disallow)
print(parser.is_agent_valid(user_agent))        # True (Mozilla is a valid user-agent)

I understand that the terms user-agent in User-Agent HTTP request header and User-Agent line in robots.txt sound confusing. The latest RFC https://www.rfc-editor.org/rfc/rfc9309 provides some elements of clarity (for example, by using the term product token instead of user-agent), but in robots.txt, it will remain User-agent.

Other libraries or tools like https://github.com/google/robotstxt or https://github.com/jimsmart/grobotstxt follow the same approach as robotspy, and share similar tests to validate. Nevertheless, I could have missed something, been more precise, or provided better error log to guide the library users. Let me know how I could improve it or if my explanation offers some clarification.

Thank you again for raising your concern.

@kox-solid
Copy link
Author

In RFC939 (Robots Exclusion Protocol), the term user-agent is also the product token and represents the crawler's name (for example, Googlebot). The User-agent line in the robots.txt is not valid because the product token only accepts uppercase and lowercase letters ("a-z" and "A-Z"), underscores ("_"), and hyphens ("-"), per the RFC https://www.rfc-editor.org/rfc/rfc9309#name-the-user-agent-line. mozilla/5 includes a slash /, incompatible with the spec.

This is the question of interpretation. The RFC says, "The product token MUST contain only uppercase and lowercase letters ("a-z" and "A-Z"), underscores ("_"), and hyphens ("-")". But it is not said that if a forbidden symbol is met, it does not need to create a token at all. Some parsers create tokens to the first incorrect symbol. And it is in this case that "mozilla/5" will be cut to "mozilla".
If you rely on Google, look at the tests: https://github.com/google/robotstxt-spec-test
here https://github.com/google/robotstxt-spec-test/blob/master/src/main/resources/CTC/stress/369883.textproto last 2 tests don't pass your parser for example.

So parser.is_agent_valid(user_agent) returns False because the User-Agent HTTP request header includes characters not valid for a product token ...

Why? The RFC says, "the product token SHOULD be a substring in the User-Agent header". Where is the restriction on characters in the User-Agent header?

@andreburgaud
Copy link
Owner

Thank you, @kox-solid, for pointing to the 369883 tests. I initially did not include these tests in robotspy, so this will allow me to dig deeper. I tested it with the Google C++ version which behaves as you stated. The Go version behaves like robotspy. Again, I will investigate further and appreciate your insight.

You are correct regarding the last point related to the User-Agent header; I did not question that. I meant that you passed the User-Agent header to an internal function of robotspy intended to parse a crawler name, not a User-Agent header. Indeed, there is no restriction on characters in the User-Agent header, and the function is_agent_valid was not intended to parse a User-Agent header. As you pointed out, it returned False, as you expected.

Out of curiosity, was there a particular scenario you faced and wanted to use a robots.txt parser? My starting point with robotspy was to fix a bug in the robots parser of the Python standard library, but I'm curious about concrete use cases I could leverage in my tests.

@andreburgaud
Copy link
Owner

Hi @kox-solid, Thanks again for raising this issue. As you suggested, I updated the parser to behave like Google robots. The two tests you pointed out in 369883. text proto are now acting like Google robots. The new 0.9.0 version is available at https://pypi.org/project/robotspy/.

Your code example will work if you set user_agent to Mozilla or other case-insensitive variants. It behaves like https://github.com/google/robotstxt or https://github.com/jimsmart/grobotstxt.

Thank you again for pointing out this anomaly.

@andreburgaud andreburgaud self-assigned this Aug 25, 2024
@kox-solid
Copy link
Author

Hello @andreburgaud, thanks for your work and quick reply. I tested robotspy==0.9.0 against all Google "stress" tests and got the following result:
Didn't pass the next tests:
https://github.com/google/robotstxt-spec-test/blob/master/src/main/resources/CTC/stress/541230.textproto (4, 5, 8, 12, 14)
https://github.com/google/robotstxt-spec-test/blob/master/src/main/resources/CTC/stress/308278.textproto (6. 8. 10)
Of course, these issues are not related to the topic, rather for your information.

I need a robots.txt parser for SEO audit crawler that will interpret robots.txt the same as Google or close to it. At the moment, there is no such parser for Python, all of them have certain flaws.

@andreburgaud
Copy link
Owner

Hi @kox-solid, thank you for pointing to the failing tests and sharing your motivation for finding a robots.txt parser. This is super helpful. I can't pretend robotspy will meet your expectations, but I welcome any suggestions. I will integrate the Google stress tests into the robotspy pytest suite to understand and possibly bridge the gaps. I value your contribution and interest in robotspy 🙌

@andreburgaud
Copy link
Owner

@kox-solid, I just released robotspy version 0.10.0 with fixes for the bugs you pointed out (Google stress tests 541230 and 308278, in particular). I will continue to integrate the other Google tests, and I'm sure I will discover more bugs. Thank you for your contribution! I will close this issue but don't hesitate to ping me with any other concerns or suggestions.

@andreburgaud
Copy link
Owner

I'm closing this according to my previous comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants