New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLI: Do not treat as URL if path exists in PathOrUrl
parameter type
#6360
CLI: Do not treat as URL if path exists in PathOrUrl
parameter type
#6360
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sphuber thanks for taking this on! The change looks like a strict improvement, although I think further improvement is possible.
For example, when I make a typo in a filename, I will still get error about invalid url.
How about we first try to parse the input to see if it looks like an URL? If not, we assume user meant local path. If yes, then we can try to request it. WDYT?
Something like urllib.parse.urlparse might be good for this. https://stackoverflow.com/a/38020041
@@ -81,6 +82,8 @@ def convert(self, value, param, ctx): | |||
try: | |||
return super().convert(value, param, ctx) | |||
except click.exceptions.BadParameter: | |||
if pathlib.Path(value).exists(): | |||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's very nice to vomit the whole stacktrace to users terminal if they make a mistake. Is it possible to use click.echo.error or something like that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is caught and just printed as:
Usage: verdi archive import [OPTIONS] [--] [ARCHIVES]...
Try 'verdi archive import --help' for help.
Error: Invalid value for '[ARCHIVES]...': Path 'test.txt' is not readable.
Damn, here I was just trying to get away with the bare minimum and I get the quality police on my case :D
Fair enough, the fix I added is indeed a bit barebones.
I thought about inversing the priority but I thought it was going to be difficult to detect the intention. I am not sure if there are hard rules on what constitutes a valid URL so we could check for that. For example, the
So we would have to start checking if there is not scheme defined and make other assumptions, but I am not sure if that would reject some valid URLs. Another possibility is to simply call In [1]: import urllib.request
In [2]: urllib.request.urlopen('/tmp/text.txt')
ValueError: unknown url type: '/tmp/text.txt' If it is a valid URL, it would be In [3]: urllib.request.urlopen('http://localhost/tmp/text.txt')
URLError: <urlopen error [Errno 111] Connection refused> So we can catch those separately. |
@danielhollas had a stab at the alternate implementation. If this is better, we should probably also likely change the implementation of |
3199740
to
d23a964
Compare
import socket | ||
import urllib.error | ||
import urllib.request | ||
|
||
try: | ||
with urllib.request.urlopen(url, timeout=self.timeout_seconds): | ||
with urllib.request.urlopen(value, timeout=self.timeout_seconds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, are there any performance implications of calling urlopen first? How quickly does it fail for Paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was my original reason for not changing the order of checking, but I quickly checked and it seems really negligible.
In [1]: import urllib.request
In [2]: def call_invalid():
...: try:
...: urllib.request.urlopen('some/path.txt')
...: except ValueError:
...: pass
In [3]: %timeit call_invalid
8.4 ns ± 0.231 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)
As your investigation below showed, this is because it does a string match operation to check for a protocol which should be really fast and since filepaths will not match protocols, it will almost immediately error out and we move on to treating it as a local path.
Of course then the problem becomes the inverse. If the user meant a URL but just forgot the protocol or had a typo, then they would get an error about the file not existing. I guess no matter how you turn it, it is going to be difficult to guess the intention. Maybe the best we can do is be explicit in when we tried both options and that we couldn't make it work either as a path nor a URL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an idea if one is likely than the other (path/URL), based where these options are used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameters is mostly used for the --config
option. I think most users may have the config file locally, but support was added for URLs exactly because this makes sense for computer and code configurations which can be on servers, such as provided by the aiida-code-registry
. I think I may revert the order to the original state and just improve the error message. If the local path validation fails, we still try the remote URL but if that fails we make sure the error message is clear to state that it was tried as local path and then URL and both failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a good plan. I think you can still special-case the message if the path actually exists but we failed to read from it, as in your original fix before I started making a fuzz. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn, here I was just trying to get away with the bare minimum and I get the quality police on my case :D
Hahaha, sorry. 👼 😊 TBH I am very happy to merge a quick fix for the specific issue in #6360 for 2.6 and leave the further improvements for another time since this indeed seems like a can of worms.
That said, not sure if I like the new approach of trying to open URL first, as I am not sure about possible performance implications.
urlparse
indeed does not seem very helpful. I looked a little bit into the urlopen
implementation. Specifically, trying to open an url without a scheme results in an error:
>>> from urllib.request import urlopen
>>> urlopen('google.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.12/urllib/request.py", line 215, in urlopen
return opener.open(url, data, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/urllib/request.py", line 499, in open
req = Request(fullurl, data)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/urllib/request.py", line 318, in __init__
self.full_url = url
^^^^^^^^^^^^^
File "/usr/lib64/python3.12/urllib/request.py", line 344, in full_url
self._parse()
File "/usr/lib64/python3.12/urllib/request.py", line 373, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
Tracing where that error comes from:
def _parse(self):
self.type, rest = _splittype(self._full_url)
if self.type is None:
raise ValueError("unknown url type: %r" % self.full_url)
where _splittype
is imported from urllib/parse.py
def _splittype(url):
"""splittype('type:opaquestring') --> 'type', 'opaquestring'."""
global _typeprog
if _typeprog is None:
_typeprog = re.compile('([^/:]+):(.*)', re.DOTALL)
match = _typeprog.match(url)
if match:
scheme, data = match.groups()
return scheme.lower(), data
return None, url
So this would suggest that urlopen needs a scheme? In which case we could simply catch this specific ValueError and print an error that we got either wrong URL or path.
7588af5
to
ec53d09
Compare
Ok @danielhollas I reverted the original order of checking but then simply tried to improve the error-handling and messaging. Now if a value fails to be converted to either a local path or a URL, the error message will explicitly mention that it was tried as both. Hopefully this should be clear enough for the user. Also deduplicated and cleaned the code a bit. |
Wow, CI is taken a huge dump on everything:
|
312e708
to
e532d25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn, here I was just trying to get away with the bare minimum and I get the quality police on my case :D
Well done Mr. Huber, you are free to go merge. 👮 🚓
:param int timeout_seconds: Maximum timeout accepted for URL response. | ||
Must be an integer in the range [0;60]. | ||
:param timeout_seconds: Maximum timeout accepted for URL response. Must be an integer in the range [0;60]. | ||
:returns: The path or URL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if URL_TIMEOUT_SECONDS=10
is too generous?
If the user provides plausible URL with a typo (http://googl.com) will it fail earlier or will it block the CLI for 10s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it will fail sooner with DNS error?
check_timeout_seconds(invalid) | ||
|
||
|
||
def test_fail_non_existing_path(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the new tests!
The `PathOrUrl` and `FileOrUrl` parameter types extend click's `Path` and `File` parameter types, respectively, by adding support for URLs. If the conversion the base type fails, the provided value is tried as a URL. The problem was that if the provided value was intended to be a local file or path, but conversion failed, the conversion to URL would almost certainly also fail and that would be the returned error message. This would be confusing for users that tried to specify a local path, but it contained a typo or was not readable for other reasons, such as file permissions. When the base type conversion fails, now the parameter types first check whether the provided file or path exists on the local disk. If so, it is assumed that the value was intended to be a local path and an error is returned that the path could not be read. Only then is the path tried as a URL. Here the `URLError` and `socket.timeout` exceptions are caught separately in which case we are almost certainly dealing with a URL which just couldn't be reached (in time). The `urlopen` method can also throw a `ValueError` which is usually when no protocol is defined. In this case an error is returned that explicitly states that the provided value does not correspond to a readable path nor to a reachable URL.
e532d25
to
5e3b659
Compare
Fixes #6313
The
PathOrUrl
parameter type extends click'sPath
type to add support for remote paths defined by a URL. The value is first converted by the parents implementation, and if that raises an exception the value is considered to be a URL. However, if the initial conversion fails not because the path is not a file on the local file system, but there was another error, for example the file is not readable, it does not make sense to try and treat it as a URL as that is most likely not what was intended by the user. This would lead to confusing error messages claiming that the path is not a valid URL whereas the real problem is that it is not readable.The solution is to add a check that if the initial local path conversion fails, to only check the path as a URL if the path does not exist on the local file system. If it does, the original exception is simply reraised.