Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI: Do not treat as URL if path exists in PathOrUrl parameter type #6360

Merged
merged 1 commit into from Apr 25, 2024

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Apr 20, 2024

Fixes #6313

The PathOrUrl parameter type extends click's Path type to add support for remote paths defined by a URL. The value is first converted by the parents implementation, and if that raises an exception the value is considered to be a URL. However, if the initial conversion fails not because the path is not a file on the local file system, but there was another error, for example the file is not readable, it does not make sense to try and treat it as a URL as that is most likely not what was intended by the user. This would lead to confusing error messages claiming that the path is not a valid URL whereas the real problem is that it is not readable.

The solution is to add a check that if the initial local path conversion fails, to only check the path as a URL if the path does not exist on the local file system. If it does, the original exception is simply reraised.

Copy link
Collaborator

@danielhollas danielhollas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sphuber thanks for taking this on! The change looks like a strict improvement, although I think further improvement is possible.

For example, when I make a typo in a filename, I will still get error about invalid url.

How about we first try to parse the input to see if it looks like an URL? If not, we assume user meant local path. If yes, then we can try to request it. WDYT?

Something like urllib.parse.urlparse might be good for this. https://stackoverflow.com/a/38020041

@@ -81,6 +82,8 @@ def convert(self, value, param, ctx):
try:
return super().convert(value, param, ctx)
except click.exceptions.BadParameter:
if pathlib.Path(value).exists():
raise
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's very nice to vomit the whole stacktrace to users terminal if they make a mistake. Is it possible to use click.echo.error or something like that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is caught and just printed as:

Usage: verdi archive import [OPTIONS] [--] [ARCHIVES]...
Try 'verdi archive import --help' for help.

Error: Invalid value for '[ARCHIVES]...': Path 'test.txt' is not readable.

@sphuber
Copy link
Contributor Author

sphuber commented Apr 20, 2024

Damn, here I was just trying to get away with the bare minimum and I get the quality police on my case :D

For example, when I make a typo in a filename, I will still get error about invalid url.

Fair enough, the fix I added is indeed a bit barebones.

How about we first try to parse the input to see if it looks like an URL? If not, we assume user meant local path. If yes, then we can try to request it. WDYT?
Something like urllib.parse.urlparse might be good for this. https://stackoverflow.com/a/38020041

I thought about inversing the priority but I thought it was going to be difficult to detect the intention. I am not sure if there are hard rules on what constitutes a valid URL so we could check for that. For example, the urllib.parse.urlparse pretty much swallows anything:

In [1]: from urllib.parse import urlparse

In [2]: urlparse('file.txt')
Out[2]: ParseResult(scheme='', netloc='', path='file.txt', params='', query='', fragment='')

In [3]: urlparse('test/file.txt')
Out[3]: ParseResult(scheme='', netloc='', path='test/file.txt', params='', query='', fragment='')

In [4]: urlparse('/path/test/file.txt')
Out[4]: ParseResult(scheme='', netloc='', path='/path/test/file.txt', params='', query='', fragment='')

So we would have to start checking if there is not scheme defined and make other assumptions, but I am not sure if that would reject some valid URLs.

Another possibility is to simply call urllib.request.urlopen(value) which will raise a ValueError if it doesn't think it is a valid URL:

In [1]: import urllib.request

In [2]: urllib.request.urlopen('/tmp/text.txt')

ValueError: unknown url type: '/tmp/text.txt'

If it is a valid URL, it would be

In [3]: urllib.request.urlopen('http://localhost/tmp/text.txt')

URLError: <urlopen error [Errno 111] Connection refused>

So we can catch those separately.

@sphuber
Copy link
Contributor Author

sphuber commented Apr 20, 2024

@danielhollas had a stab at the alternate implementation. If this is better, we should probably also likely change the implementation of FileOrUrl which is essentially the same.

@sphuber sphuber force-pushed the fix/6313/cli-path-type-not-readable branch from 3199740 to d23a964 Compare April 20, 2024 20:45
import socket
import urllib.error
import urllib.request

try:
with urllib.request.urlopen(url, timeout=self.timeout_seconds):
with urllib.request.urlopen(value, timeout=self.timeout_seconds):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, are there any performance implications of calling urlopen first? How quickly does it fail for Paths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my original reason for not changing the order of checking, but I quickly checked and it seems really negligible.

In [1]: import urllib.request

In [2]: def call_invalid():
   ...:     try:
   ...:         urllib.request.urlopen('some/path.txt')
   ...:     except ValueError:
   ...:         pass

In [3]: %timeit call_invalid
8.4 ns ± 0.231 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)

As your investigation below showed, this is because it does a string match operation to check for a protocol which should be really fast and since filepaths will not match protocols, it will almost immediately error out and we move on to treating it as a local path.

Of course then the problem becomes the inverse. If the user meant a URL but just forgot the protocol or had a typo, then they would get an error about the file not existing. I guess no matter how you turn it, it is going to be difficult to guess the intention. Maybe the best we can do is be explicit in when we tried both options and that we couldn't make it work either as a path nor a URL.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an idea if one is likely than the other (path/URL), based where these options are used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameters is mostly used for the --config option. I think most users may have the config file locally, but support was added for URLs exactly because this makes sense for computer and code configurations which can be on servers, such as provided by the aiida-code-registry. I think I may revert the order to the original state and just improve the error message. If the local path validation fails, we still try the remote URL but if that fails we make sure the error message is clear to state that it was tried as local path and then URL and both failed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good plan. I think you can still special-case the message if the path actually exists but we failed to read from it, as in your original fix before I started making a fuzz. 😅

Copy link
Collaborator

@danielhollas danielhollas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, here I was just trying to get away with the bare minimum and I get the quality police on my case :D

Hahaha, sorry. 👼 😊 TBH I am very happy to merge a quick fix for the specific issue in #6360 for 2.6 and leave the further improvements for another time since this indeed seems like a can of worms.

That said, not sure if I like the new approach of trying to open URL first, as I am not sure about possible performance implications.

urlparse indeed does not seem very helpful. I looked a little bit into the urlopen implementation. Specifically, trying to open an url without a scheme results in an error:

>>> from urllib.request import urlopen
>>> urlopen('google.com')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.12/urllib/request.py", line 215, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/urllib/request.py", line 499, in open
    req = Request(fullurl, data)
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/urllib/request.py", line 318, in __init__
    self.full_url = url
    ^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/urllib/request.py", line 344, in full_url
    self._parse()
  File "/usr/lib64/python3.12/urllib/request.py", line 373, in _parse
    raise ValueError("unknown url type: %r" % self.full_url)

Tracing where that error comes from:

    def _parse(self):
        self.type, rest = _splittype(self._full_url)
        if self.type is None:
            raise ValueError("unknown url type: %r" % self.full_url)

where _splittype is imported from urllib/parse.py

def _splittype(url):
    """splittype('type:opaquestring') --> 'type', 'opaquestring'."""
    global _typeprog
    if _typeprog is None:
        _typeprog = re.compile('([^/:]+):(.*)', re.DOTALL)
    
    match = _typeprog.match(url)
    if match:
        scheme, data = match.groups()
        return scheme.lower(), data
    return None, url

So this would suggest that urlopen needs a scheme? In which case we could simply catch this specific ValueError and print an error that we got either wrong URL or path.

@sphuber sphuber force-pushed the fix/6313/cli-path-type-not-readable branch 2 times, most recently from 7588af5 to ec53d09 Compare April 24, 2024 09:58
@sphuber
Copy link
Contributor Author

sphuber commented Apr 24, 2024

Ok @danielhollas I reverted the original order of checking but then simply tried to improve the error-handling and messaging. Now if a value fails to be converted to either a local path or a URL, the error message will explicitly mention that it was tried as both. Hopefully this should be clear enough for the user. Also deduplicated and cleaned the code a bit.

@sphuber
Copy link
Contributor Author

sphuber commented Apr 24, 2024

Wow, CI is taken a huge dump on everything:

 E: Failed to fetch https://packages.microsoft.com/ubuntu/22.04/prod/dists/jammy/InRelease  Clearsigned file isn't valid, got 'NOSPLIT' (does the network require authentication?)
E: The repository 'https://packages.microsoft.com/ubuntu/22.04/prod jammy InRelease' is no longer signed.

@sphuber sphuber force-pushed the fix/6313/cli-path-type-not-readable branch 3 times, most recently from 312e708 to e532d25 Compare April 24, 2024 15:46
Copy link
Collaborator

@danielhollas danielhollas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, here I was just trying to get away with the bare minimum and I get the quality police on my case :D

Well done Mr. Huber, you are free to go merge. 👮 🚓

:param int timeout_seconds: Maximum timeout accepted for URL response.
Must be an integer in the range [0;60].
:param timeout_seconds: Maximum timeout accepted for URL response. Must be an integer in the range [0;60].
:returns: The path or URL.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if URL_TIMEOUT_SECONDS=10 is too generous?
If the user provides plausible URL with a typo (http://googl.com) will it fail earlier or will it block the CLI for 10s?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it will fail sooner with DNS error?

check_timeout_seconds(invalid)


def test_fail_non_existing_path():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the new tests!

The `PathOrUrl` and `FileOrUrl` parameter types extend click's `Path`
and `File` parameter types, respectively, by adding support for URLs.
If the conversion the base type fails, the provided value is tried as a
URL.

The problem was that if the provided value was intended to be a local
file or path, but conversion failed, the conversion to URL would almost
certainly also fail and that would be the returned error message. This
would be confusing for users that tried to specify a local path, but it
contained a typo or was not readable for other reasons, such as file
permissions.

When the base type conversion fails, now the parameter types first check
whether the provided file or path exists on the local disk. If so, it is
assumed that the value was intended to be a local path and an error is
returned that the path could not be read. Only then is the path tried as
a URL. Here the `URLError` and `socket.timeout` exceptions are caught
separately in which case we are almost certainly dealing with a URL
which just couldn't be reached (in time). The `urlopen` method can also
throw a `ValueError` which is usually when no protocol is defined. In
this case an error is returned that explicitly states that the provided
value does not correspond to a readable path nor to a reachable URL.
@sphuber sphuber force-pushed the fix/6313/cli-path-type-not-readable branch from e532d25 to 5e3b659 Compare April 25, 2024 07:45
@sphuber sphuber merged commit ffc6e4f into aiidateam:main Apr 25, 2024
19 checks passed
@sphuber sphuber deleted the fix/6313/cli-path-type-not-readable branch April 25, 2024 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve error handling of verdi archive import archive.aiida
2 participants