CLI: Do not treat as URL if path exists in `PathOrUrl` parameter type #6360

sphuber · 2024-04-20T14:01:27Z

The PathOrUrl parameter type extends click's Path type to add support for remote paths defined by a URL. The value is first converted by the parents implementation, and if that raises an exception the value is considered to be a URL. However, if the initial conversion fails not because the path is not a file on the local file system, but there was another error, for example the file is not readable, it does not make sense to try and treat it as a URL as that is most likely not what was intended by the user. This would lead to confusing error messages claiming that the path is not a valid URL whereas the real problem is that it is not readable.

The solution is to add a check that if the initial local path conversion fails, to only check the path as a URL if the path does not exist on the local file system. If it does, the original exception is simply reraised.

danielhollas

@sphuber thanks for taking this on! The change looks like a strict improvement, although I think further improvement is possible.

For example, when I make a typo in a filename, I will still get error about invalid url.

How about we first try to parse the input to see if it looks like an URL? If not, we assume user meant local path. If yes, then we can try to request it. WDYT?

Something like urllib.parse.urlparse might be good for this. https://stackoverflow.com/a/38020041

danielhollas · 2024-04-20T16:19:22Z

src/aiida/cmdline/params/types/path.py

@@ -81,6 +82,8 @@ def convert(self, value, param, ctx):
        try:
            return super().convert(value, param, ctx)
        except click.exceptions.BadParameter:
+            if pathlib.Path(value).exists():
+                raise


I don't think it's very nice to vomit the whole stacktrace to users terminal if they make a mistake. Is it possible to use click.echo.error or something like that here?

This is caught and just printed as:

Usage: verdi archive import [OPTIONS] [--] [ARCHIVES]... Try 'verdi archive import --help' for help. Error: Invalid value for '[ARCHIVES]...': Path 'test.txt' is not readable.

sphuber · 2024-04-20T17:35:15Z

Damn, here I was just trying to get away with the bare minimum and I get the quality police on my case :D

For example, when I make a typo in a filename, I will still get error about invalid url.

Fair enough, the fix I added is indeed a bit barebones.

How about we first try to parse the input to see if it looks like an URL? If not, we assume user meant local path. If yes, then we can try to request it. WDYT?
Something like urllib.parse.urlparse might be good for this. https://stackoverflow.com/a/38020041

I thought about inversing the priority but I thought it was going to be difficult to detect the intention. I am not sure if there are hard rules on what constitutes a valid URL so we could check for that. For example, the urllib.parse.urlparse pretty much swallows anything:

In [1]: from urllib.parse import urlparse

In [2]: urlparse('file.txt')
Out[2]: ParseResult(scheme='', netloc='', path='file.txt', params='', query='', fragment='')

In [3]: urlparse('test/file.txt')
Out[3]: ParseResult(scheme='', netloc='', path='test/file.txt', params='', query='', fragment='')

In [4]: urlparse('/path/test/file.txt')
Out[4]: ParseResult(scheme='', netloc='', path='/path/test/file.txt', params='', query='', fragment='')

So we would have to start checking if there is not scheme defined and make other assumptions, but I am not sure if that would reject some valid URLs.

Another possibility is to simply call urllib.request.urlopen(value) which will raise a ValueError if it doesn't think it is a valid URL:

In [1]: import urllib.request

In [2]: urllib.request.urlopen('/tmp/text.txt')

ValueError: unknown url type: '/tmp/text.txt'

If it is a valid URL, it would be

In [3]: urllib.request.urlopen('http://localhost/tmp/text.txt')

URLError: <urlopen error [Errno 111] Connection refused>

So we can catch those separately.

sphuber · 2024-04-20T18:17:36Z

@danielhollas had a stab at the alternate implementation. If this is better, we should probably also likely change the implementation of FileOrUrl which is essentially the same.

danielhollas · 2024-04-22T14:54:36Z

src/aiida/cmdline/params/types/path.py

        import socket
        import urllib.error
        import urllib.request

        try:
-            with urllib.request.urlopen(url, timeout=self.timeout_seconds):
+            with urllib.request.urlopen(value, timeout=self.timeout_seconds):


Just wondering, are there any performance implications of calling urlopen first? How quickly does it fail for Paths?

This was my original reason for not changing the order of checking, but I quickly checked and it seems really negligible.

In [1]: import urllib.request In [2]: def call_invalid(): ...: try: ...: urllib.request.urlopen('some/path.txt') ...: except ValueError: ...: pass In [3]: %timeit call_invalid 8.4 ns ± 0.231 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)

As your investigation below showed, this is because it does a string match operation to check for a protocol which should be really fast and since filepaths will not match protocols, it will almost immediately error out and we move on to treating it as a local path.

Of course then the problem becomes the inverse. If the user meant a URL but just forgot the protocol or had a typo, then they would get an error about the file not existing. I guess no matter how you turn it, it is going to be difficult to guess the intention. Maybe the best we can do is be explicit in when we tried both options and that we couldn't make it work either as a path nor a URL.

Do we have an idea if one is likely than the other (path/URL), based where these options are used?

This parameters is mostly used for the --config option. I think most users may have the config file locally, but support was added for URLs exactly because this makes sense for computer and code configurations which can be on servers, such as provided by the aiida-code-registry. I think I may revert the order to the original state and just improve the error message. If the local path validation fails, we still try the remote URL but if that fails we make sure the error message is clear to state that it was tried as local path and then URL and both failed.

That sounds like a good plan. I think you can still special-case the message if the path actually exists but we failed to read from it, as in your original fix before I started making a fuzz. 😅

danielhollas

Damn, here I was just trying to get away with the bare minimum and I get the quality police on my case :D

Hahaha, sorry. 👼 😊 TBH I am very happy to merge a quick fix for the specific issue in #6360 for 2.6 and leave the further improvements for another time since this indeed seems like a can of worms.

That said, not sure if I like the new approach of trying to open URL first, as I am not sure about possible performance implications.

urlparse indeed does not seem very helpful. I looked a little bit into the urlopen implementation. Specifically, trying to open an url without a scheme results in an error:

>>> from urllib.request import urlopen
>>> urlopen('google.com')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.12/urllib/request.py", line 215, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/urllib/request.py", line 499, in open
    req = Request(fullurl, data)
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/urllib/request.py", line 318, in __init__
    self.full_url = url
    ^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/urllib/request.py", line 344, in full_url
    self._parse()
  File "/usr/lib64/python3.12/urllib/request.py", line 373, in _parse
    raise ValueError("unknown url type: %r" % self.full_url)

Tracing where that error comes from:

    def _parse(self):
        self.type, rest = _splittype(self._full_url)
        if self.type is None:
            raise ValueError("unknown url type: %r" % self.full_url)

where _splittype is imported from urllib/parse.py

def _splittype(url):
    """splittype('type:opaquestring') --> 'type', 'opaquestring'."""
    global _typeprog
    if _typeprog is None:
        _typeprog = re.compile('([^/:]+):(.*)', re.DOTALL)
    
    match = _typeprog.match(url)
    if match:
        scheme, data = match.groups()
        return scheme.lower(), data
    return None, url

So this would suggest that urlopen needs a scheme? In which case we could simply catch this specific ValueError and print an error that we got either wrong URL or path.

sphuber · 2024-04-24T10:00:30Z

Ok @danielhollas I reverted the original order of checking but then simply tried to improve the error-handling and messaging. Now if a value fails to be converted to either a local path or a URL, the error message will explicitly mention that it was tried as both. Hopefully this should be clear enough for the user. Also deduplicated and cleaned the code a bit.

sphuber · 2024-04-24T10:08:02Z

Wow, CI is taken a huge dump on everything:

 E: Failed to fetch https://packages.microsoft.com/ubuntu/22.04/prod/dists/jammy/InRelease  Clearsigned file isn't valid, got 'NOSPLIT' (does the network require authentication?)
E: The repository 'https://packages.microsoft.com/ubuntu/22.04/prod jammy InRelease' is no longer signed.

danielhollas

Damn, here I was just trying to get away with the bare minimum and I get the quality police on my case :D

Well done Mr. Huber, you are free to go merge. 👮 🚓

danielhollas · 2024-04-24T17:51:34Z

src/aiida/cmdline/params/types/path.py

-    :param int timeout_seconds: Maximum timeout accepted for URL response.
-        Must be an integer in the range [0;60].
+    :param timeout_seconds: Maximum timeout accepted for URL response. Must be an integer in the range [0;60].
+    :returns: The path or URL.


Just wondering if URL_TIMEOUT_SECONDS=10 is too generous?
If the user provides plausible URL with a typo (http://googl.com) will it fail earlier or will it block the CLI for 10s?

I guess it will fail sooner with DNS error?

danielhollas · 2024-04-24T18:06:40Z

tests/cmdline/params/types/test_path.py

+            check_timeout_seconds(invalid)
+
+
+def test_fail_non_existing_path():


Thanks for all the new tests!

The `PathOrUrl` and `FileOrUrl` parameter types extend click's `Path` and `File` parameter types, respectively, by adding support for URLs. If the conversion the base type fails, the provided value is tried as a URL. The problem was that if the provided value was intended to be a local file or path, but conversion failed, the conversion to URL would almost certainly also fail and that would be the returned error message. This would be confusing for users that tried to specify a local path, but it contained a typo or was not readable for other reasons, such as file permissions. When the base type conversion fails, now the parameter types first check whether the provided file or path exists on the local disk. If so, it is assumed that the value was intended to be a local path and an error is returned that the path could not be read. Only then is the path tried as a URL. Here the `URLError` and `socket.timeout` exceptions are caught separately in which case we are almost certainly dealing with a URL which just couldn't be reached (in time). The `urlopen` method can also throw a `ValueError` which is usually when no protocol is defined. In this case an error is returned that explicitly states that the provided value does not correspond to a readable path nor to a reachable URL.

sphuber requested a review from danielhollas April 20, 2024 14:01

danielhollas reviewed Apr 20, 2024

View reviewed changes

danielhollas suggested changes Apr 20, 2024

View reviewed changes

sphuber requested a review from danielhollas April 20, 2024 18:16

sphuber force-pushed the fix/6313/cli-path-type-not-readable branch from 3199740 to d23a964 Compare April 20, 2024 20:45

danielhollas reviewed Apr 22, 2024

View reviewed changes

sphuber force-pushed the fix/6313/cli-path-type-not-readable branch 2 times, most recently from 7588af5 to ec53d09 Compare April 24, 2024 09:58

sphuber requested a review from danielhollas April 24, 2024 09:59

sphuber force-pushed the fix/6313/cli-path-type-not-readable branch 3 times, most recently from 312e708 to e532d25 Compare April 24, 2024 15:46

danielhollas approved these changes Apr 24, 2024

View reviewed changes

sphuber force-pushed the fix/6313/cli-path-type-not-readable branch from e532d25 to 5e3b659 Compare April 25, 2024 07:45

sphuber merged commit ffc6e4f into aiidateam:main Apr 25, 2024
19 checks passed

sphuber deleted the fix/6313/cli-path-type-not-readable branch April 25, 2024 09:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI: Do not treat as URL if path exists in `PathOrUrl` parameter type #6360

CLI: Do not treat as URL if path exists in `PathOrUrl` parameter type #6360

sphuber commented Apr 20, 2024

danielhollas left a comment •

edited

danielhollas Apr 20, 2024

sphuber Apr 20, 2024

sphuber commented Apr 20, 2024

sphuber commented Apr 20, 2024

danielhollas Apr 22, 2024

sphuber Apr 22, 2024

danielhollas Apr 22, 2024

sphuber Apr 22, 2024

danielhollas Apr 23, 2024

danielhollas left a comment •

edited

sphuber commented Apr 24, 2024

sphuber commented Apr 24, 2024

danielhollas left a comment

danielhollas Apr 24, 2024

danielhollas Apr 24, 2024

danielhollas Apr 24, 2024

		check_timeout_seconds(invalid)


		def test_fail_non_existing_path():

CLI: Do not treat as URL if path exists in PathOrUrl parameter type #6360

CLI: Do not treat as URL if path exists in PathOrUrl parameter type #6360

Conversation

sphuber commented Apr 20, 2024

danielhollas left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sphuber commented Apr 20, 2024

sphuber commented Apr 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielhollas left a comment • edited

Choose a reason for hiding this comment

sphuber commented Apr 24, 2024

sphuber commented Apr 24, 2024

danielhollas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CLI: Do not treat as URL if path exists in `PathOrUrl` parameter type #6360

CLI: Do not treat as URL if path exists in `PathOrUrl` parameter type #6360

danielhollas left a comment •

edited

danielhollas left a comment •

edited