Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LocalFileSystem restore _strip_protocol signature #1567

Merged
merged 12 commits into from
May 7, 2024

Conversation

ap--
Copy link
Contributor

@ap-- ap-- commented Apr 7, 2024

Hello,

I noticed that the LocalFileSystem._strip_protocol signature changed in #1477, when running the universal_pathlib test-suite against the current fsspec version.

To me it seems that the intention of #1477 was initially to prevent "C:/" or "C:\\" to be reduced to "C:" by _strip_protocol, but in addition it introduced function signature changes in fsspec.implementations.local and minor changes in fsspec.mapping.

I created this PR as a basis for discussion if the signature change could be avoided. In this PR I reverted the changes to LocalFileSystem and make_path_posix but kept the changes to the tests (first commit) and then provide an alternative implementation that avoids the function signature change.

I would also be happy to try to add more tests around the LocalFileSystem._parent, LocalFileSystem._strip_protocol and make_path_posix behavior for edge-cases if there is interest. But windows is not my main OS for daily work, so I am very likely not aware of most of them.

Cheers,
Andreas

@martindurant
Copy link
Member

@fleming79 , this undoes much of your code, but maintains the tests for correctness, while reimplementing consistency for _strip_protocol. Do you have thoughts?

@fleming79
Copy link
Contributor

I have no issue with removing the option 'remove_trailing_slash' if it isn't required. It was provided because there were .rstrip('/') calls scattered about.

The question I pose is - which code is more maintainable/performant/correct?

@ap--
Copy link
Contributor Author

ap-- commented Apr 10, 2024

Thanks for the quick response!

Regarding maintainability, I would prefer if the method signatures would agree with AbstractFileSystem. For example, enforcing #1104 (where possible) would make downstream development easier.

Regarding performance, I'm not a big fan of micro-benchmarks anymore, but I ran one (checking 10000 generated URIs) and the implementation in this PR seems to be equally fast/slow (within margin of error) as 2024.3.1 and 2024.3.0 for _strip_protocol. It is ~2% slower for _parent. This was on an M2max arm macbook, only tested with py3.11 and only run a few times. Variation between runs is larger than the differences within a run.

[click to expand] benchmark for `_strip_protocol` and `_parent`
❯ asv run
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.11
·· Building 518984d4 <master> for virtualenv-py3.11...
·· Installing 518984d4 <master> into virtualenv-py3.11.
· Running 6 total benchmarks (3 commits * 1 environments * 2 benchmarks)
[ 0.00%] · For filesystem_spec commit 518984d4 <master>:
[ 0.00%] ·· Benchmarking virtualenv-py3.11
[ 8.33%] ··· Running (bench_strip_protocol.Suite.time_parent--)..
[25.00%] ··· bench_strip_protocol.Suite.time_parent                        207±0.8ms
[33.33%] ··· bench_strip_protocol.Suite.time_split_protocol                  205±2ms
[33.33%] · For filesystem_spec commit 4bd16f64 <2024.3.0^0>:
[33.33%] ·· Building for virtualenv-py3.11.
[33.33%] ·· Benchmarking virtualenv-py3.11
[41.67%] ··· Running (bench_strip_protocol.Suite.time_parent--)..
[58.33%] ··· bench_strip_protocol.Suite.time_parent                        207±0.7ms
[66.67%] ··· bench_strip_protocol.Suite.time_split_protocol                  204±2ms
[66.67%] · For filesystem_spec commit c9eaf46e <local-fs-restore-strip-protocol>:
[66.67%] ·· Building for virtualenv-py3.11....
[66.67%] ·· Benchmarking virtualenv-py3.11
[75.00%] ··· Running (bench_strip_protocol.Suite.time_parent--)..
[91.67%] ··· bench_strip_protocol.Suite.time_parent                          211±1ms
[100.00%] ··· bench_strip_protocol.Suite.time_split_protocol                 204±2ms
# pip install asv

import itertools
import random
import string
from fsspec import get_filesystem_class

random.seed(0)

def _make_random_path():
    # build a random length path
    pth_len = random.randint(5, 40)
    return "".join(random.sample(string.ascii_letters + "/", k=pth_len))

def _make_uris(n):
    # create uris with and without protocols and with and without netlocs
    it_proto_netloc = itertools.cycle(
        itertools.product(
            ["file", "local", "wrong", None],
            ["netloc", ""]
        )
    )

    for _ in range(n):
        proto, netloc = next(it_proto_netloc)
        pth = _make_random_path()
        if proto and netloc:
            yield f"{proto}://{netloc}/{pth}"
        elif proto:
            yield f"{proto}:/{pth}"
        else:
            yield f"{netloc}/{pth}"

uris = list(_make_uris(10000))


class Suite:

    def setup(self):
        self.fs_cls = get_filesystem_class("file")
        self.uris = uris

    def teardown(self):
        del self.uris

    def time_split_protocol(self):
        for uri in self.uris:
            self.fs_cls._strip_protocol(uri)

    def time_parent(self):
        for uri in self.uris:
            self.fs_cls._parent(uri)

To be able to answer if this implementation is more or less correct would require more tests. I'll try to find some time to come up with more cases for the test-suite.

Cheers,
Andreas

@martindurant
Copy link
Member

I don't think we can trust timing differences <5%; in these cases it can even matter which order you run them (whether the CPU is warm, etc). The interesting thing would be to check on Windows (but they have lots of path styles).

@fleming79
Copy link
Contributor

The original make_path_posix contains quite a lot of strip protocol code in it. It also includes 'sep' as an argument implying it can work for an alternate os.
I'd prefer to see the newer code tweaked to remove the signature.

Here is a starting point.





def make_path_posix(path):
    """Make path generic for current OS"""
    if not isinstance(path, str):
        if isinstance(path, (list, set, tuple)):
            return type(path)(make_path_posix(p) for p in path)
        else:
            path = str(stringify_path(path))
    if os.sep == "/":
        # Native posix
        if path.startswith("/"):
            # most common fast case for posix
            return path.rstrip("/") or "/"
        elif path.startswith("~"):
            return make_path_posix(osp.expanduser(path))
        elif path.startswith("./"):
            path = path[2:]
            path = f"{os.getcwd()}/{path}"
            return path.rstrip("/") or "/"
        return f"{os.getcwd()}/{path}"
    else:
        # NT handling
        if len(path) > 1:
            if path[1] == ":":
                # windows full path like "C:\\local\\path"
                if len(path) <= 3:
                    # nt root (something like c:/)
                    return path[0] + ":/"
                path = path.replace("\\", "/").replace("//", "/")
                return path
            elif path[0] == "~":
                return make_path_posix(osp.expanduser(path))
            elif path.startswith(("\\\\", "//")):
                # windows UNC/DFS-style paths
                return "//" + path[2:].replace("\\", "/").replace("//", "/")
        return make_path_posix(osp.abspath(path))



    @classmethod
    def _strip_protocol(cls, path):
        path = stringify_path(path)
        if path.startswith(("file:",'local')):
            if os.sep == '\\' and path.startswith("file:///"):
                path = path[8:]
            elif path.startswith("file://"):
                path = path[7:]
            elif path.startswith("file:"):
                path = path[5:]
            elif path.startswith("local://"):
                path = path[8:]
            elif path.startswith("local:"):
                path = path[6:]
        drive, path = osp.splitdrive(make_path_posix(path))
        path = path.rstrip("/") or cls.root_marker
        return drive + path

@ap--
Copy link
Contributor Author

ap-- commented Apr 15, 2024

Thank you @fleming79 for the proposed implementations. I'll try them out shortly, when I find some more time.

I've been collecting more test cases for _strip_protocol (still need to do _parent). In the test cases I basically try to document current behavior.

So far I noticed that between v2024.3.0 and v2024.3.1 there are the following differences in LocalFileSystem._strip_protocol behavior (note, that a lot of the test cases are probably just uncommon edge-cases):

Outdated. todo: update

on posix py310

input 2024.3.0 2024.3.1 this PR proposed variant (todo)
./ /cwd /cwd/ /cwd
file:// /cwd /cwd/ /cwd
file://./ /cwd /cwd/ /cwd

on windows py311

input 2024.3.0 2024.3.1 this PR proposed variant (todo)
file:/// C: C:/cwd C:/
file://///remote/share/path ///remote/share/path C:/cwd/remote/share/path ///remote/share/path
file:////remote/share/path //remote/share/path C:/cwd/remote/share/path //remote/share/path
file:/foo/bar C:/foo/bar C:/cwd/foo/bar C:/foo/bar
local:/path C:/path C:/cwd/path C:/path
s3://bucket/key s3://bucket/key C:/cwd/s3:/bucket/key s3://bucket/key

I will write more tests for _parent. (done)

(Update: the failing windows test succeeds locally on a windows python==3.11. I'll investigate a bit further...)

@ap--
Copy link
Contributor Author

ap-- commented Apr 15, 2024

Performance comparison on windows python==3.11, shows that the 2023.3.1 implementation is the fastest and the implementation in this PR is significantly worse.

I'll update this benchmark once I test the implementation provided in the comment above.

outdated benchmark
(venv) PS C:\User\user\development\filesystem_spec> asv run
· Creating environments.........
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.11
·· Building 4bd16f64 <2024.3.0^0> for virtualenv-py3.11...........
·· Installing 4bd16f64 <2024.3.0^0> into virtualenv-py3.11.
· Running 6 total benchmarks (3 commits * 1 environments * 2 benchmarks)
[ 0.00%] · For filesystem_spec commit 4bd16f64 <2024.3.0^0>:
[ 0.00%] ·· Benchmarking virtualenv-py3.11
[ 8.33%] ··· Running (bench_strip_protocol.Suite.time_parent--)..
[25.00%] ··· bench_strip_protocol.Suite.time_parent                                   22.5±0.3ms
[33.33%] ··· bench_strip_protocol.Suite.time_split_protocol                           20.3±0.2ms
[33.33%] · For filesystem_spec commit 516b0626 <local-fs-restore-strip-protocol>:
[33.33%] ·· Building for virtualenv-py3.11........
[33.33%] ·· Benchmarking virtualenv-py3.11
[41.67%] ··· Running (bench_strip_protocol.Suite.time_parent--)..
[58.33%] ··· bench_strip_protocol.Suite.time_parent                                   43.3±0.3ms
[66.67%] ··· bench_strip_protocol.Suite.time_split_protocol                           25.7±0.3ms
[66.67%] · For filesystem_spec commit 47b445ae <2024.3.1^0>:
[66.67%] ·· Building for virtualenv-py3.11...........
[66.67%] ·· Benchmarking virtualenv-py3.11
[75.00%] ··· Running (bench_strip_protocol.Suite.time_parent--)..
[91.67%] ··· bench_strip_protocol.Suite.time_parent                                   19.5±0.3ms
[100.00%] ··· bench_strip_protocol.Suite.time_split_protocol                          16.5±0.2ms

@ap-- ap-- marked this pull request as draft April 28, 2024 12:39
@ap--
Copy link
Contributor Author

ap-- commented May 1, 2024

Hello @martindurant and @fleming79

I found some time to continue with this PR, and it would be great to get your feedback.

NOTE: the failing tests seem to be unrelated, and sporadic? related to a bug in setuptools_scm when using the newest version of git

Summarizing the changes:

  • added tests for as many (edge) cases I could think of for
    • make_posix_path
    • LocalFileSystem._strip_protocol
    • LocalFileSystem._parent
  • the implementation of _strip_protocol is now based on @fleming79's suggestion from the comment
    above with a few minor adjustments:
    • restored trailing slash handling to 2023.3.0 state
    • special handling of "." as a path
    • removed the if len(path) > 1 check
  • the implementation of _parent is now using a code-path on windows that is a stripped down version
    of ntpath.splitdrive, avoiding many checks for cases (byte paths, etc.) that don't seem to be required
    here.

Performance:

I ran a benchmark on 10000 randomly generated uri's comparing performance of _strip_protocol and _parent
on windows and linux (running in WSL2). The results are as follows:

click to show code for generating the uris
import itertools
import random
import string

from fsspec import get_filesystem_class

random.seed(0)

def _make_random_path(style):
    pth_len = random.randint(5, 40)
    if style == "posix":
        chars = string.ascii_letters + "/"
        prefix = ""
    elif style == "win":
        chars = string.ascii_letters + "\\"
        prefix = "c:\\"
    elif style == "win-posix":
        chars = string.ascii_letters + "/"
        prefix = "c:/"
    else:
        raise ValueError(f"Unknown style {style}")
    return prefix + "".join(random.sample(chars, k=pth_len))


def _make_uris(n):
    it_proto_netloc_style = itertools.cycle(
        itertools.product(
            ["file", "local", "wrong", None],
            ["netloc", ""],
            ["posix", "win", "win-posix"],
        )
    )

    for _ in range(n):
        proto, netloc, style = next(it_proto_netloc_style)
        pth = _make_random_path(style)
        if proto and netloc:
            yield f"{proto}://{netloc}/{pth}"
        elif proto:
            yield f"{proto}:/{pth}"
        else:
            yield f"{netloc}/{pth}"


uris = list(_make_uris(10000))


class Suite:
    def setup(self):
        self.fs_cls = get_filesystem_class("file")
        self.uris = uris

    def teardown(self):
        del self.uris

    def time_split_protocol(self):
        for uri in self.uris:
            self.fs_cls._strip_protocol(uri)

    def time_parent(self):
        for uri in self.uris:
            self.fs_cls._parent(uri)
Ubuntu under WSL2

_strip_protocol

python version 2024.3.0 2024.3.1 this PR
3.8 7.32 +- 0.1 ms 7.15 +- 0.08 ms 7.37 +- 0.1 ms
3.9 6.93 +- 0.1 ms 7.39 +- 0.3 ms 7.37 +- 0.6 ms
3.10 7.29 +- 0.08 ms 6.94 +- 0.1 ms 6.89 +- 0.3 ms
3.11 6.33 +- 0.2 ms 5.92 +- 0.06 ms 5.97 +- 0.07 ms
3.12 6.62 +- 0.1 ms 5.98 +- 0.03 ms 6.42 +- 0.05 ms

_parent

python version 2024.3.0 2024.3.1 this PR
3.8 9.58 +- 0.3 ms 9.79 +- 0.2 ms 9.31 +- 0.3 ms
3.9 8.61 +- 0.2 ms 10.3 +- 0.8 ms 8.44 +- 0.4 ms
3.10 8.83 +- 0.2 ms 9.59 +- 0.4 ms 9.08 +- 0.6 ms
3.11 8.25 +- 0.06 ms 8.33 +- 0.4 ms 7.51 +- 0.08 ms
3.12 8.05 +- 0.07 ms 7.93 +- 0.09 ms 8.14 +- 0.05 ms
Windows 11

_strip_protocol

python version 2024.3.0 2024.3.1 this PR
3.8 27.1 ± 0.7 ms 32.2 ± 0.3 ms 16.9 ± 0.9 ms
3.9 27.7 ± 0.3 ms 33.1 ± 0.2 ms 15.2 ± 0.2 ms
3.10 27.3 ± 0.6 ms 35.0 ± 0.2 ms 15.3 ± 0.2 ms
3.11 20.9 ± 0.4 ms 18.4 ± 0.7 ms 14.7 ± 0.3 ms
3.12 20.8 ± 0.2 ms 16.9 ± 0.2 ms 15.0 ± 1 ms

_parent

python version 2024.3.0 2024.3.1 this PR
3.8 31.1 ± 1 ms 37.6 ± 1 ms 19.3 ± 0.6 ms
3.9 31.1 ± 2 ms 36.2 ± 0.7 ms 17.9 ± 0.6 ms
3.10 30.5 ± 0.4 ms 38.8 ± 0.2 ms 19.0 ± 1 ms
3.11 23.7 ± 0.2 ms 20.9 ± 2 ms 17.6 ± 0.2 ms
3.12 23.7 ± 0.7 ms 20.4 ± 0.3 ms 16.8 ± 0.3 ms

Some open questions:

  • PR LocalFileSystem fix _strip_protocol for root directory #1477 added
    string coercion to make_path_posix, causing the newly added tests here to
    fail. I marked them as xfail for now. It would be great to get your feedback
    on this. My opinion is that these cases should raise an error. On current master
    make_posix_path(None) will return '/cwd/None'.

  • To document the behavior of some uncommon edge cases I added some xfail tests
    for legacy dos uris (e.g. file:///c|/foo/bar), rfc8089 uris with
    localhost (file://localhost/foo/bar), and windows relative paths with drives
    (c:Path). I don't think it's worth supporting them but if you think
    otherwise, I can make the changes to add support.

Have a great day!
Andreas

@ap-- ap-- marked this pull request as ready for review May 1, 2024 15:38
@ap--
Copy link
Contributor Author

ap-- commented May 1, 2024

Note: the failing CI runs seem to be caused by a new git release? pypa/setuptools_scm#1038

@fleming79
Copy link
Contributor

@ap-- It looks like you've made some good changes.

PR #1477 added
string coercion to make_path_posix, causing the newly added tests here to
fail. I marked them as xfail for now. It would be great to get your feedback
on this. My opinion is that these cases should raise an error. On current master
make_posix_path(None) will return '/cwd/None'.

I agree those tests are useful and an error should probably be raised for invalid types somehow. I guess removing the coercion would suffice, but maybe it would be better to also check the return type from the call to stringify_path.

@martindurant
Copy link
Member

causing the newly added tests here to
fail. I marked them as xfail for now.

Do those fail with a different exception type now, or just return garbage pathstrings?

the failing CI runs seem to be caused by a new git release? pypa/setuptools_scm#1038

This is for the s3fs build, right? We could pin its requirement on setuptools_scm, wait for a new release, or move that build to hatch (as fsspec has been).

@ap--
Copy link
Contributor Author

ap-- commented May 2, 2024

Do those fail with a different exception type now, or just return garbage pathstrings?

Current master returns garbage:

>>> from fsspec.implementations.local import make_path_posix
>>> make_path_posix(object())
'/Users/poehlmann/development/filesystem_spec/<object object at 0x100664b80>'

Possible fixes are:

  1. change path = str(stringify_path(path)) to path = stringify_path(path) which will then cause an AttributeError to be raised due to the missing attribute ".startswith". (This is asserted in the test)
  2. check the instance type of stringify_path(path) and raise a TypeError if not str
  3. raise a TypeError in the fallthrough case in stringify_path:
    else:
    return filepath # type: ignore[return-value]

This is for the s3fs build, right? We could pin its requirement on setuptools_scm, wait for a new release, or move that build to hatch (as fsspec has been).

I think this happens when during the s3fs install fsspec gets installed. The traceback shows it failing in hatchling, which uses the hatch-vcs plugin to get the current version via setuptools_scm. setuptools_scm in turn has a compatibility issue with the newest git version which seems to be used on the Github Actions runner already via the conda test-env environment. I'll see if downgrading git will fix the issue.

@martindurant
Copy link
Member

I'll see if downgrading git will fix the issue.

A specific version of git could be pinned in the CI conda environment file

@ap--
Copy link
Contributor Author

ap-- commented May 2, 2024

That made it work ☺️ The py3.11 test failed because the smb tests failed.

@martindurant
Copy link
Member

rerunning ... I have yet to come up with a way to get SMB to reliably pass every time.

@martindurant
Copy link
Member

I'm happy with this. @fleming79, any comments?

@ap--
Copy link
Contributor Author

ap-- commented May 2, 2024

Awesome.

So we just need a decision regarding make_path_posix behavior:

  1. make_path_posix(None) returns /cwd/None (fsspec>=2024.3.1 and this PR)
  2. make_path_posix(None) raises TypeError (fsspec<=2024.3.0)
  3. make_path_posix(None) raises TypeError (might be most intuitive for users?)

Update: I double checked, and <=2024.3.0 raised TypeError because of if "~" in path.

@martindurant
Copy link
Member

2. seems fine to me, but happy to catch the error and convert to TypeError too (no strong feeling)

@fleming79
Copy link
Contributor

I'm happy with 2 .

In stringify_path I suggest changing from isinstance(filepath, pathlib.Path) to isinstance(filepath, pathlib.PurePath).Pathis a subclass ofPurePath. and would mean PurePosixPathandPureWindowsPath` are also converted to strings (which they should). Perhaps some tests involving PurePaths could be added?

@ap--
Copy link
Contributor Author

ap-- commented May 2, 2024

I made the changes and added tests for stringify_path.

In stringify_path I suggest changing from isinstance(filepath, pathlib.Path) to isinstance(filepath, pathlib.PurePath).Pathis a subclass ofPurePath. and would mean PurePosixPathandPureWindowsPath` are also converted to strings (which they should). Perhaps some tests involving PurePaths could be added?

PurePath already implements the __fspath__ method, which made the isinstance check after hasattr('fspath') redundant.

@martindurant martindurant merged commit da77548 into fsspec:master May 7, 2024
11 checks passed
gutzbenj pushed a commit to earthobservations/filesystem_spec that referenced this pull request May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants