Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LocalFileSystem fix _strip_protocol for root directory #1477

Merged
merged 34 commits into from
Mar 18, 2024

Conversation

fleming79
Copy link
Contributor

This modifies LocalFileSystem to fix #945.

All local tests pass on Windows.

@nicholasjng
Copy link
Contributor

I'd like to see this go into the AbstractFileSystem._strip_protocol base class method as well - I've had trouble with it in the case of lakeFS, where stripping away trailing slashes has some implications on URI parsing and API call results.

@martindurant
Copy link
Member

I'd like to see this go into the AbstractFileSystem._strip_protocol base class

The trouble here is, that for real filesystems (local, ftp, ssh, ...), files cannot e d with "/", even though directory names are sometimes printed with the extra character. So for those, stripping is always the right thing to do. Similarly, in s3 and gcs, "/" is the "separation" character used with the remote API and also has a special place, if not exactly the same. So I think I would for now recommend that implementations that need it should provide their own _strip_protocol rather than changing the default one.

@nicholasjng
Copy link
Contributor

nicholasjng commented Dec 21, 2023

The trouble here is, that for real filesystems (local, ftp, ssh, ...), files cannot e d with "/", even though directory names are sometimes printed with the extra character. So for those, stripping is always the right thing to do. Similarly, in s3 and gcs, "/" is the "separation" character used with the remote API and also has a special place, if not exactly the same. So I think I would for now recommend that implementations that need it should provide their own _strip_protocol rather than changing the default one.

I see your point, but on the other hand, trailing slashes are also the defining hint of directories in many implementations - in a different world, in those implementations, all file-ops could raise IsADirectoryErrors on trailing-slash filenames if they are not applied recursively.

In my case, I am working with URIs similiar to S3, of the structure lakefs://<bucket>/<version>/<resource>, where an empty resource marks the root of the bucket (i.e. / as root marker). I cannot just simply return the root marker there since the bucket and version are still in front, so the filename or cls.root_marker check will not do what I want.

But since the subclass implementation works fine in our case, I see no harm in keeping that in place for now.

@fleming79
Copy link
Contributor Author

My apologies if the updates have been noisy. I'm a novice when it comes to git and pull requests.

The other history has been removed when I switched to using a nested function so that remove_trailing_slash is always respected.

I note there is an error for test_fsmap_error_on_protocol_keys.

def test_fsmap_access_with_suffix(tmp_path):
    tmp_path.joinpath("b").mkdir()
    tmp_path.joinpath("b", "a").write_bytes(b"data")
    m = fsspec.get_mapper(f"file://{tmp_path}")

    with pytest.raises(KeyError):
        _ = m["b/"]

    assert m["b/a/"] == b"data"

I haven't used FSMap before but wonder if the line in the test that is failing:

assert m["b/a/"] == b"data"`

should be without the trailing separator?

assert m["b/a"] == b"data"'

return path.rstrip("/") if remove_trailing_slash else path


def _make_path_posix(path, sep=os.sep):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why split up the function? The windows-root check is a rare one and should ideally be checked after more common things. This would be a second case within "# windows full path"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning was that the original make_path_posix function has several return statements. Wrapping the original function enables detection of special edge cases but also adds support for stripping the trailing suffix where appropriate.

path = _make_path_posix(path, sep)
if os.name == "nt" and len(path) == 3 and path[1] == ":":
return path # windows root
return path.rstrip("/") if remove_trailing_slash else path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this i what breaks the mapper test; the key in question I think would have hit "if path.startswith("/"):" previously.

Copy link
Contributor Author

@fleming79 fleming79 Dec 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trouble here is, that for real filesystems (local, ftp, ssh, ...), files cannot e d with "/", even though directory names are sometimes printed with the extra character.

The filename ends with a trailing '/'. Is it supposed to?

image

I should mention that I also added that check in open, however the issue still occurs if the check is disabled.

The issue can be resolved by stripping the trailing suffix in mapping.py->__getitem__. Doing doesn't appear to cause any regression.

k = self._key_to_str(key).rstrip("/")

…th_posix.

make_path_posix with some optimisation tweaks such as change re.match to a string comparison.
Should also properly format concatenated windows paths as posix.
@fleming79
Copy link
Contributor Author

make_path_posix has been divided into a native posix version and nt (windows) version. It has also been ordered with assumptions made about the most common types of paths.
A test, test_parent for testing the results of the method: _parent has also been added. Do you think _parent of a root should return the root, or should it raise an error?

fsspec/implementations/local.py Outdated Show resolved Hide resolved
fsspec/implementations/local.py Outdated Show resolved Hide resolved
fsspec/implementations/local.py Show resolved Hide resolved
self.root = fs._strip_protocol(root).rstrip("/")
try:
self.root = fs._strip_protocol(root, remove_trailing_slash=True)
except TypeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunate; in what case does rstrip fail? In this case, I think we might be explicitly joining on the separator anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Windows if the root is something like c:/, rstrip will cause a root of c: which is treated like '.', the current working directory instead of the root.
On posix rstrip will make / change to .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only place that the special case for LocalFileSystem leaks, and I don't like it :|. I think it may be better to let through strange behaviour if the user thinks they want to may their whole C drive (and gets only cwd files instead).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But without .rstrip("/")

@martindurant
Copy link
Member

py38 does not support removeprefix

AttributeError: 'str' object has no attribute 'removeprefix'

@fleming79
Copy link
Contributor Author

fleming79 commented Jan 24, 2024

I think sep probably shouldn't be an argument for the function make_path_posix since having it there advertises that it is capable of producing posix for either system, but it relies on system specific functions for much of its behaviour. The associated tests can be made os specific (some previously were already).

image

Alan Fleming and others added 2 commits January 25, 2024 08:36
@fleming79
Copy link
Contributor Author

The latest changes remove sep accordingly.

@@ -38,6 +38,7 @@ def jupyter(tmpdir):
P.terminate()


@pytest.mark.skipif(WIN, reason="Subprocess gets stuck in a loop")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice this before. Any idea what's going on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the problem - Something to do with the temp dir passed to Jupyterlab in the launch call showing up as an invalid path. Surrounding the path with "" fixed it. The test now passes when I run it on Windows.

cmd = f'jupyter notebook --notebook-dir="{tmpdir}" --no-browser --port=5566'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line was to be removed according to the thread above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite right - I made the changes but forgot to push them. It is done now.

@martindurant
Copy link
Member

please merge from master to pick up the constraint on pytest that should fix CI

@fleming79
Copy link
Contributor Author

Did you mean something other than to sync the fork?

@martindurant
Copy link
Member

That's exactly what I meant - sorry I didn't get a notification

@fleming79
Copy link
Contributor Author

No problem.
I fixed the Ruff lint issue, but there were also errors with 'dbfs'. I'm not sure how they could be related to this PR.

@martindurant
Copy link
Member

martindurant commented Mar 2, 2024

@nils-braun , @a24lorie , any idea why DBFS should start failing, but only for some python versions? It is also happening on other PRs.

@fleming79 , you may wish to take the commit from #1533 , which should at least squash the intermittent SMB failures.

@a24lorie
Copy link
Contributor

a24lorie commented Mar 4, 2024

@nils-braun , @a24lorie , any idea why DBFS should start failing, but only for some python versions? It is also happening on other PRs.

@martindurant I have found this to be an issue with the pytest-vcr and urllib3 version 2.x with python 3.10 and 3.11.
Downgrading urllib3 to version 1.26.18 in Python 3.10 removes this error

@martindurant
Copy link
Member

Please update your fork following #1533

@fleming79
Copy link
Contributor Author

I synced the fork with GitHub.

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this has been taking so long...

I wonder, since we now are essentially calling _strip_protocol from within LocalFileSystem, would it be easy to split off the optional remove_trailing_slash= into a separate method, to maintain the previous signature? I am still not sure what the best default is, since in most places we think of paths not having terminal sep characters even when they refer to directories.

fsspec/implementations/local.py Show resolved Hide resolved
fsspec/implementations/local.py Show resolved Hide resolved
fsspec/implementations/local.py Show resolved Hide resolved
@@ -38,6 +38,7 @@ def jupyter(tmpdir):
P.terminate()


@pytest.mark.skipif(WIN, reason="Subprocess gets stuck in a loop")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line was to be removed according to the thread above

fsspec/implementations/tests/test_local.py Show resolved Hide resolved
self.root = fs._strip_protocol(root).rstrip("/")
try:
self.root = fs._strip_protocol(root, remove_trailing_slash=True)
except TypeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only place that the special case for LocalFileSystem leaks, and I don't like it :|. I think it may be better to let through strange behaviour if the user thinks they want to may their whole C drive (and gets only cwd files instead).

@fleming79
Copy link
Contributor Author

The fork has been synced, with no other changes.

@martindurant martindurant merged commit 1653552 into fsspec:master Mar 18, 2024
10 checks passed
@martindurant
Copy link
Member

:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LocalFileSystem inconsistent with Windows paths
4 participants