Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isdir/info method works incorrectly #574

Open
TSienki opened this issue Aug 14, 2023 · 20 comments
Open

isdir/info method works incorrectly #574

TSienki opened this issue Aug 14, 2023 · 20 comments

Comments

@TSienki
Copy link

TSienki commented Aug 14, 2023

Hello,
I've found a strange behavior of the isdir method (digging deeper also with info method). It returns incorrect values. These values seem to be returned randomly.

I use Python 3.10.12 and I've tested this behavior on gscfs=2022.3.0, and the latest version gscfs=2023.6.0

I've prepared a helper function to show what is happening here:

from gcsfs import GCSFileSystem

fs = GCSFileSystem()


def check_is_dir(path):
    is_dir = fs.isdir(path)
    info_type = fs.info(path)["type"]

    print(path, is_dir, info_type)

Problem example

An exemplary run:

check_is_dir('gs://my/super')
check_is_dir('gs://my/super/secret')
check_is_dir('gs://my/super/secret/gcs')
check_is_dir('gs://my/super/secret/gcs/directory')
check_is_dir('gs://my/super/secret/gcs/directory/file.json')

Results:

gs://my/super False director  # the first string is a path, the first boolean is a value returned by isdir method, and the second string is 'type' value in the dictionary returned by fs.info(path)
gs://my/super/secret True directory
gs://my/super/secret/gcs False directory
gs://my/super/secret/gcs/directory True directory
gs://my/super/secret/gcs/directory/file.json False file

As you can see, some directories are incorrectly treated as files. So more, values returned by the info and isdir methods are inconsistent.

Another insight

Changing the order of calling these methods, like in the snippet below:

def check_is_dir(path):
    info_type = fs.info(path)["type"]
    is_dir = fs.isdir(path)

    print(path, is_dir, info_type)

makes is_dir contains a correct value, but info_type incorrect one. Like here:

gs://my/super True file  # the first string is a path, the first boolean is a value returned by isdir method, and the second string is 'type' value in the dictionary returned by fs.info(path)
gs://my/super/secret True directory
gs://my/super/secret/gcs True file
gs://my/super/secret/gcs/directory True directory
gs://my/super/secret/gcs/directory/file.json False file

Workaround

For now, my workaround is to run isdir method two times:

def check_is_dir(path):
    is_dir = fs.isdir(path)
    is_dir = fs.isdir(path)
    info_type = fs.info(path)["type"]

    print(path, is_dir, info_type)

It works:

gs://my/super True directory  # the first string is a path, the first boolean is a value returned by isdir method, and the second string is 'type' value in the dictionary returned by fs.info(path)
gs://my/super/secret True directory
gs://my/super/secret/gcs True directory
gs://my/super/secret/gcs/directory True directory
gs://my/super/secret/gcs/directory/file.json False file

But I want to work with this library without such workaround ;)

@slevang
Copy link
Contributor

slevang commented Aug 16, 2023

I don't have an MCVE but can say I've also run into this before. The hack to run isdir twice does work, so I've used that since I first saw this and didn't do any further investigation.

@TSienki
Copy link
Author

TSienki commented Aug 17, 2023

It seems that I've found the root cause of this problem.
Maybe It's a bug in the GCS file system. When I used the ls method on this directory, it listed the paths of all files and the directory path itself. The expected behavior would be to list only files inside this directory.

I decided to remove this directory and upload it again, and that solved the whole problem.

@tomecki
Copy link

tomecki commented Nov 29, 2023

AFAIK this was tackled in #313, but for some reason never merged to main branch.

Motivating example using a public bucket gs://hadoop-lib/gcs <- this should be recognized as a folder. (link to console UI)

import fsspec
path = "gs://hadoop-lib/gcs"
fs = fsspec.filesystem("gs")

With python 3.7 and fsspec 2023.1.0:

fs.info(path) # {'kind': 'storage#object', ... , 'type': 'file'}
fs.info(path) # {..., 'type': 'directory'} <--- THIS IS TRUE

With python 3.9 and fsspec 2023.10.0:

fs.info(path) # {'kind': 'storage#object', ... , 'type': 'file'}
fs.info(path) # {'kind': 'storage#object', ... , 'type': 'file'}

(at least it's consistently wrong)

@martindurant
Copy link
Member

There is indeed an empty file: 'hadoop-lib/gcs/', so . With #313 this is shown as a directory?

@tomecki
Copy link

tomecki commented Nov 29, 2023

hadoop-lib/gcs is a folder, containing other files. The library incorrectly recognizes it as a file.
#313 never got merged, I haven't tested the behaviour with this PR.

@martindurant
Copy link
Member

hadoop-lib/gcs is a folder, containing other files

it is also a file, though. In pathstring-land, the final "/" is immaterial.

@martindurant
Copy link
Member

#313 appears abandoned, maybe it should be resurrected.

@hmeyer
Copy link

hmeyer commented Dec 13, 2023

I'm affected by this as well:

>>> gcsfs.__version__
'2023.12.1'
>>> fs.info('henning-test/empty_folder')['type']
'directory'  # Good.
>>> fs.info('henning-test/empty_folder/')['type']
'directory'  # Good.
>>> fs.info('henning-test/folder/test.txt')['type']
'file'  # Good.
>>> fs.info('henning-test/folder/test.txt/')['type']
'directory'  # ? I would expect file!
>>> fs.info('henning-test/folder')['type']
'directory'  # Good.
>>> fs.info('henning-test/folder/')['type']
'directory'  # Good.

>>> fs.ls('henning-test/folder')
['henning-test/folder/test.txt']  # Good!
>>> fs.ls('henning-test/folder/test.txt')
['henning-test/folder/test.txt']  # ? I would expect NotADirectoryError!
>>> fs.ls('henning-test/empty_folder')
['henning-test/empty_folder/']  # ? I would expect []!

@martindurant
Copy link
Member

I am guessing "henning-test/empty_folder/" is a zero-length-file?
Keeping that in mind, maybe most of this makes sense.

It would be best to show your complete set of files (as returned by find() ) to understand better.

@hmeyer
Copy link

hmeyer commented Dec 14, 2023

>>> fs.find('henning-test')
['henning-test/empty_folder/', 'henning-test/folder/test.txt']

Which still does not explain this:

>>> fs.ls('henning-test/folder/test.txt')
['henning-test/folder/test.txt']  # ? I would expect NotADirectoryError!
>>> fs.ls('henning-test/empty_folder')
['henning-test/empty_folder/']  # ? I would expect []!

Or does it?

@martindurant
Copy link
Member

Or does it?

:)

For the first case, it is conventional that ls(file-path) should return [file-path]. This in bash:

$ ls README.md
README.md

For the second, the return is showing you the zero-length-file, which has the same prefix as the path you asked to list.

Yes, it's complicated, because GCS doesn't have folders, only "common prefixes". Sometimes it's convenient to assume these (optional) weird placeholders are the same as directories, but sometimes you need to know they are there.

@hmeyer
Copy link

hmeyer commented Dec 14, 2023

But it doesn't really match the behavior of LocalFileSystem (and also it does not match the documentation of fsspec).

>>> from fsspec.implementations.local import LocalFileSystem
>>> lfs = LocalFileSystem()
>>> lfs.ls('/tmp/fsspec-test/empty_folder')
[]
>>> lfs.ls('/tmp/fsspec-test/folder')
['/tmp/fsspec-test/folder/test.txt]']
>>> lfs.ls('/tmp/fsspec-test/folder/test.txt')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/venv/lib/python3.10/site-packages/fsspec/implementations/local.py", line 66, in ls
    return [posixpath.join(path, f) for f in os.listdir(path)]
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/fsspec-test/folder/test.txt'

@martindurant
Copy link
Member

does not match the documentation of fsspec

What specifically?

@hmeyer
Copy link

hmeyer commented Dec 14, 2023

After reading the documentation again, I realize it might also compatible with returning the path to a file itself, when called on a file. Although this seems awkward to me.
Also it is inconsistent, that when calling ls on an empty folder or a file, we get the empty folder or file back, but when calling ls on a folder with files or subfolders, we don't get the folder itself back. Isn't it?

@martindurant
Copy link
Member

Also it is inconsistent, that when calling ls on an empty folder or a file, we get the empty folder or file back, but when calling ls on a folder with files or subfolders, we don't get the folder itself back. Isn't it?

I agree that it would be good to sketch out all the possible cases, and then write a bunch of tests which can be applied to any filesystem and thus ensure consistency (much as was done for get/put/rm). This is a fair amount of work!

@Conchylicultor
Copy link

Conchylicultor commented Dec 20, 2023

+1 to what @hmeyer said. GCS and local file-system should be consistent! And also consistent with the rest of Python ecosystem (os.listdir() and pathlib.Path().iterdir())

empty_folder = '/.../empty_folder/'

assert os.listdir(empty_folder) == []
assert list(pathlib.Path(empty_folder).iterdir()) == []
assert local_fs.ls(empty_folder) == []
assert gcs_fs.ls(empty_folder) == ['/.../empty_folder/']  # <<<< Inconsistent and very surprising

@martindurant
Copy link
Member

Let me point out, again, that in local filesystems, having a file and a directory with the same name is not allowed, and having a file with "/" as the last character is not allowed. This is a real difference we can't pretend doesn't exist. So, gcsfs is not_wrong in what it says. Since batch operations (e.g., rm() ) depend on the output of ls/find, we can't simply throw away these not-directory-files. We could patch ls() specifically, and I am happy to see such a PR.

@hmeyer
Copy link

hmeyer commented Dec 20, 2023

Actually it is perfectly ok to create a file with a trailing "/" locally:

>>> pathlib.Path('./bar\/').touch()
>>> pathlib.Path('./bar').mkdir()
>>> list(pathlib.Path('.').iterdir())
[PosixPath('bar\\'), PosixPath('bar')]

And I think following the principle of least surprise, it would be good if gcsfs would behave similar to gsutil or gcloud storage (both of which treat zero size files with trailing "/" as empty directories).

@martindurant
Copy link
Member

PosixPath('bar\\')

Am I being dim, or is this not the path you had asked for?

@hmeyer
Copy link

hmeyer commented Dec 20, 2023

It is the path I asked for, but curiously pathlib (and also os.listdir()) have an interesting way of showing this.
Speaking about a PR - I started #598.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants