Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Different get_file_info behaviour between pyarrow.fs.S3FileSystem and s3fs #36983

Closed
benhc opened this issue Aug 1, 2023 · 7 comments · Fixed by #37768
Closed

[Python] Different get_file_info behaviour between pyarrow.fs.S3FileSystem and s3fs #36983

benhc opened this issue Aug 1, 2023 · 7 comments · Fixed by #37768

Comments

@benhc
Copy link

benhc commented Aug 1, 2023

I need to use s3fs as the filesystem in the dataset constructor due to performance considerations raised in #33169. However, when I try to do so I get the following stack trace:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[14], line 1
----> 1 ds.dataset(
      2     "my_bucket/my_dataset_path",
      3     filesystem=fs,
      4     partitioning=None,
      5     format="parquet"
      6 )

File ~/blah/.venv/lib/python3.10/site-packages/pyarrow/dataset.py:763, in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
    752 kwargs = dict(
    753     schema=schema,
    754     filesystem=filesystem,
   (...)
    759     selector_ignore_prefixes=ignore_prefixes
    760 )
    762 if _is_path_like(source):
--> 763     return _filesystem_dataset(source, **kwargs)
    764 elif isinstance(source, (tuple, list)):
    765     if all(_is_path_like(elem) for elem in source):

File ~/blah/.venv/lib/python3.10/site-packages/pyarrow/dataset.py:456, in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
    448 options = FileSystemFactoryOptions(
    449     partitioning=partitioning,
    450     partition_base_dir=partition_base_dir,
    451     exclude_invalid_files=exclude_invalid_files,
    452     selector_ignore_prefixes=selector_ignore_prefixes
    453 )
    454 factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
--> 456 return factory.finish(schema)

File ~/blah/.venv/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2752, in pyarrow._dataset.DatasetFactory.finish()

File ~/blah/.venv/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/blah/.venv/lib/python3.10/site-packages/pyarrow/_fs.pyx:1551, in pyarrow._fs._cb_open_input_file()

File ~/blah/.venv/lib/python3.10/site-packages/pyarrow/fs.py:424, in FSSpecHandler.open_input_file(self, path)
    421 from pyarrow import PythonFile
    423 if not self.fs.isfile(path):
--> 424     raise FileNotFoundError(path)
    426 return PythonFile(self.fs.open(path, mode="rb"), mode="r")

FileNotFoundError: my_bucket/my_dataset_path/

This is due to inconsistencies in the result of get_file_info from the s3fs filesystem when mapped into a PyFileSystem vs the pyarrow S3 filesystem.

Code to demonstrate:

from boto3 import Session
import s3fs
import pyarrow.dataset as ds
import pyarrow as pa
from pyarrow.fs import PyFileSystem, FSSpecHandler, FileSelector

session = Session()
credentials = session.get_credentials()
s3_fs = s3fs.S3FileSystem(
    anon=False,
    key=credentials.access_key,
    secret=credentials.secret_key,
    token=credentials.token,
)
pa_s3_fs = PyFileSystem(FSSpecHandler(s3_fs))

pa_fs = pa.fs.S3FileSystem(
    access_key=credentials.access_key,
    secret_key=credentials.secret_key,
    session_token=credentials.token,
)

selector = FileSelector("my_bucket/my_dataset_path/", recursive=True)
pa_s3_fs.get_file_info(selector)
>>> [<FileInfo for 'my_bucket/my_dataset_path/': type=FileType.File, size=0>,
 <FileInfo for 'my_bucket/my_dataset_path/part-0.parquet': type=FileType.File, size=1707552354>]
pa_fs.get_file_info(selector)
>>> [<FileInfo for 'my_bucket/my_dataset_path/part-0.parquet': type=FileType.File, size=1707552354>]

This 0 length file is created by S3 when the folder is created, and can be seen by calling:

s3_fs.find(selector.base_dir, maxdepth=None, withdirs=True, detail=True)
>>> {'my_bucket/my_dataset_path': {'Key': 'my_bucket/my_dataset_path',
  'LastModified': datetime.datetime(2022, 11, 3, 15, 3, 4, tzinfo=tzutc()),
  'ETag': '"d41d8cd92342344e9800998ecf8427e"',
  'Size': 0,
  'StorageClass': 'STANDARD',
  'type': 'file',
  'size': 0,
  'name': 'my_bucket/my_dataset_path'},
 'my_bucket/my_dataset_pathpart-0.parquet': {'Key': 'my_bucket/my_dataset_pathpart-0.parquet',
  'LastModified': datetime.datetime(2023, 8, 1, 9, 8, 21, tzinfo=tzutc()),
  'ETag': '"ce6df3f4234f5e078431949620e203c8-33"',
  'Size': 1707552354,
  'StorageClass': 'STANDARD',
  'type': 'file',
  'size': 1707552354,
  'name': 'my_bucket/my_dataset_path/part-0.parquet'}}

I assume, but cannot find, that somewhere in the pyarrow S3 filesystem code these 0 length files are ignored. The filtering behaviour should be the same for fsspec filesystems.

Component(s)

Python

@benhc benhc added the Type: bug label Aug 1, 2023
@raulcd raulcd changed the title Different get_file_info behaviour between pyarrow.fs.S3FileSystem and s3fs [Python] Different get_file_info behaviour between pyarrow.fs.S3FileSystem and s3fs Aug 1, 2023
@pitrou
Copy link
Member

pitrou commented Aug 22, 2023

Ok, so here is what the the Python docstring says:

    def get_file_info(self, paths_or_selector):
        """
        Get info for the given files.

        [...]

        Parameters
        ----------
        paths_or_selector : FileSelector, path-like or list of path-likes
            Either a selector object, a path-like object or a list of
            path-like objects. The selector's base directory will not be
            part of the results, even if it exists. If it doesn't exist,
            use `allow_not_found`.

Quoting the relevant part above:

The selector's base directory will not be part of the results, even if it exists.

So with selector = FileSelector("my_bucket/my_dataset_path/", recursive=True), the FileInfo for 'my_bucket/my_dataset_path/' should not be returned as part of the results. FSSpecHandler should probably be fixed to ensure that.

@danepitkin

@danepitkin
Copy link
Member

Hmm potentially related issue? #37555

@AlenkaF
Copy link
Member

AlenkaF commented Sep 13, 2023

Hmm potentially related issue? #37555

Correct! Will link this issue to the PR that will solve this: #37558 and will also add a test.

@AlenkaF
Copy link
Member

AlenkaF commented Sep 18, 2023

The issue with the FileInfo returning base directory as part of the results has been fixed with #37558. But there is another issue (as reported here) where in case of fsspec_s3fs 0 length files are created for every subdirectory. If the subdirectory is empty, the only thing that seems to be created is 0 length file, without info item with directory type.

S3FileSystem only includes the item with the directory type, if the dir is empty or not (ignoring the base directory).

I guess we want the behaviour in FSSpecHandler to be equal to S3FileSystem ?

@pitrou
Copy link
Member

pitrou commented Sep 18, 2023

This is tricky to workaround, because a genuine 0-length file could have been created by the user.

I would suggest simply ignoring the issue in the tests.

@coolacid
Copy link

pa_s3_fs.get_file_info(selector)
>>> [<FileInfo for 'my_bucket/my_dataset_path/': type=FileType.File, size=0>,
 <FileInfo for 'my_bucket/my_dataset_path/part-0.parquet': type=FileType.File, size=1707552354>]

One thing to note: I believe that "directories" provide the ending / as well as size=0. Would this not work as a filter to remove these items?

@pitrou
Copy link
Member

pitrou commented Sep 28, 2023

One thing to note: I believe that "directories" provide the ending / as well as size=0. Would this not work as a filter to remove these items?

You're right, that would probably work indeed.

jorisvandenbossche added a commit that referenced this issue Oct 10, 2023
…fs.S3FileSystem and s3fs (#37768)

### What changes are included in this PR?

Update `test_get_file_info_with_selector ` to check that the base directory is not included in the `S3FileSystem` or `s3fs` filesystem. Also remove old comments and flexible checks.

* Closes: #36983

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche added this to the 14.0.0 milestone Oct 10, 2023
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…arrow.fs.S3FileSystem and s3fs (apache#37768)

### What changes are included in this PR?

Update `test_get_file_info_with_selector ` to check that the base directory is not included in the `S3FileSystem` or `s3fs` filesystem. Also remove old comments and flexible checks.

* Closes: apache#36983

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…arrow.fs.S3FileSystem and s3fs (apache#37768)

### What changes are included in this PR?

Update `test_get_file_info_with_selector ` to check that the base directory is not included in the `S3FileSystem` or `s3fs` filesystem. Also remove old comments and flexible checks.

* Closes: apache#36983

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…arrow.fs.S3FileSystem and s3fs (apache#37768)

### What changes are included in this PR?

Update `test_get_file_info_with_selector ` to check that the base directory is not included in the `S3FileSystem` or `s3fs` filesystem. Also remove old comments and flexible checks.

* Closes: apache#36983

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants