[Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow due to s3 prefix #38794

yf-yang · 2023-11-20T10:00:15Z

Describe the bug, including details regarding any error messages, version, and platform.

pyarrow == 14.0.1

How the parquet is created:

import polars as pl
import pyarrow.dataset as ds
import s3fs

s3fs = s3fs.S3FileSystem()
df = pl.DataFrame()
ds.write_dataset(
    df.to_arrow(),
    "s3://bucket/parquet_root",
    format='parquet', 
    filesystem=s3fs,
    partitioning=ds.partitioning(pa.schema([("set", pa.string()), ("subset", pa.string())])),
    existing_data_behavior='delete_matching'
)

After the action, in the bucket, path parquet_root/abc/def/part-0.parquet exists.

Try to access the parquet

NOTE: same API call endures a behavior change after I call s3fs.isdir in between, that is also weird

import polars as pl
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import s3fs

s3fs = s3fs.S3FileSystem()
pq.read_table('bucket/parquet_root/abc/def/part-0.parquet',filesystem=s3fs) # ok
pq.read_table('s3://bucket/parquet_root/abc/def/part-0.parquet',filesystem=s3fs) # ok

pq.ParquetDataset('s3://bucket/parquet_root/', filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())])))
# pyarrow.lib.ArrowInvalid: Could not open Parquet input source 's3://bucket/parquet_root/': Parquet file size is 0 bytes

# after I manually call s3fs.isdir, things changes, I suspect this is another bug
s3fs.isdir('s3://bucket/parquet_root/') # True

# repeat the call
pq.ParquetDataset('s3://bucket/parquet_root/', filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())])))
# pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'bucket/parquet_root/abc/def/part-0.parquet', which is outside base dir 's3://bucket/parquet_root/'

# another try, the same error
dataset = ds.dataset(
  's3://bucket/parquet_root/',
  format='parquet', 
  filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())]))
)
# pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'bucket/parquet_root/abc/def/part-0.parquet', which is outside base dir 's3://bucket/parquet_root/'

Component(s)

Python

The text was updated successfully, but these errors were encountered:

yf-yang · 2023-11-20T10:07:15Z

I did see #26864, it seems they are similar issues, but this one is not fixed.

mapleFU · 2023-11-20T10:09:05Z

Would he info here ( #26864 (comment) ) helps?

yf-yang · 2023-11-20T13:10:53Z

@mapleFU As I learned from some stack overflow answers, I did try to use different prefix with the path, for example,

without `s3://`

pq.ParquetDataset('bucket/parquet_root/', filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())])))
# FileNotFoundError: bucket/parquet_root/

or, with a prefix `/` only

pq.ParquetDataset('/bucket/parquet_root/', filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())])))
# pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'bucket/parquet_root/abc/def/part-0.parquet', which is outside base dir '/bucket/parquet_root/'

#26864 (comment) works because as I shown in the issue:

pq.read_table('bucket/parquet_root/abc/def/part-0.parquet',filesystem=s3fs) # ok
pq.read_table('s3://bucket/parquet_root/abc/def/part-0.parquet',filesystem=s3fs) # ok

Both approaches are OK, so with the flag use_pandas_metadata=True the call is also OK. However, I still do not know what to do with a partitioned dataset.

I also tried to reproduce #26864 (comment)

from pyarrow.fs import FileSelector, PyFileSystem, FSSpecHandler
fs = PyFileSystem(FSSpecHandler(s3fs))
selector = FileSelector("s3://bucket/parquet_root/", recursive=True)
fs.get_file_info(selector)

It returns

[
  <FileInfo for 'bucket/parquet_root': type=FileType.Directory>, 
  <FileInfo for 'bucket/parquet_root/': type=FileType.File, size=0>, 
  <FileInfo for 'bucket/parquet_root/abc': type=FileType.Directory>, 
  <FileInfo for 'bucket/parquet_root/abc/': type=FileType.File, size=0>, 
  <FileInfo for 'bucket/parquet_root/abc/def': type=FileType.Directory>, 
  <FileInfo for 'bucket/parquet_root/abc/def/': type=FileType.File, size=0>, 
  <FileInfo for 'bucket/parquet_root/abc/def/part-0.parquet': type=FileType.File, size=1078092>
]

Not sure if it helps

yf-yang · 2023-11-21T02:35:22Z

Additional info:
Using s3fs-fuse to mount the bucket to /mnt/s3, then call

dataset = ds.dataset(
  '/mnt/s3/parquet_root/',
  format='parquet', 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())]))
)

Things work as expected.

Therefore, it should be something related to the s3fs python package's filesystem interface and pyarrow.

mapleFU · 2023-11-21T14:39:18Z

Yeah, you can first using it as the workaround? I'm a bit busy these days and I may dive into it this weekend. I guess this is s3fs discovery related behavior

yf-yang · 2023-11-22T03:27:12Z

Sure, thank you for your attention!

mapleFU · 2023-11-26T12:08:39Z

@pitrou

#35440

I think the reason here is FileListerTask::ToFileInfos didn't add the s3:// prefix. Should we add the s3 prefix in FileListerTask? Or should we wrap the GetFileInfo(const Selector&) with a prefix?

yf-yang · 2024-01-23T08:14:45Z

@mapleFU @pitrou Any updates? I guess this one is not so difficult to solve

pitrou · 2024-01-23T09:26:39Z

Well, the interactions between PyArrow and the third-party s3fs project are complicated. You could use PyArrow's built-in S3 filesystem instead.

cc @jorisvandenbossche

yf-yang · 2024-01-24T13:30:40Z

Sure, I don't know that API, let me have a try.

yf-yang · 2024-01-24T13:56:11Z

Pyarrow's native s3 fs is fine.

yf-yang · 2024-01-24T15:07:41Z

Reopen because still not work.

To reproduce:

dataset = ds.dataset(
  'bucket/parquet_root/', # with slash at the end
  format='parquet', 
  filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())]))
)
dataset.head(1) # OSError: Not a regular file: 'bucket/parquet_root/'

dataset = ds.dataset(
  'bucket/parquet_root', # remove slash
  format='parquet', 
  filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())]))
)
dataset.head(1) # pyarrow.lib.ArrowInvalid: Could not open Parquet input source 'bucket/parquet_root': Parquet file size is 0 bytes

When reading a single file, it is OK

dataset = ds.dataset(
  'bucket/parquet_root/abc/def/part-0.parquet'
  format='parquet', 
  filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())]))
)
dataset.head(1)

As the doc of dataset says:

When calling s3.get_file_info('bucket/parquet_root'), it returns:

<FileInfo for 'bucket/parquet_root': type=FileType.File, size=0>

I think that is the reason. A directory in s3 is File type, not Directory type.
Related codes:

arrow/cpp/src/arrow/filesystem/s3fs.cc

Line 2644 in df83e50

FileObjectToInfo(outcome.GetResult(), &info);

arrow/cpp/src/arrow/filesystem/s3fs.cc

Lines 1746 to 1750 in df83e50

    
           void FileObjectToInfo(const S3Model::HeadObjectResult& obj, FileInfo* info) { 
        
             info->set_type(FileType::File); 
        
             info->set_size(static_cast<int64_t>(obj.GetContentLength())); 
        
             info->set_mtime(FromAwsDatetime(obj.GetLastModified())); 
        
           }

If I got it right, in arrow's s3 fs implementation, all the directories are always treated as files. Actually, I do try to use boto3 to call headObject, and its contentType shows correctly that is a folder (application/x-directory, not sure what's that value in the c++ s3 client). Should the s3 filesystem implementation be improved?

pitrou · 2024-01-24T15:17:05Z

If I got it right, in arrow's s3 fs implementation, all the directories are always treated as files. Actually, I do try to use boto3 to call headObject, and its contentType shows correctly that is a folder (application/x-directory, not sure what's that value in the c++ s3 client).

Oh, interesting. That must be something recent? We definitely should detect such cases, and also use that content-type when actually writing out directories.

pitrou · 2024-01-24T16:06:22Z

@yf-yang Do you have a public S3 bucket that we can reproduce this issue with?

yf-yang · 2024-01-24T23:17:04Z

Nope, and actually I am using neither aws s3 nor minio implementation, so I am afraid if my cloud infra provider's implementation is correct. (However, for this specific case, it seems they are correct).

I'll try to borrow one and give you a full reproduction if you need it. I'll come back after a while.

yf-yang · 2024-01-24T23:39:41Z

FYI, I find this SO link and several people mention content-type, even in a 2017 answer.

pitrou · 2024-01-25T09:23:54Z

What I would like to know is what the following returns:

s3.get_file_info(pyarrow.fs.FileSelector('bucket/parquet_root', recursive=True))

pitrou · 2024-01-25T09:27:00Z

I'll try to borrow one and give you a full reproduction if you need it.

Definitely not a full reproduction. An empty dataset with one dummy file that can be accessed remotely would be sufficient if it allows to reproduce the issue.

jorisvandenbossche · 2024-01-25T12:47:46Z

What I would like to know is what the following returns:

There is an older comment that shows the output of that: #38794 (comment), although that's still with fsspec's s3fs, not our filesystem (the fact that there is each time a duplicate file and directory for a directory seems a bit suspicious)

yf-yang · 2024-01-29T01:43:39Z

What I would like to know is what the following returns:

It is

[
<FileInfo for 'bucket/parquet_root/abc': type=FileType.Directory>,
<FileInfo for 'bucket/parquet_root/abc/def': type=FileType.Directory>,
<FileInfo for 'bucket/parquet_root/abc/def/part-0.parquet': type=FileType.File, size=xxxx>,
]

…ories Some AWS-related tools write and expect the content-type "application/x-directory" for directory-like entries. Unfortunately, this cannot be tested for MinIO, as it apparently ignores the content-type set on directories (as opposed to files).

…40147) ### Rationale for this change Some AWS-related tools write and expect the content-type "application/x-directory" for directory-like entries. This PR does two things: 1) set the object's content-type to "application/x-directory" when the user explicitly creates a directory 2) when a 0-sized object with content-type starting with "application/x-directory" is encountered, consider it a directory ### Are these changes tested? Unfortunately, this cannot be tested with MinIO, as it seems to ignore the content-type set on directories (as opposed to regular files). ### Are there any user-facing changes? Hopefully better compatibility with existing S3 filesystem hierarchies. * Closes: #38794 * GitHub Issue: #38794 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

…ories (apache#40147) ### Rationale for this change Some AWS-related tools write and expect the content-type "application/x-directory" for directory-like entries. This PR does two things: 1) set the object's content-type to "application/x-directory" when the user explicitly creates a directory 2) when a 0-sized object with content-type starting with "application/x-directory" is encountered, consider it a directory ### Are these changes tested? Unfortunately, this cannot be tested with MinIO, as it seems to ignore the content-type set on directories (as opposed to regular files). ### Are there any user-facing changes? Hopefully better compatibility with existing S3 filesystem hierarchies. * Closes: apache#38794 * GitHub Issue: apache#38794 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

yf-yang added the Type: bug label Nov 20, 2023

github-actions bot added the Component: Python label Nov 20, 2023

yf-yang changed the title ~~Fail to open partitioned parquet with s3fs + pyarrow~~ [Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow Nov 20, 2023

yf-yang changed the title ~~[Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow~~ [Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow due to s3 prefix Nov 20, 2023

yf-yang closed this as completed Jan 24, 2024

yf-yang reopened this Jan 24, 2024

pitrou added the Component: C++ label Jan 24, 2024

github-actions bot mentioned this issue Feb 19, 2024

GH-38794: [C++][S3] Handle conventional content-type for directories #40147

Merged

github-actions bot assigned pitrou Feb 19, 2024

pitrou closed this as completed in #40147 Mar 7, 2024

pitrou added this to the 16.0.0 milestone Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow due to s3 prefix #38794

[Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow due to s3 prefix #38794

yf-yang commented Nov 20, 2023 •

edited

Loading

yf-yang commented Nov 20, 2023

mapleFU commented Nov 20, 2023

yf-yang commented Nov 20, 2023 •

edited

Loading

yf-yang commented Nov 21, 2023

mapleFU commented Nov 21, 2023

yf-yang commented Nov 22, 2023

mapleFU commented Nov 26, 2023

yf-yang commented Jan 23, 2024

pitrou commented Jan 23, 2024

yf-yang commented Jan 24, 2024

yf-yang commented Jan 24, 2024

yf-yang commented Jan 24, 2024 •

edited

Loading

pitrou commented Jan 24, 2024

pitrou commented Jan 24, 2024 •

edited

Loading

yf-yang commented Jan 24, 2024

yf-yang commented Jan 24, 2024

pitrou commented Jan 25, 2024

pitrou commented Jan 25, 2024

jorisvandenbossche commented Jan 25, 2024

yf-yang commented Jan 29, 2024

[Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow due to s3 prefix #38794

[Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow due to s3 prefix #38794

Comments

yf-yang commented Nov 20, 2023 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

How the parquet is created:

Try to access the parquet

Component(s)

yf-yang commented Nov 20, 2023

mapleFU commented Nov 20, 2023

yf-yang commented Nov 20, 2023 • edited Loading

without s3://

or, with a prefix / only

yf-yang commented Nov 21, 2023

mapleFU commented Nov 21, 2023

yf-yang commented Nov 22, 2023

mapleFU commented Nov 26, 2023

yf-yang commented Jan 23, 2024

pitrou commented Jan 23, 2024

yf-yang commented Jan 24, 2024

yf-yang commented Jan 24, 2024

yf-yang commented Jan 24, 2024 • edited Loading

pitrou commented Jan 24, 2024

pitrou commented Jan 24, 2024 • edited Loading

yf-yang commented Jan 24, 2024

yf-yang commented Jan 24, 2024

pitrou commented Jan 25, 2024

pitrou commented Jan 25, 2024

jorisvandenbossche commented Jan 25, 2024

yf-yang commented Jan 29, 2024

yf-yang commented Nov 20, 2023 •

edited

Loading

yf-yang commented Nov 20, 2023 •

edited

Loading

without `s3://`

or, with a prefix `/` only

yf-yang commented Jan 24, 2024 •

edited

Loading

pitrou commented Jan 24, 2024 •

edited

Loading