Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow due to s3 prefix #38794

Closed
yf-yang opened this issue Nov 20, 2023 · 20 comments · Fixed by #40147
Closed

[Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow due to s3 prefix #38794

yf-yang opened this issue Nov 20, 2023 · 20 comments · Fixed by #40147

Comments

@yf-yang
Copy link

yf-yang commented Nov 20, 2023

Describe the bug, including details regarding any error messages, version, and platform.

pyarrow == 14.0.1

How the parquet is created:

import polars as pl
import pyarrow.dataset as ds
import s3fs

s3fs = s3fs.S3FileSystem()
df = pl.DataFrame()
ds.write_dataset(
    df.to_arrow(),
    "s3://bucket/parquet_root",
    format='parquet', 
    filesystem=s3fs,
    partitioning=ds.partitioning(pa.schema([("set", pa.string()), ("subset", pa.string())])),
    existing_data_behavior='delete_matching'
)

After the action, in the bucket, path parquet_root/abc/def/part-0.parquet exists.

Try to access the parquet

NOTE: same API call endures a behavior change after I call s3fs.isdir in between, that is also weird

import polars as pl
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import s3fs

s3fs = s3fs.S3FileSystem()
pq.read_table('bucket/parquet_root/abc/def/part-0.parquet',filesystem=s3fs) # ok
pq.read_table('s3://bucket/parquet_root/abc/def/part-0.parquet',filesystem=s3fs) # ok

pq.ParquetDataset('s3://bucket/parquet_root/', filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())])))
# pyarrow.lib.ArrowInvalid: Could not open Parquet input source 's3://bucket/parquet_root/': Parquet file size is 0 bytes

# after I manually call s3fs.isdir, things changes, I suspect this is another bug
s3fs.isdir('s3://bucket/parquet_root/') # True

# repeat the call
pq.ParquetDataset('s3://bucket/parquet_root/', filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())])))
# pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'bucket/parquet_root/abc/def/part-0.parquet', which is outside base dir 's3://bucket/parquet_root/'

# another try, the same error
dataset = ds.dataset(
  's3://bucket/parquet_root/',
  format='parquet', 
  filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())]))
)
# pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'bucket/parquet_root/abc/def/part-0.parquet', which is outside base dir 's3://bucket/parquet_root/'

Component(s)

Python

@yf-yang yf-yang changed the title Fail to open partitioned parquet with s3fs + pyarrow [Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow Nov 20, 2023
@yf-yang yf-yang changed the title [Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow [Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow due to s3 prefix Nov 20, 2023
@yf-yang
Copy link
Author

yf-yang commented Nov 20, 2023

I did see #26864, it seems they are similar issues, but this one is not fixed.

@mapleFU
Copy link
Member

mapleFU commented Nov 20, 2023

Would he info here ( #26864 (comment) ) helps?

@yf-yang
Copy link
Author

yf-yang commented Nov 20, 2023

@mapleFU As I learned from some stack overflow answers, I did try to use different prefix with the path, for example,

without s3://

pq.ParquetDataset('bucket/parquet_root/', filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())])))
# FileNotFoundError: bucket/parquet_root/

or, with a prefix / only

pq.ParquetDataset('/bucket/parquet_root/', filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())])))
# pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'bucket/parquet_root/abc/def/part-0.parquet', which is outside base dir '/bucket/parquet_root/'

#26864 (comment) works because as I shown in the issue:

pq.read_table('bucket/parquet_root/abc/def/part-0.parquet',filesystem=s3fs) # ok
pq.read_table('s3://bucket/parquet_root/abc/def/part-0.parquet',filesystem=s3fs) # ok

Both approaches are OK, so with the flag use_pandas_metadata=True the call is also OK. However, I still do not know what to do with a partitioned dataset.


I also tried to reproduce #26864 (comment)

from pyarrow.fs import FileSelector, PyFileSystem, FSSpecHandler
fs = PyFileSystem(FSSpecHandler(s3fs))
selector = FileSelector("s3://bucket/parquet_root/", recursive=True)
fs.get_file_info(selector)

It returns

[
  <FileInfo for 'bucket/parquet_root': type=FileType.Directory>, 
  <FileInfo for 'bucket/parquet_root/': type=FileType.File, size=0>, 
  <FileInfo for 'bucket/parquet_root/abc': type=FileType.Directory>, 
  <FileInfo for 'bucket/parquet_root/abc/': type=FileType.File, size=0>, 
  <FileInfo for 'bucket/parquet_root/abc/def': type=FileType.Directory>, 
  <FileInfo for 'bucket/parquet_root/abc/def/': type=FileType.File, size=0>, 
  <FileInfo for 'bucket/parquet_root/abc/def/part-0.parquet': type=FileType.File, size=1078092>
]

Not sure if it helps

@yf-yang
Copy link
Author

yf-yang commented Nov 21, 2023

Additional info:
Using s3fs-fuse to mount the bucket to /mnt/s3, then call

dataset = ds.dataset(
  '/mnt/s3/parquet_root/',
  format='parquet', 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())]))
)

Things work as expected.


Therefore, it should be something related to the s3fs python package's filesystem interface and pyarrow.

@mapleFU
Copy link
Member

mapleFU commented Nov 21, 2023

Yeah, you can first using it as the workaround? I'm a bit busy these days and I may dive into it this weekend. I guess this is s3fs discovery related behavior

@yf-yang
Copy link
Author

yf-yang commented Nov 22, 2023

Sure, thank you for your attention!

@mapleFU
Copy link
Member

mapleFU commented Nov 26, 2023

@pitrou

#35440

I think the reason here is FileListerTask::ToFileInfos didn't add the s3:// prefix. Should we add the s3 prefix in FileListerTask? Or should we wrap the GetFileInfo(const Selector&) with a prefix?

@yf-yang
Copy link
Author

yf-yang commented Jan 23, 2024

@mapleFU @pitrou Any updates? I guess this one is not so difficult to solve

@pitrou
Copy link
Member

pitrou commented Jan 23, 2024

Well, the interactions between PyArrow and the third-party s3fs project are complicated. You could use PyArrow's built-in S3 filesystem instead.

cc @jorisvandenbossche

@yf-yang
Copy link
Author

yf-yang commented Jan 24, 2024

Sure, I don't know that API, let me have a try.

@yf-yang
Copy link
Author

yf-yang commented Jan 24, 2024

Pyarrow's native s3 fs is fine.

@yf-yang yf-yang closed this as completed Jan 24, 2024
@yf-yang
Copy link
Author

yf-yang commented Jan 24, 2024

Reopen because still not work.

To reproduce:

dataset = ds.dataset(
  'bucket/parquet_root/', # with slash at the end
  format='parquet', 
  filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())]))
)
dataset.head(1) # OSError: Not a regular file: 'bucket/parquet_root/'
dataset = ds.dataset(
  'bucket/parquet_root', # remove slash
  format='parquet', 
  filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())]))
)
dataset.head(1) # pyarrow.lib.ArrowInvalid: Could not open Parquet input source 'bucket/parquet_root': Parquet file size is 0 bytes

When reading a single file, it is OK

dataset = ds.dataset(
  'bucket/parquet_root/abc/def/part-0.parquet'
  format='parquet', 
  filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())]))
)
dataset.head(1) 

As the doc of dataset says:
image

When calling s3.get_file_info('bucket/parquet_root'), it returns:

<FileInfo for 'bucket/parquet_root': type=FileType.File, size=0>

I think that is the reason. A directory in s3 is File type, not Directory type.
Related codes:

FileObjectToInfo(outcome.GetResult(), &info);

void FileObjectToInfo(const S3Model::HeadObjectResult& obj, FileInfo* info) {
info->set_type(FileType::File);
info->set_size(static_cast<int64_t>(obj.GetContentLength()));
info->set_mtime(FromAwsDatetime(obj.GetLastModified()));
}

If I got it right, in arrow's s3 fs implementation, all the directories are always treated as files. Actually, I do try to use boto3 to call headObject, and its contentType shows correctly that is a folder (application/x-directory, not sure what's that value in the c++ s3 client). Should the s3 filesystem implementation be improved?

@yf-yang yf-yang reopened this Jan 24, 2024
@pitrou
Copy link
Member

pitrou commented Jan 24, 2024

If I got it right, in arrow's s3 fs implementation, all the directories are always treated as files. Actually, I do try to use boto3 to call headObject, and its contentType shows correctly that is a folder (application/x-directory, not sure what's that value in the c++ s3 client).

Oh, interesting. That must be something recent? We definitely should detect such cases, and also use that content-type when actually writing out directories.

@pitrou
Copy link
Member

pitrou commented Jan 24, 2024

@yf-yang Do you have a public S3 bucket that we can reproduce this issue with?

@yf-yang
Copy link
Author

yf-yang commented Jan 24, 2024

Nope, and actually I am using neither aws s3 nor minio implementation, so I am afraid if my cloud infra provider's implementation is correct. (However, for this specific case, it seems they are correct).

I'll try to borrow one and give you a full reproduction if you need it. I'll come back after a while.

@yf-yang
Copy link
Author

yf-yang commented Jan 24, 2024

FYI, I find this SO link and several people mention content-type, even in a 2017 answer.

@pitrou
Copy link
Member

pitrou commented Jan 25, 2024

What I would like to know is what the following returns:

s3.get_file_info(pyarrow.fs.FileSelector('bucket/parquet_root', recursive=True))

@pitrou
Copy link
Member

pitrou commented Jan 25, 2024

I'll try to borrow one and give you a full reproduction if you need it.

Definitely not a full reproduction. An empty dataset with one dummy file that can be accessed remotely would be sufficient if it allows to reproduce the issue.

@jorisvandenbossche
Copy link
Member

What I would like to know is what the following returns:

There is an older comment that shows the output of that: #38794 (comment), although that's still with fsspec's s3fs, not our filesystem (the fact that there is each time a duplicate file and directory for a directory seems a bit suspicious)

@yf-yang
Copy link
Author

yf-yang commented Jan 29, 2024

What I would like to know is what the following returns:

It is

[
<FileInfo for 'bucket/parquet_root/abc': type=FileType.Directory>,
<FileInfo for 'bucket/parquet_root/abc/def': type=FileType.Directory>,
<FileInfo for 'bucket/parquet_root/abc/def/part-0.parquet': type=FileType.File, size=xxxx>,
]

pitrou added a commit to pitrou/arrow that referenced this issue Feb 19, 2024
…ories

Some AWS-related tools write and expect the content-type "application/x-directory" for directory-like entries.

Unfortunately, this cannot be tested for MinIO, as it apparently ignores the content-type set on directories (as opposed to files).
pitrou added a commit to pitrou/arrow that referenced this issue Feb 20, 2024
…ories

Some AWS-related tools write and expect the content-type "application/x-directory" for directory-like entries.

Unfortunately, this cannot be tested for MinIO, as it apparently ignores the content-type set on directories (as opposed to files).
pitrou added a commit to pitrou/arrow that referenced this issue Mar 7, 2024
…ories

Some AWS-related tools write and expect the content-type "application/x-directory" for directory-like entries.

Unfortunately, this cannot be tested for MinIO, as it apparently ignores the content-type set on directories (as opposed to files).
pitrou added a commit that referenced this issue Mar 7, 2024
…40147)

### Rationale for this change

Some AWS-related tools write and expect the content-type "application/x-directory" for directory-like entries.

This PR does two things:
1) set the object's content-type to "application/x-directory" when the user explicitly creates a directory
2) when a 0-sized object with content-type starting with "application/x-directory" is encountered, consider it a directory

### Are these changes tested?

Unfortunately, this cannot be tested with MinIO, as it seems to ignore the content-type set on directories (as opposed to regular files).

### Are there any user-facing changes?

Hopefully better compatibility with existing S3 filesystem hierarchies.

* Closes: #38794
* GitHub Issue: #38794

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 16.0.0 milestone Mar 7, 2024
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…ories (apache#40147)

### Rationale for this change

Some AWS-related tools write and expect the content-type "application/x-directory" for directory-like entries.

This PR does two things:
1) set the object's content-type to "application/x-directory" when the user explicitly creates a directory
2) when a 0-sized object with content-type starting with "application/x-directory" is encountered, consider it a directory

### Are these changes tested?

Unfortunately, this cannot be tested with MinIO, as it seems to ignore the content-type set on directories (as opposed to regular files).

### Are there any user-facing changes?

Hopefully better compatibility with existing S3 filesystem hierarchies.

* Closes: apache#38794
* GitHub Issue: apache#38794

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants