-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow due to s3 prefix #38794
Comments
I did see #26864, it seems they are similar issues, but this one is not fixed. |
Would he info here ( #26864 (comment) ) helps? |
@mapleFU As I learned from some stack overflow answers, I did try to use different prefix with the path, for example, without
|
Additional info: dataset = ds.dataset(
'/mnt/s3/parquet_root/',
format='parquet',
partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())]))
) Things work as expected. Therefore, it should be something related to the |
Yeah, you can first using it as the workaround? I'm a bit busy these days and I may dive into it this weekend. I guess this is s3fs discovery related behavior |
Sure, thank you for your attention! |
Well, the interactions between PyArrow and the third-party |
Sure, I don't know that API, let me have a try. |
Pyarrow's native s3 fs is fine. |
Reopen because still not work. To reproduce: dataset = ds.dataset(
'bucket/parquet_root/', # with slash at the end
format='parquet',
filesystem=s3fs,
partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())]))
)
dataset.head(1) # OSError: Not a regular file: 'bucket/parquet_root/' dataset = ds.dataset(
'bucket/parquet_root', # remove slash
format='parquet',
filesystem=s3fs,
partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())]))
)
dataset.head(1) # pyarrow.lib.ArrowInvalid: Could not open Parquet input source 'bucket/parquet_root': Parquet file size is 0 bytes When reading a single file, it is OK dataset = ds.dataset(
'bucket/parquet_root/abc/def/part-0.parquet'
format='parquet',
filesystem=s3fs,
partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())]))
)
dataset.head(1) As the doc of dataset says: When calling <FileInfo for 'bucket/parquet_root': type=FileType.File, size=0> I think that is the reason. A directory in s3 is arrow/cpp/src/arrow/filesystem/s3fs.cc Line 2644 in df83e50
arrow/cpp/src/arrow/filesystem/s3fs.cc Lines 1746 to 1750 in df83e50
If I got it right, in arrow's s3 fs implementation, all the directories are always treated as files. Actually, I do try to use |
Oh, interesting. That must be something recent? We definitely should detect such cases, and also use that content-type when actually writing out directories. |
@yf-yang Do you have a public S3 bucket that we can reproduce this issue with? |
Nope, and actually I am using neither aws s3 nor minio implementation, so I am afraid if my cloud infra provider's implementation is correct. (However, for this specific case, it seems they are correct). I'll try to borrow one and give you a full reproduction if you need it. I'll come back after a while. |
FYI, I find this SO link and several people mention content-type, even in a 2017 answer. |
What I would like to know is what the following returns:
|
Definitely not a full reproduction. An empty dataset with one dummy file that can be accessed remotely would be sufficient if it allows to reproduce the issue. |
There is an older comment that shows the output of that: #38794 (comment), although that's still with fsspec's s3fs, not our filesystem (the fact that there is each time a duplicate file and directory for a directory seems a bit suspicious) |
It is [
<FileInfo for 'bucket/parquet_root/abc': type=FileType.Directory>,
<FileInfo for 'bucket/parquet_root/abc/def': type=FileType.Directory>,
<FileInfo for 'bucket/parquet_root/abc/def/part-0.parquet': type=FileType.File, size=xxxx>,
] |
…ories Some AWS-related tools write and expect the content-type "application/x-directory" for directory-like entries. Unfortunately, this cannot be tested for MinIO, as it apparently ignores the content-type set on directories (as opposed to files).
…ories Some AWS-related tools write and expect the content-type "application/x-directory" for directory-like entries. Unfortunately, this cannot be tested for MinIO, as it apparently ignores the content-type set on directories (as opposed to files).
…ories Some AWS-related tools write and expect the content-type "application/x-directory" for directory-like entries. Unfortunately, this cannot be tested for MinIO, as it apparently ignores the content-type set on directories (as opposed to files).
…40147) ### Rationale for this change Some AWS-related tools write and expect the content-type "application/x-directory" for directory-like entries. This PR does two things: 1) set the object's content-type to "application/x-directory" when the user explicitly creates a directory 2) when a 0-sized object with content-type starting with "application/x-directory" is encountered, consider it a directory ### Are these changes tested? Unfortunately, this cannot be tested with MinIO, as it seems to ignore the content-type set on directories (as opposed to regular files). ### Are there any user-facing changes? Hopefully better compatibility with existing S3 filesystem hierarchies. * Closes: #38794 * GitHub Issue: #38794 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ories (apache#40147) ### Rationale for this change Some AWS-related tools write and expect the content-type "application/x-directory" for directory-like entries. This PR does two things: 1) set the object's content-type to "application/x-directory" when the user explicitly creates a directory 2) when a 0-sized object with content-type starting with "application/x-directory" is encountered, consider it a directory ### Are these changes tested? Unfortunately, this cannot be tested with MinIO, as it seems to ignore the content-type set on directories (as opposed to regular files). ### Are there any user-facing changes? Hopefully better compatibility with existing S3 filesystem hierarchies. * Closes: apache#38794 * GitHub Issue: apache#38794 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Describe the bug, including details regarding any error messages, version, and platform.
pyarrow == 14.0.1
How the parquet is created:
After the action, in the bucket, path
parquet_root/abc/def/part-0.parquet
exists.Try to access the parquet
NOTE: same API call endures a behavior change after I call
s3fs.isdir
in between, that is also weirdComponent(s)
Python
The text was updated successfully, but these errors were encountered: