Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] ArrowIOError: Invalid Parquet file size is 0 bytes on reading from S3 #24093

Open
asfimport opened this issue Feb 16, 2020 · 2 comments

Comments

@asfimport
Copy link

I'm not sure if this issue belongs here or to S3FS library.

The error occurs when reading from partitioned parquet from S3, in case when the "root folder" of the parquet was created manually before writing the parquet there. 

I.e. the steps to reproduce:

 

# 1. Create "folder" s3://bucket.name/data.parquet in e.g. cyberduck app

# 2. Write
table = pa.Table.from_pandas(df)
pq.write_table(table, 's3://bucket.name/data.parquet', partition_cols=[], filesystem=s3fs.S3FileSystem())

# 3. Read
pq.read_table('s3://bucket.name/data.parquet', filesystem=s3fs.S3FileSystem())
# ArrowIOError: Invalid Parquet file size is 0 bytes

In case when the table was partitioned by a non-empty set of columns, an error reads: "ValueError: Found files in an intermediate directory".

This is likely due to the fact that S3 does not have "folders" per-se, and various software "mimic" creation of empty folder by writing an empty (zero-size) object to S3. So the parquet confuses this object with the actual contents of the parquet file.

At the same time s3fs library correctly identifies the key as a folder: 

s3fs.S3FileSystem().isdir('s3://bucket.name/data.parquet')  # Returns True

 

Reporter: Vladimir

Note: This issue was originally created as ARROW-7867. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Seems that the pyarrow.parquet.ParquetDataset logic is fooled by the Parquet-file-like directory name. I suspect this would be relatively easy to test and fix

@asfimport
Copy link
Author

Wes McKinney / @wesm:
cc @jorisvandenbossche

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant