You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm not sure if this issue belongs here or to S3FS library.
The error occurs when reading from partitioned parquet from S3, in case when the "root folder" of the parquet was created manually before writing the parquet there.
I.e. the steps to reproduce:
# 1.Create"folder"s3://bucket.name/data.parquet in e.g. cyberduck app
# 2.Writetable = pa.Table.from_pandas(df)
pq.write_table(table, 's3://bucket.name/data.parquet', partition_cols=[], filesystem=s3fs.S3FileSystem())
# 3.Readpq.read_table('s3://bucket.name/data.parquet', filesystem=s3fs.S3FileSystem())
# ArrowIOError: InvalidParquetfilesizeis0bytes
In case when the table was partitioned by a non-empty set of columns, an error reads: "ValueError: Found files in an intermediate directory".
This is likely due to the fact that S3 does not have "folders" per-se, and various software "mimic" creation of empty folder by writing an empty (zero-size) object to S3. So the parquet confuses this object with the actual contents of the parquet file.
At the same time s3fs library correctly identifies the key as a folder:
Wes McKinney / @wesm:
Seems that the pyarrow.parquet.ParquetDataset logic is fooled by the Parquet-file-like directory name. I suspect this would be relatively easy to test and fix
I'm not sure if this issue belongs here or to S3FS library.
The error occurs when reading from partitioned parquet from S3, in case when the "root folder" of the parquet was created manually before writing the parquet there.
I.e. the steps to reproduce:
In case when the table was partitioned by a non-empty set of columns, an error reads: "ValueError: Found files in an intermediate directory".
This is likely due to the fact that S3 does not have "folders" per-se, and various software "mimic" creation of empty folder by writing an empty (zero-size) object to S3. So the parquet confuses this object with the actual contents of the parquet file.
At the same time s3fs library correctly identifies the key as a folder:
Reporter: Vladimir
Note: This issue was originally created as ARROW-7867. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: