-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Footer parsing fails for very large parquet file. #4592
Comments
Looking at the parquet-mr repository, the closest thing to an authoritative parquet implementation, the footer length is an i32 - https://github.com/apache/parquet-mr/blob/d2c3c6d2e761a17b75d56fd356a37a2f754072f7/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1342C30-L1342C30 This is consistent with parquet in general, which as a result of its Java pedigree tends to use signed quantities everywhere. I suspect arrow-cpp should probably not let you write such a file |
parquet-cpp uses u32, thus pyarrow also uses u32. I didn't face any warnings or errors when writing such a file with pyarrow. |
parquet-cpp is moved to arrow, and pyarrow shares the same underlying implemention with parquet cpp. I'm ok for change it to u32, but I guess a 44Gib is too large for a parquet file, you may also need a huge thrift container for that |
It's really a huge file🤣, and we have hundreds of such files to query. pyarrow needs to specify a larger thrift container buffer size. Both pyarrow and arrow-rs needs a long time to parse the metadata. |
|
Describe the bug
The footer parser of parquet reader uses i32 incorrectly to get the size of footer.
To Reproduce
It's not that easy to generate such a large parquet file. I have a 44G parquet file generated by pyarrow.
Expected behavior
It should be opened successfully because it is generated by pyarrow:)
Additional context
arrow-rs/parquet/src/file/footer.rs
Lines 106 to 112 in 16744e5
It should be
u32
here, and we don't need the error below.The text was updated successfully, but these errors were encountered: