Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Footer parsing fails for very large parquet file. #4592

Closed
Berrysoft opened this issue Jul 30, 2023 · 5 comments · Fixed by #4599
Closed

Footer parsing fails for very large parquet file. #4592

Berrysoft opened this issue Jul 30, 2023 · 5 comments · Fixed by #4599
Labels
bug parquet Changes to the parquet crate

Comments

@Berrysoft
Copy link
Contributor

Describe the bug
The footer parser of parquet reader uses i32 incorrectly to get the size of footer.

To Reproduce
It's not that easy to generate such a large parquet file. I have a 44G parquet file generated by pyarrow.

Expected behavior
It should be opened successfully because it is generated by pyarrow:)

Additional context

let metadata_len = i32::from_le_bytes(slice[..4].try_into().unwrap());
metadata_len.try_into().map_err(|_| {
general_err!(
"Invalid Parquet file. Metadata length is less than zero ({})",
metadata_len
)
})

It should be u32 here, and we don't need the error below.

@Berrysoft Berrysoft added the bug label Jul 30, 2023
@tustvold
Copy link
Contributor

tustvold commented Jul 30, 2023

Looking at the parquet-mr repository, the closest thing to an authoritative parquet implementation, the footer length is an i32 - https://github.com/apache/parquet-mr/blob/d2c3c6d2e761a17b75d56fd356a37a2f754072f7/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1342C30-L1342C30

This is consistent with parquet in general, which as a result of its Java pedigree tends to use signed quantities everywhere. I suspect arrow-cpp should probably not let you write such a file

@Berrysoft
Copy link
Contributor Author

@mapleFU
Copy link
Member

mapleFU commented Jul 31, 2023

parquet-cpp is moved to arrow, and pyarrow shares the same underlying implemention with parquet cpp.

I'm ok for change it to u32, but I guess a 44Gib is too large for a parquet file, you may also need a huge thrift container for that

@Berrysoft
Copy link
Contributor Author

It's really a huge file🤣, and we have hundreds of such files to query.

pyarrow needs to specify a larger thrift container buffer size. Both pyarrow and arrow-rs needs a long time to parse the metadata.

@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'parquet'} from #4599

@tustvold tustvold added the parquet Changes to the parquet crate label Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants