Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In general, this crate should error on invalid data rather than panic
https://github.com/apache/arrow-rs?tab=readme-ov-file#guidelines-for-panic-vs-result
For those caused by invalid user input, however, we prefer to report that invalidity gracefully as an error result instead of panicking. In general, invalid input should result in an Error as soon as possible.
However, we keep hitting various paths in parquet where there are panics
Given these paths require a corrupt / invalid datasource, it is hard to write tests for them
For example, here is a test that @xuzifu666 added for one such error: 0bb9942
However, I thought it would be hard to maintain over the long run as the programatic generation of bad data will be brittle (if we change how the thrift is written, for example, the truncation may go down a different path).
Describe the solution you'd like
I think we should consider some sort of parquet fuzzer that makes randomly bad data and ensures that the reader is returning error (not panicing). It would be nice if it made some parqut files and then applied common data corruption:
- Truncate the data (remove bytes from end of the file)
- Truncate the data (remove bytesof the start of the file)
- Switch a random bit
- Set a random range of the file to all zeros
There are probably other good ones we can do
Describe alternatives you've considered
Additional context
Related to
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In general, this crate should error on invalid data rather than panic
https://github.com/apache/arrow-rs?tab=readme-ov-file#guidelines-for-panic-vs-result
However, we keep hitting various paths in parquet where there are panics
Given these paths require a corrupt / invalid datasource, it is hard to write tests for them
For example, here is a test that @xuzifu666 added for one such error: 0bb9942
However, I thought it would be hard to maintain over the long run as the programatic generation of bad data will be brittle (if we change how the thrift is written, for example, the truncation may go down a different path).
Describe the solution you'd like
I think we should consider some sort of parquet fuzzer that makes randomly bad data and ensures that the reader is returning error (not
panicing). It would be nice if it made some parqut files and then applied common data corruption:There are probably other good ones we can do
Describe alternatives you've considered
Additional context
Related to