Expose SortingColumn
when reading and writing parquet metadata
#3090
Labels
enhancement
Any new improvement worthy of a entry in the changelog
good first issue
Good for newcomers
parquet
Changes to the parquet crate
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Storing sorted data in parquet is often a key performance technique as it "clusters" data in interesting ways than can make predicate evaluation and other query techniques faster.
The parquet file format contains a way to encode the sortedness of data stored there using a "SortingColumn" in the format
https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L685-L698
Which is then in the RowGroup metadata:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L829-L832
However, I did not find any code to read/write this metadata yet in the parquet crate
https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/apache/arrow-rs%24+SortingColumn&patternType=standard
Describe the solution you'd like
I would like some way to provide the parquet writer the
SortingColumn
when creatingRowgroupMetadata
Perhaps we could add something to the
WriterProperties
https://docs.rs/parquet/26.0.0/parquet/file/properties/struct.WriterProperties.html
Likewise, I would like a way to get the relevant
SortingColumn
list fromRowGroupMetadata
:https://docs.rs/parquet/26.0.0/parquet/file/metadata/struct.RowGroupMetaData.html
Describe alternatives you've considered
It might be worth considering having the parquet writer determine automatically if the data was sorted (maybe this would be better than letting the caller have to verify it)? However, verifying in the writer would likely be a significant performance hit.
Additional context
DataFusion is getting more sophisticated in its ability to track and use sortedness information (e.g. apache/datafusion#4122). If this metadata was included in the parquet file, DataFusion might be able to take more advantage of it: apache/datafusion#4177.
There is more discussion about this topic here apache/datafusion#4169 (comment)
The text was updated successfully, but these errors were encountered: