New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Python] Expose sorting_columns
in RowGroupMetaData for Parquet files
#35331
Comments
sorting_columns
in RowGroupMetaData for Parquet filessorting_columns
in RowGroupMetaData for Parquet files
sorting_columns
in RowGroupMetaData for Parquet filessorting_columns
in RowGroupMetaData for Parquet files
@ei-grad thanks for opening the issue! I think that would be a welcome enhancement. Note that this is also not yet exposed in the C++ |
I can help with the C++ part. Will finish it tonight |
Draft an issue, will test it later: https://github.com/apache/arrow/pull/35351/files |
…35351) ### Rationale for this change Allow read/set SortColumns in C++ parquet. Node that currently we didn't check sort columns, so user should ensure that records don't violates the order ### What changes are included in this PR? For RowGroupMetadata, add a SortColumns interface ### Are these changes tested? * [x] tests ### Are there any user-facing changes? User can read sort columns in the future * Closes: #35331 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
@wjones127 Hi will, I only solve the C++ part. Should we reuse this issue, or I should create another issue for C++/Python, and use that issue? |
Ah sorry I didn't notice that. I think we generally want one-to-one correspondence with issues and pull requests, so i should have created a separate sub-issue for the C++ part. I think its fine for now if we re-use the issue, right @jorisvandenbossche ? |
Yes, let's just re-use it |
I'm not familiar with Python part, would you mind do this, or tell me the code I can take use for reference? |
I can work on that soon. |
…umns (apache#35351) ### Rationale for this change Allow read/set SortColumns in C++ parquet. Node that currently we didn't check sort columns, so user should ensure that records don't violates the order ### What changes are included in this PR? For RowGroupMetadata, add a SortColumns interface ### Are these changes tested? * [x] tests ### Are there any user-facing changes? User can read sort columns in the future * Closes: apache#35331 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
…umns (apache#35351) ### Rationale for this change Allow read/set SortColumns in C++ parquet. Node that currently we didn't check sort columns, so user should ensure that records don't violates the order ### What changes are included in this PR? For RowGroupMetadata, add a SortColumns interface ### Are these changes tested? * [x] tests ### Are there any user-facing changes? User can read sort columns in the future * Closes: apache#35331 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
…umns (apache#35351) ### Rationale for this change Allow read/set SortColumns in C++ parquet. Node that currently we didn't check sort columns, so user should ensure that records don't violates the order ### What changes are included in this PR? For RowGroupMetadata, add a SortColumns interface ### Are these changes tested? * [x] tests ### Are there any user-facing changes? User can read sort columns in the future * Closes: apache#35331 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
Please not assign to me =_= |
@judahrand would you mind reply here? we can only assign to the one replied here |
### Rationale for this change Picking up where #35453 left off. Closes #35331 This PR builds on top of #37469 ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: #35331 Lead-authored-by: Judah Rand <17158624+judahrand@users.noreply.github.com> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com>
Sure! |
### Rationale for this change Picking up where apache#35453 left off. Closes apache#35331 This PR builds on top of apache#37469 ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#35331 Lead-authored-by: Judah Rand <17158624+judahrand@users.noreply.github.com> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com>
### Rationale for this change Picking up where apache#35453 left off. Closes apache#35331 This PR builds on top of apache#37469 ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#35331 Lead-authored-by: Judah Rand <17158624+judahrand@users.noreply.github.com> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com>
Summary
Currently, the
pyarrow.parquet.RowGroupMetaData
class does not expose thesorting_columns
information available in the Parquet format'sRowGroup
struct. This information is useful for users who need to understand the local sorting order of columns within each RowGroup. It would be beneficial to expose this information in theRowGroupMetaData
class.Details
The Parquet format includes an optional
sorting_columns
field in theRowGroup
struct, which stores information about the sorting order of columns within the RowGroup. This information is defined in theSortingColumn
struct in theparquet.thrift
file:In the
RowGroup
struct, thesorting_columns
field is defined as follows:However, the
pyarrow.parquet.RowGroupMetaData
class does not expose this information. As a result, users cannot access the local sorting information of columns within RowGroups.Proposal
I propose adding a new method or property in the
RowGroupMetaData
class to expose thesorting_columns
information. This could be implemented as a new method, such asget_sorting_columns()
, or as a property, such assorting_columns
. The output should include the column index, sorting order (ascending or descending), and whether null values appear first or last in the sorted order.Use Case
Users working with sorted Parquet files can benefit from understanding the local sorting order of columns within RowGroups. This information is particularly useful when analyzing large datasets or performing operations that require knowledge of the sort order, such as range queries, filtering, or merging.
By exposing the
sorting_columns
information in theRowGroupMetaData
class, users can more easily work with sorted Parquet files and perform advanced data processing operations.Component(s)
Python
The text was updated successfully, but these errors were encountered: