Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] Expose sorting_columns in RowGroupMetaData for Parquet files #35331

Closed
ei-grad opened this issue Apr 25, 2023 · 11 comments · Fixed by #35351 or #37665
Closed

[C++][Python] Expose sorting_columns in RowGroupMetaData for Parquet files #35331

ei-grad opened this issue Apr 25, 2023 · 11 comments · Fixed by #35351 or #37665

Comments

@ei-grad
Copy link

ei-grad commented Apr 25, 2023

Summary

Currently, the pyarrow.parquet.RowGroupMetaData class does not expose the sorting_columns information available in the Parquet format's RowGroup struct. This information is useful for users who need to understand the local sorting order of columns within each RowGroup. It would be beneficial to expose this information in the RowGroupMetaData class.

Details

The Parquet format includes an optional sorting_columns field in the RowGroup struct, which stores information about the sorting order of columns within the RowGroup. This information is defined in the SortingColumn struct in the parquet.thrift file:

struct SortingColumn {
  1: required int32 column_idx;
  2: required bool descending;
  3: optional bool nulls_first;
}

In the RowGroup struct, the sorting_columns field is defined as follows:

struct RowGroup {
  1: required list<ColumnChunk> columns;
  2: required i64 total_byte_size;
  3: required i64 num_rows;
  4: optional list<SortingColumn> sorting_columns;
}

However, the pyarrow.parquet.RowGroupMetaData class does not expose this information. As a result, users cannot access the local sorting information of columns within RowGroups.

Proposal

I propose adding a new method or property in the RowGroupMetaData class to expose the sorting_columns information. This could be implemented as a new method, such as get_sorting_columns(), or as a property, such as sorting_columns. The output should include the column index, sorting order (ascending or descending), and whether null values appear first or last in the sorted order.

Use Case

Users working with sorted Parquet files can benefit from understanding the local sorting order of columns within RowGroups. This information is particularly useful when analyzing large datasets or performing operations that require knowledge of the sort order, such as range queries, filtering, or merging.

By exposing the sorting_columns information in the RowGroupMetaData class, users can more easily work with sorted Parquet files and perform advanced data processing operations.

Component(s)

Python

@ei-grad ei-grad changed the title [pyarrow] Expose sorting_columns in RowGroupMetaData for Parquet files [Python] Expose sorting_columns in RowGroupMetaData for Parquet files Apr 25, 2023
@jorisvandenbossche jorisvandenbossche changed the title [Python] Expose sorting_columns in RowGroupMetaData for Parquet files [C++][Python] Expose sorting_columns in RowGroupMetaData for Parquet files Apr 26, 2023
@jorisvandenbossche
Copy link
Member

@ei-grad thanks for opening the issue! I think that would be a welcome enhancement.

Note that this is also not yet exposed in the C++ parquet::RowGroupMetaData (which wraps the raw parquet::format::RowGroup, which is the auto-generated equivalent of the thrift struct you mentioned). So a first step would be to expose it in C++, and then the Python bindings can expose it as well.

@mapleFU
Copy link
Member

mapleFU commented Apr 27, 2023

I can help with the C++ part. Will finish it tonight

@mapleFU
Copy link
Member

mapleFU commented Apr 27, 2023

Draft an issue, will test it later: https://github.com/apache/arrow/pull/35351/files

wjones127 pushed a commit that referenced this issue May 1, 2023
…35351)

### Rationale for this change

Allow read/set SortColumns in C++ parquet. Node that currently we didn't check sort columns, so user should ensure
that records don't violates the order

### What changes are included in this PR?

For RowGroupMetadata, add a SortColumns interface

### Are these changes tested?

* [x] tests

### Are there any user-facing changes?

User can read sort columns in the future

* Closes: #35331

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
@wjones127 wjones127 added this to the 13.0.0 milestone May 1, 2023
@mapleFU
Copy link
Member

mapleFU commented May 1, 2023

@wjones127 Hi will, I only solve the C++ part. Should we reuse this issue, or I should create another issue for C++/Python, and use that issue?

@wjones127
Copy link
Member

Ah sorry I didn't notice that. I think we generally want one-to-one correspondence with issues and pull requests, so i should have created a separate sub-issue for the C++ part. I think its fine for now if we re-use the issue, right @jorisvandenbossche ?

@jorisvandenbossche
Copy link
Member

Yes, let's just re-use it

@mapleFU
Copy link
Member

mapleFU commented May 1, 2023

I'm not familiar with Python part, would you mind do this, or tell me the code I can take use for reference?

@wjones127
Copy link
Member

I can work on that soon.

@wjones127 wjones127 self-assigned this May 2, 2023
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this issue May 11, 2023
…umns (apache#35351)

### Rationale for this change

Allow read/set SortColumns in C++ parquet. Node that currently we didn't check sort columns, so user should ensure
that records don't violates the order

### What changes are included in this PR?

For RowGroupMetadata, add a SortColumns interface

### Are these changes tested?

* [x] tests

### Are there any user-facing changes?

User can read sort columns in the future

* Closes: apache#35331

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…umns (apache#35351)

### Rationale for this change

Allow read/set SortColumns in C++ parquet. Node that currently we didn't check sort columns, so user should ensure
that records don't violates the order

### What changes are included in this PR?

For RowGroupMetadata, add a SortColumns interface

### Are these changes tested?

* [x] tests

### Are there any user-facing changes?

User can read sort columns in the future

* Closes: apache#35331

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
…umns (apache#35351)

### Rationale for this change

Allow read/set SortColumns in C++ parquet. Node that currently we didn't check sort columns, so user should ensure
that records don't violates the order

### What changes are included in this PR?

For RowGroupMetadata, add a SortColumns interface

### Are these changes tested?

* [x] tests

### Are there any user-facing changes?

User can read sort columns in the future

* Closes: apache#35331

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
@raulcd raulcd modified the milestones: 13.0.0, 14.0.0 Jul 7, 2023
@raulcd raulcd removed this from the 14.0.0 milestone Oct 10, 2023
@mapleFU
Copy link
Member

mapleFU commented Dec 20, 2023

Please not assign to me =_=

@mapleFU mapleFU removed their assignment Dec 20, 2023
@mapleFU
Copy link
Member

mapleFU commented Dec 20, 2023

@judahrand would you mind reply here? we can only assign to the one replied here

AlenkaF pushed a commit that referenced this issue Dec 20, 2023
### Rationale for this change

Picking up where #35453 left off.

Closes #35331

This PR builds on top of #37469 

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

* Closes: #35331

Lead-authored-by: Judah Rand <17158624+judahrand@users.noreply.github.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
@AlenkaF AlenkaF added this to the 15.0.0 milestone Dec 20, 2023
@judahrand
Copy link
Contributor

Sure!

clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
### Rationale for this change

Picking up where apache#35453 left off.

Closes apache#35331

This PR builds on top of apache#37469 

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

* Closes: apache#35331

Lead-authored-by: Judah Rand <17158624+judahrand@users.noreply.github.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
### Rationale for this change

Picking up where apache#35453 left off.

Closes apache#35331

This PR builds on top of apache#37469 

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

* Closes: apache#35331

Lead-authored-by: Judah Rand <17158624+judahrand@users.noreply.github.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment