-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35331: [C++][Parquet] Parquet Export Footer metadata SortColumns #35351
Conversation
|
I was thinking if we can reuse But it slightly differs with The null placement for In most cases null placement should be consistent in the same engine, so I think we can simply reuse WDYT? @mapleFU @wjones127 @pitrou |
Currently I just want to wrap parquet thrift SortingColumn lightweight. Ooops, seems parquet-testing not have file with sorting-columns, so just reading it need construct some data. Should I just generate a file with parquet-mr, or implement building logic in builder? |
What about adding the writer support as well? Then we can do round trip test. |
Ok I'll go forward now |
I think something that directly corresponds with the Parquet notion makes sense. If we want we can define conversions to the compute notion of sorting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation seems good to me. @wgtmac, any comments?
I guess we need a mapping layer if we want By the way, I think currently it's a bit unsafe to use By the way, ci failed because:
No idea why it failed :-( |
That's a fair point. I think we should say in the doc string that this is not verified and is meant to be a low-level API. |
/// Define the sorting columns.
/// Default empty.
///
/// If sorting columns are set, user should ensure that records
/// are sorted by sorting columns. Otherwise, the storing data
/// will be inconsistent with sorting_columns metadata. I add this but you can edit it directly. Since I'm not good at documenting... |
That CI failure is unrelated. I also see it in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Thanks @mapleFU and @wjones127
Benchmark runs are scheduled for baseline = 3b48834 and contender = da6dbd4. da6dbd4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
…umns (apache#35351) ### Rationale for this change Allow read/set SortColumns in C++ parquet. Node that currently we didn't check sort columns, so user should ensure that records don't violates the order ### What changes are included in this PR? For RowGroupMetadata, add a SortColumns interface ### Are these changes tested? * [x] tests ### Are there any user-facing changes? User can read sort columns in the future * Closes: apache#35331 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
…umns (apache#35351) ### Rationale for this change Allow read/set SortColumns in C++ parquet. Node that currently we didn't check sort columns, so user should ensure that records don't violates the order ### What changes are included in this PR? For RowGroupMetadata, add a SortColumns interface ### Are these changes tested? * [x] tests ### Are there any user-facing changes? User can read sort columns in the future * Closes: apache#35331 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
…umns (apache#35351) ### Rationale for this change Allow read/set SortColumns in C++ parquet. Node that currently we didn't check sort columns, so user should ensure that records don't violates the order ### What changes are included in this PR? For RowGroupMetadata, add a SortColumns interface ### Are these changes tested? * [x] tests ### Are there any user-facing changes? User can read sort columns in the future * Closes: apache#35331 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
Rationale for this change
Allow read/set SortColumns in C++ parquet. Node that currently we didn't check sort columns, so user should ensure
that records don't violates the order
What changes are included in this PR?
For RowGroupMetadata, add a SortColumns interface
Are these changes tested?
Are there any user-facing changes?
User can read sort columns in the future
sorting_columns
in RowGroupMetaData for Parquet files #35331