GH-35331: [C++][Parquet] Parquet Export Footer metadata SortColumns #35351

mapleFU · 2023-04-27T08:25:58Z

Rationale for this change

Allow read/set SortColumns in C++ parquet. Node that currently we didn't check sort columns, so user should ensure
that records don't violates the order

What changes are included in this PR?

For RowGroupMetadata, add a SortColumns interface

Are these changes tested?

tests

Are there any user-facing changes?

User can read sort columns in the future

Closes: [C++][Python] Expose sorting_columns in RowGroupMetaData for Parquet files #35331

github-actions · 2023-04-27T08:26:23Z

Closes: [C++][Python] Expose sorting_columns in RowGroupMetaData for Parquet files #35331

github-actions · 2023-04-27T08:26:26Z

⚠️ GitHub issue #35331 has been automatically assigned in GitHub to PR creator.

wgtmac · 2023-04-27T14:41:41Z

I was thinking if we can reuse arrow::compute::Ordering: https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/ordering.h#L61-L117

But it slightly differs with parquet::format::SortingColumn: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L682-L692

The null placement for arrow::compute::Ordering is the same for all sort keys but that of parquet can vary among columns.

In most cases null placement should be consistent in the same engine, so I think we can simply reuse arrow::compute::Ordering and do not return sorting columns if that in the RowGroupMetadata indicates different null placement from columns.

WDYT? @mapleFU @wjones127 @pitrou

mapleFU · 2023-04-28T15:15:21Z

Currently I just want to wrap parquet thrift SortingColumn lightweight.

Ooops, seems parquet-testing not have file with sorting-columns, so just reading it need construct some data. Should I just generate a file with parquet-mr, or implement building logic in builder?

wgtmac · 2023-04-28T15:17:51Z

Currently I just want to wrap parquet thrift SortingColumn lightweight.

Ooops, seems parquet-testing not have file with sorting-columns, so just reading it need construct some data. Should I just generate a file with parquet-mr, or implement building logic in builder?

What about adding the writer support as well? Then we can do round trip test.

mapleFU · 2023-04-28T15:20:56Z

Ok I'll go forward now

wjones127 · 2023-04-28T17:27:03Z

I was thinking if we can reuse arrow::compute::Ordering

I think something that directly corresponds with the Parquet notion makes sense. If we want we can define conversions to the compute notion of sorting.

wjones127

This implementation seems good to me. @wgtmac, any comments?

mapleFU · 2023-04-28T17:38:37Z

I guess we need a mapping layer if we want arrow::compute::Order, because things like unordered would be tricky.

By the way, I think currently it's a bit unsafe to use set_sorting_order, because no checking is executed here. I go through the code of arrow-rs and parquet-mr. They also do not check the code. Maybe we can use Statistic to check, but it's too expansive?

By the way, ci failed because:

=================================== FAILURES ===================================
___________________ test_pandas_assertion_error_large_string ___________________

    @pytest.mark.large_memory
    @pytest.mark.pandas
    def test_pandas_assertion_error_large_string():
        # Test AssertionError as pandas does not support "U" type strings
        if Version(pd.__version__) < Version("1.5.0"):
            pytest.skip("__dataframe__ added to pandas in 1.5.0")
    
        data = np.array([b'x'*1024]*(3*1024**2), dtype='object')  # 3GB bytes data
        arr = pa.array(data, type=pa.large_string())
        table = pa.table([arr], names=["large_string"])
    
        from pandas.api.interchange import (
            from_dataframe as pandas_from_dataframe
        )
    
>       with pytest.raises(AssertionError):
E       Failed: DID NOT RAISE <class 'AssertionError'>

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:294: Failed
=============================== warnings summary ===============================
tests/test_pandas.py::TestConvertListTypes::test_to_list_of_maps_pandas
tests/test_pandas.py::TestConvertListTypes::test_to_list_of_maps_pandas_sliced
tests/test_pandas.py::test_roundtrip_nested_map_array_with_pydicts_sliced
  /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/core/dtypes/missing.py:571: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
    if np.any(np.asarray(left_value != right_value)):

tests/test_substrait.py::test_named_table_invalid_table_name
  /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/unraisableexception.py:78: PytestUnraisableExceptionWarning: Exception ignored in: 'pyarrow._substrait._create_named_table_provider'
  
  Traceback (most recent call last):
    File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pyarrow/tests/test_substrait.py", line 250, in table_provider
      raise Exception("Unrecognized table name")
  Exception: Unrecognized table name
  
    warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))

tests/parquet/test_datetime.py::test_list_of_datetime_time_roundtrip
  /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pyarrow/tests/parquet/test_datetime.py:361: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
    times = pd.to_datetime(['09:00', '09:30', '10:00', '10:30', '11:00',

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================

No idea why it failed :-(

wjones127 · 2023-04-28T17:42:03Z

because no checking is executed here

That's a fair point. I think we should say in the doc string that this is not verified and is meant to be a low-level API.

mapleFU · 2023-04-28T17:44:24Z

    /// Define the sorting columns.
    /// Default empty.
    ///
    /// If sorting columns are set, user should ensure that records
    /// are sorted by sorting columns. Otherwise, the storing data
    /// will be inconsistent with sorting_columns metadata.

I add this but you can edit it directly. Since I'm not good at documenting...

wjones127 · 2023-04-28T17:49:42Z

That CI failure is unrelated. I also see it in the main branch. Since the release of pandas 2.0, some of our integration tests have started failing.

wgtmac

Overall LGTM. Thanks @mapleFU and @wjones127

cpp/src/parquet/metadata.cc

ursabot · 2023-05-01T21:24:02Z

Benchmark runs are scheduled for baseline = 3b48834 and contender = da6dbd4. da6dbd4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.9% ⬆️0.2%] test-mac-arm
[Failed ⬇️0.52% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.39% ⬆️0.06%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] da6dbd48 ec2-t3-xlarge-us-east-2
[Failed] da6dbd48 test-mac-arm
[Failed] da6dbd48 ursa-i9-9960x
[Finished] da6dbd48 ursa-thinkcentre-m75q
[Finished] 3b488349 ec2-t3-xlarge-us-east-2
[Failed] 3b488349 test-mac-arm
[Failed] 3b488349 ursa-i9-9960x
[Finished] 3b488349 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2023-05-01T21:25:48Z

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

…umns (apache#35351) ### Rationale for this change Allow read/set SortColumns in C++ parquet. Node that currently we didn't check sort columns, so user should ensure that records don't violates the order ### What changes are included in this PR? For RowGroupMetadata, add a SortColumns interface ### Are these changes tested? * [x] tests ### Are there any user-facing changes? User can read sort columns in the future * Closes: apache#35331 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>

Parquet Export Footer metadata SortColumns

c149adb

github-actions bot added Component: C++ Component: Parquet awaiting review Awaiting review labels Apr 27, 2023

mapleFU added 3 commits April 28, 2023 23:21

using __isset in RowGroup

faa3af7

Merge branch 'main' into parquet/export-sorting-columns

acd8f38

impl logic for setting sorting_columns

32fab52

mapleFU marked this pull request as ready for review April 28, 2023 17:05

mapleFU requested a review from wjones127 as a code owner April 28, 2023 17:05

wjones127 approved these changes Apr 28, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Apr 28, 2023

wgtmac reviewed Apr 29, 2023

View reviewed changes

cpp/src/parquet/metadata.cc Show resolved Hide resolved

Using ToThrift and FromThrift to unify casting

8836cd7

wjones127 approved these changes May 1, 2023

View reviewed changes

wjones127 merged commit da6dbd4 into apache:main May 1, 2023
28 of 31 checks passed

mapleFU deleted the parquet/export-sorting-columns branch May 1, 2023 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-35331: [C++][Parquet] Parquet Export Footer metadata SortColumns #35351

GH-35331: [C++][Parquet] Parquet Export Footer metadata SortColumns #35351

mapleFU commented Apr 27, 2023 •

edited

github-actions bot commented Apr 27, 2023

github-actions bot commented Apr 27, 2023

wgtmac commented Apr 27, 2023 •

edited

mapleFU commented Apr 28, 2023

wgtmac commented Apr 28, 2023

mapleFU commented Apr 28, 2023

wjones127 commented Apr 28, 2023

wjones127 left a comment

mapleFU commented Apr 28, 2023

wjones127 commented Apr 28, 2023

mapleFU commented Apr 28, 2023

wjones127 commented Apr 28, 2023

wgtmac left a comment

ursabot commented May 1, 2023

ursabot commented May 1, 2023

GH-35331: [C++][Parquet] Parquet Export Footer metadata SortColumns #35351

GH-35331: [C++][Parquet] Parquet Export Footer metadata SortColumns #35351

Conversation

mapleFU commented Apr 27, 2023 • edited

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Apr 27, 2023

github-actions bot commented Apr 27, 2023

wgtmac commented Apr 27, 2023 • edited

mapleFU commented Apr 28, 2023

wgtmac commented Apr 28, 2023

mapleFU commented Apr 28, 2023

wjones127 commented Apr 28, 2023

wjones127 left a comment

Choose a reason for hiding this comment

mapleFU commented Apr 28, 2023

wjones127 commented Apr 28, 2023

mapleFU commented Apr 28, 2023

wjones127 commented Apr 28, 2023

wgtmac left a comment

Choose a reason for hiding this comment

ursabot commented May 1, 2023

ursabot commented May 1, 2023

mapleFU commented Apr 27, 2023 •

edited

wgtmac commented Apr 27, 2023 •

edited