Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35331: [C++][Parquet] Parquet Export Footer metadata SortColumns #35351

Merged
merged 5 commits into from
May 1, 2023

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Apr 27, 2023

Rationale for this change

Allow read/set SortColumns in C++ parquet. Node that currently we didn't check sort columns, so user should ensure
that records don't violates the order

What changes are included in this PR?

For RowGroupMetadata, add a SortColumns interface

Are these changes tested?

  • tests

Are there any user-facing changes?

User can read sort columns in the future

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #35331 has been automatically assigned in GitHub to PR creator.

@wgtmac
Copy link
Member

wgtmac commented Apr 27, 2023

I was thinking if we can reuse arrow::compute::Ordering: https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/ordering.h#L61-L117

But it slightly differs with parquet::format::SortingColumn: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L682-L692

The null placement for arrow::compute::Ordering is the same for all sort keys but that of parquet can vary among columns.

In most cases null placement should be consistent in the same engine, so I think we can simply reuse arrow::compute::Ordering and do not return sorting columns if that in the RowGroupMetadata indicates different null placement from columns.

WDYT? @mapleFU @wjones127 @pitrou

@mapleFU
Copy link
Member Author

mapleFU commented Apr 28, 2023

Currently I just want to wrap parquet thrift SortingColumn lightweight.

Ooops, seems parquet-testing not have file with sorting-columns, so just reading it need construct some data. Should I just generate a file with parquet-mr, or implement building logic in builder?

@wgtmac
Copy link
Member

wgtmac commented Apr 28, 2023

Currently I just want to wrap parquet thrift SortingColumn lightweight.

Ooops, seems parquet-testing not have file with sorting-columns, so just reading it need construct some data. Should I just generate a file with parquet-mr, or implement building logic in builder?

What about adding the writer support as well? Then we can do round trip test.

@mapleFU
Copy link
Member Author

mapleFU commented Apr 28, 2023

Ok I'll go forward now

@mapleFU mapleFU marked this pull request as ready for review April 28, 2023 17:05
@mapleFU mapleFU requested a review from wjones127 as a code owner April 28, 2023 17:05
@wjones127
Copy link
Member

I was thinking if we can reuse arrow::compute::Ordering

I think something that directly corresponds with the Parquet notion makes sense. If we want we can define conversions to the compute notion of sorting.

Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation seems good to me. @wgtmac, any comments?

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Apr 28, 2023
@mapleFU
Copy link
Member Author

mapleFU commented Apr 28, 2023

I guess we need a mapping layer if we want arrow::compute::Order, because things like unordered would be tricky.

By the way, I think currently it's a bit unsafe to use set_sorting_order, because no checking is executed here. I go through the code of arrow-rs and parquet-mr. They also do not check the code. Maybe we can use Statistic to check, but it's too expansive?

By the way, ci failed because:

=================================== FAILURES ===================================
___________________ test_pandas_assertion_error_large_string ___________________

    @pytest.mark.large_memory
    @pytest.mark.pandas
    def test_pandas_assertion_error_large_string():
        # Test AssertionError as pandas does not support "U" type strings
        if Version(pd.__version__) < Version("1.5.0"):
            pytest.skip("__dataframe__ added to pandas in 1.5.0")
    
        data = np.array([b'x'*1024]*(3*1024**2), dtype='object')  # 3GB bytes data
        arr = pa.array(data, type=pa.large_string())
        table = pa.table([arr], names=["large_string"])
    
        from pandas.api.interchange import (
            from_dataframe as pandas_from_dataframe
        )
    
>       with pytest.raises(AssertionError):
E       Failed: DID NOT RAISE <class 'AssertionError'>

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:294: Failed
=============================== warnings summary ===============================
tests/test_pandas.py::TestConvertListTypes::test_to_list_of_maps_pandas
tests/test_pandas.py::TestConvertListTypes::test_to_list_of_maps_pandas_sliced
tests/test_pandas.py::test_roundtrip_nested_map_array_with_pydicts_sliced
  /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/core/dtypes/missing.py:571: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
    if np.any(np.asarray(left_value != right_value)):

tests/test_substrait.py::test_named_table_invalid_table_name
  /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/unraisableexception.py:78: PytestUnraisableExceptionWarning: Exception ignored in: 'pyarrow._substrait._create_named_table_provider'
  
  Traceback (most recent call last):
    File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pyarrow/tests/test_substrait.py", line 250, in table_provider
      raise Exception("Unrecognized table name")
  Exception: Unrecognized table name
  
    warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))

tests/parquet/test_datetime.py::test_list_of_datetime_time_roundtrip
  /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pyarrow/tests/parquet/test_datetime.py:361: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
    times = pd.to_datetime(['09:00', '09:30', '10:00', '10:30', '11:00',

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================

No idea why it failed :-(

@wjones127
Copy link
Member

because no checking is executed here

That's a fair point. I think we should say in the doc string that this is not verified and is meant to be a low-level API.

@mapleFU
Copy link
Member Author

mapleFU commented Apr 28, 2023

    /// Define the sorting columns.
    /// Default empty.
    ///
    /// If sorting columns are set, user should ensure that records
    /// are sorted by sorting columns. Otherwise, the storing data
    /// will be inconsistent with sorting_columns metadata.

I add this but you can edit it directly. Since I'm not good at documenting...

@wjones127
Copy link
Member

That CI failure is unrelated. I also see it in the main branch. Since the release of pandas 2.0, some of our integration tests have started failing.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Thanks @mapleFU and @wjones127

cpp/src/parquet/metadata.cc Show resolved Hide resolved
@wjones127 wjones127 merged commit da6dbd4 into apache:main May 1, 2023
28 of 31 checks passed
@mapleFU mapleFU deleted the parquet/export-sorting-columns branch May 1, 2023 14:55
@ursabot
Copy link

ursabot commented May 1, 2023

Benchmark runs are scheduled for baseline = 3b48834 and contender = da6dbd4. da6dbd4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.9% ⬆️0.2%] test-mac-arm
[Failed ⬇️0.52% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.39% ⬆️0.06%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] da6dbd48 ec2-t3-xlarge-us-east-2
[Failed] da6dbd48 test-mac-arm
[Failed] da6dbd48 ursa-i9-9960x
[Finished] da6dbd48 ursa-thinkcentre-m75q
[Finished] 3b488349 ec2-t3-xlarge-us-east-2
[Failed] 3b488349 test-mac-arm
[Failed] 3b488349 ursa-i9-9960x
[Finished] 3b488349 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented May 1, 2023

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
…umns (apache#35351)

### Rationale for this change

Allow read/set SortColumns in C++ parquet. Node that currently we didn't check sort columns, so user should ensure
that records don't violates the order

### What changes are included in this PR?

For RowGroupMetadata, add a SortColumns interface

### Are these changes tested?

* [x] tests

### Are there any user-facing changes?

User can read sort columns in the future

* Closes: apache#35331

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
…umns (apache#35351)

### Rationale for this change

Allow read/set SortColumns in C++ parquet. Node that currently we didn't check sort columns, so user should ensure
that records don't violates the order

### What changes are included in this PR?

For RowGroupMetadata, add a SortColumns interface

### Are these changes tested?

* [x] tests

### Are there any user-facing changes?

User can read sort columns in the future

* Closes: apache#35331

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this pull request May 16, 2023
…umns (apache#35351)

### Rationale for this change

Allow read/set SortColumns in C++ parquet. Node that currently we didn't check sort columns, so user should ensure
that records don't violates the order

### What changes are included in this PR?

For RowGroupMetadata, add a SortColumns interface

### Are these changes tested?

* [x] tests

### Are there any user-facing changes?

User can read sort columns in the future

* Closes: apache#35331

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Python] Expose sorting_columns in RowGroupMetaData for Parquet files
4 participants