Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35331: [Python] Expose Parquet sorting metadata #37665

Merged
merged 18 commits into from
Dec 20, 2023

Conversation

judahrand
Copy link
Contributor

@judahrand judahrand commented Sep 11, 2023

Rationale for this change

Picking up where #35453 left off.

Closes #35331

This PR builds on top of #37469

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@judahrand
Copy link
Contributor Author

@mapleFU Could you also take a look at this PR when you get a chance?

@mapleFU
Copy link
Member

mapleFU commented Sep 20, 2023

Ooops seems CI failed.

I'm a little busy this week, but I'll take a glance later this week :-) Maybe you can call other for review if it's hurry

@judahrand
Copy link
Contributor Author

Ooops seems CI failed.

I'm a little busy this week, but I'll take a glance later this week :-) Maybe you can call other for review if it's hurry

Sorry! Will do.

Notes
-----

Column indices are zero-based, refer only to leaf fields, and are in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, in current implemention, parquet itself will not check the sorting_order is satisfied in user input. So we'd better:

  1. Maybe only enable writing for test?
  2. Or tell the user this is so dangerous and user should set this order if he/she cannot not gurantee ordering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is made clear in the docstrings of the Parquet Writer:

sorting_columns : Sequence of SortingColumn, default None
Specify the sort order of the data being written. The writer does not sort
the data nor does it verify that the data is sorted. The sort order is
written to the row group metadata, which can then be used by readers.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Sep 20, 2023
@judahrand judahrand force-pushed the GH-35331-parquet-sorting-python branch from 5db4e8e to 65a5798 Compare October 2, 2023 11:15
@judahrand
Copy link
Contributor Author

A rebase seems to have fixed the failing tests and now it is just AppVeyor failing (which I think is unrelated to these changes). Could someone please review this when they get the chance?

@judahrand judahrand requested a review from mapleFU October 2, 2023 14:01
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@judahrand judahrand force-pushed the GH-35331-parquet-sorting-python branch from 65a5798 to c162438 Compare October 23, 2023 19:26
@judahrand
Copy link
Contributor Author

cc @AlenkaF @jorisvandenbossche

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question otherwise LGTM 👍

return out


def _sort_keys_to_sorting_columns(sort_keys, null_placement, Schema schema):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this is not used anywhere - do you know, is it meant to be as a possible helper function? Looks like a duplicate of from_sort_order to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do look to do exactly the same thing so almost certainly just accidentially left in when the classmethod was added.

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AppVeyor failure is not connected. Thanks again @judahrand !

@jorisvandenbossche would you like to have a view before we merge?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good! I am bit hesitant about the custom to_sort_order / from_sort_order API ..

python/pyarrow/_parquet.pxd Outdated Show resolved Hide resolved
python/pyarrow/_parquet.pyx Show resolved Hide resolved
python/pyarrow/_parquet.pyx Outdated Show resolved Hide resolved
Comment on lines +2043 to +2061
out = dict()

cdef SchemaDescriptor* parquet_schema = sp_parquet_schema.get()

for i in range(parquet_schema.num_columns()):
name = frombytes(parquet_schema.Column(i).path().get().ToDotString())
out[name] = i

return out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be done with existing ParquetSchema cython class (parquet_schema[idx].path instead of creating the dict + lookup), but I suppose the problem is that we currently don't allow to create a ParquetSchema from scratch from a pyarrow schema, but only from an actual FileMetaData object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. A from_arrow_schema classmethod could be introduced on ParquetSchema but that feels a bit out of scope of these changes.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting review Awaiting review awaiting committer review Awaiting committer review and removed awaiting committer review Awaiting committer review awaiting review Awaiting review awaiting changes Awaiting changes labels Oct 27, 2023
@judahrand judahrand force-pushed the GH-35331-parquet-sorting-python branch from e76aeaa to fcfd65b Compare October 30, 2023 11:56
@judahrand judahrand force-pushed the GH-35331-parquet-sorting-python branch from 6b86bcd to 7c8849f Compare December 10, 2023 10:59
@mapleFU
Copy link
Member

mapleFU commented Dec 10, 2023

Ping @jorisvandenbossche

@mapleFU
Copy link
Member

mapleFU commented Dec 13, 2023

Gentle ping @jorisvandenbossche

@mapleFU
Copy link
Member

mapleFU commented Dec 18, 2023

@AlenkaF would you like to move it forward? Or lets wait after January

@AlenkaF
Copy link
Member

AlenkaF commented Dec 18, 2023

Will wait till tomorrow for any comments otherwise I think this could get merged 👍

@AlenkaF
Copy link
Member

AlenkaF commented Dec 19, 2023

Just one small comment before we merge: #37665 (comment)

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! +1

@AlenkaF AlenkaF merged commit cc9e649 into apache:main Dec 20, 2023
14 checks passed
@AlenkaF AlenkaF removed the awaiting committer review Awaiting committer review label Dec 20, 2023
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit cc9e649.

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them.

clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
### Rationale for this change

Picking up where apache#35453 left off.

Closes apache#35331

This PR builds on top of apache#37469 

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

* Closes: apache#35331

Lead-authored-by: Judah Rand <17158624+judahrand@users.noreply.github.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
### Rationale for this change

Picking up where apache#35453 left off.

Closes apache#35331

This PR builds on top of apache#37469 

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

* Closes: apache#35331

Lead-authored-by: Judah Rand <17158624+judahrand@users.noreply.github.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Python] Expose sorting_columns in RowGroupMetaData for Parquet files
6 participants