Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34884: [Python]: Support pickling pyarrow.dataset Partitioning subclasses #36462

Merged
merged 2 commits into from
Jul 7, 2023

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jul 4, 2023

Rationale for this change

Add support for pickling Directory/Hive/FilenamePartitioning objects.

Does not yet actually fix the issue #34884, because this PR only addresses the actual Partitioning subclasses, and not the PartitioningFactory subclasses.

Are these changes tested?

Yes

Are there any user-facing changes?

Only new support for pickling and == operation.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!


cdef cppclass CSegmentEncoding" arrow::dataset::SegmentEncoding":
pass
bint operator==(CSegmentEncoding)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own curiosity, is bint different than c_bool?

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c_bool comes from from libcpp cimport bool as c_bool, while bint is a built-in cython type (for C boolean values, i.e. int with 0/non-0 values for False/True). Not fully sure in which case which should be used (or if it matters). I think generally where we expect to pass a Python boolean (eg in signatures) we use bint, while c_bool is mostly used when interfacing with the C++ declared methods. Although my expectation is that cython will cast the one to the other if needed.

In this case I copied it from another place in the pxd file that was using the same pattern.

python/pyarrow/tests/test_dataset.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jul 6, 2023
Co-authored-by: Weston Pace <weston.pace@gmail.com>
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jul 7, 2023
@jorisvandenbossche jorisvandenbossche merged commit dd36705 into apache:main Jul 7, 2023
31 of 32 checks passed
@jorisvandenbossche jorisvandenbossche removed the awaiting changes Awaiting changes label Jul 7, 2023
@jorisvandenbossche jorisvandenbossche deleted the gh-34884 branch July 7, 2023 12:47
@github-actions github-actions bot added the awaiting review Awaiting review label Jul 7, 2023
jorisvandenbossche added a commit that referenced this pull request Jul 10, 2023
…ory objects (#36550)

### Rationale for this change

#36462 already added support for pickling Partitioning objects, but not yet the PartitioningFactory objects.

The problem for PartitioningFactory is that we currently don't really expose the full class hierarchy in python, just the base class PartitioningFactory. We also don't expose creating those factory objects, except through the `discover` methods of the Partitioning classes. 
I think it would be nice to keep this minimal binding, but that means if we want to make them serializable with pickle, we need another way to do that (and if we don't want to add custom code for serialization on the C++ side). 

In this PR, I went for the route of essentially storing the constructor (the discover static method) and the arguments that were passed to the constructor, on the factory object, so we can use this info for pickling. Not the nicest code, but the simplest solution I could think of.

### Are these changes tested?

Yes
* Closes: #34884

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
jorisvandenbossche added a commit that referenced this pull request Jul 13, 2023
…ring with other type (#36661)

### Rationale for this change

Ensure that `part == other` doesn't crash with `other` is not a Partitioning instance

Small follow-up on #36462

* Closes: #36659

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
raulcd pushed a commit that referenced this pull request Jul 13, 2023
…ring with other type (#36661)

### Rationale for this change

Ensure that `part == other` doesn't crash with `other` is not a Partitioning instance

Small follow-up on #36462

* Closes: #36659

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@conbench-apache-arrow
Copy link

Conbench analyzed the 6 benchmark runs on commit dd367058.

There were 64 benchmark results indicating a performance regression:

The full Conbench report has more details.

chelseajonesr pushed a commit to chelseajonesr/arrow that referenced this pull request Jul 20, 2023
… comparing with other type (apache#36661)

### Rationale for this change

Ensure that `part == other` doesn't crash with `other` is not a Partitioning instance

Small follow-up on apache#36462

* Closes: apache#36659

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
R-JunmingChen pushed a commit to R-JunmingChen/arrow that referenced this pull request Aug 20, 2023
… comparing with other type (apache#36661)

### Rationale for this change

Ensure that `part == other` doesn't crash with `other` is not a Partitioning instance

Small follow-up on apache#36462

* Closes: apache#36659

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants