Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Dataset PartitioningFactory cannot be serialized in Python #34884

Closed
rjzamora opened this issue Apr 4, 2023 · 3 comments · Fixed by #36550
Closed

[Python] Dataset PartitioningFactory cannot be serialized in Python #34884

rjzamora opened this issue Apr 4, 2023 · 3 comments · Fixed by #36550

Comments

@rjzamora
Copy link
Contributor

rjzamora commented Apr 4, 2023

Describe the bug, including details regarding any error messages, version, and platform.

I would like to be able to serialize a dictionary of pyarrow.dataset.dataset key-word arguments (for parallel processing in Dask). However, it is not possible to do this when one of those arguments contains a Partitioning/PartitioningFactory object, because those objects cannot be serialized in Python.

Reproducer:

In [1]: import pyarrow.dataset as ds
   ...: import pickle
   ...: 
   ...: partitioning = ds.partitioning(flavor="hive")
   ...: pickle.dumps(partitioning)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 5
      2 import pickle
      4 partitioning = ds.partitioning(flavor="hive")
----> 5 pickle.dumps(partitioning)

File stringsource:2, in pyarrow._dataset.PartitioningFactory.__reduce_cython__()

TypeError: self.factory,self.wrapped cannot be converted to a Python object for pickling

Component(s)

Python

@jorisvandenbossche jorisvandenbossche changed the title Dataset PartitioningFactory cannot be serialized in Python [Python ]Dataset PartitioningFactory cannot be serialized in Python Apr 4, 2023
@jorisvandenbossche jorisvandenbossche changed the title [Python ]Dataset PartitioningFactory cannot be serialized in Python [Python] Dataset PartitioningFactory cannot be serialized in Python Apr 4, 2023
@westonpace
Copy link
Member

All the partitioning objects we have today can boil down to a schema (which can be saved as an empty parquet/arrow file) and a string denoting the type (e.g. "dictionary" or "hive" or "filename"). I'm not sure if this is helpful or not since I suspect the goal is automatic serialization.

@rjzamora
Copy link
Contributor Author

Thanks @westonpace ! The current workaround in Dask is indeed to allow the user to specify a dictionary like {"flavor": "hive", "schema": ...}. This works fine, but the Dask UX would certainly be cleaner if the user could pass in something like an initialized HivePartitioning object.

@jorisvandenbossche jorisvandenbossche added this to the 13.0.0 milestone May 10, 2023
@jorisvandenbossche jorisvandenbossche self-assigned this May 17, 2023
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Jul 4, 2023
jorisvandenbossche added a commit that referenced this issue Jul 7, 2023
…classes (#36462)

### Rationale for this change

Add support for pickling Directory/Hive/FilenamePartitioning objects.

Does not yet actually fix the issue #34884, because this PR only addresses the actual Partitioning subclasses, and not the PartitioningFactory subclasses.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Only new support for pickling and `==` operation.
* Issue: #34884

Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Jul 7, 2023
jorisvandenbossche added a commit that referenced this issue Jul 10, 2023
…ory objects (#36550)

### Rationale for this change

#36462 already added support for pickling Partitioning objects, but not yet the PartitioningFactory objects.

The problem for PartitioningFactory is that we currently don't really expose the full class hierarchy in python, just the base class PartitioningFactory. We also don't expose creating those factory objects, except through the `discover` methods of the Partitioning classes. 
I think it would be nice to keep this minimal binding, but that means if we want to make them serializable with pickle, we need another way to do that (and if we don't want to add custom code for serialization on the C++ side). 

In this PR, I went for the route of essentially storing the constructor (the discover static method) and the arguments that were passed to the constructor, on the factory object, so we can use this info for pickling. Not the nicest code, but the simplest solution I could think of.

### Are these changes tested?

Yes
* Closes: #34884

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@rjzamora
Copy link
Contributor Author

Thanks @jorisvandenbossche !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants