Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Allow usage of field_names in partitioning when saving datasets #29385

Closed
asfimport opened this issue Aug 25, 2021 · 1 comment
Closed

Comments

@asfimport
Copy link

When loading back datasets, it's possible to quickly provide the name of the columns for which data was partitioned using

partitioning=pyarrow.dataset.partitioning(field_names=["year"])

this is convenient because it's easier and quicker than providing the whole schema, which can still be autodetected from the loaded data.

On the other side, we don't support this when saving data. If you provide field_names instead of the schema you will get a crash

pyarrow/dataset.py in _ensure_write_partitioning(scheme)
    684     if not isinstance(scheme, Partitioning):
    685         # TODO support passing field names, and get types from schema
--> 686         raise ValueError("partitioning needs to be actual Partitioning object")
    687     return scheme
    688 

It would be convenient to allow to use field_names only even when saving as we can automatically detect the schema from the table itself that we are saving.

Reporter: Alessandro Molina / @amol-
Assignee: Alessandro Molina / @amol-

PRs and other links:

Note: This issue was originally created as ARROW-13755. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 11008
#11008

@asfimport asfimport added this to the 6.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants