Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Ability to create partitions when writing to Parquet #15457

Closed
asfimport opened this issue Aug 23, 2017 · 3 comments
Closed

[Python] Ability to create partitions when writing to Parquet #15457

asfimport opened this issue Aug 23, 2017 · 3 comments

Comments

@asfimport
Copy link

I'm fairly new to pyarrow so I apologize if this is already a feature, but I couldn't find a solution in the documentation nor an existing issue. Basically I'm trying to export pandas dataframes to .parquet files with partitions. I can see that pyarrow.parquet has a way of reading .parquet files with partitions, but there's no indication that it can write with partitions. E.g., it would be nice if there was a parameter in pyarrow.Table.write_table() that took a list of columns to partition the table similar to the pyspark implementation: spark.write.parquet's "partitionBy" parameter.

Referenced links:
https://arrow.apache.org/docs/python/parquet.html
https://arrow.apache.org/docs/python/parquet.html?highlight=pyarrow%20parquet%20partition

Environment: Mac OS Sierra 10.12.6
Reporter: Safyre Anderson / @saffrydaffry
Assignee: Safyre Anderson / @saffrydaffry

Note: This issue was originally created as ARROW-1400. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
This would be a very useful feature. The simplest way to do this in the short term will be to generate the partition scheme from a pandas.DataFrame using pandas operations to split the object into pieces. We should add a function in pyarrow.parquet which enables data to be "inserted" into a directory containing a standard Hive-like partition schema. So you could do something like (just spitballing here)

pq.write_table_to_dataset(dataset_path, partition_keys=keys, **options)

Here dataset_path is a directory, and this will write a new Parquet file in the appropriate location in the subdirectory structure if partition_keys is not None.

A patch would be welcome. I will mark this issue for 0.7.0

@asfimport
Copy link
Author

Safyre Anderson / @saffrydaffry:
Submitted a pull request (991) for a hot fix: #991.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Issue resolved by pull request 991
#991

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant