[Python][Parquet] support partitioning by Pandas DataFrame index

In a Pandas `DataFrame` with a multi-index, with a slowly-varying "outer" index level, one might want to partition by that index level when saving the data frame to Parquet format. This is currently not possible; you need to manually reset the index before writing, and re-add the index after reading. It would be very useful if you could supply the name of an index level to `partition_cols` instead of (or ideally in addition to) a data column name.

I originally posted this on the Pandas issue tracker (<https://github.com/pandas-dev/pandas/issues/47797>). Matthew Roeschke looked at the code and figured out that the partitioning functionality was implemented entirely in PyArrow, and that the change would need to happen within PyArrow itself.

**Reporter**: [Gregory Werbin](https://issues.apache.org/jira/browse/ARROW-17200)
#### Externally tracked issue: [https://github.com/pandas-dev/pandas/issues/47797](https://github.com/pandas-dev/pandas/issues/47797)

<sub>**Note**: *This issue was originally created as [ARROW-17200](https://issues.apache.org/jira/browse/ARROW-17200). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][Parquet] support partitioning by Pandas DataFrame index #32492

Externally tracked issue: pandas-dev/pandas#47797

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python][Parquet] support partitioning by Pandas DataFrame index #32492

Description

Externally tracked issue: pandas-dev/pandas#47797

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions