Skip to content

[Python][Parquet] support partitioning by Pandas DataFrame index #32492

@asfimport

Description

@asfimport

In a Pandas DataFrame with a multi-index, with a slowly-varying "outer" index level, one might want to partition by that index level when saving the data frame to Parquet format. This is currently not possible; you need to manually reset the index before writing, and re-add the index after reading. It would be very useful if you could supply the name of an index level to partition_cols instead of (or ideally in addition to) a data column name.

I originally posted this on the Pandas issue tracker (pandas-dev/pandas#47797). Matthew Roeschke looked at the code and figured out that the partitioning functionality was implemented entirely in PyArrow, and that the change would need to happen within PyArrow itself.

Reporter: Gregory Werbin

Externally tracked issue: pandas-dev/pandas#47797

Note: This issue was originally created as ARROW-17200. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions