-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
In a Pandas DataFrame with a multi-index, with a slowly-varying "outer" index level, one might want to partition by that index level when saving the data frame to Parquet format. This is currently not possible; you need to manually reset the index before writing, and re-add the index after reading. It would be very useful if you could supply the name of an index level to partition_cols instead of (or ideally in addition to) a data column name.
I originally posted this on the Pandas issue tracker (pandas-dev/pandas#47797). Matthew Roeschke looked at the code and figured out that the partitioning functionality was implemented entirely in PyArrow, and that the change would need to happen within PyArrow itself.
Reporter: Gregory Werbin
Externally tracked issue: pandas-dev/pandas#47797
Note: This issue was originally created as ARROW-17200. Please see the migration documentation for further details.