-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories) #23870
Comments
Vladimir: It is not saving data, just creating empty partitions for all categories. Maybe still undesired behavior, but not the bug as it was reported. |
Joris Van den Bossche / @jorisvandenbossche: It's indeed writing all partitions, and this is because all the unique categories are still present in the type (the categories in pandas, the dictionary items in arrow), and then because of pandas' |
Joris Van den Bossche / @jorisvandenbossche: |
Antoine Pitrou / @pitrou: |
Andrew Wieteska / @arw2019: |
…lues don't create empty files in parquet.write_to_dataset
…ty files for non-observed dictionary (category) values (#36465) ### What changes are included in this PR? If we partition on a categorical variable with "unobserved" categories (values present in the dictionary, but not in the actual data), the legacy path in `pq.write_to_dataset` currently creates empty files. The new dataset-based path already has the preferred behavior, and this PR fixes it for the legacy path and adds a test for both as well. This also fixes one of the pandas deprecation warnings listed in #36412 ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, this no longer creates a hive-style directory with one empty file (parquet file with 0 rows) when users have unobserved categories. However, this aligns the legacy path with the new and default dataset-based path. * Closes: #23870 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Hello,
it looks like, views with selection along categorical column are not properly respected.
For the following dummy dataframe:
The slice by Year is saved to partitioned parquet properly:
However, if we convert Year to pandas.Categorical - it will save the whole original dataframe, not only slice of Year=1990:
Reporter: Vladimir
Note: This issue was originally created as ARROW-7617. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: