[Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories) #23870

asfimport · 2020-01-20T14:16:50Z

Hello,

it looks like, views with selection along categorical column are not properly respected.

For the following dummy dataframe:

d = pd.date_range('1990-01-01', freq='D', periods=10000)
vals = pd.np.random.randn(len(d), 4)
x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D'])
x['Year'] = x.index.year

The slice by Year is saved to partitioned parquet properly:

table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False)
pq.write_to_dataset(table, root_path='test_a.parquet', partition_cols=['Year'])

However, if we convert Year to pandas.Categorical - it will save the whole original dataframe, not only slice of Year=1990:

x['Year'] = x['Year'].astype('category')

table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False)
pq.write_to_dataset(table, root_path='test_b.parquet', partition_cols=['Year'])

Reporter: Vladimir

_{Note: This issue was originally created as ARROW-7617. Please see the migration documentation for further details.}

asfimport · 2020-01-20T14:36:18Z

Vladimir:
Apologies for the false alarm.

It is not saving data, just creating empty partitions for all categories. Maybe still undesired behavior, but not the bug as it was reported.

asfimport · 2020-01-21T11:44:39Z

Joris Van den Bossche / @jorisvandenbossche:
[~Filimonov] thanks for the report!

It's indeed writing all partitions, and this is because all the unique categories are still present in the type (the categories in pandas, the dictionary items in arrow), and then because of pandas' groupby behaviour of by default preserving all categories (used here: pq.write_to_dataset(table, root_path='test_b.parquet', partition_cols=['Year'])). Pandas has a observed=True option in groupby (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) to ensure only categories that are actually present in the data are used to group, which could be used here. PR certainly welcome!

asfimport · 2020-01-21T11:46:39Z

Joris Van den Bossche / @jorisvandenbossche:
Reopened and changed the title to reflect the actual issue

asfimport · 2021-06-23T14:02:50Z

Antoine Pitrou / @pitrou:
@arw2019 Do you still want to be assigned this issue?

asfimport · 2021-06-23T14:36:46Z

Andrew Wieteska / @arw2019:
@pitrou unassigned myself

…lues don't create empty files in parquet.write_to_dataset

…ty files for non-observed dictionary (category) values (#36465) ### What changes are included in this PR? If we partition on a categorical variable with "unobserved" categories (values present in the dictionary, but not in the actual data), the legacy path in `pq.write_to_dataset` currently creates empty files. The new dataset-based path already has the preferred behavior, and this PR fixes it for the legacy path and adds a test for both as well. This also fixes one of the pandas deprecation warnings listed in #36412 ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, this no longer creates a hive-style directory with one empty file (parquet file with 0 rows) when users have unobserved categories. However, this aligns the legacy path with the new and default dataset-based path. * Closes: #23870 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Jul 4, 2023

apacheGH-23870: [Python] Ensure non-observed dictionary (category) va…

dc81e1b

…lues don't create empty files in parquet.write_to_dataset

github-actions bot mentioned this issue Jul 4, 2023

GH-23870: [Python] Ensure parquet.write_to_dataset doesn't create empty files for non-observed dictionary (category) values #36465

Merged

github-actions bot assigned jorisvandenbossche Jul 4, 2023

jorisvandenbossche closed this as completed in #36465 Jul 5, 2023

jorisvandenbossche added this to the 13.0.0 milestone Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories) #23870

[Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories) #23870

asfimport commented Jan 20, 2020

asfimport commented Jan 20, 2020

asfimport commented Jan 21, 2020

asfimport commented Jan 21, 2020

asfimport commented Jun 23, 2021

asfimport commented Jun 23, 2021

[Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories) #23870

[Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories) #23870

Comments

asfimport commented Jan 20, 2020

asfimport commented Jan 20, 2020

asfimport commented Jan 21, 2020

asfimport commented Jan 21, 2020

asfimport commented Jun 23, 2021

asfimport commented Jun 23, 2021