Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories) #23870

Closed
asfimport opened this issue Jan 20, 2020 · 5 comments · Fixed by #36465

Comments

@asfimport
Copy link
Collaborator

Hello,

it looks like, views with selection along categorical column are not properly respected.

For the following dummy dataframe:

 

d = pd.date_range('1990-01-01', freq='D', periods=10000)
vals = pd.np.random.randn(len(d), 4)
x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D'])
x['Year'] = x.index.year

The slice by Year is saved to partitioned parquet properly:

table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False)
pq.write_to_dataset(table, root_path='test_a.parquet', partition_cols=['Year'])

However, if we convert Year to pandas.Categorical - it will save the whole original dataframe, not only slice of Year=1990:

x['Year'] = x['Year'].astype('category')

table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False)
pq.write_to_dataset(table, root_path='test_b.parquet', partition_cols=['Year'])

 

 

Reporter: Vladimir

Note: This issue was originally created as ARROW-7617. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Vladimir:
Apologies for the false alarm.

It is not saving data, just creating empty partitions for all categories. Maybe still undesired behavior, but not the bug as it was reported.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
[~Filimonov] thanks for the report!

It's indeed writing all partitions, and this is because all the unique categories are still present in the type (the categories in pandas, the dictionary items in arrow), and then because of pandas' groupby behaviour of by default preserving all categories (used here: pq.write_to_dataset(table, root_path='test_b.parquet', partition_cols=['Year'])). Pandas has a observed=True option in groupby (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) to ensure only categories that are actually present in the data are used to group, which could be used here. PR certainly welcome!

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Reopened and changed the title to reflect the actual issue

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
@arw2019 Do you still want to be assigned this issue?

@asfimport
Copy link
Collaborator Author

Andrew Wieteska / @arw2019:
@pitrou  unassigned myself

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Jul 4, 2023
…lues don't create empty files in parquet.write_to_dataset
jorisvandenbossche added a commit that referenced this issue Jul 5, 2023
…ty files for non-observed dictionary (category) values (#36465)

### What changes are included in this PR?

If we partition on a categorical variable with "unobserved" categories (values present in the dictionary, but not in the actual data), the legacy path in `pq.write_to_dataset` currently creates empty files. The new dataset-based path already has the preferred behavior, and this PR fixes it for the legacy path and adds a test for both as well.

This also fixes one of the pandas deprecation warnings listed in #36412

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes, this no longer creates a hive-style directory with one empty file (parquet file with 0 rows) when users have unobserved categories. However, this aligns the legacy path with the new and default dataset-based path.
* Closes: #23870

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche added this to the 13.0.0 milestone Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment