Fix nullable-dtype error when writing partitioned parquet data by rjzamora · Pull Request #8400 · dask/dask

rjzamora · 2021-11-18T17:13:07Z

Avoids duplicated pandas DataFrame objects within _write_partitioned (used for directory-partitioned to_parquet operations for the "pyarrow" engine). The current logic converts the original DataFrame partition into a pyarrow Table before writing, but then converts the Table back into a DataFrame for "hive" partitioning. This round trip approach is inefficient when the original DataFrame is still available for the groupby operation. The current approach also exposes a risk of loosing nullable-dtype information (which can be a serious problem for appended writes - see #8373).

Note that this PR will fix the dtype information stored in the metadata of partitioned-parquet files. However, appending will still break for older partitioned datasets that were produced with earlier versions of Dask. We will need to improve the "pyarrow" engine to coerce compatible-but-different dtypes to allow users to append to these older files.

Closes Nullable types saved incorrectly in partitioned parquet files, leading to ValueError when appending #8373
Tests added / passed
Passes pre-commit run --all-files

jsignell · 2021-11-22T20:43:37Z

dask/dataframe/io/parquet/arrow.py

+    filename,
+    partition_cols,
+    fs,
+    preserve_index,


presumably we don't need to worry about the order changing since this is a private function?

Right - I assumed explcit arguments were probably better since the funciton is private. However, I have no problem using kwargs if you think this is an unncessary risk

jsignell · 2021-11-22T20:45:10Z

It looks like there are some genuine CI failures on this.

rjzamora · 2021-11-22T21:25:16Z

It looks like there are some genuine CI failures on this.

I am having a lot of trouble reproducing the failures locally, but I'll create the same ubuntu environment and see if I can reproduce :/

Huh... Still cannot reproduce those errors. CI is passing, but I would certainly like to understand why those tests were failing.

…ioned

scharlottej13 · 2021-12-01T16:58:35Z

Hi @rjzamora, have you had a chance to look into this more?

rjzamora · 2021-12-01T17:00:47Z

rerun tests

…ioned

…a/dask into fix-nullable-partitioned

rjzamora · 2021-12-02T15:22:08Z

have you had a chance to look into this more?

Thanks for the ping @scharlottej13! The CI errors we were originally seeing have not popped up again recently, and I have not been able to reproduce them locally. I wish I could say that I am sure the root cause was fixed elsewhere, but I'm just unsure :/

Overall, my suggestions is that we merge this. My gut tells me that the changes in this PR should be "correct", and that the test_create_metadata_file failure is likley explained by this existing issue. Also, the test_writing_parquet_with_kwargs failure is only for the "fatparquet" engine, which shouldn't be hitting the changes in this PR.

rjzamora · 2021-12-03T02:24:38Z

I plan to merge this tomorrow afternoon if there are no complaints (cc @jsignell - In case you wanted to take a final look)

rjzamora added 3 commits November 18, 2021 08:42

simple fix for nullable-dtype append failure

6d50a14

simplify solution and add test coverage

241eb5e

add ref to issue in test

0e508c0

rjzamora added io parquet labels Nov 18, 2021

github-actions bot added the dataframe label Nov 18, 2021

drop columns not included in pyarrow table

3a60a46

jsignell reviewed Nov 22, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/main' into fix-nullable-partit…

7b480c9

…ioned

rjzamora added 2 commits December 1, 2021 14:26

Merge remote-tracking branch 'upstream/main' into fix-nullable-partit…

f37e450

…ioned

Merge branch 'fix-nullable-partitioned' of https://github.com/rjzamor…

2709ada

…a/dask into fix-nullable-partitioned

rjzamora merged commit 21b99af into dask:main Dec 3, 2021

rjzamora deleted the fix-nullable-partitioned branch December 3, 2021 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix nullable-dtype error when writing partitioned parquet data#8400

Fix nullable-dtype error when writing partitioned parquet data#8400
rjzamora merged 7 commits intodask:mainfrom
rjzamora:fix-nullable-partitioned

rjzamora commented Nov 18, 2021 •

edited

Loading

Uh oh!

jsignell Nov 22, 2021

Uh oh!

rjzamora Nov 22, 2021

Uh oh!

jsignell commented Nov 22, 2021

Uh oh!

rjzamora commented Nov 22, 2021 •

edited

Loading

Uh oh!

scharlottej13 commented Dec 1, 2021

Uh oh!

rjzamora commented Dec 1, 2021

Uh oh!

rjzamora commented Dec 2, 2021

Uh oh!

rjzamora commented Dec 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

rjzamora commented Nov 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsignell Nov 22, 2021

Choose a reason for hiding this comment

Uh oh!

rjzamora Nov 22, 2021

Choose a reason for hiding this comment

Uh oh!

jsignell commented Nov 22, 2021

Uh oh!

rjzamora commented Nov 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scharlottej13 commented Dec 1, 2021

Uh oh!

rjzamora commented Dec 1, 2021

Uh oh!

rjzamora commented Dec 2, 2021

Uh oh!

rjzamora commented Dec 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rjzamora commented Nov 18, 2021 •

edited

Loading

rjzamora commented Nov 22, 2021 •

edited

Loading