Fix nullable-dtype error when writing partitioned parquet data#8400
Fix nullable-dtype error when writing partitioned parquet data#8400
Conversation
| filename, | ||
| partition_cols, | ||
| fs, | ||
| preserve_index, |
There was a problem hiding this comment.
presumably we don't need to worry about the order changing since this is a private function?
There was a problem hiding this comment.
Right - I assumed explcit arguments were probably better since the funciton is private. However, I have no problem using kwargs if you think this is an unncessary risk
|
It looks like there are some genuine CI failures on this. |
I am having a lot of trouble reproducing the failures locally, but I'll create the same ubuntu environment and see if I can reproduce :/ Huh... Still cannot reproduce those errors. CI is passing, but I would certainly like to understand why those tests were failing. |
|
Hi @rjzamora, have you had a chance to look into this more? |
|
rerun tests |
…a/dask into fix-nullable-partitioned
Thanks for the ping @scharlottej13! The CI errors we were originally seeing have not popped up again recently, and I have not been able to reproduce them locally. I wish I could say that I am sure the root cause was fixed elsewhere, but I'm just unsure :/ Overall, my suggestions is that we merge this. My gut tells me that the changes in this PR should be "correct", and that the |
|
I plan to merge this tomorrow afternoon if there are no complaints (cc @jsignell - In case you wanted to take a final look) |
Avoids duplicated pandas
DataFrameobjects within_write_partitioned(used for directory-partitionedto_parquetoperations for the "pyarrow" engine). The current logic converts the original DataFrame partition into a pyarrow Table before writing, but then converts the Table back into a DataFrame for "hive" partitioning. This round trip approach is inefficient when the original DataFrame is still available for the groupby operation. The current approach also exposes a risk of loosing nullable-dtype information (which can be a serious problem for appended writes - see #8373).Note that this PR will fix the dtype information stored in the metadata of partitioned-parquet files. However, appending will still break for older partitioned datasets that were produced with earlier versions of Dask. We will need to improve the "pyarrow" engine to coerce compatible-but-different dtypes to allow users to append to these older files.
pre-commit run --all-files