New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid collecting parquet metadata in pyarrow when write_metadata_file=False #8906
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks Rick!
This is a pretty simple change, so I'll plan to merge EOD tomorrow if there are no other comments. cc @dask/io |
Hi @rjzamora! Thanks for this change. I have a couple of questions, but I wouldn't consider them blockers to merging, more for my own information:
|
Good questions @bryanwweber
My understanding is that we do need to finish the graph with a single task that depends on all partitions being written. We always call this task
Yes. In the interest of having less code to maintain, it still makes sense to replace this function with the pyarrow version. However, we definitely do some things some differently in the Dask version (mostly related to index/pandas-metadata preservation). Therefore, I would definitely suggest that this possibly-tricky task be left for a stand-alone PR. |
Yeah, I think that would be good. I don't think you need to address it here though 😄
Agreed, thanks for the clarification! |
Such a simple and related change. Call it whatever the likes of to_zarr, to_csv etc. have (which do not have a metadata write at the end). |
I agree that changing the task name is very easy, but we do need to agree on a name. I don't think it's quite as simple as looking at what other IO functions do, because everything else is still using There is no task/layer used to tie the Do you think |
Sounds good with me |
… is not being written
Okay - Changed the name of the final task to @bryanwweber - Note that I did test out the case where we do not add a final task to tie everything together. I did this by coverting the |
Thanks @rjzamora it might be worth adding a comment to the code to that effect to clarify for future readers? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rjzamora, this looks good to me.
This should address #7977 for the case that
write_metadata_file=False
(which may become the default soon anyway). There may still be an upstream issue in pyarrow, but this change should allow Dask users to avoid it.cc @jcrist