Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid collecting parquet metadata in pyarrow when write_metadata_file=False #8906

Merged
merged 5 commits into from Apr 13, 2022

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Apr 8, 2022

This should address #7977 for the case that write_metadata_file=False (which may become the default soon anyway). There may still be an upstream issue in pyarrow, but this change should allow Dask users to avoid it.

cc @jcrist

@rjzamora rjzamora changed the title avoid collecting metadata in pyarrow write when write_metadata_file i… Avoid collecting parquet metadata in pyarrow when write_metadata_file=False Apr 8, 2022
Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks Rick!

dask/dataframe/io/parquet/arrow.py Outdated Show resolved Hide resolved
@rjzamora rjzamora requested a review from jcrist April 12, 2022 17:08
@rjzamora
Copy link
Member Author

This is a pretty simple change, so I'll plan to merge EOD tomorrow if there are no other comments. cc @dask/io

@bryanwweber
Copy link
Contributor

Hi @rjzamora! Thanks for this change. I have a couple of questions, but I wouldn't consider them blockers to merging, more for my own information:

  1. df.to_parquet(..., write_metadata_file=False) still generates a metadata-to-parquet task, which is effectively a no-op, but it's mildly confusing that the task is even generated at all. I think this PR avoids even collecting the metadata, which is an improvement, but do you suppose there's a way to avoid generating that task?
  2. The docstring for _write_partitioned mentions that the method can be removed once ARROW-8224 is resolved. As that issue is now resolved, should the method be removed? I think it's out of scope for this PR, but just curious.

@rjzamora
Copy link
Member Author

Good questions @bryanwweber

df.to_parquet(..., write_metadata_file=False) still generates a metadata-to-parquet task, which is effectively a no-op, but it's mildly confusing that the task is even generated at all. I think this PR avoids even collecting the metadata, which is an improvement, but do you suppose there's a way to avoid generating that task?

My understanding is that we do need to finish the graph with a single task that depends on all partitions being written. We always call this task "metadata-<token>" for now, but we could also use a different label when there is no metadata being written (like "finish-" or "barrier-"). Does this option interest you at all?

The docstring for _write_partitioned mentions that the method can be removed once ARROW-8224 is resolved. As that issue is now resolved, should the method be removed? I think it's out of scope for this PR, but just curious.

Yes. In the interest of having less code to maintain, it still makes sense to replace this function with the pyarrow version. However, we definitely do some things some differently in the Dask version (mostly related to index/pandas-metadata preservation). Therefore, I would definitely suggest that this possibly-tricky task be left for a stand-alone PR.

@bryanwweber
Copy link
Contributor

but we could also use a different label when there is no metadata being written (like "finish-" or "barrier-"). Does this option interest you at all?

Yeah, I think that would be good. I don't think you need to address it here though 😄

Therefore, I would definitely suggest that this possibly-tricky task be left for a stand-alone PR.

Agreed, thanks for the clarification!

@martindurant
Copy link
Member

I don't think you need to address it here though

Such a simple and related change. Call it whatever the likes of to_zarr, to_csv etc. have (which do not have a metadata write at the end).

@rjzamora
Copy link
Member Author

Such a simple and related change. Call it whatever the likes of to_zarr, to_csv etc. have (which do not have a metadata write at the end).

I agree that changing the task name is very easy, but we do need to agree on a name. I don't think it's quite as simple as looking at what other IO functions do, because everything else is still using Delayed and needs to be changed anyway :)

There is no task/layer used to tie the Delayed to_csv tasks together (it looks like it just passes back a list of Delayed objects). Howeer, Zarr does seem to tie everything together with a "store-" layer when the user sets compute=False.

Do you think "store"- makes sense @martindurant ?

@martindurant
Copy link
Member

Sounds good with me

@rjzamora
Copy link
Member Author

Okay - Changed the name of the final task to "store-" when metadata is not being written.

@bryanwweber - Note that I did test out the case where we do not add a final task to tie everything together. I did this by coverting the data_write collection to a Bag collection. I don't think this gains us anything, because it is much slower to call compute on a multi-partition collection than it is to call compute on a Scaler collection with an extra layer. This is because computing a multi-partition collection requires the transfer and concatenation of every partition on the client. Even if there is nothing to transfer/concatenate, it is faster to just define a barrier task in the graph.

@bryanwweber
Copy link
Contributor

Thanks @rjzamora it might be worth adding a comment to the code to that effect to clarify for future readers?

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjzamora, this looks good to me.

@jcrist jcrist merged commit 0ccad39 into dask:main Apr 13, 2022
@rjzamora rjzamora deleted the no-meta-collect-arrow branch April 13, 2022 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants