Avoid collecting parquet metadata in pyarrow when write_metadata_file=False #8906

rjzamora · 2022-04-08T19:10:51Z

This should address #7977 for the case that write_metadata_file=False (which may become the default soon anyway). There may still be an upstream issue in pyarrow, but this change should allow Dask users to avoid it.

cc @jcrist

…s False anyway

jcrist

Nice, thanks Rick!

dask/dataframe/io/parquet/arrow.py

rjzamora · 2022-04-12T22:53:24Z

This is a pretty simple change, so I'll plan to merge EOD tomorrow if there are no other comments. cc @dask/io

bryanwweber · 2022-04-13T13:15:57Z

Hi @rjzamora! Thanks for this change. I have a couple of questions, but I wouldn't consider them blockers to merging, more for my own information:

df.to_parquet(..., write_metadata_file=False) still generates a metadata-to-parquet task, which is effectively a no-op, but it's mildly confusing that the task is even generated at all. I think this PR avoids even collecting the metadata, which is an improvement, but do you suppose there's a way to avoid generating that task?
The docstring for _write_partitioned mentions that the method can be removed once ARROW-8224 is resolved. As that issue is now resolved, should the method be removed? I think it's out of scope for this PR, but just curious.

rjzamora · 2022-04-13T13:35:16Z

Good questions @bryanwweber

df.to_parquet(..., write_metadata_file=False) still generates a metadata-to-parquet task, which is effectively a no-op, but it's mildly confusing that the task is even generated at all. I think this PR avoids even collecting the metadata, which is an improvement, but do you suppose there's a way to avoid generating that task?

My understanding is that we do need to finish the graph with a single task that depends on all partitions being written. We always call this task "metadata-<token>" for now, but we could also use a different label when there is no metadata being written (like "finish-" or "barrier-"). Does this option interest you at all?

The docstring for _write_partitioned mentions that the method can be removed once ARROW-8224 is resolved. As that issue is now resolved, should the method be removed? I think it's out of scope for this PR, but just curious.

Yes. In the interest of having less code to maintain, it still makes sense to replace this function with the pyarrow version. However, we definitely do some things some differently in the Dask version (mostly related to index/pandas-metadata preservation). Therefore, I would definitely suggest that this possibly-tricky task be left for a stand-alone PR.

bryanwweber · 2022-04-13T14:28:40Z

but we could also use a different label when there is no metadata being written (like "finish-" or "barrier-"). Does this option interest you at all?

Yeah, I think that would be good. I don't think you need to address it here though 😄

Therefore, I would definitely suggest that this possibly-tricky task be left for a stand-alone PR.

Agreed, thanks for the clarification!

martindurant · 2022-04-13T14:32:07Z

I don't think you need to address it here though

Such a simple and related change. Call it whatever the likes of to_zarr, to_csv etc. have (which do not have a metadata write at the end).

rjzamora · 2022-04-13T15:00:42Z

Such a simple and related change. Call it whatever the likes of to_zarr, to_csv etc. have (which do not have a metadata write at the end).

I agree that changing the task name is very easy, but we do need to agree on a name. I don't think it's quite as simple as looking at what other IO functions do, because everything else is still using Delayed and needs to be changed anyway :)

There is no task/layer used to tie the Delayed to_csv tasks together (it looks like it just passes back a list of Delayed objects). Howeer, Zarr does seem to tie everything together with a "store-" layer when the user sets compute=False.

Do you think "store"- makes sense @martindurant ?

martindurant · 2022-04-13T15:03:56Z

Sounds good with me

… is not being written

rjzamora · 2022-04-13T15:53:03Z

Okay - Changed the name of the final task to "store-" when metadata is not being written.

@bryanwweber - Note that I did test out the case where we do not add a final task to tie everything together. I did this by coverting the data_write collection to a Bag collection. I don't think this gains us anything, because it is much slower to call compute on a multi-partition collection than it is to call compute on a Scaler collection with an extra layer. This is because computing a multi-partition collection requires the transfer and concatenation of every partition on the client. Even if there is nothing to transfer/concatenate, it is faster to just define a barrier task in the graph.

bryanwweber · 2022-04-13T15:54:08Z

Thanks @rjzamora it might be worth adding a comment to the code to that effect to clarify for future readers?

jcrist

Thanks @rjzamora, this looks good to me.

avoid collecting metadata in pyarrow write when write_metadata_file i…

8e4f065

…s False anyway

github-actions bot added dataframe io labels Apr 8, 2022

rjzamora changed the title ~~avoid collecting metadata in pyarrow write when write_metadata_file i…~~ Avoid collecting parquet metadata in pyarrow when write_metadata_file=False Apr 8, 2022

rjzamora added the parquet label Apr 8, 2022

jcrist reviewed Apr 8, 2022

View reviewed changes

dask/dataframe/io/parquet/arrow.py Outdated Show resolved Hide resolved

rjzamora added 2 commits April 8, 2022 12:19

remove confusing syntax

b905acf

Merge remote-tracking branch 'upstream/main' into no-meta-collect-arrow

c75988c

rjzamora requested a review from jcrist April 12, 2022 17:08

use 'store-' rather than 'metadata-' in final task name when metadata…

fd78a39

… is not being written

add comment explaining extra task

028bf90

jcrist approved these changes Apr 13, 2022

View reviewed changes

jcrist merged commit 0ccad39 into dask:main Apr 13, 2022

rjzamora deleted the no-meta-collect-arrow branch April 13, 2022 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid collecting parquet metadata in pyarrow when write_metadata_file=False #8906

Avoid collecting parquet metadata in pyarrow when write_metadata_file=False #8906

rjzamora commented Apr 8, 2022

jcrist left a comment

rjzamora commented Apr 12, 2022

bryanwweber commented Apr 13, 2022

rjzamora commented Apr 13, 2022

bryanwweber commented Apr 13, 2022

martindurant commented Apr 13, 2022

rjzamora commented Apr 13, 2022

martindurant commented Apr 13, 2022

rjzamora commented Apr 13, 2022

bryanwweber commented Apr 13, 2022

jcrist left a comment

Avoid collecting parquet metadata in pyarrow when write_metadata_file=False #8906

Avoid collecting parquet metadata in pyarrow when write_metadata_file=False #8906

Conversation

rjzamora commented Apr 8, 2022

jcrist left a comment

Choose a reason for hiding this comment

rjzamora commented Apr 12, 2022

bryanwweber commented Apr 13, 2022

rjzamora commented Apr 13, 2022

bryanwweber commented Apr 13, 2022

martindurant commented Apr 13, 2022

rjzamora commented Apr 13, 2022

martindurant commented Apr 13, 2022

rjzamora commented Apr 13, 2022

bryanwweber commented Apr 13, 2022

jcrist left a comment

Choose a reason for hiding this comment