Issue with pipeline resilience when using to_parquet
and preemptible workers
#10463
Labels
dataframe
io
needs attention
It's been a while since this was pushed on. Needs attention from the owner or a maintainer.
parquet
Hello everyone, and first of all thanks a lot for Dask :D
This feature request stems from a conversation on dask discourse.
Most of the information are in the post but I will try to synthetize it here.
When using preemptible workers with dask, using
to_parquet
can be an issue because the last taskstore-to-parquet
will wait for all partitions to be saved. So your worker will hold results in memory for a part of the graphs that has been computed and saved.But your workers are preemptible, so can be restarted anytime, thus even if you saved results in memory you may have to recompute them because your worker has been killed.
On the one hand tying things together is neat and avoid the overhead of concatenating results on the client.
On the other hand when using preemptible workers you would want to return the last
map_partitions
instead so that you could release memory as soon as possible if you wish to.I made a PR on my own fork there and would be happy to submit a PR to this repository if this is deemed good to go.
In the PR I made the return of
map_partitions
a default when we are not writing the metadata file but could make it optional instead.Thanks a lot in advance
The text was updated successfully, but these errors were encountered: