Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with pipeline resilience when using to_parquet and preemptible workers #10463

Open
hyenal opened this issue Aug 25, 2023 · 0 comments
Open
Assignees
Labels
dataframe io needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. parquet

Comments

@hyenal
Copy link

hyenal commented Aug 25, 2023

Hello everyone, and first of all thanks a lot for Dask :D

This feature request stems from a conversation on dask discourse.

Most of the information are in the post but I will try to synthetize it here.

When using preemptible workers with dask, using to_parquet can be an issue because the last task store-to-parquet will wait for all partitions to be saved. So your worker will hold results in memory for a part of the graphs that has been computed and saved.
But your workers are preemptible, so can be restarted anytime, thus even if you saved results in memory you may have to recompute them because your worker has been killed.

On the one hand tying things together is neat and avoid the overhead of concatenating results on the client.
On the other hand when using preemptible workers you would want to return the last map_partitions instead so that you could release memory as soon as possible if you wish to.

I made a PR on my own fork there and would be happy to submit a PR to this repository if this is deemed good to go.

In the PR I made the return of map_partitions a default when we are not writing the metadata file but could make it optional instead.

Thanks a lot in advance

@github-actions github-actions bot added the needs triage Needs a response from a contributor label Aug 25, 2023
@hyenal hyenal changed the title Improving pipeline resilience when using to_parquet and preemptible workers Issue with pipeline resilience when using to_parquet and preemptible workers Aug 25, 2023
@rjzamora rjzamora added dataframe io parquet and removed needs triage Needs a response from a contributor labels Aug 26, 2023
@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Sep 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe io needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. parquet
Projects
None yet
Development

No branches or pull requests

2 participants