-
-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry operations on network issues #3294
Conversation
5a0cc70
to
e81b9bc
Compare
Thank you @byjott . In principle this seems fine to me. I'm glad to have you all making sure that Dask operates well in situations where network connections are unstable. What do you think about adding a await self.scheduler.foo(..., retry=True) (not a requirement, it's an honest question) |
Thanks for looking into this.
Actually, I did this as the first implementation. It would also work and be more concise at the calling site.
|
I'm wondering if there are any scenarios where we would not like to retry |
I though about that, but as I wrote above: "only some operations are re-tried, as for some operations, triggering it twice may have undesired effects". To give an example: I'm not sure whether we really want to re-try a scheduler "rebalance" operation, as this could lead to two "rebalance" operations being active at the same time, and I'm not sure the code was written to make this safe. Similar thoughts hold for other operations. |
distributed/distributed.yaml
Outdated
retry_operations: # some operations (such as gathering data) are subject to re-tries with the below parameters | ||
max_retries: 0 # the maximum number to retry an operation in case of a connection problem | ||
base_delay: 1s # the first non-zero delay between re-tries | ||
max_delay: 20s # the maximum delay between re-tries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend the following names:
retry:
count: 0
delay:
min: 1s
max: 20s
Also, if you want to keep with multi-word names then I recommend the use of hyphens over underscores. Mostly my reasoning here is for consistency with the names that are currently here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I just changed it to what you proposed. Thanks for the quick review!
Other than the small comment about the configuration names I'm happy to merge what's here. Thanks @byjott for your efforts here. |
5053a36
to
59e0b03
Compare
Thank you for your effort here @byjott . This is in. |
We operate distributed in the cloud and see tcp connection aborts. Unfortunately,
distributed
often does not recover cleanly from such situations, although a simple re-try would have helped in most cases.This PR proposes to add a more generic retry to some operations.
Notes: